## 数据集


| NO | 字段名称 | 数据类型 | 字段描述|
| --- | --- | --- | --- |
| 1 | ID | Int | 客户唯一标识
| 2 | age | Int | 客户年龄
| 3 | job | String | 客户的职业
| 4 | marital | String | 婚姻状况
| 5 | education | String | 受教育水平
| 6 | default | String | 是否有违约记录
| 7 | balance | Int | 每年账户的平均余额
| 8 | housing | String | 是否有住房贷款
| 9 | loan | String | 是否有个人贷款
| 10 | contact | String | 与客户联系的沟通方式
| 11 | day | Int | 最后一次联系的时间（几号）
| 12 | month | String | 最后一次联系的时间（月份）
| 13 | duration | Int | 最后一次联系的交流时长
| 14 | campaign | Int | 在本次活动中，与该客户交流过的次数
| 15 | pdays | Int | 距离上次活动最后一次联系该客户，过去了多久（999表示没有联系过）
| 16 | previous | Int | 在本次活动之前，与该客户交流过的次数
| 17 | poutcome | String | 上一次活动的结果
| 18 | y | Int | 预测客户是否会订购定期存款业务 

In [1]:
import pandas as pd
import numpy as np

### 预览

In [2]:
train_data = pd.read_csv('../Data/train_set.csv')
print(train_data.shape)

train_data.head(3)

(25317, 18)


Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,1,43,management,married,tertiary,no,291,yes,no,unknown,9,may,150,2,-1,0,unknown,0
1,2,42,technician,divorced,primary,no,5076,yes,no,cellular,7,apr,99,1,251,2,other,0
2,3,47,admin.,married,secondary,no,104,yes,yes,cellular,14,jul,77,2,-1,0,unknown,0


In [3]:
test_data = pd.read_csv('../Data/test_set.csv')
print(test_data.shape)

test_data.head(3)

(10852, 17)


Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,25318,51,housemaid,married,unknown,no,174,no,no,telephone,29,jul,308,3,-1,0,unknown
1,25319,32,management,married,tertiary,no,6059,yes,no,cellular,20,nov,110,2,-1,0,unknown
2,25320,60,retired,married,primary,no,0,no,no,telephone,30,jul,130,3,-1,0,unknown


### 训练集数据类型

| job | marital | education | default | housing | loan | contact | month | poutcome | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 

In [4]:
train_data.columns

Index(['ID', 'age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign',
       'pdays', 'previous', 'poutcome', 'y'],
      dtype='object')

In [5]:
lst_strFeaNames = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

In [6]:
df_strFeaTypes = pd.DataFrame(columns=lst_strFeaNames, index=['Count', 'Types', 'Num'])

for n in lst_strFeaNames:
    ser = train_data[n].value_counts()
    df_strFeaTypes[n]['Count'] = len(ser)
    df_strFeaTypes[n]['Types'] = list(ser.index)
    df_strFeaTypes[n]['Num'] = list(ser)

df_strFeaTypes

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,poutcome
Count,12,3,4,2,2,2,3,12,4
Types,"[blue-collar, management, technician, admin., ...","[married, single, divorced]","[secondary, tertiary, primary, unknown]","[no, yes]","[yes, no]","[no, yes]","[cellular, unknown, telephone]","[may, jul, aug, jun, nov, apr, feb, jan, oct, ...","[unknown, failure, other, success]"
Num,"[5456, 5296, 4241, 2909, 2342, 1273, 884, 856,...","[15245, 7157, 2915]","[12957, 7447, 3848, 1065]","[24869, 448]","[14020, 11297]","[21258, 4059]","[16391, 7281, 1645]","[7655, 3937, 3482, 2968, 2243, 1669, 1464, 777...","[20677, 2735, 1070, 835]"


In [7]:
lst_intFeaNames = ['age', 'balance', 'day','duration', 'campaign', 'pdays', 'previous']

In [8]:
df_intFeaNames = pd.DataFrame(columns=lst_intFeaNames, index=['Count', 'Types', 'Num'])

for n in lst_intFeaNames:
    ser = train_data[n].value_counts()
    df_intFeaNames[n]['Count'] = len(ser)
    df_intFeaNames[n]['Types'] = list(ser.index)
    df_intFeaNames[n]['Num'] = list(ser)

df_intFeaNames

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
Count,75,5736,31,1388,43,493,36
Types,"[32, 31, 33, 35, 34, 36, 30, 37, 39, 38, 40, 4...","[0, 1, 4, 2, 3, 5, 6, 8, 23, 7, 47, 10, 94, 25...","[20, 18, 21, 17, 5, 6, 28, 7, 8, 14, 19, 15, 2...","[124, 90, 111, 96, 82, 89, 106, 73, 72, 81, 15...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16...","[-1, 182, 92, 183, 91, 181, 370, 95, 94, 184, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 10, 12, 13,..."
Num,"[1209, 1134, 1110, 1039, 1025, 1020, 991, 946,...","[1936, 121, 82, 81, 72, 66, 52, 45, 42, 40, 37...","[1552, 1331, 1121, 1086, 1081, 1056, 1037, 103...","[119, 110, 104, 101, 101, 100, 99, 99, 98, 97,...","[9825, 7009, 3098, 1957, 989, 723, 406, 302, 1...","[20674, 93, 87, 77, 76, 66, 57, 49, 45, 44, 44...","[20674, 1544, 1178, 631, 413, 270, 152, 120, 7..."


In [9]:
max(df_intFeaNames['pdays']['Types'])

854

## Embedding-Word2Vec

`计划`：
- str通过Word2Vec转为16维向量
- 

In [10]:
['age','balance','day','duration','campaign','pdays','previous']
big = ['balance',]                                                                 # 越大越正面
small = ['day',]                                               # 越小越正面
mid = ['age',]                                                                      # 适中正面

In [11]:
from copy import deepcopy

from gensim.models import Word2Vec                                         # 词向量


### str训练16维词向量

In [12]:
train_vec = pd.read_csv('../Data/SMSSpamCollection.csv', sep='\t', header=None, names=['label', 'sms_message'])

train_vec.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Word2Vec

In [13]:
d_strModelVec = {}
for n in lst_strFeaNames:
    # 单词转为sze维词向量
    m = Word2Vec(min_count=1, size=16)
    m.build_vocab([df_strFeaTypes[n]['Types']])
    
    d_strModelVec[n] = m

In [16]:
d_strModelVec['job'].wv.vocab.keys()

dict_keys(['blue-collar', 'management', 'technician', 'admin.', 'services', 'retired', 'self-employed', 'entrepreneur', 'unemployed', 'housemaid', 'student', 'unknown'])

In [14]:
print('str keys:\t', d_strModelVec.keys())

print('shape:\t',d_strModelVec['job'].wv.vectors.shape)

print('Names:\t', d_strModelVec['job'].wv.vocab.keys())

str keys:	 dict_keys(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome'])
shape:	 (12, 16)
Names:	 dict_keys(['blue-collar', 'management', 'technician', 'admin.', 'services', 'retired', 'self-employed', 'entrepreneur', 'unemployed', 'housemaid', 'student', 'unknown'])


In [54]:
d_strModelVec['job'].wv['blue-collar']

array([ 0.00520166,  0.01996946, -0.01445977, -0.01011027,  0.00940884,
        0.01760333,  0.01011164,  0.02380495, -0.01202753,  0.02323542,
       -0.00147088,  0.03064089, -0.02397148,  0.00165667,  0.02042116,
       -0.01010898], dtype=float32)

#### Word2Vec转str为16维向量

In [55]:
df_vecStr = deepcopy(train_data)

df_vecStr.head(3)

Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,1,43,management,married,tertiary,no,291,yes,no,unknown,9,may,150,2,-1,0,unknown,0
1,2,42,technician,divorced,primary,no,5076,yes,no,cellular,7,apr,99,1,251,2,other,0
2,3,47,admin.,married,secondary,no,104,yes,yes,cellular,14,jul,77,2,-1,0,unknown,0


In [56]:
for n in d_strModelVec.keys():
    df_vecStr[n] = df_vecStr[n].apply(lambda v:d_strModelVec[n].wv[v])
    
df_vecStr.head(3)

Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,1,43,"[0.018116662, 0.022473088, -0.021333665, 0.009...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.021311874, 0.024186863, 0.025521293, 0.005...","[-0.014931658, -0.004090762, -0.0220671, 0.005...",291,"[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.023616474, -0.019877823, 0.020715777, -0.0...",9,"[0.029583937, 0.016554277, 0.016240135, -0.001...",150,2,-1,0,"[-0.023616474, -0.019877823, 0.020715777, -0.0...",0
1,2,42,"[-0.00016407517, 0.019021202, 0.0011287049, -0...","[0.023925979, 0.0042528315, -0.024949398, -0.0...","[-0.014845046, 0.027101468, -0.01265675, 0.029...","[-0.014931658, -0.004090762, -0.0220671, 0.005...",5076,"[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.012542447, -0.0015956812, -0.025851712, 0....",7,"[0.009946998, -0.02076528, 0.011940169, -0.006...",99,1,251,2,"[0.014372752, -0.003626981, 0.030918641, -0.01...",0
2,3,47,"[-0.012270937, -0.030331438, 0.015268925, 0.00...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.029846007, 0.0009747704, -0.015414518, -0....","[-0.014931658, -0.004090762, -0.0220671, 0.005...",104,"[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.012542447, -0.0015956812, -0.025851712, 0....",14,"[0.011986204, -0.01250622, 0.0052852114, 0.024...",77,2,-1,0,"[-0.023616474, -0.019877823, 0.020715777, -0.0...",0


### int转为str训练16维向量

#### Word2Vec ,int转为str

In [57]:
d_intModelVec = {}
for n in lst_intFeaNames:
    # 单词转为sze维词向量
    m = Word2Vec(min_count=1, size=16)
    l_s = [str(v) for v in df_intFeaNames[n]['Types']]
    m.build_vocab([l_s])
    
    d_intModelVec[n] = m

In [58]:
print('int keys:\t', d_intModelVec.keys())

print('shape:\t',d_intModelVec['age'].wv.vectors.shape)

print('Names:\t', d_intModelVec['age'].wv.vocab.keys())

int keys:	 dict_keys(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous'])
shape:	 (75, 16)
Names:	 dict_keys(['32', '31', '33', '35', '34', '36', '30', '37', '39', '38', '40', '41', '46', '42', '45', '43', '29', '44', '47', '48', '28', '51', '49', '50', '52', '27', '53', '26', '57', '55', '56', '58', '54', '59', '60', '25', '24', '23', '61', '22', '63', '21', '62', '70', '65', '64', '69', '73', '66', '72', '71', '77', '20', '67', '75', '74', '19', '76', '68', '78', '79', '83', '82', '80', '81', '84', '18', '86', '85', '89', '95', '87', '90', '92', '93'])


In [59]:
d_intModelVec['age'].wv['50']

array([-0.02442143, -0.03108581, -0.00496869,  0.02264105, -0.02612359,
       -0.01928991, -0.00250499,  0.01268252,  0.01747569,  0.01339408,
        0.00232439, -0.00047512,  0.00404299, -0.00101422, -0.01086552,
        0.01291022], dtype=float32)

#### Word2Vec转str(int)为16维向量

In [60]:
df_vecInt = deepcopy(df_vecStr)

df_vecInt.head(3)

Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,1,43,"[0.018116662, 0.022473088, -0.021333665, 0.009...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.021311874, 0.024186863, 0.025521293, 0.005...","[-0.014931658, -0.004090762, -0.0220671, 0.005...",291,"[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.023616474, -0.019877823, 0.020715777, -0.0...",9,"[0.029583937, 0.016554277, 0.016240135, -0.001...",150,2,-1,0,"[-0.023616474, -0.019877823, 0.020715777, -0.0...",0
1,2,42,"[-0.00016407517, 0.019021202, 0.0011287049, -0...","[0.023925979, 0.0042528315, -0.024949398, -0.0...","[-0.014845046, 0.027101468, -0.01265675, 0.029...","[-0.014931658, -0.004090762, -0.0220671, 0.005...",5076,"[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.012542447, -0.0015956812, -0.025851712, 0....",7,"[0.009946998, -0.02076528, 0.011940169, -0.006...",99,1,251,2,"[0.014372752, -0.003626981, 0.030918641, -0.01...",0
2,3,47,"[-0.012270937, -0.030331438, 0.015268925, 0.00...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.029846007, 0.0009747704, -0.015414518, -0....","[-0.014931658, -0.004090762, -0.0220671, 0.005...",104,"[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.012542447, -0.0015956812, -0.025851712, 0....",14,"[0.011986204, -0.01250622, 0.0052852114, 0.024...",77,2,-1,0,"[-0.023616474, -0.019877823, 0.020715777, -0.0...",0


In [64]:
for n in d_intModelVec.keys():
    df_vecInt[n] = df_vecInt[n].apply(lambda v:d_intModelVec[n].wv[str(v)])
    
df_vecInt.head(3)

Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,1,"[-0.019752193, 0.021300457, 0.0242151, -0.0046...","[0.018116662, 0.022473088, -0.021333665, 0.009...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.021311874, 0.024186863, 0.025521293, 0.005...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[0.005519224, 0.005089936, -0.018180123, 0.023...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.023616474, -0.019877823, 0.020715777, -0.0...","[-0.021098055, 0.0034826696, -0.019695386, -0....","[0.029583937, 0.016554277, 0.016240135, -0.001...","[-0.008920803, 0.021988817, 0.0039854147, 0.02...","[-0.014961863, -0.02937544, -0.028882926, 0.02...","[-0.021917518, 0.010373652, 0.0016121529, 0.00...","[0.02397878, -0.00014616139, -0.015256789, 0.0...","[-0.023616474, -0.019877823, 0.020715777, -0.0...",0
1,2,"[-0.010871594, 0.013240599, 5.5315537e-05, 0.0...","[-0.00016407517, 0.019021202, 0.0011287049, -0...","[0.023925979, 0.0042528315, -0.024949398, -0.0...","[-0.014845046, 0.027101468, -0.01265675, 0.029...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[0.023087101, -0.02101208, -0.007457626, -0.01...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.012542447, -0.0015956812, -0.025851712, 0....","[-0.00278352, -0.025637574, 0.0032752936, 0.03...","[0.009946998, -0.02076528, 0.011940169, -0.006...","[-0.0016354138, -0.0056181, -0.004213516, -0.0...","[-0.0098251905, 0.026561484, -0.0011860528, 0....","[0.016412089, 0.008561323, -0.010666011, 0.026...","[-0.014961863, -0.02937544, -0.028882926, 0.02...","[0.014372752, -0.003626981, 0.030918641, -0.01...",0
2,3,"[0.012307459, -0.0059652026, -0.0008859696, 0....","[-0.012270937, -0.030331438, 0.015268925, 0.00...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.029846007, 0.0009747704, -0.015414518, -0....","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.02649962, 0.029122839, -0.001093726, 0.021...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.012542447, -0.0015956812, -0.025851712, 0....","[0.028220924, -0.0035779716, 0.01906455, -0.00...","[0.011986204, -0.01250622, 0.0052852114, 0.024...","[0.022317251, -0.01884651, -0.024422843, -0.02...","[-0.014961863, -0.02937544, -0.028882926, 0.02...","[-0.021917518, 0.010373652, 0.0016121529, 0.00...","[0.02397878, -0.00014616139, -0.015256789, 0.0...","[-0.023616474, -0.019877823, 0.020715777, -0.0...",0


## 重整总训练集

In [69]:
from sklearn.model_selection import train_test_split

In [65]:
df_totalTrain = deepcopy(df_vecInt)
df_totalTrain = df_totalTrain.drop(columns=['ID','y'])

df_totalY =  deepcopy(df_vecInt[['ID','y']])

df_totalTrain.head(3)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,"[-0.019752193, 0.021300457, 0.0242151, -0.0046...","[0.018116662, 0.022473088, -0.021333665, 0.009...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.021311874, 0.024186863, 0.025521293, 0.005...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[0.005519224, 0.005089936, -0.018180123, 0.023...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.023616474, -0.019877823, 0.020715777, -0.0...","[-0.021098055, 0.0034826696, -0.019695386, -0....","[0.029583937, 0.016554277, 0.016240135, -0.001...","[-0.008920803, 0.021988817, 0.0039854147, 0.02...","[-0.014961863, -0.02937544, -0.028882926, 0.02...","[-0.021917518, 0.010373652, 0.0016121529, 0.00...","[0.02397878, -0.00014616139, -0.015256789, 0.0...","[-0.023616474, -0.019877823, 0.020715777, -0.0..."
1,"[-0.010871594, 0.013240599, 5.5315537e-05, 0.0...","[-0.00016407517, 0.019021202, 0.0011287049, -0...","[0.023925979, 0.0042528315, -0.024949398, -0.0...","[-0.014845046, 0.027101468, -0.01265675, 0.029...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[0.023087101, -0.02101208, -0.007457626, -0.01...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.012542447, -0.0015956812, -0.025851712, 0....","[-0.00278352, -0.025637574, 0.0032752936, 0.03...","[0.009946998, -0.02076528, 0.011940169, -0.006...","[-0.0016354138, -0.0056181, -0.004213516, -0.0...","[-0.0098251905, 0.026561484, -0.0011860528, 0....","[0.016412089, 0.008561323, -0.010666011, 0.026...","[-0.014961863, -0.02937544, -0.028882926, 0.02...","[0.014372752, -0.003626981, 0.030918641, -0.01..."
2,"[0.012307459, -0.0059652026, -0.0008859696, 0....","[-0.012270937, -0.030331438, 0.015268925, 0.00...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.029846007, 0.0009747704, -0.015414518, -0....","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.02649962, 0.029122839, -0.001093726, 0.021...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.012542447, -0.0015956812, -0.025851712, 0....","[0.028220924, -0.0035779716, 0.01906455, -0.00...","[0.011986204, -0.01250622, 0.0052852114, 0.024...","[0.022317251, -0.01884651, -0.024422843, -0.02...","[-0.014961863, -0.02937544, -0.028882926, 0.02...","[-0.021917518, 0.010373652, 0.0016121529, 0.00...","[0.02397878, -0.00014616139, -0.015256789, 0.0...","[-0.023616474, -0.019877823, 0.020715777, -0.0..."


In [66]:
df_totalY.head(3)

Unnamed: 0,ID,y
0,1,0
1,2,0
2,3,0


### 单列16\*1\*16数据组合为16\*16\*1

In [83]:
df_totalTrain.iloc[:3]

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,"[-0.019752193, 0.021300457, 0.0242151, -0.0046...","[0.018116662, 0.022473088, -0.021333665, 0.009...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.021311874, 0.024186863, 0.025521293, 0.005...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[0.005519224, 0.005089936, -0.018180123, 0.023...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.023616474, -0.019877823, 0.020715777, -0.0...","[-0.021098055, 0.0034826696, -0.019695386, -0....","[0.029583937, 0.016554277, 0.016240135, -0.001...","[-0.008920803, 0.021988817, 0.0039854147, 0.02...","[-0.014961863, -0.02937544, -0.028882926, 0.02...","[-0.021917518, 0.010373652, 0.0016121529, 0.00...","[0.02397878, -0.00014616139, -0.015256789, 0.0...","[-0.023616474, -0.019877823, 0.020715777, -0.0..."
1,"[-0.010871594, 0.013240599, 5.5315537e-05, 0.0...","[-0.00016407517, 0.019021202, 0.0011287049, -0...","[0.023925979, 0.0042528315, -0.024949398, -0.0...","[-0.014845046, 0.027101468, -0.01265675, 0.029...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[0.023087101, -0.02101208, -0.007457626, -0.01...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.012542447, -0.0015956812, -0.025851712, 0....","[-0.00278352, -0.025637574, 0.0032752936, 0.03...","[0.009946998, -0.02076528, 0.011940169, -0.006...","[-0.0016354138, -0.0056181, -0.004213516, -0.0...","[-0.0098251905, 0.026561484, -0.0011860528, 0....","[0.016412089, 0.008561323, -0.010666011, 0.026...","[-0.014961863, -0.02937544, -0.028882926, 0.02...","[0.014372752, -0.003626981, 0.030918641, -0.01..."
2,"[0.012307459, -0.0059652026, -0.0008859696, 0....","[-0.012270937, -0.030331438, 0.015268925, 0.00...","[0.011621209, -0.013714126, -0.024665294, 0.02...","[-0.029846007, 0.0009747704, -0.015414518, -0....","[-0.014931658, -0.004090762, -0.0220671, 0.005...","[-0.02649962, 0.029122839, -0.001093726, 0.021...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.02737304, 0.029297203, -0.022824943, -0.01...","[-0.012542447, -0.0015956812, -0.025851712, 0....","[0.028220924, -0.0035779716, 0.01906455, -0.00...","[0.011986204, -0.01250622, 0.0052852114, 0.024...","[0.022317251, -0.01884651, -0.024422843, -0.02...","[-0.014961863, -0.02937544, -0.028882926, 0.02...","[-0.021917518, 0.010373652, 0.0016121529, 0.00...","[0.02397878, -0.00014616139, -0.015256789, 0.0...","[-0.023616474, -0.019877823, 0.020715777, -0.0..."


In [88]:
np.array(list(df_totalTrain.iloc[0])).shape

(16, 16)

In [104]:
np.array([np.array(list(df_totalTrain.iloc[i])) for i in range(3)]).shape

(3, 16, 16)

In [105]:
# 总训练集
int_trainLen = df_totalTrain.shape[0]
ary_totalTrain = np.array([np.array(list(df_totalTrain.iloc[i])) for i in range(int_trainLen)])

ary_totalTrain.shape, ary_totalTrain[0].shape

((25317, 16, 16), (16, 16))

In [None]:
# # 总目标集
# int_testLen = te.shape[0]
# ary_totalTrain = np.array([np.array(list(df_totalTrain.iloc[i])) for i in range(int_trainLen)])

# ary_totalTrain.shape, ary_totalTrain[0].shape

### 分割词向量整理好的数据集

In [108]:
X, y = ary_totalTrain, df_totalY['y']

In [109]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1,random_state=1)

In [110]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((22785, 16, 16), (2532, 16, 16), (22785,), (2532,))

## CNN

In [111]:
from keras.models import Sequential                            # 卷积模型

from keras.layers import Dense                                 # Dense net

from keras.layers.convolutional import Conv2D, MaxPooling2D    # 卷积神经网络

from keras.layers import Flatten                               # 扁平化卷积结果

### 训练

In [129]:
model_cnn= Sequential()
"""
卷积核为32个3*3，池化为2*2，激活函数为ReLu
"""
# 1次卷积+最大值池化
model_cnn.add(Conv2D(32, (3, 3), input_shape = (16, 16, 1), activation = 'relu'))
model_cnn.add(Conv2D(32, (3, 3), input_shape = (16, 16, 1), activation = 'relu'))
model_cnn.add(MaxPooling2D(pool_size = (2, 2)))

# 2次卷积+最大值池化
model_cnn.add(Conv2D(64, (3, 3), input_shape = (16, 16, 1), activation = 'relu'))
model_cnn.add(Conv2D(64, (3, 3), input_shape = (16, 16, 1), activation = 'relu'))
model_cnn.add(MaxPooling2D(pool_size = (2, 2)))

# flatten + 2层Densenet
model_cnn.add(Flatten())
model_cnn.add(Dense(128, activation = 'relu'))
model_cnn.add(Dense(1, activation = 'softmax'))

# 编译，损失函数是二元交叉熵，优化函数是adam
model_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_cnn.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_25 (Conv2D)           (None, 14, 14, 32)        320       
_________________________________________________________________
conv2d_26 (Conv2D)           (None, 12, 12, 32)        9248      
_________________________________________________________________
max_pooling2d_13 (MaxPooling (None, 6, 6, 32)          0         
_________________________________________________________________
conv2d_27 (Conv2D)           (None, 4, 4, 64)          18496     
_________________________________________________________________
conv2d_28 (Conv2D)           (None, 2, 2, 64)          36928     
_________________________________________________________________
max_pooling2d_14 (MaxPooling (None, 1, 1, 64)          0         
_________________________________________________________________
flatten_7 (Flatten)          (None, 64)                0         
__________

In [130]:
X_train_shape = X_train.reshape(X_train.shape[0], 16, 16, 1).astype('float32')

X_train_shape.shape

(22785, 16, 16, 1)

In [131]:
X_test_shape = X_test.reshape(X_test.shape[0], 16, 16, 1).astype('float32')

X_test_shape.shape

(2532, 16, 16, 1)

In [132]:
# 训练模型，训练集，测试集参数，迭代代数，批量单批数量，展示形式（1是进度条）
model_cnn.fit(X_train_shape, y_train, validation_data = (X_test_shape, y_test), epochs = 2, batch_size = 200, verbose = 1)

Instructions for updating:
Use tf.cast instead.
Train on 22785 samples, validate on 2532 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1836ebe0>