## 简介

`训练集`：(16个特征)     

| NO | 字段名称 | 数据类型 | 字段描述|
| --- | --- | --- | --- |
| 1 | ID | Int | 客户唯一标识
| 2 | age | Int | 客户年龄
| 3 | job | String | 客户的职业
| 4 | marital | String | 婚姻状况
| 5 | education | String | 受教育水平
| 6 | default | String | 是否有违约记录
| 7 | balance | Int | 每年账户的平均余额
| 8 | housing | String | 是否有住房贷款
| 9 | loan | String | 是否有个人贷款
| 10 | contact | String | 与客户联系的沟通方式
| 11 | day | Int | 最后一次联系的时间（几号）
| 12 | month | String | 最后一次联系的时间（月份）
| 13 | duration | Int | 最后一次联系的交流时长
| 14 | campaign | Int | 在本次活动中，与该客户交流过的次数
| 15 | pdays | Int | 距离上次活动最后一次联系该客户，过去了多久（999表示没有联系过）
| 16 | previous | Int | 在本次活动之前，与该客户交流过的次数
| 17 | poutcome | String | 上一次活动的结果
| 18 | y | Int | 预测客户是否会订购定期存款业务 

- `AUC`计分
- 提交格式：

| ID |	pred |
| --- | --- |
| 25318 | 0.123456 |
| 25319	| 0.654321 | 
| 25320	| 0.799212 | 

## 数据集

In [1]:
import pandas as pd
import numpy as np

from copy import deepcopy

from sklearn.model_selection import train_test_split

### 预览

In [2]:
train_data = pd.read_csv('../Data/train_set.csv')

train_data.head()

Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,1,43,management,married,tertiary,no,291,yes,no,unknown,9,may,150,2,-1,0,unknown,0
1,2,42,technician,divorced,primary,no,5076,yes,no,cellular,7,apr,99,1,251,2,other,0
2,3,47,admin.,married,secondary,no,104,yes,yes,cellular,14,jul,77,2,-1,0,unknown,0
3,4,28,management,single,secondary,no,-994,yes,yes,cellular,18,jul,174,2,-1,0,unknown,0
4,5,42,technician,divorced,secondary,no,2974,yes,no,unknown,21,may,187,5,-1,0,unknown,0


In [3]:
train_data[train_data['pdays']==-1].shape

(20674, 18)

In [4]:
train_data[train_data['poutcome']=='unknown'].shape

(20677, 18)

In [5]:
train_data[train_data['poutcome']=='unknown'][train_data['pdays']!=-1]

  """Entry point for launching an IPython kernel.


Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
9932,9933,37,management,married,secondary,no,209,no,no,cellular,14,oct,183,3,528,7,unknown,0
23674,23675,61,retired,married,tertiary,no,3140,yes,yes,cellular,6,aug,975,4,98,1,unknown,1
24601,24602,26,admin.,single,secondary,no,338,no,no,cellular,29,oct,209,1,188,2,unknown,1


In [6]:
test_data = pd.read_csv('../Data/test_set.csv')

test_data.head()

Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,25318,51,housemaid,married,unknown,no,174,no,no,telephone,29,jul,308,3,-1,0,unknown
1,25319,32,management,married,tertiary,no,6059,yes,no,cellular,20,nov,110,2,-1,0,unknown
2,25320,60,retired,married,primary,no,0,no,no,telephone,30,jul,130,3,-1,0,unknown
3,25321,32,student,single,tertiary,no,64,no,no,cellular,30,jun,598,4,105,5,failure
4,25322,41,housemaid,married,secondary,no,0,yes,yes,cellular,15,jul,368,4,-1,0,unknown


In [7]:
train_data.shape, test_data.shape

((25317, 18), (10852, 17))

### 训练集数据类型

| job | marital | education | default | housing | loan | contact | month | poutcome | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 

In [8]:
train_data.columns

Index(['ID', 'age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign',
       'pdays', 'previous', 'poutcome', 'y'],
      dtype='object')

In [9]:
lst_strFeaNames = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

In [10]:
list(train_data['job'].value_counts())

[5456, 5296, 4241, 2909, 2342, 1273, 884, 856, 701, 663, 533, 163]

In [11]:
df_strFeaTypes = pd.DataFrame(columns=lst_strFeaNames, index=['Count', 'Types', 'Num'])

for n in lst_strFeaNames:
    ser = train_data[n].value_counts()
    df_strFeaTypes[n]['Count'] = len(ser)
    df_strFeaTypes[n]['Types'] = list(ser.index)
    df_strFeaTypes[n]['Num'] = list(ser)

df_strFeaTypes

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,poutcome
Count,12,3,4,2,2,2,3,12,4
Types,"[blue-collar, management, technician, admin., ...","[married, single, divorced]","[secondary, tertiary, primary, unknown]","[no, yes]","[yes, no]","[no, yes]","[cellular, unknown, telephone]","[may, jul, aug, jun, nov, apr, feb, jan, oct, ...","[unknown, failure, other, success]"
Num,"[5456, 5296, 4241, 2909, 2342, 1273, 884, 856,...","[15245, 7157, 2915]","[12957, 7447, 3848, 1065]","[24869, 448]","[14020, 11297]","[21258, 4059]","[16391, 7281, 1645]","[7655, 3937, 3482, 2968, 2243, 1669, 1464, 777...","[20677, 2735, 1070, 835]"


In [12]:
lst_intFeaNames = ['age', 'balance', 'day','duration', 'campaign', 'pdays', 'previous']

In [13]:
df_intFeaNames = pd.DataFrame(columns=lst_intFeaNames, index=['Count', 'Types', 'Num'])

for n in lst_intFeaNames:
    ser = train_data[n].value_counts()
    df_intFeaNames[n]['Count'] = len(ser)
    df_intFeaNames[n]['Types'] = list(ser.index)
    df_intFeaNames[n]['Num'] = list(ser)

df_intFeaNames

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
Count,75,5736,31,1388,43,493,36
Types,"[32, 31, 33, 35, 34, 36, 30, 37, 39, 38, 40, 4...","[0, 1, 4, 2, 3, 5, 6, 8, 23, 7, 47, 10, 94, 25...","[20, 18, 21, 17, 5, 6, 28, 7, 8, 14, 19, 15, 2...","[124, 90, 111, 96, 82, 89, 106, 73, 72, 81, 15...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16...","[-1, 182, 92, 183, 91, 181, 370, 95, 94, 184, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 10, 12, 13,..."
Num,"[1209, 1134, 1110, 1039, 1025, 1020, 991, 946,...","[1936, 121, 82, 81, 72, 66, 52, 45, 42, 40, 37...","[1552, 1331, 1121, 1086, 1081, 1056, 1037, 103...","[119, 110, 104, 101, 101, 100, 99, 99, 98, 97,...","[9825, 7009, 3098, 1957, 989, 723, 406, 302, 1...","[20674, 93, 87, 77, 76, 66, 57, 49, 45, 44, 44...","[20674, 1544, 1178, 631, 413, 270, 152, 120, 7..."


In [14]:
test_intFeaNames = pd.DataFrame(columns=lst_intFeaNames, index=['Count', 'Types', 'Num'])

for n in lst_intFeaNames:
    ser = test_data[n].value_counts()
    test_intFeaNames[n]['Count'] = len(ser)
    test_intFeaNames[n]['Types'] = list(ser.index)
    test_intFeaNames[n]['Num'] = list(ser)

test_intFeaNames

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
Count,70,3832,31,1102,39,397,30
Types,"[34, 32, 31, 33, 35, 30, 37, 36, 38, 39, 40, 4...","[0, 2, 1, 3, 4, 5, 8, 6, 23, 17, 14, -1, 86, 1...","[20, 21, 18, 6, 14, 17, 5, 8, 28, 19, 7, 13, 1...","[104, 85, 95, 76, 136, 75, 135, 71, 113, 114, ...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 12, 15...","[-1, 182, 92, 183, 181, 91, 370, 350, 364, 188...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
Num,"[493, 466, 464, 463, 462, 428, 411, 409, 345, ...","[877, 44, 39, 31, 31, 27, 21, 19, 19, 18, 18, ...","[673, 507, 505, 497, 460, 455, 453, 450, 426, ...","[56, 51, 50, 49, 48, 48, 47, 46, 46, 45, 45, 4...","[4207, 2980, 1329, 872, 402, 313, 174, 119, 85...","[8875, 39, 31, 30, 27, 27, 24, 23, 22, 19, 18,...","[8875, 674, 521, 282, 168, 105, 64, 42, 30, 20..."


### 字符串改造

计划

| job | marital | education | default | housing | loan | contact | month | poutcome | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| 有差别 | 无差别（0,1,-1） | 有差别 | 无差别（1，-1） | 无差别（1，-1） | 无差别（1，-1） | 无差别（0,1,-1） | 有差别 | 有差别 | 

In [15]:
df_trainProcessed = deepcopy(train_data.iloc[:, 1:-1])
df_testProcessed = deepcopy(test_data.iloc[:, 1:])

#### 无差别的5个列

In [16]:
lst_noDiff = ['marital', 'default', 'housing', 'loan', 'contact']

d_marital = {'married':1, 'divorced':0, 'single':-1}
d_default = {'yes':1, 'no':-1}
d_housing = {'yes':1, 'no':-1}
d_loan    = {'yes':1, 'no':-1}
d_contact = {'unknown':0, 'cellular':-1, 'telephone':1}

In [17]:
# 训练集
df_trainProcessed['marital'] = df_trainProcessed['marital'].apply(lambda n:d_marital[n])
df_trainProcessed['default'] = df_trainProcessed['default'].apply(lambda n:d_default[n])
df_trainProcessed['housing'] = df_trainProcessed['housing'].apply(lambda n:d_housing[n])
df_trainProcessed['loan'] = df_trainProcessed['loan'].apply(lambda n:d_loan[n])
df_trainProcessed['contact'] = df_trainProcessed['contact'].apply(lambda n:d_contact[n])

df_trainProcessed.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,43,management,1,tertiary,-1,291,1,-1,0,9,may,150,2,-1,0,unknown
1,42,technician,0,primary,-1,5076,1,-1,-1,7,apr,99,1,251,2,other
2,47,admin.,1,secondary,-1,104,1,1,-1,14,jul,77,2,-1,0,unknown
3,28,management,-1,secondary,-1,-994,1,1,-1,18,jul,174,2,-1,0,unknown
4,42,technician,0,secondary,-1,2974,1,-1,0,21,may,187,5,-1,0,unknown


In [18]:
# 测试集
df_testProcessed['marital'] = df_testProcessed['marital'].apply(lambda n:d_marital[n])
df_testProcessed['default'] = df_testProcessed['default'].apply(lambda n:d_default[n])
df_testProcessed['housing'] = df_testProcessed['housing'].apply(lambda n:d_housing[n])
df_testProcessed['loan'] = df_testProcessed['loan'].apply(lambda n:d_loan[n])
df_testProcessed['contact'] = df_testProcessed['contact'].apply(lambda n:d_contact[n])

df_testProcessed.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,51,housemaid,1,unknown,-1,174,-1,-1,1,29,jul,308,3,-1,0,unknown
1,32,management,1,tertiary,-1,6059,1,-1,-1,20,nov,110,2,-1,0,unknown
2,60,retired,1,primary,-1,0,-1,-1,1,30,jul,130,3,-1,0,unknown
3,32,student,-1,tertiary,-1,64,-1,-1,-1,30,jun,598,4,105,5,failure
4,41,housemaid,1,secondary,-1,0,1,1,-1,15,jul,368,4,-1,0,unknown


#### 有差别的3个列
`计划`：   根据类型不同分为多个列矩阵形式，_+3个字母

|  -   | job | education | month | poutcome |
| --- | --- | --- | --- | --- |
| Operation | job_3b | education_3b | month_3b | poutcome_3b |

In [19]:
lst_diff = ['job', 'education', 'month','poutcome']

for n in lst_diff:
    print(df_strFeaTypes[n]['Types'])

['blue-collar', 'management', 'technician', 'admin.', 'services', 'retired', 'self-employed', 'entrepreneur', 'unemployed', 'housemaid', 'student', 'unknown']
['secondary', 'tertiary', 'primary', 'unknown']
['may', 'jul', 'aug', 'jun', 'nov', 'apr', 'feb', 'jan', 'oct', 'sep', 'mar', 'dec']
['unknown', 'failure', 'other', 'success']


In [20]:
df_trainProcessed.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,43,management,1,tertiary,-1,291,1,-1,0,9,may,150,2,-1,0,unknown
1,42,technician,0,primary,-1,5076,1,-1,-1,7,apr,99,1,251,2,other
2,47,admin.,1,secondary,-1,104,1,1,-1,14,jul,77,2,-1,0,unknown
3,28,management,-1,secondary,-1,-994,1,1,-1,18,jul,174,2,-1,0,unknown
4,42,technician,0,secondary,-1,2974,1,-1,0,21,may,187,5,-1,0,unknown


In [21]:
"""columns names"""
# job 12列
job_col = ['job_%.3s'%n for n in df_strFeaTypes['job']['Types']]

# education 4列
edu_col = ['edu_%.3s'%n for n in df_strFeaTypes['education']['Types']]

# month 12列
mon_col = ['mon_%.3s'%n for n in df_strFeaTypes['month']['Types']]

# poutcome 4列
pou_col = ['pou_%.3s'%n for n in df_strFeaTypes['poutcome']['Types']]

In [22]:
"""train columns values"""
# job
job_val = pd.get_dummies(df_trainProcessed['job'])
job_val.columns = job_col
df_trainProcessed[job_col] = job_val

# education
edu_val = pd.get_dummies(df_trainProcessed['education'])
edu_val.columns = edu_col
df_trainProcessed[edu_col] = edu_val

# month
mon_val = pd.get_dummies(df_trainProcessed['month'])
mon_val.columns = mon_col
df_trainProcessed[mon_col] = mon_val

# poutcome
pou_val = pd.get_dummies(df_trainProcessed['poutcome'])
pou_val.columns = pou_col
df_trainProcessed[pou_col] = pou_val

In [23]:
df_trainProcessed.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,mon_feb,mon_jan,mon_oct,mon_sep,mon_mar,mon_dec,pou_unk,pou_fai,pou_oth,pou_suc
0,43,management,1,tertiary,-1,291,1,-1,0,9,...,0,0,1,0,0,0,0,0,0,1
1,42,technician,0,primary,-1,5076,1,-1,-1,7,...,0,0,0,0,0,0,0,1,0,0
2,47,admin.,1,secondary,-1,104,1,1,-1,14,...,0,0,0,0,0,0,0,0,0,1
3,28,management,-1,secondary,-1,-994,1,1,-1,18,...,0,0,0,0,0,0,0,0,0,1
4,42,technician,0,secondary,-1,2974,1,-1,0,21,...,0,0,1,0,0,0,0,0,0,1


In [24]:
"""test columns values"""
# job
job_val = pd.get_dummies(df_testProcessed['job'])
job_val.columns = job_col
df_testProcessed[job_col] = job_val

# education
edu_val = pd.get_dummies(df_testProcessed['education'])
edu_val.columns = edu_col
df_testProcessed[edu_col] = edu_val

# month
mon_val = pd.get_dummies(df_testProcessed['month'])
mon_val.columns = mon_col
df_testProcessed[mon_col] = mon_val

# poutcome
pou_val = pd.get_dummies(df_trainProcessed['poutcome'])
pou_val.columns = pou_col
df_testProcessed[pou_col] = pou_val

In [25]:
df_testProcessed.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,mon_feb,mon_jan,mon_oct,mon_sep,mon_mar,mon_dec,pou_unk,pou_fai,pou_oth,pou_suc
0,51,housemaid,1,unknown,-1,174,-1,-1,1,29,...,0,0,0,0,0,0,0,0,0,1
1,32,management,1,tertiary,-1,6059,1,-1,-1,20,...,0,0,0,1,0,0,0,1,0,0
2,60,retired,1,primary,-1,0,-1,-1,1,30,...,0,0,0,0,0,0,0,0,0,1
3,32,student,-1,tertiary,-1,64,-1,-1,-1,30,...,1,0,0,0,0,0,0,0,0,1
4,41,housemaid,1,secondary,-1,0,1,1,-1,15,...,0,0,0,0,0,0,0,0,0,1


#### 删除有差别4列

In [26]:
df_trainProcessed = df_trainProcessed.drop(columns=lst_diff)

df_trainProcessed.head()

Unnamed: 0,age,marital,default,balance,housing,loan,contact,day,duration,campaign,...,mon_feb,mon_jan,mon_oct,mon_sep,mon_mar,mon_dec,pou_unk,pou_fai,pou_oth,pou_suc
0,43,1,-1,291,1,-1,0,9,150,2,...,0,0,1,0,0,0,0,0,0,1
1,42,0,-1,5076,1,-1,-1,7,99,1,...,0,0,0,0,0,0,0,1,0,0
2,47,1,-1,104,1,1,-1,14,77,2,...,0,0,0,0,0,0,0,0,0,1
3,28,-1,-1,-994,1,1,-1,18,174,2,...,0,0,0,0,0,0,0,0,0,1
4,42,0,-1,2974,1,-1,0,21,187,5,...,0,0,1,0,0,0,0,0,0,1


In [27]:
df_testProcessed = df_testProcessed.drop(columns=lst_diff)

df_testProcessed.head()

Unnamed: 0,age,marital,default,balance,housing,loan,contact,day,duration,campaign,...,mon_feb,mon_jan,mon_oct,mon_sep,mon_mar,mon_dec,pou_unk,pou_fai,pou_oth,pou_suc
0,51,1,-1,174,-1,-1,1,29,308,3,...,0,0,0,0,0,0,0,0,0,1
1,32,1,-1,6059,1,-1,-1,20,110,2,...,0,0,0,1,0,0,0,1,0,0
2,60,1,-1,0,-1,-1,1,30,130,3,...,0,0,0,0,0,0,0,0,0,1
3,32,-1,-1,64,-1,-1,-1,30,598,4,...,1,0,0,0,0,0,0,0,0,1
4,41,1,-1,0,1,1,-1,15,368,4,...,0,0,0,0,0,0,0,0,0,1


#### 补充‘pdays’的-1取值-v2.0
- 20个数平均值

In [28]:
ave_size = 20

In [29]:
lst_fillMinus1 = deepcopy(list(train_data['pdays'])[:30])
print(len(lst_fillMinus1))

lst_fillMinus1.extend([0]*(ave_size-1))
print(len(lst_fillMinus1))

l_coop = lst_fillMinus1
l_ave = []

print(lst_fillMinus1)

30
49
[-1, 251, -1, -1, -1, -1, -1, -1, -1, -1, -1, 239, -1, -1, -1, -1, -1, -1, -1, -1, 182, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [30]:
for idx, v in enumerate(l_coop[:len(l_coop)-(ave_size-1)]):
    l = l_coop[idx:idx+ave_size]
    if v==-1:
        ave = sum(l)/ave_size+1
        l_ave.append(ave)
    else:
        l_ave.append(v)
print(len(l_ave))
print(l_ave)

30
[24.6, 251, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15, 21.15, 239, 9.25, 9.3, 9.35, 9.4, 9.45, 9.5, 9.55, 9.6, 182, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]


In [31]:
lst_fillMinus1 = deepcopy(list(df_trainProcessed['pdays']))
print(len(lst_fillMinus1))

lst_fillMinus1.extend([0]*(ave_size-1))
print(len(lst_fillMinus1))

l_coop = lst_fillMinus1
l_ave = []

for idx, v in enumerate(l_coop[:len(l_coop)-(ave_size-1)]):
    l = l_coop[idx:idx+ave_size]
    if v==-1:
        ave = sum(l)/ave_size+1
        l_ave.append(ave)
    else:
        l_ave.append(v)
print(len(l_ave))

df_trainProcessed['pdays'] = l_ave

df_trainProcessed.head()

25317
25336
25317


Unnamed: 0,age,marital,default,balance,housing,loan,contact,day,duration,campaign,...,mon_feb,mon_jan,mon_oct,mon_sep,mon_mar,mon_dec,pou_unk,pou_fai,pou_oth,pou_suc
0,43,1,-1,291,1,-1,0,9,150,2,...,0,0,1,0,0,0,0,0,0,1
1,42,0,-1,5076,1,-1,-1,7,99,1,...,0,0,0,0,0,0,0,1,0,0
2,47,1,-1,104,1,1,-1,14,77,2,...,0,0,0,0,0,0,0,0,0,1
3,28,-1,-1,-994,1,1,-1,18,174,2,...,0,0,0,0,0,0,0,0,0,1
4,42,0,-1,2974,1,-1,0,21,187,5,...,0,0,1,0,0,0,0,0,0,1


In [32]:
df_trainProcessed['pdays'].head(10)

0     24.60
1    251.00
2     21.15
3     21.15
4     21.15
5     21.15
6     21.15
7     21.15
8     21.15
9     21.15
Name: pdays, dtype: float64

In [33]:
lst_fillMinus1 = deepcopy(list(df_testProcessed['pdays']))
print(len(lst_fillMinus1))

lst_fillMinus1.extend([0]*(ave_size-1))
print(len(lst_fillMinus1))

l_coop = lst_fillMinus1
l_ave = []

for idx, v in enumerate(l_coop[:len(l_coop)-(ave_size-1)]):
    l = l_coop[idx:idx+ave_size]
    if v==-1:
        ave = sum(l)/ave_size+1
        l_ave.append(ave)
    else:
        l_ave.append(v)
print(len(l_ave))

df_testProcessed['pdays'] = l_ave

df_testProcessed.head()

10852
10871
10852


Unnamed: 0,age,marital,default,balance,housing,loan,contact,day,duration,campaign,...,mon_feb,mon_jan,mon_oct,mon_sep,mon_mar,mon_dec,pou_unk,pou_fai,pou_oth,pou_suc
0,51,1,-1,174,-1,-1,1,29,308,3,...,0,0,0,0,0,0,0,0,0,1
1,32,1,-1,6059,1,-1,-1,20,110,2,...,0,0,0,1,0,0,0,1,0,0
2,60,1,-1,0,-1,-1,1,30,130,3,...,0,0,0,0,0,0,0,0,0,1
3,32,-1,-1,64,-1,-1,-1,30,598,4,...,1,0,0,0,0,0,0,0,0,1
4,41,1,-1,0,1,1,-1,15,368,4,...,0,0,0,0,0,0,0,0,0,1


In [34]:
df_testProcessed['pdays'].head(10)

0     26.7
1     26.7
2     33.4
3    105.0
4     28.1
5     28.1
6     28.1
7     28.1
8     28.1
9     28.1
Name: pdays, dtype: float64

### 标准化

In [35]:
df_trainNor = deepcopy(df_trainProcessed)
df_testNor = deepcopy(df_testProcessed)

In [36]:
for n in df_trainProcessed.columns:
    maxium, minium = max(df_trainProcessed[n]), min(df_trainProcessed[n])
    diff = maxium - minium
    
    df_trainNor[n] = df_trainProcessed[n].apply(lambda v: (v-minium) / (diff+1.01) + 0.01)
    
df_trainNor.head()

Unnamed: 0,age,marital,default,balance,housing,loan,contact,day,duration,campaign,...,mon_feb,mon_jan,mon_oct,mon_sep,mon_mar,mon_dec,pou_unk,pou_fai,pou_oth,pou_suc
0,0.330472,0.674452,0.01,0.085445,0.674452,0.01,0.342226,0.267981,0.04864,0.028179,...,0.01,0.01,0.507512,0.01,0.01,0.01,0.01,0.01,0.01,0.507512
1,0.317653,0.342226,0.01,0.128887,0.674452,0.01,0.01,0.203486,0.035502,0.01,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.507512,0.01,0.01
2,0.381747,0.674452,0.01,0.083747,0.674452,0.674452,0.01,0.42922,0.029835,0.028179,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.507512
3,0.138189,0.01,0.01,0.073778,0.674452,0.674452,0.01,0.55821,0.054822,0.028179,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.507512
4,0.317653,0.342226,0.01,0.109803,0.674452,0.01,0.342226,0.654953,0.058171,0.082714,...,0.01,0.01,0.507512,0.01,0.01,0.01,0.01,0.01,0.01,0.507512


In [37]:
for n in df_testProcessed.columns:
    maxium, minium = max(df_testProcessed[n]), min(df_testProcessed[n])
    diff = maxium - minium
    
    df_testNor[n] = df_testProcessed[n].apply(lambda v: (v-minium)/ (diff+1.01) + 0.01)
    
df_testNor.head()

Unnamed: 0,age,marital,default,balance,housing,loan,contact,day,duration,campaign,...,mon_feb,mon_jan,mon_oct,mon_sep,mon_mar,mon_dec,pou_unk,pou_fai,pou_oth,pou_suc
0,0.438516,0.674452,0.01,0.043147,0.01,0.01,0.674452,0.912935,0.109258,0.044477,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.507512
1,0.191795,0.674452,0.01,0.113366,0.674452,0.01,0.01,0.622706,0.045449,0.027238,...,0.01,0.01,0.01,0.507512,0.01,0.01,0.01,0.507512,0.01,0.01
2,0.555384,0.674452,0.01,0.041071,0.01,0.01,0.674452,0.945182,0.051895,0.044477,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.507512
3,0.191795,0.01,0.01,0.041834,0.01,0.01,0.01,0.945182,0.202716,0.061715,...,0.507512,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.507512
4,0.308663,0.674452,0.01,0.041071,0.674452,0.674452,0.01,0.461467,0.128595,0.061715,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.507512


### 分割数据集

In [38]:
X = df_trainNor
y = train_data.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((18987, 44), (6330, 44), (18987,), (6330,))

In [39]:
X_train.iloc[0], y_train.iloc[0]

(age         0.407385
 marital     0.674452
 default     0.010000
 balance     0.085526
 housing     0.010000
 loan        0.010000
 contact     0.010000
 day         0.364724
 duration    0.029577
 campaign    0.028179
 pdays       0.060935
 previous    0.010000
 job_blu     0.507512
 job_man     0.010000
 job_tec     0.010000
 job_adm     0.010000
 job_ser     0.010000
 job_ret     0.010000
 job_sel     0.010000
 job_ent     0.010000
 job_une     0.010000
 job_hou     0.010000
 job_stu     0.010000
 job_unk     0.010000
 edu_sec     0.010000
 edu_ter     0.507512
 edu_pri     0.010000
 edu_unk     0.010000
 mon_may     0.010000
 mon_jul     0.507512
 mon_aug     0.010000
 mon_jun     0.010000
 mon_nov     0.010000
 mon_apr     0.010000
 mon_feb     0.010000
 mon_jan     0.010000
 mon_oct     0.010000
 mon_sep     0.010000
 mon_mar     0.010000
 mon_dec     0.010000
 pou_unk     0.010000
 pou_fai     0.010000
 pou_oth     0.010000
 pou_suc     0.507512
 Name: 20153, dtype: float64, 0)

## MLP-v1.0（0.9005）

In [40]:
from sklearn.neural_network import MLPClassifier

### 模型

In [41]:
%%time
"""
1.1000*100,Time:4min 3s
2.20,10,5,Time:20.4 s
"""
clf = MLPClassifier(solver='adam', alpha=1e-3,hidden_layer_sizes=(20,10,5), random_state=1)  

clf.fit(X_train, y_train)

print('每层网络层系数矩阵维度：\n',[coef.shape for coef in clf.coefs_])

print('正确率',clf.score(X_test, y_test))

每层网络层系数矩阵维度：
 [(44, 20), (20, 10), (10, 5), (5, 1)]
正确率 0.9279620853080569
Wall time: 23 s


### 预测

In [42]:
result = clf.predict_proba(df_testNor)

len(result), df_testNor.shape

(10852, (10852, 44))

In [43]:
result[:10]

array([[9.93065712e-01, 6.93428814e-03],
       [9.98210276e-01, 1.78972386e-03],
       [9.99466563e-01, 5.33436610e-04],
       [5.48466216e-02, 9.45153378e-01],
       [9.98974145e-01, 1.02585543e-03],
       [9.99664197e-01, 3.35803475e-04],
       [9.99973488e-01, 2.65120872e-05],
       [9.99561814e-01, 4.38185742e-04],
       [9.97999352e-01, 2.00064820e-03],
       [9.41933413e-01, 5.80665869e-02]])

`原预测结果`

```
array([[8.35496865e-01, 1.64503135e-01],
       [9.99997081e-01, 2.91907897e-06],
       [1.00000000e+00, 3.40281096e-18],
       [3.25647953e-08, 9.99999967e-01],
       [1.00000000e+00, 2.13275251e-11],
       [1.00000000e+00, 4.03494837e-10],
       [1.00000000e+00, 5.31897176e-13],
       [1.00000000e+00, 1.12170706e-15],
       [1.00000000e+00, 3.04603904e-13],
       [8.74722072e-01, 1.25277928e-01]])
```

In [44]:
res_end = [l[1] for l in result]

res_end[:10]

[0.006934288139459028,
 0.0017897238572075836,
 0.0005334366104188447,
 0.9451533783654511,
 0.0010258554294588435,
 0.0003358034745999318,
 2.6512087169471175e-05,
 0.00043818574157258546,
 0.00200064820021319,
 0.05806658685972797]

In [45]:
df_result = pd.DataFrame(columns=['ID','pred'])
df_result['ID'] = test_data['ID']
df_result['pred'] = res_end

df_result.head()

Unnamed: 0,ID,pred
0,25318,0.006934
1,25319,0.00179
2,25320,0.000533
3,25321,0.945153
4,25322,0.001026


In [46]:
df_result.to_csv('../Result/MLP_0625_9088.csv', sep=',', index=False)

## MLP-v2.0（0.8556）

In [1]:
from sklearn.neural_network import MLPClassifier

### 模型

In [2]:
%%time
"""
1.1000*100,Time:4min 3s
2.20,10,5,Time:20.4 s
"""
clf = MLPClassifier(solver='adam', alpha=1e-3,hidden_layer_sizes=(20,10,5), random_state=1)  

clf.fit(X_train, y_train)

print('每层网络层系数矩阵维度：\n',[coef.shape for coef in clf.coefs_])

print('正确率',clf.score(X_test, y_test))

NameError: name 'X_train' is not defined

### 预测

In [47]:
result = clf.predict_proba(X_test)

len(result), X_test.shape

(6330, (6330, 44))

In [48]:
res_end = [l[1] for l in result]

res_end[:10]

[0.22367188295932355,
 0.0023576353290809737,
 0.0006074816220319018,
 0.0016757393510683418,
 0.0004291204043540161,
 0.00025306177679327007,
 0.007032061064389484,
 0.00024974346267227557,
 0.049810667875505275,
 0.0052437228921600396]

In [49]:
from sklearn.metrics import roc_auc_score # AUC score

roc_auc_score(y_test, res_end)

0.9428595902257745

In [285]:
result = clf.predict_proba(df_testNor)

len(result), df_testNor.shape

(10852, (10852, 44))

In [286]:
res_end = [l[1] for l in result]

res_end[:10]

[0.006934288139459028,
 0.0017897238572075836,
 0.0005334366104188447,
 0.9451533783654511,
 0.0010258554294588435,
 0.0003358034745999318,
 2.6512087169471175e-05,
 0.00043818574157258546,
 0.00200064820021319,
 0.05806658685972797]

In [287]:
df_result = pd.DataFrame(columns=['ID','pred'])
df_result['ID'] = test_data['ID']
df_result['pred'] = res_end

df_result.head()

Unnamed: 0,ID,pred
0,25318,0.006934
1,25319,0.00179
2,25320,0.000533
3,25321,0.945153
4,25322,0.001026


In [288]:
df_result.to_csv('../Result/MLP_0625_9279.csv', sep=',', index=False)

## SVM.svc

In [80]:
from sklearn import svm

### 模型

In [81]:
%%time
# clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr', probability=True)
clf = svm.SVC(probability=True)
clf.fit(X_train, y_train)

print(clf.score(X_test, y_test))

0.8902053712480252
Wall time: 6min 3s


### 预测

In [83]:
result = clf.predict_proba(df_testNor)

len(result), df_testNor.shape, result[:3]

(10852, (10852, 44), array([[0.88567247, 0.11432753],
        [0.61403621, 0.38596379],
        [0.9220408 , 0.0779592 ]]))

In [84]:
res_end = [l[1] for l in result]

res_end[:10]

[0.11432753305043525,
 0.3859637885899854,
 0.07795919778824387,
 0.15900977344893985,
 0.11559483684673251,
 0.08279614522042728,
 0.06134560603514815,
 0.15947304077744764,
 0.10964112003075072,
 0.3812859923042722]

In [54]:
df_result = pd.DataFrame(columns=['ID','pred'])
df_result['ID'] = test_data['ID']
df_result['pred'] = res_end

df_result.head()

Unnamed: 0,ID,pred
0,25318,0.114665
1,25319,0.395986
2,25320,0.077431
3,25321,0.160753
4,25322,0.115968
