任务5：使用网格搜索法对5个模型进行调优（调参时采用五折交叉验证的方式），并进行模型评估，记得展示代码的运行结果。 
时间：2天

In [2]:
import pickle
with open('feature.pkl','rb') as f:
    X = pickle.load(f, encoding = 'gbk')

In [3]:
with open('label.pkl','rb') as f:
    y = pickle.load(f)

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2018)

### 模型评估

In [5]:
from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics(clf, X_train, X_test, y_train, y_test):
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 准确率
    print('[准确率]', end = ' ')
    print('训练集：', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
    print('测试集：', '%.4f'%accuracy_score(y_test, y_test_pred))
    
    # auc取值：用roc_auc_score或auc
    print('[auc值]', end = ' ')
    print('训练集：', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
    print('测试集：', '%.4f'%roc_auc_score(y_test, y_test_proba))

### 网格调参法

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier

In [7]:
import warnings
warnings.filterwarnings('ignore')

### LR

In [8]:
lr = LogisticRegression()
param = {'C': [1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}

gsearch = GridSearchCV(lr, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳参数：',gsearch.best_params_)
print('训练集的最佳分数：', gsearch.best_score_)
print('测试集的最佳分数：', gsearch.score(X_test, y_test))

最佳参数： {'C': 0.1, 'penalty': 'l1'}
训练集的最佳分数： 0.7975393268016738
测试集的最佳分数： 0.7786600790513833


In [9]:
lr = LogisticRegression(C = 0.1, penalty = 'l1')
lr.fit(X_train, y_train)
model_metrics(lr, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.8067 测试集： 0.7958
[auc值] 训练集： 0.8213 测试集： 0.7793


### SVM

In [10]:
# 多项式SVM， valueerror
    svm_linear = svm.SVC(kernel = 'poly', probability=True)
    param = {'C':[0.01,0.1,1]}
    gsearch = GridSearchCV(svm_linear, param_grid = param,scoring ='roc_auc', cv=5)
    gsearch.fit(X_train, y_train)

    print('最佳参数：',gsearch.best_params_)
    print('训练集的最佳分数：', gsearch.best_score_)
    print('测试集的最佳分数：', gsearch.score(X_test, y_test))

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [11]:
# sigmoid - SVM 时间较长
svm_sigmoid = svm.SVC(kernel = 'sigmoid',probability=True)
param = {'C':[0.01,0.1,1]}
gsearch = GridSearchCV(svm_sigmoid, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳参数：',gsearch.best_params_)
print('训练集的最佳分数：', gsearch.best_score_)
print('测试集的最佳分数：', gsearch.score(X_test, y_test))

最佳参数： {'C': 0.01}
训练集的最佳分数： 0.5
测试集的最佳分数： 0.5


In [12]:
svm_sigmoid=svm.SVC(kernel = 'sigmoid',probability = True, C = 0.01)
svm_sigmoid.fit(X_train,y_train)
model_metrics(svm_sigmoid, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.7378 测试集： 0.7699
[auc值] 训练集： 0.5000 测试集： 0.5000


In [None]:
# 高斯SVM 运行时间太长


### 决策树模型

 > 1）首先对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。

In [13]:
param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=8,min_samples_split=300,min_samples_leaf=20, max_features='sqrt',random_state =2333),
                       param_grid = param,scoring ='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'max_depth': 7, 'min_samples_split': 100}, 0.7478991290958494)

> 2）对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参。

In [14]:
param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=11,min_samples_split=100,min_samples_leaf=20, max_features='sqrt', random_state =2333),
                       param_grid = param,scoring ='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'min_samples_leaf': 60, 'min_samples_split': 150}, 0.7539002649214331)

> 3）再对最大特征数max_features进行网格搜索

In [15]:
param = {'max_features':range(7,20,2)}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=11,min_samples_split=550,min_samples_leaf=80, max_features='sqrt', random_state =2333),
                       param_grid = param,scoring ='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'max_features': 7}, 0.7410852964065583)

In [16]:
dt = DecisionTreeClassifier(max_depth=11,min_samples_split=550,min_samples_leaf=80,max_features=19, random_state =2333)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.7593 测试集： 0.7690
[auc值] 训练集： 0.7731 测试集： 0.7130


### RF

In [18]:
param_test = {'n_estimators':range(20,200,20)}
gsearch = GridSearchCV(estimator = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50, 
                                                          min_samples_leaf=20, max_features = 9,random_state=2018), 
                       param_grid = param_test, scoring='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
gsearch.best_params_, gsearch.best_score_

({'n_estimators': 160}, 0.8112761622319916)

In [20]:
rf = RandomForestClassifier(n_estimators=160, max_depth=9, min_samples_split=50,
                            min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2018)
rf.fit(X_train, y_train)
print('袋外分数：', rf.oob_score_)
model_metrics(rf, X_train, X_test, y_train, y_test)

袋外分数： 0.7941176470588235
[准确率] 训练集： 0.8318 测试集： 0.7900
[auc值] 训练集： 0.9055 测试集： 0.7649


### xgboost

1、max_depth = 5 :这个参数的取值最好在3-10之间。起始值选为5，也可以选择其它的值。起始值在4-6之间都是不错的选择。

2、min_child_weight = 1:在这里选了一个比较小的值，因为这是一个极不平衡的分类问题。因此，某些叶子节点下的值会比较小。

3、gamma = 0: 起始值也可以选其它比较小的值，在0.1到0.2之间就可以。这个参数后继也是要调整的。

4、subsample, colsample_bytree = 0.8: 这个是最常见的初始值了。典型值的范围在0.5-0.9之间。

In [21]:
#默认参数
xgb0 = XGBClassifier()
xgb0.fit(X_train, y_train)

model_metrics(xgb0, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.8669 测试集： 0.7975
[auc值] 训练集： 0.9196 测试集： 0.7744


1) 首先从步长(learning rate)和迭代次数(n_estimators)入手。

开始选择一个较小的步长来网格搜索最好的迭代次数。这里，我们将步长初始值设置为0.1, 对于迭代次数进行网格搜索。

In [22]:
param_test = {'n_estimators':range(20,200,20)}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=5, 
                                                  min_child_weight=1, gamma=0, subsample=0.8, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, seed=27), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'n_estimators': 40}, 0.8111966945436094)

2) max_depth 和 min_child_weight 参数调优

In [23]:
param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}

gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=40, max_depth=5, 
                                                  min_child_weight=1, gamma=0, subsample=0.8, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, seed=27), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'max_depth': 3, 'min_child_weight': 5}, 0.8132960506868538)

可以看出理想的max_depth值为5，理想的min_child_weight值为5。在这个值附近我们可以再进一步调整，来找出理想值。

In [26]:
param_test = {'max_depth':[3,4,5], 'min_child_weight':[3,4,5]}

gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=40, max_depth=5, 
                                                  min_child_weight=1, gamma=0, subsample=0.8, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, seed=27), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'max_depth': 5, 'min_child_weight': 4}, 0.8135445348536937)

3) gamma参数调优

In [28]:
param_test = {'gamma':[i/10 for i in range(0,6)]}

gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=40, max_depth=5, 
                                                  min_child_weight=4, gamma=0, subsample=0.8, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, seed=27), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'gamma': 0.0}, 0.8135445348536937)

4）调整subsample 和 colsample_bytree 参数

In [30]:
param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}

gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=40, max_depth=5, 
                                                  min_child_weight=4, gamma=0.0, subsample=0.8, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, seed=27), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'colsample_bytree': 0.5, 'subsample': 0.7}, 0.8164973017921774)

从这里可以看出来，subsample理想取值0.7, colsample_bytree理想取值都是0.5。现在，我们以0.05为步长，在这个值附近尝试取值。

In [31]:
param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(40,61,5)]}

gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=40, max_depth=5, 
                                                  min_child_weight=4, gamma=0.0, subsample=0.8, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, seed=27), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'colsample_bytree': 0.5, 'subsample': 0.65}, 0.8186885961030738)

5）正则化参数调优

> #'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]

In [33]:
param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=40, max_depth=5, 
                                                  min_child_weight=4, gamma=0.0, subsample=0.65, 
                                                  colsample_bytree=0.5, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, seed=27), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'reg_alpha': 1e-05}, 0.8195231485701571)

6）回到第1）步，降低学习速率, 调整迭代次数

In [34]:
param_test = {'n_estimators':range(20,200,20)}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.01, n_estimators=40, max_depth=5, 
                                                  min_child_weight=4, gamma=0.0, subsample=0.65, reg_alpha=1e-05,
                                                  colsample_bytree=0.5, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, seed=27), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'n_estimators': 180}, 0.814528324593074)

选择学习率较小的模型，最终结果为：

In [42]:
xgb = XGBClassifier(learning_rate =0.01, n_estimators=180, max_depth=5, 
                                                  min_child_weight=4, gamma=0.0, subsample=0.65, reg_alpha=1e-05,
                                                  colsample_bytree=0.5, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, seed=27)
xgb.fit(X_train, y_train)

model_metrics(xgb, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.8522 测试集： 0.7950
[auc值] 训练集： 0.9086 测试集： 0.7653
