Source blog: [机器学习超详细实践攻略: 使用scikit-learn构建模型的通用模板](https://zhuanlan.zhihu.com/p/88729124)

# 加载数据集合

假设你在这时候已经完成了所有的data cleaning, feature engineering的工作了。

In [1]:
from sklearn.datasets import load_iris
data = load_iris()
x = data.data
y = data.target

In [24]:
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y,
                                                 test_size=0.1,
                                                 random_state=0, 
                                                 stratify = y
                                                 )

In [25]:
train_x[:10], train_y[:10]

(array([[5.7, 2.9, 4.2, 1.3],
        [6.7, 3.3, 5.7, 2.5],
        [6.2, 3.4, 5.4, 2.3],
        [4.4, 3. , 1.3, 0.2],
        [6.3, 2.5, 4.9, 1.5],
        [6.9, 3.1, 4.9, 1.5],
        [6.7, 3.1, 4.4, 1.4],
        [6.9, 3.1, 5.4, 2.1],
        [5. , 3.4, 1.5, 0.2],
        [5.2, 4.1, 1.5, 0.1]]), array([1, 2, 2, 0, 1, 1, 1, 2, 0, 0]))

# 万能模板V1.0——助你快速构建一个基本的算法模型

![](https://pic2.zhimg.com/v2-efed6fb2d82738cd80f1554f9b619d4d_r.jpg)

In [26]:
### svm分类器

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

svm_model = SVC()

# fit我们的training data
svm_model.fit(train_x,train_y)

pred1 = svm_model.predict(train_x)
accuracy1 = accuracy_score(train_y,pred1)
print('在训练集上的精确度: %.4f'%accuracy1)

pred2 = svm_model.predict(test_x)
accuracy2 = accuracy_score(test_y,pred2)
print('在测试集上的精确度: %.4f'%accuracy2)

在训练集上的精确度: 0.9481
在测试集上的精确度: 1.0000


In [27]:
### LogisticRegression分类器

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score #评分函数用精确度评估

lr_model = LogisticRegression()

lr_model.fit(train_x,train_y)

pred1 = lr_model.predict(train_x)
accuracy1 = accuracy_score(train_y,pred1)
print('在训练集上的精确度: %.4f'%accuracy1)

pred2 = lr_model.predict(test_x)
accuracy2 = accuracy_score(test_y,pred2)
print('在测试集上的精确度: %.4f'%accuracy2)

在训练集上的精确度: 0.9704
在测试集上的精确度: 1.0000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


# 万能模板V2.0版——加入交叉验证，让算法模型评估更加科学

在1.0版的模板中，当你多次运行同一个程序就会发现：每次运行得到的精确度并不相同，而是在一定范围内浮动，这是因为数据输入模型之前会进行选择，每次训练时数据输入模型的顺序都不一样。所以即使是同一个程序，模型最后的表现也会有好有坏。

更糟糕的是，有些情况下，在训练集上，通过调整参数设置使模型的性能达到了最佳状态，但在测试集上却可能出现过拟合的情况。 这个时候，我们在训练集上得到的评分不能有效反映出模型的泛化性能。

为了解决上述两个问题，还应该在训练集上划分出验证集(validation set)并结合交叉验证来解决。首先，在训练集中划分出不参与训练的验证集，只是在模型训练完成以后对模型进行评估，接着再在测试集上进行最后的评估。

但这样大大减少了可用于模型学习的样本数量，所以还需要采用交叉验证的方式多训练几次。比如说最常用的k-折交叉验证如下图所示，它主要是将训练集划分为 k 个较小的集合。然后将k-1份训练子集作为训练集训练模型，将剩余的 1 份训练集子集作为验证集用于模型验证。这样需要训练k次，最后在训练集上的评估得分取所有训练结果评估得分的平均值。

![](https://pic4.zhimg.com/80/v2-46ba9e927a578e55ae83c05b464d1cdb_720w.jpg)

这样一方面可以让训练集的所有数据都参与训练，另一方面也通过多次计算得到了一个比较有代表性的得分。唯一的缺点就是计算代价很高，增加了k倍的计算量。

原理就是这样，但理想很丰满，现实很骨干。在自己实现的时候却有一个很大的难题摆在面前：怎么能够把训练集均匀地划分为K份？

这个问题不用思考太多，既然别忘了，我们现在是站在巨人的肩膀上，scikit-learn已经将优秀的数学家所想到的均匀拆分方法和程序员的智慧融合在了cross_val_score() 这个函数里了，只需要调用该函数即可，不需要自己想什么拆分算法，也不用写for循环进行循环训练。

万能模板2.0如下：

![](https://vdn1.vzuu.com/SD/d100bfe0-ec2d-11ea-acfd-5ab503a75443.mp4?disable_local_cache=1&auth_key=1619058817-0-0-4073046231bc83ed8116253c20b85d50&f=mp4&bu=pico&expiration=1619058817&v=hw)

**使用cross-validation，计算精准度等指标都是采用的平均值**。

In [41]:
### svm分类器

from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC

svm_model = SVC()
svm_model.fit(train_x,train_y)

kfold = KFold(n_splits=5, random_state=42)

scores1 = cross_val_score(svm_model,train_x,train_y,cv=5, scoring='accuracy')     # 也可以选择除了accuracy之外的metrics

# 输出精确度的平均值和置信度区间
print("训练集上的平均精确度: %0.2f (+/- %0.2f)" % (scores1.mean(), scores1.std() * 2))

scores2 = cross_val_score(svm_model,test_x,test_y,cv=5, scoring='accuracy')

# 输出精确度的平均值和置信度区间
print("测试集上的平均精确度: %0.2f (+/- %0.2f)" % (scores2.mean(), scores2.std() * 2))

print(scores1)
print(scores2)

训练集上的平均精确度: 0.95 (+/- 0.11)
测试集上的平均精确度: 1.00 (+/- 0.00)
[1.         0.92592593 0.85185185 0.96296296 1.        ]
[1. 1. 1. 1. 1.]




In [42]:
### LogisticRegression分类器

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression


lr_model = LogisticRegression()
lr_model.fit(train_x,train_y)

scores1 = cross_val_score(lr_model,train_x,train_y,cv=5, scoring='accuracy')
# 输出精确度的平均值和置信度区间
print("训练集上的平均精确度: %0.2f (+/- %0.2f)" % (scores1.mean(), scores1.std() * 2))

scores2 = cross_val_score(lr_model,test_x,test_y,cv=5, scoring='accuracy')
# 输出精确度的平均值和置信度区间
print("测试集上的平均精确度: %0.2f (+/- %0.2f)" % (scores2.mean(), scores2.std() * 2))


print(scores1)
print(scores2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

训练集上的平均精确度: 0.96 (+/- 0.07)
测试集上的平均精确度: 1.00 (+/- 0.00)
[1.         0.92592593 0.92592593 0.92592593 1.        ]
[1. 1. 1. 1. 1.]


# 万能模板V3.0版——调参让算法表现更上一层楼

以上都是通过算法的默认参数来训练模型的，不同的数据集适用的参数难免会不一样，自己设计算法是设计不来的，只能调调参这样子，调参，是广大算法工程师最后的尊严。再说，若是做算法不调参，岂不是辱没了算法工程师在江湖上大名鼎鼎的“炼丹工程师”的名声？

scikit-learn对于不同的算法也提供了不同的参数可以自己调节。如果细说起来，又能写好几篇文章，本文目的是构建一个万能算法框架构建模板，所以，这里只介绍一下一个通用的自动化调参方法，至于更细节的每个算法对应参数的含义以及手动调参方法，会在以后的文章中结合实例具体慢慢介绍。

![](https://pic4.zhimg.com/80/v2-7b53a61f956bc71328d5e482deb08f77_720w.jpg)

In [7]:
SVC().get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [48]:
###1、svm分类器
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import SVC

svm_model = SVC()

params = [
        {'kernel': ['linear'], 'C': [1, 10, 100, 100]},
        {'kernel': ['poly'], 'C': [1], 'degree': [2, 3]},
        {'kernel': ['rbf'], 'C': [1, 10, 100, 100], 'gamma':[1, 0.1, 0.01, 0.001]}
        ]

best_model = GridSearchCV(svm_model,           # 模型
                          param_grid=params,   # 参数
                          cv = 5,              # cross validation
                          scoring = 'accuracy',
                          verbose = 1,
                          n_jobs = 2
                          )

best_model.fit(train_x,train_y)

Fitting 5 folds for each of 22 candidates, totalling 110 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 110 out of 110 | elapsed:    0.2s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=2,
             param_grid=[{'C': [1, 10, 100, 100], 'kernel': ['linear']},
                         {'C': [1], 'degree': [2, 3], 'kernel': ['poly']},
                         {'C': [1, 10, 100, 100],
                          'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)

In [49]:
best_model.best_score_

0.962962962962963

In [50]:
best_model.best_params_ 

{'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}

In [51]:
best_model.best_estimator_ 

SVC(C=100, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [52]:
best_model.cv_results_

{'mean_fit_time': array([0.00095983, 0.00117593, 0.00299048, 0.00099072, 0.00085754,
        0.00088077, 0.00229282, 0.00092692, 0.00353055, 0.00215845,
        0.00120778, 0.00102868, 0.00105143, 0.00113893, 0.00113878,
        0.00088267, 0.00080271, 0.00167861, 0.00158057, 0.00094194,
        0.00120282, 0.00121121]),
 'mean_score_time': array([0.0015595 , 0.00051808, 0.00058961, 0.00048409, 0.00094042,
        0.00048914, 0.00054021, 0.0005259 , 0.00061035, 0.00084376,
        0.00053945, 0.00070391, 0.00056367, 0.00055709, 0.0005928 ,
        0.00051947, 0.00051746, 0.00055542, 0.0014246 , 0.0008132 ,
        0.00056229, 0.00159359]),
 'mean_test_score': array([0.95555556, 0.94074074, 0.95555556, 0.95555556, 0.95555556,
        0.94814815, 0.94074074, 0.94814815, 0.92592593, 0.91111111,
        0.95555556, 0.94814815, 0.95555556, 0.93333333, 0.94814815,
        0.94814815, 0.95555556, 0.96296296, 0.94814815, 0.94814815,
        0.95555556, 0.96296296]),
 'param_C': masked_array(da