#### 交叉验证 cross validation
K-fold Cross Validation  
1. 将所有数据集分成 K 份
2. 不重复地每次取其中一份做测试集，用其他 K-1 份做训练集训练模型，之后计算该模型在测试集上的MES
3. 将 K 次的 MES 取平均得到最后的MSE

目的：让模型的评估更加精准

#### 超参数搜索 网格搜索 Grid Search
通常情况下，很多参数是手动指定的如 KNN 的 K 值，这种叫做超参数。  
但是手动调试参数的过程繁杂，网格搜索可以依次遍历模型的超参数，对使用不同参数建立的模型进行评估，最后选出最好的参数。
#### API
~~~python
sklearn.model_selection.GridSearchCV
~~~
主要参数：
- estimator：估计器对象
- param_grid：估计器参数
- cv：几折交叉验证
主要返回值：
- best_params_：最佳参数
- best_score_：最佳得分
- best_estimator_：最佳估计器
- cv_results_：交叉验证结果

In [18]:
# iris grid search and cross validation
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
# 1. load iris data
iris = load_iris()

# 2. spilt dataset
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target)

# 3. StandardScaler
tf = StandardScaler()
x_train = tf.fit_transform(X_train)
x_test = tf.transform(X_test)


In [19]:
# 4. KNN
estimator = KNeighborsClassifier()

# grid search
param_dict = {"n_neighbors":range(1,100)}

estimator = GridSearchCV(estimator,param_dict,cv=10)


estimator.fit(X_train,y_train)


print("best_params_:",estimator.best_params_)
print("best_score_:",estimator.best_score_)
print("best_estimator_:",estimator.best_estimator_)
print("cv_results_:",estimator.cv_results_)


best_params_: {'n_neighbors': 3}
best_score_: 0.9636363636363635
best_estimator_: KNeighborsClassifier(n_neighbors=3)
cv_results_: {'mean_fit_time': array([5.98073006e-04, 6.01410866e-04, 5.94377518e-04, 4.98938560e-04,
       7.00449944e-04, 5.98883629e-04, 4.95409966e-04, 8.03685188e-04,
       7.94339180e-04, 6.94704056e-04, 7.91430473e-04, 9.97781754e-04,
       3.98850441e-04, 6.01720810e-04, 9.97304916e-04, 7.99083710e-04,
       4.98604774e-04, 7.97700882e-04, 6.98161125e-04, 7.97724724e-04,
       5.98502159e-04, 4.98914719e-04, 4.98580933e-04, 5.98311424e-04,
       5.98502159e-04, 0.00000000e+00, 4.98509407e-04, 3.98945808e-04,
       7.97891617e-04, 5.94949722e-04, 2.99286842e-04, 5.92446327e-04,
       8.02111626e-04, 4.98890877e-04, 5.98192215e-04, 5.02657890e-04,
       5.00512123e-04, 4.95076180e-04, 5.95831871e-04, 8.06164742e-04,
       2.99096107e-04, 6.95896149e-04, 5.98216057e-04, 3.99041176e-04,
       4.98628616e-04, 7.94792175e-04, 1.39641762e-03, 5.94878197e-04,

In [20]:


# model evaluate
# 1. 直接比对真实值和预测值  
y_predict = estimator.predict(X_test)  
print(f"y_predict={y_predict}")
print(f"y_test == y_predict:\n{y_test == y_predict}")

y_predict_proba = estimator.predict_proba(X_test)
print(f"y_predict_proba=\n{y_predict_proba}")

# 2. 计算准确率  
score = estimator.score(X_test,y_test)
print(f"score:{score}")

y_predict=[1 1 1 1 1 2 2 2 2 1 1 1 2 0 0 1 1 1 0 0 2 0 1 2 1 2 2 1 2 0 1 0 2 2 2 1 2
 0]
y_test == y_predict:
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True]
y_predict_proba=
[[0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         0.66666667 0.33333333]
 [0.         0.         1.        ]
 [0.         0.33333333 0.66666667]
 [0.         0.         1.        ]
 [0.         0.33333333 0.66666667]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         0.33333333 0.66666667]
 [1.         0.         0.        ]
 [1.         0.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.        ]
 [0.         1.         0.  