# 參數說明

## 網格搜索

### GridSearchCV()

* estimator：分类器

>如estimator=RandomForestClassifier(),

>并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个score方法。

* param_grid：值为字典或者列表

>即需要最优化的参数的取值，param_grid  = {'n_estimators':range(10,71,10)}。

* scoring :准确度评价标准

>默认None,使用estimator的误差估计函数

>或者如scoring='roc_auc'，

>或是可调用对象，需要其函数签名形如：scorer(estimator, X, y)

* cv :交叉验证参数

>默认None，指定fold数量(佔數據幾分之1)，默认为3，也可以是yield训练/测试数据的生成器。

* refit :

>默认为True,即在搜索参数(交叉验证)结束后，用最佳参数结果再次fit一遍全部数据集。

* iid:

>默认True,为True时，默认为各个样本fold概率分布一致，误差估计为所有样本之和，而非各个fold的平均。

* verbose：

>日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，>1：对每个子模型都输出。

* n_jobs: 

>并行数，int：个数,-1：跟CPU核数一致, 1:默认值。

* pre_dispatch：

>指定总共分发的并行任务数。当n_jobs大于1时，数据将在每个运行点进行复制，这可能导致OOM(Out Of Memory)，

>而设置pre_dispatch参数，则可以预先划分总共的job数量，使数据最多被复制pre_dispatch次


#### 进行预测的常用方法和属性

* grid.fit()：运行网格搜索

* grid\_scores\_：给出不同参数情况下的评价结果

* best\_params\_：描述了已取得最佳结果的参数的组合

* best\_score\_：成员提供优化过程期间观察到的最好的评分

# 範例1

## 參考來源

https://blog.csdn.net/jiangjingxuan/article/details/54728601

In [25]:
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline

# 构造样本，这块得多构造点，不然会报class不足的错误，因为gridsearch会拆分成小组
X = [
    "fuck you",
    "fuck you all",
    "hello everyone",
    "fuck me",
    "hello boy",
    "fuck you",
    "fuck you all",
    "hello everyone",
    "fuck me",
    "hello boy",
    "fuck you",
    "fuck you all",
    "hello everyone",
    "fuck me",
    "hello boy",
    "fuck you",
    "fuck you all",
    "hello everyone",
    "fuck me",
    "hello boy",
    "fuck you",
    "fuck you all",
    "hello everyone",
    "fuck me",
    "hello boy"]

y = [1,0,1,0,1,1,0,1,0,1,1,0,1,0,1,1,0,1,0,1,1,0,1,0,1]

# 这是执行的序列，gridsearch是构造多进程顺序执行序列并比较结果
# 这里的vect和clf名字自己随便起，但是要和parameters中的前缀对应
pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
    ])

# 这里面的max_features必须是TfidfVectorizer的参数, 里面的取值就是子进程分别执行所用
parameters = {
        'vect__max_features': (3, 5) # or [i for i in range(1,6)]
        }

# accuracy表示按精确度判断最优值
grid_search = GridSearchCV(pipeline, parameters, n_jobs = -1, verbose = 1, scoring = 'accuracy', cv = 3)
grid_search.fit(X, y)

print('最佳效果: %0.3f' % grid_search.best_score_)
print('最优参数组合: ')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('{name}: {value}'.format(
        name=param_name, 
        value=best_parameters[param_name]
        )
    )

Fitting 2 folds for each of 5 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.0s finished


GridSearchCV(cv=2, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'vect__max_features': [1, 2, 3, 4, 5]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=1)

最佳效果: 0.800
最优参数组合: 
vect__max_features: 1


# 範例2(Pipeline)

Pipeline 的中间过程由scikit-learn相适配的转换器（transformer）构成，最后一步是一个estimator。

比如下述的代码，StandardScaler和PCA transformer 构成intermediate steps，LogisticRegression 作为最终的estimator。

当我们执行 pipe_lr.fit(X_train, y_train)时，首先由StandardScaler在训练集上执行 fit和transform方法，transformed后的数据又被传递给

Pipeline对象的下一步，也即PCA()。和StandardScaler一样，PCA也是执行fit和transform方法，最终将转换后的数据传递给 LosigsticRegression

## 參考來源

https://blog.csdn.net/lanchunhui/article/details/50521648

In [29]:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder
# Breast Cancer Wisconsin dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/'
                 'breast-cancer-wisconsin/wdbc.data', header=None)
                                 

X, y = df.values[:, 2:], df.values[:, 1]
# y为字符型标签
# 使用LabelEncoder类将其转换为0开始的数值型
encoder = LabelEncoder()
y = encoder.fit_transform(y)
# encoder.transform(['M', 'B']) TO  array([1, 0])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)

In [31]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.990,10.38,122.80,1001.0,0.11840,0.27760,0.300100,0.147100,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.71190,0.26540,0.4601,0.11890
1,842517,M,20.570,17.77,132.90,1326.0,0.08474,0.07864,0.086900,0.070170,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.24160,0.18600,0.2750,0.08902
2,84300903,M,19.690,21.25,130.00,1203.0,0.10960,0.15990,0.197400,0.127900,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.45040,0.24300,0.3613,0.08758
3,84348301,M,11.420,20.38,77.58,386.1,0.14250,0.28390,0.241400,0.105200,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.68690,0.25750,0.6638,0.17300
4,84358402,M,20.290,14.34,135.10,1297.0,0.10030,0.13280,0.198000,0.104300,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.40000,0.16250,0.2364,0.07678
5,843786,M,12.450,15.70,82.57,477.1,0.12780,0.17000,0.157800,0.080890,...,15.470,23.75,103.40,741.6,0.17910,0.52490,0.53550,0.17410,0.3985,0.12440
6,844359,M,18.250,19.98,119.60,1040.0,0.09463,0.10900,0.112700,0.074000,...,22.880,27.66,153.20,1606.0,0.14420,0.25760,0.37840,0.19320,0.3063,0.08368
7,84458202,M,13.710,20.83,90.20,577.9,0.11890,0.16450,0.093660,0.059850,...,17.060,28.14,110.60,897.0,0.16540,0.36820,0.26780,0.15560,0.3196,0.11510
8,844981,M,13.000,21.82,87.50,519.8,0.12730,0.19320,0.185900,0.093530,...,15.490,30.73,106.20,739.3,0.17030,0.54010,0.53900,0.20600,0.4378,0.10720
9,84501001,M,12.460,24.04,83.97,475.9,0.11860,0.23960,0.227300,0.085430,...,15.090,40.68,97.65,711.4,0.18530,1.05800,1.10500,0.22100,0.4366,0.20750


In [34]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline

pipe_lr = Pipeline([('sc', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', LogisticRegression(random_state=1))
                    ])
pipe_lr.fit(X_train, y_train)
print('Test accuracy: %.3f' % pipe_lr.score(X_test, y_test))
#print('Test accuracy: {:.3f}'.format(pipe_lr.score(X_test, y_test)))



Pipeline(memory=None,
     steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

Test accuracy: 0.921
Test accuracy: 0.921




# 範例3

In [40]:
from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
# 产生一些测试数据
X, y = samples_generator.make_classification(n_informative=5, n_redundant=0, random_state=42)

In [52]:
# 特徵選擇(SelectKBest:選擇最k個特徵)
anova_filter = SelectKBest(f_regression, k=5)
# 分類法
clf = svm.SVC(kernel='linear')
# 建立流程
anova_svm = Pipeline([
    ('anova', anova_filter), 
    ('svc', clf)
    ]
)

# 參數設定
# 單一
#anova_svm.set_params(
#    anova__k=10, 
#    svc__C=.1, 
#    svc__kernel='linear'#, 'poly', 'rbf', 'sigmoid']
#)

# 參數範圍
params = dict(
    anova__k=[i for i in range(1,11)], 
    svc__C=[i for i in range(1,11)], 
    svc__kernel=['linear', 'poly', 'rbf', 'sigmoid']
)

# 網格搜索
grid_search = GridSearchCV(anova_svm, params, n_jobs = -1, verbose = 2, scoring = 'accuracy', cv = 3)
# 訓練模型
grid_search.fit(X,y)
# 印出結果
grid_search.grid_scores_

print('最佳效果: {:.3f}'.format(grid_search.best_score_))
best = grid_search.best_params_
for param_name in sorted(params.keys()):
    print('{name}: {value}'.format(
        name=param_name, 
        value=best[param_name]
        )
    )


Fitting 3 folds for each of 400 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed:    2.1s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('anova', SelectKBest(k=5, score_func=<function f_regression at 0x0000000008D1C158>)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'anova__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'svc__C': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid']},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=2)

[mean: 0.63000, std: 0.03619, params: {'anova__k': 1, 'svc__C': 1, 'svc__kernel': 'linear'},
 mean: 0.56000, std: 0.04803, params: {'anova__k': 1, 'svc__C': 1, 'svc__kernel': 'poly'},
 mean: 0.66000, std: 0.03414, params: {'anova__k': 1, 'svc__C': 1, 'svc__kernel': 'rbf'},
 mean: 0.73000, std: 0.09445, params: {'anova__k': 1, 'svc__C': 1, 'svc__kernel': 'sigmoid'},
 mean: 0.63000, std: 0.03619, params: {'anova__k': 1, 'svc__C': 2, 'svc__kernel': 'linear'},
 mean: 0.56000, std: 0.04803, params: {'anova__k': 1, 'svc__C': 2, 'svc__kernel': 'poly'},
 mean: 0.66000, std: 0.03414, params: {'anova__k': 1, 'svc__C': 2, 'svc__kernel': 'rbf'},
 mean: 0.63000, std: 0.18041, params: {'anova__k': 1, 'svc__C': 2, 'svc__kernel': 'sigmoid'},
 mean: 0.63000, std: 0.03619, params: {'anova__k': 1, 'svc__C': 3, 'svc__kernel': 'linear'},
 mean: 0.56000, std: 0.04803, params: {'anova__k': 1, 'svc__C': 3, 'svc__kernel': 'poly'},
 mean: 0.66000, std: 0.03414, params: {'anova__k': 1, 'svc__C': 3, 'svc__kernel'

最佳效果: 0.840
anova__k: 2
svc__C: 5
svc__kernel: rbf


In [50]:
# 預測結果
anova_svm.set_params(
    anova__k= 2,
    svc__C= 5,
    svc__kernel= 'rbf'
)
anova_svm.fit(X,y)
anova_svm.predict(X)
anova_svm.score(X,y)

Pipeline(memory=None,
     steps=[('anova', SelectKBest(k=2, score_func=<function f_regression at 0x0000000008D1C158>)), ('svc', SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

Pipeline(memory=None,
     steps=[('anova', SelectKBest(k=2, score_func=<function f_regression at 0x0000000008D1C158>)), ('svc', SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

array([0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1])

0.87