# 管道线(Pip Line)

Pipline的方便在于可以将 feature的变换，训练，网格搜索统一进行处理。所以对于Pipline来说，除了最后一个参数意外，其他的参数必须是transform的。这很好理解，因为中间都是对数据的转换。

In [3]:
from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import make_pipeline

生成测试数据集

In [4]:
X, y = samples_generator.make_classification(n_features=20, n_informative=3, n_redundant=0, n_classes=4,
    n_clusters_per_class=2)

In [7]:
print type(X), X.shape, type(y), y.shape
print X[:5]
print '-' * 40
print y[:30]

<type 'numpy.ndarray'> (100, 20) <type 'numpy.ndarray'> (100,)
[[-0.73136804 -0.74355319 -0.01178045  0.30384299  0.05416159  0.83556058
   1.21965891 -0.5159956   0.80068141 -1.2400836  -0.46595553  1.39015463
  -1.44124081 -0.14952096  0.16574752  2.53732546 -0.1073431  -2.01848883
  -0.87046776  0.88549295]
 [ 0.78498615  0.63812859  0.02711703  1.130632    0.84190661 -1.04800836
  -0.07744851 -0.34544734 -1.26283435 -2.02191239 -0.97783715  1.9862074
   0.90882809  0.88193662 -0.0666316  -0.06598415 -1.14603766  0.55598271
  -0.6523373  -2.26675137]
 [-0.1595581   0.45873226  0.64446618 -0.17646372 -0.51216779 -0.44728081
   1.08465256 -0.66170884  0.61371716 -0.6251336  -0.78562297 -1.6033022
  -0.17310238 -0.02423014 -0.77557861 -2.27777802 -0.61179307  1.80091506
   1.0074296  -1.13879568]
 [-0.87595809 -1.34724395  0.7711238   0.79973402  0.09797152  1.69439073
   0.06448806  0.06606908  0.65706678 -1.24618161  0.65872232  1.34841607
   0.25432785  0.31002789 -1.48403666  0.759

特征过滤选择器

In [8]:
anova_filter = SelectKBest(f_regression, k=3)

In [10]:
print type(anova_filter)

<class 'sklearn.feature_selection.univariate_selection.SelectKBest'>


分类器

In [11]:
clf = svm.SVC(kernel='linear')

管道流程: 特征过滤器->分类器。这样的流程就形成的管道

In [12]:
anova_svm = make_pipeline(anova_filter, clf)

In [18]:
print type(anova_svm)

<class 'sklearn.pipeline.Pipeline'>


In [21]:
anova_svm.fit(X, y)
predict = anova_svm.predict(X)
print anova_svm.score(X, predict)
print anova_svm.score(X, y)

1.0
0.9


In [23]:
import sklearn
help(SelectKBest)

Help on class SelectKBest in module sklearn.feature_selection.univariate_selection:

class SelectKBest(_BaseFilter)
 |  Select features according to the k highest scores.
 |  
 |  Read more in the :ref:`User Guide <univariate_feature_selection>`.
 |  
 |  Parameters
 |  ----------
 |  score_func : callable
 |      Function taking two arrays X and y, and returning a pair of arrays
 |      (scores, pvalues).
 |  
 |  k : int or "all", optional, default=10
 |      Number of top features to select.
 |      The "all" option bypasses selection, for use in a parameter search.
 |  
 |  Attributes
 |  ----------
 |  scores_ : array-like, shape=(n_features,)
 |      Scores of features.
 |  
 |  pvalues_ : array-like, shape=(n_features,)
 |      p-values of feature scores.
 |  
 |  Notes
 |  -----
 |  Ties between features with equal scores will be broken in an unspecified
 |  way.
 |  
 |  See also
 |  --------
 |  f_classif: ANOVA F-value between labe/feature for classification tasks.
 |  chi2: