
<br>
=================================================<br>
Concatenating multiple feature extraction methods<br>
=================================================<br>
In many real-world examples, there are many ways to extract features from a<br>
dataset. Often it is beneficial to combine several methods to obtain good<br>
performance. This example shows how to use ``FeatureUnion`` to combine<br>
features obtained by PCA and univariate selection.<br>
Combining features using this transformer has the benefit that it allows<br>
cross validation and grid searches over the whole process.<br>
The combination used in this example is not particularly helpful on this<br>
dataset and is only used to illustrate the usage of FeatureUnion.<br>


Author: Andreas Mueller <amueller@ais.uni-bonn.de><br>
<br>
License: BSD 3 clause

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

In [None]:
iris = load_iris()

In [None]:
X, y = iris.data, iris.target

This dataset is way too high-dimensional. Better do PCA:

In [None]:
pca = PCA(n_components=2)

Maybe some original features where good, too?

In [None]:
selection = SelectKBest(k=1)

Build estimator from PCA and Univariate selection:

In [None]:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

Use combined features to transform dataset:

In [None]:
X_features = combined_features.fit(X, y).transform(X)
print("Combined space has", X_features.shape[1], "features")

In [None]:
svm = SVC(kernel="linear")

Do grid search over k, n_components and C:

In [None]:
pipeline = Pipeline([("features", combined_features), ("svm", svm)])

In [None]:
param_grid = dict(features__pca__n_components=[1, 2, 3],
                  features__univ_select__k=[1, 2],
                  svm__C=[0.1, 1, 10])

In [None]:
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)