**Building machine learning pipelines**

The scikit-learn library has provisions to build machine learning pipelines. We just need to
specify the functions, and it will build a composed object that makes the data go through the
whole pipeline. This pipeline can include functions, such as preprocessing, feature selection,
supervised learning, unsupervised learning, and so on. In this recipe, we will be building a
pipeline to take the input feature vector, select the top *k* features, and then classify them
using a random forest classifier.

In [1]:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import Pipeline

In [2]:
# Generate some sample data
# This will generate 20-dimensional feature vectors (n_features=20)
X, y = make_classification(n_informative=4, n_features=20, n_redundant=0, random_state=5)

In [3]:
# Our first step of the pipeline is to select the k best features
# and before the datapoint is used further.
# In this case, let's set k to 10.
# Feature selector
selector_k_best = SelectKBest(f_regression, k=10)

In [4]:
# The next step is to use a random forest classifier to classify the data:
# Random forest classifier
classifier = RandomForestClassifier(n_estimators=50, max_depth=4)

In [5]:
# We are now ready to build the pipeline. The pipeline method
# allows us to use predefined objects to build the pipeline

# Build the machine learning pipeline
pipeline_classifier = Pipeline([('selector', selector_k_best),
                                ('rf', classifier)])

# We can also assigned names to the blocks in out pipeline. In the preceding
# line, we assign the selector name to our feature selector and
# the rf to our random forest classifier. You are free to use any
# other random names here!

In [6]:
# We can also update these parameters as we go along. We can set
# the parameters using the names that we assigned in the previous
# step. For example, if we want to set k to 6 in the feature
# selector and set n_estimators to 25 in the random forest
# classifier, we can do it like in the following code. Not that
# these are the variable names given in the previous step:

pipeline_classifier.set_params(selector__k=6, rf__n_estimators=25)

Pipeline(memory=None,
         steps=[('selector',
                 SelectKBest(k=6,
                             score_func=<function f_regression at 0x116f61050>)),
                ('rf',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=4, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=25, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
        

In [7]:
# Let's go ahead and train the classifier

pipeline_classifier.fit(X, y)

Pipeline(memory=None,
         steps=[('selector',
                 SelectKBest(k=6,
                             score_func=<function f_regression at 0x116f61050>)),
                ('rf',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=4, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=25, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
        

In [8]:
# Let's predict the outputs for the training data:
prediction = pipeline_classifier.predict(X)
print("Predictions: ")
print(prediction)

Predictions: 
[1 1 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1
 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 0 0 0 1 1 0 0 1
 1 0 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1]


In [9]:
# Print the selected features chosen by the selector
features_status = pipeline_classifier.named_steps['selector'].get_support()
selected_features = []
for count, item in enumerate(features_status):
    if item:
        selected_features.append(count)

print("Selected features (0-indexed): ", ', '.join([str(x) for x in selected_features]))

Selected features (0-indexed):  0, 5, 9, 10, 11, 15


**How it works...**

The advantage of selecting the *k* best features is that we will be able to work with low-dimensional
data. This is helpful in reducing the computational complexity. The way in which
we select the *k* best features is based on univariate feature selection. This performs univariate
statistical tests and the extracts the top performing features from the feature vector. Univariate
statistical tests refer to analysis techniques wher a single variable is involved.

Once these tests are performed, each feature in the feature vector is assigned a score.
Based on these scores, we select the top *k* features. We do this as a preprocessing step in
our classifier pipeline. Once we extract the top *k* features, a *k*-dimensional feature vector is
formed, and we use it as the input training data for the random forest classifier.