# Part 1: Introduction to Pipeline

Scikit-learn's Pipeline class is designed apply a series of data transformations followed by the application of an estimator.

Benefits:
- Convenience in creating an easy-to-understand workflow.
- Enforcing workflow implementation and the desired order of step applications.
- Reproducibility.

---
In this notebook, we will perform the following:

Build **3 pipelines**, each with a different **Estimators**, using default hyperparameters:

- Logistic Regression
- Support Vector Machine
- Decision Tree

Build a **pipeline** for **Transform**, consisting of:

- Feature Scaling
- Dimensionality Reduction (PCA)

Then, the data is fitted to the final estimators.

In [1]:
# Importing the dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

- We will construct pipelines for Logistic Regression, Support Vector Machine and Decision Tree
- `StandardScaler()` is used to resize the distribution of values such that the mean value is 0 and the standard deviation is 1.

In [None]:
# Import the required packages
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree

# Construct the Pipelines
pipe_logreg = Pipeline([('scaler', StandardScaler()),
                        ('pca', PCA(n_components=2)),
                        ('clf', LogisticRegression(random_state=42))])

pipe_svm = Pipeline([('scaler', StandardScaler()),
                        ('pca', PCA(n_components=2)),
                        ('svm', svm.SVC(random_state=42))])

pipe_tree = Pipeline([('scaler', StandardScaler()),
                        ('pca', PCA(n_components=2)),
                        ('clf', tree.DecisionTreeClassifier(random_state=42))])

In [None]:
# List of pipelines for ease of iteration
pipelines = [pipe_logreg, pipe_svm, pipe_tree]

In [None]:
# Dictionary of Pipelines and Classifiers for ease of reference
pipe_dict = {0: 'Logistic Regression', 1: 'Support Vector Machine', 2: 'Decision Tree'}

In [None]:
# Fit the pipelines
for pipe in pipelines:
    pipe.fit(X_train, y_train)

In [None]:
# Compare accuracies
for idx, val in enumerate (pipelines):
    print("{} pipeline test accuracy: {:.3f}".format(pipe_dict[idx], val.score(X_test, y_test)))

Logistic Regression pipeline test accuracy: 0.900
Support Vector Machine pipeline test accuracy: 0.900
Decision Tree pipeline test accuracy: 0.867


In [None]:
# Identify the most accurate model on test data
best_accuracy = 0.0
best_clf = 0
best_pipe = ''

for idx, val in enumerate (pipelines):
    if val.score(X_test, y_test) > best_accuracy:
        best_accuracy = val.score(X_test, y_test)
        best_pipe = val
        best_clf = idx

print("Classifier with best accuracy: {}".format(pipe_dict[idx]))

Classifier with best accuracy: Decision Tree


# Part 2: Integrating Grid Search

Another simple yet powerful technique we can pair with pipelines to improve performance is **grid search**, which attempts to optimize model hyperparameter combinations.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import tree

# Construct Pipeline
pipe = Pipeline([('scaler', StandardScaler()),
                 ('pca', PCA(n_components=2)),
                 ('clf', tree.DecisionTreeClassifier(random_state=42))])

In [None]:
# Fit the Pipeline
pipe.fit(X_train, y_train)

In [None]:
# Pipeline Test Accuracy
print("Test Accuracy: {:.3f}".format(pipe.score(X_test, y_test)))

In [None]:
# Pipeline estimator parameters
# Estimator is stored as step 3 ([2]), second item ([1])
print("Model hyperparameters: \n{}".format(pipe.steps[2][1].get_params()))

As summary, we applied feature **scaling (scaler)**, **dimensionality reduction (pca)**, and applied the **final estimator (clf)**

## Part 2.1 Adding Grid Search to the Pipeline

The purpose of grid search is to locate the optimal hyperparameters to optimize the model's accuracy. Grid Search will be applied to optimize the following hyperparameters:

- **criterion** - This is the function to evaluate the quality of the split; Gini **impurity** and **information gain (entropy)**
- **min_samples_leaf** - This is the minimum number of samples required for a valid leaf node; we will use the integer range of 1 to 5
- **max_depth** - The is the maximum depth of the tree; we will use the integer range of 1 to 5
- **min_samples_split** - This is the minimum number of samples required in order to split a non-leaf node; we will use the integer range of 1 to 5

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

In [None]:
# Construct Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import tree

# Construct Pipeline
pipe = Pipeline([('scaler', StandardScaler()),
                 ('pca', PCA(n_components=2)),
                 ('clf', tree.DecisionTreeClassifier(random_state=42))])

from sklearn.model_selection import GridSearchCV

# Define parameter range
param_range = [1, 2, 3, 4, 5]

# Set Grid Search
grid_params = [{'clf__criterion': ['gini', 'entropy'],
                'clf__min_samples_leaf': param_range,
                'clf__max_depth': param_range,
                'clf__min_samples_split': param_range[1:]}]

# Construct Grid Search
grid_search = GridSearchCV(estimator=pipe,
                           param_grid=grid_params,
                           scoring='accuracy',
                           cv=10) # 10-fold cross validation

# Fit using Grid Search
grid_search.fit(X_train, y_train)

# Best Accuracy
print("Best accuracy: {:.3f}".format(grid_search.best_score_))

# Best Hyperparameters
print("Best hyperparameters:\n {}".format(grid_search.best_params_))