**References** :
* Pipelines 
  1. https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html#pipelining
  2. **basic pipeline** : https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
  3. **make pipline** : https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html


* Classification algos 
  1. **LogisticRegression** :  https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
  2. **SVM Classiifer** : https://scikit-learn.org/stable/modules/svm.html#svm-classification
  3. **KNeighborsClassifier** : https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
  4. **Decision Tree Classifier** : https://scikit-learn.org/stable/modules/tree.html#classification 

* Hyperparameter tuning 
  1. **GridSearch CV** :  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html





# What are Pipelines
**Pipeline can be used to chain multiple estimators into one.** This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification.

Pipeline serves two purposes here:
* **Convenience:** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
* **Joint parameter selection:** You can grid search over parameters of all estimators in the pipeline at once.


**Note** : **All estimators in a pipeline, except the last one, must be transformers** (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

# Importing libraries and loading data

In [40]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import numpy as np

In [None]:
iris_df=load_iris()
iris_df.data
iris_df.target


In [None]:
## splitiing data
X_train,X_test,y_train,y_test=train_test_split(iris_df.data,iris_df.target,test_size=0.3,random_state=0)
X_train

  # Creating Pipelines
  Steps involved in model and in pipeline

  1. Data Preprocessing  --> Standard Scaler
  2. Dimensional reductn  --> PCA
  3. Apply classifier algo --> LogisticRegression (or) DecisionTreeClassifier (or) RandomForestClassifier

## LogisticRegression Classifier pipeline

In [4]:
pipeline_LR = Pipeline([ ( 'preprocess1', StandardScaler()) , ('pca1' , PCA(n_components=2) ) , ('LR_classifier' , LogisticRegression(random_state=0) ) ])
pipeline_LR

Pipeline(memory=None,
         steps=[('preprocess1',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca1',
                 PCA(copy=True, iterated_power='auto', n_components=2,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('LR_classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=0,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

## DecisionTree Classifier pipeline

In [5]:
pipeline_DT = Pipeline([ ( 'preprocess2', StandardScaler()) , ('pca2' , PCA(n_components=2) ) , ('DT_classifier' , DecisionTreeClassifier() ) ])
pipeline_DT

Pipeline(memory=None,
         steps=[('preprocess2',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca2',
                 PCA(copy=True, iterated_power='auto', n_components=2,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('DT_classifier',
                 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features=None, max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort='deprecated', random_state=None,
                                        spl

## RandomForest Classifier pipeline

In [6]:
pipeline_RF = Pipeline([ ( 'preprocess3', StandardScaler()) , ('pca3' , PCA(n_components=2) ) , ('RF_classifier' , RandomForestClassifier() ) ])
pipeline_RF

Pipeline(memory=None,
         steps=[('preprocess3',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca3',
                 PCA(copy=True, iterated_power='auto', n_components=2,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('RF_classifier',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estima

## KNeighbors Classifier Pipeline

In [7]:
pipeline_KNN = Pipeline([ ( 'preprocess4', StandardScaler()) , ('pca4' , PCA(n_components=2) ) , ('KNN_classifier' , KNeighborsClassifier(n_neighbors=3) ) ])
pipeline_KNN

Pipeline(memory=None,
         steps=[('preprocess4',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca4',
                 PCA(copy=True, iterated_power='auto', n_components=2,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('KNN_classifier',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=3, p=2,
                                      weights='uniform'))],
         verbose=False)

## Deciding Which pipeline is best

In [8]:
pipelines = [pipeline_LR , pipeline_DT ,  pipeline_RF , pipeline_KNN]
classifiers_list = ['Logistic Regression', 'Decision Tree', 'RandomForest' , 'KNN Classifier']

In [9]:
best_accuracy = 0.0
best_pipeline = ""

In [10]:
for pipe in pipelines:
  pipe.fit(X_train , y_train)

In [11]:
for i , model in enumerate(pipelines):
  acc = model.score(X_test , y_test)
  name = classifiers_list[i]
  print(name , " Accuracy :" , acc )
  best_accuracy = max(best_accuracy , acc)
  best_pipeline = name

Logistic Regression  Accuracy : 0.8666666666666667
Decision Tree  Accuracy : 0.9111111111111111
RandomForest  Accuracy : 0.9111111111111111
KNN Classifier  Accuracy : 0.9111111111111111


In [12]:
## Best classifier and accuracy
print(best_pipeline , " is best classifier with accuracy = " , best_accuracy)

KNN Classifier  is best classifier with accuracy =  0.9111111111111111


# Hyper Parameter Tuning Using Pipeline
--> Here we are using Grid Search CV for hyperparameter tuning in the pipelines

Part 1 :  Grid Search CV only on single classifier (Logistic reg) but two pipelines <br/>
Part 2 : Grid Search CV on all the classifiers with a single pipeline



## Part 1 : Grid Search CV only on single classifier pipeline(Logistic reg)

In [34]:
## Building a single pipeline for LogisticRegression
pipe1 = Pipeline([ ( 'PP1', StandardScaler()) , ('PCA1' , PCA(n_components=2) ) , ('LR1' , LogisticRegression(random_state=0) ) ])

In [35]:
## Grid search on above pipeline

# parameters dictionary for gridserach CV
param_grid1 = {'PCA1__n_components' : [2,3,4,5,6,7,8,9] , 'LR1__C' : np.linspace(-4, 4, 4), }

# performing Grid serach CV
grid_search1 = GridSearchCV(pipe1 , param_grid1 , scoring = 'accuracy' , n_jobs=-1)
grid_search1.fit(X_train,y_train)

# Best paramteres returned by Grid search CV
GS_acc = grid_search1.best_score_
print('Grid search accuracy : ',GS_acc)
LR1_best_para = grid_search1.best_params_
print( " best_hyperparameters : " , LR1_best_para )

Grid search accuracy :  0.9428571428571428
 best_hyperparameters :  {'LR1__C': 1.333333333333333, 'PCA1__n_components': 3}


In [36]:
## Testing the best parameters on the above model but by different pipeline

# new pipleine 
final_pipe1 = Pipeline([ ( 'final_PP1', StandardScaler()) , ('final_PCA1' , PCA(n_components=3) ) , ('final_LR1' , LogisticRegression(random_state=0 , C = 1.333333333333333 ) ) ])

# Testing the outputted hyperparameters 
final_pipe1.fit(X_train , y_train)
final_LR1_acc = final_pipe1.score(X_test , y_test)
print( "Final Accuracy :" , final_LR1_acc )

Final Accuracy : 0.9777777777777777


## Part 2 : Grid Search CV on all the classifiers with a single pipeline

In [39]:
## Initally just create a pipeline wtih any one classifier
pipe = Pipeline(steps = [ ('classifier' , KNeighborsClassifier()) ])

In [43]:
## Now do hyperparameter tuning on all above classifers and by taking hyperparameters of each classiifer

# Parameters dictionary with classifer algorithms and their hyperparameters
grid_param = [
                {"classifier": [LogisticRegression()],
                 "classifier__penalty": ['l2','l1'],
                 "classifier__C": np.linspace(0, 4, 10)
                 },
              
                {"classifier": [RandomForestClassifier()],
                 "classifier__n_estimators": [10, 100, 1000],
                 "classifier__max_depth":[5,8,15,25,30,None],
                 "classifier__min_samples_leaf":[1,2,5,10,15,100],
                 "classifier__max_leaf_nodes": [2, 5,10]
                 },
              
                 {"classifier": [KNeighborsClassifier()],
                  "classifier__n_neighbors": [2,3,4,5,6,7,8],
                  "classifier__n_jobs" : [-1]
                 },
              
                  {"classifier": [DecisionTreeClassifier()],
                   "classifier__min_samples_split": [2,3,4,5,6,7,8],
                 },

                {   'classifier' : [SVC()] ,
                    'classifier__C' : [1,10,100,1000] , 
                    'classifier__kernel' : ['linear']
                 } ,
                 
                 {   'classifier' : [SVC()] ,
                    'classifier__C' : [1,10,100,1000] , 
                    'classifier__kernel' : ['rbf'],
                    'classifier__gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
                 }

              ]



In [44]:
#gridsearch of the pipeline, the fit the best model
gridsearch = GridSearchCV(pipe, grid_param, cv=5, verbose=0,n_jobs=-1) # Fit grid search

In [47]:
## Grid search fit on train data   
best_model = gridsearch.fit(X_train,y_train)   # Takes around 10-15 mins cuz we have so many possiblites in above parametr grid


In [51]:
## Best paramteres returned by Grid search CV
GS_accuracy = best_model.best_score_
print('Grid search accuracy : ',GS_accuracy)
Best_hyperparameters = best_model.best_params_
print( " best hyperparameters : " , Best_hyperparameters )
#print(" best pipeline : " ,best_model.best_estimator_)

Grid search accuracy :  0.980952380952381
 best hyperparameters :  {'classifier': SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False), 'classifier__C': 10, 'classifier__gamma': 0.1, 'classifier__kernel': 'rbf'}


In [53]:
## Testing the best parameters (getting accuracy)
final_acc = best_model.score(X_test , y_test)
print( "Final Best model Accuracy :" , final_acc )

Final Best model Accuracy : 0.9777777777777777
