## This notebook demonstrates a simple sklearn pipeline and how to use gridserachcv with sklearn pipeline. We have considered to work with iris dataset. The goal here is not to build a powerful classifier but instead to demonstarte how to use sklearn pipeline in it's simplest form. More complex examples will be added.

In [28]:
# import required packages
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Download the iris dataset
raw_data = pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv')


In [3]:
# Quick look at the data
raw_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [4]:
# Check for any missing data
raw_data.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [8]:
# Split data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(raw_data.drop(['species'], axis=1), 
                                                    raw_data['species'],
                                                    test_size=0.3,
                                                    stratify=raw_data['species'],
                                                    random_state=42)

## Pipeline code
1. Create a pipeline code for preprocessing the data, where we scale the data
2. Compose the pipeline with a logistic classifier
3. Finally when you test the classifier

In [18]:

num_trans = Pipeline(steps=[
            ('scale', StandardScaler())
])

pipe = Pipeline(steps=[
       ('preprocess', num_trans),
       ('classifier', LogisticRegression())
])
pipe.fit(X_train, Y_train)

In [21]:
pipe.score(X_test, Y_test)

0.9111111111111111

## Gridsearch CV

In [23]:
parameters = {
            'classifier__penalty': ['l1', 'l2'],
            'classifier__C'      : np.logspace(-3,3,7), #array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03])
            'classifier__solver' : ['newton-cg', 'lbfgs', 'liblinear'],
}

In [29]:
clf = GridSearchCV(pipe,
                  param_grid=parameters,
                  scoring='accuracy',
                  cv=10)

In [30]:
clf.fit(X_train, Y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('preprocess',
                                        Pipeline(steps=[('scale',
                                                         StandardScaler())])),
                                       ('classifier', LogisticRegression())]),
             param_grid={'classifier__C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'classifier__penalty': ['l1', 'l2'],
                         'classifier__solver': ['newton-cg', 'lbfgs',
                                                'liblinear']},
             scoring='accuracy')

In [31]:
clf.best_params_

{'classifier__C': 1.0,
 'classifier__penalty': 'l2',
 'classifier__solver': 'newton-cg'}

In [32]:
clf.score(X_test, Y_test)

0.9111111111111111

In [33]:
pipe2 = Pipeline(steps=[
       ('preprocess', num_trans),
       ('classifier', LogisticRegression(C=1.0, penalty='l2', solver='newton-cg'))
])

In [34]:
pipe2.fit(X_train, Y_train)

Pipeline(steps=[('preprocess', Pipeline(steps=[('scale', StandardScaler())])),
                ('classifier', LogisticRegression(solver='newton-cg'))])

In [35]:
pipe2.score(X_test, Y_test)

0.9111111111111111