This is a basic example of building a classification pipeline, by which different Classification algorithm can be tried out, and once the pipeline is built hyperparameters tuning can be done usng Cross Validation

In [2]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib 
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

#from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

## Reading the data

In [4]:
data = pd.read_csv('./iris-species/Iris.csv')
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
Id               150 non-null int64
SepalLengthCm    150 non-null float64
SepalWidthCm     150 non-null float64
PetalLengthCm    150 non-null float64
PetalWidthCm     150 non-null float64
Species          150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.1+ KB


In [6]:
data.drop('Id',axis=1,inplace=True)

## Exploratory Analysis

In [7]:
#cool visualization from https://www.kaggle.com/benhamner/python-data-visualizations

sns.pairplot(data, hue='Species', size=3)

<seaborn.axisgrid.PairGrid at 0x7f0161d0b410>

Petal length and Petal Width are highly correlated, highly correlated features can be omitted in feature selection when working on huge datasets 

## Building the Pipeline

Before proceeding 'Species' must be encoded to an integer using ` LabelEncoder()`. *(someone kindly throw light on whether transformations to 'y' can be a part of the pipeline, last time I checked this was an open in issue in sklearn)*

We are building a basic pipeline with two steps,

* Normalize numerical features with `StandardScaler()`
* Run the Classifier, `KNearestClassifier()`


In [8]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data['Species'] = LabelEncoder().fit_transform(data['Species'])
data.iloc[[0,1,-2,-1],:]

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


In [9]:
pipeline = Pipeline([
    ('normalizer', StandardScaler()), #Step1 - normalize data
    ('clf', KNeighborsClassifier(n_neighbors=3)) #step2 - classifier
])
pipeline.steps

[('normalizer', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('clf',
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
             metric_params=None, n_jobs=1, n_neighbors=3, p=2,
             weights='uniform'))]

In [10]:
#Seperate train and test data
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,:-1].values,
                                                   data['Species'],
                                                   train_size = 0.75,
                                                   test_size = 0.25,
                                                   random_state = 10)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(112, 4)
(38, 4)
(112,)
(38,)


Use Cross-validation to test the accuracy of the pipeline

In [11]:
from sklearn.model_selection import cross_validate

scores = cross_validate(pipeline, X_train, y_train)
scores

{'fit_time': array([ 0.00298214,  0.00139999,  0.00212908]),
 'score_time': array([ 0.00176597,  0.00107598,  0.00131106]),
 'test_score': array([ 0.86842105,  0.84210526,  0.97222222]),
 'train_score': array([ 1.        ,  0.95945946,  0.92105263])}

In [None]:
y_predict = pipeline.predict(X_test)
accuracy = accuracy_score(y_test,y_predict)
print('Accuracy of KNearest Classifier is %.3f%%' % (accuracy*100))

## Useing other Classifier

The classfier step of the pipeline should be modified to the necessary classifier, I am trying out `SVC()` and `KNearestClassifier()`

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

pipeline.set_params(clf= SVC())

In [None]:
pipeline.fit(X_train, y_train)
y_predict = pipeline.predict(X_test)
accuracy = accuracy_score(y_test,y_predict)
print('Accuracy of SVC is %.3f%%' % (accuracy*100))

In [None]:
pipeline.set_params(clf= LogisticRegression())
pipeline.fit(X_train, y_train)
y_predict = pipeline.predict(X_test)
accuracy = accuracy_score(y_test,y_predict)
print('Accuracy of Logistic Regression is %.3f%%' % (accuracy*100))

## Cross-Validation and Hyper Parameters Tuning

Cross Validation is the process of finding the best combination of parameters for the model by traning and evaluating the model for each combination of the parameters.
GridSearchCV takes a pipeline and a grid of parameters as input and performs Cross Validation.

In [None]:
from sklearn.model_selection import GridSearchCV
pipeline.steps

Trying out different values for parameters solver and regularization Strength 'C' of logistic regression classifier
to provide values to a parameter of a step in the pipeline, the syntax is *stepname__parameter*

In [None]:
cv_grid = GridSearchCV(pipeline, param_grid = {
    'clf__solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag'],
    'clf__C' : [0.6,0.8,1,1.2,1.4]
})

cv_grid.fit(X_train, y_train)

The best combination of the parameters can be accessed from `best_params_`

In [None]:
cv_grid.best_params_

In [None]:
y_predict = cv_grid.predict(X_test)
accuracy = accuracy_score(y_test,y_predict)
print('Accuracy of Logistic Regression after CV is %.3f%%' % (accuracy*100))

Accuracy of Logistic Regression has increased from 94.73% to 97.36% after changing the solver from `liblinear` to `newton-cg`

I'll revisit and make improvements to the pipeline in the future, kindly provide reviews and suggestions to improve this process.

Kindly upvote if you've found this notebook useful :)
