This is a basic example of building a classification pipeline, by which different Classification algorithm can be tried out, and once the pipeline is built hyperparameters tuning can be done usng Cross Validation. 

I've updated the notebook to follow [ML-mastery's process](https://machinelearningmastery.com/python-machine-learning-mini-course/), which is very useful for getting consistent results

In [2]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib 
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

#from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

## Reading the data

In [4]:
data = pd.read_csv('./iris-species/Iris.csv')
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
Id               150 non-null int64
SepalLengthCm    150 non-null float64
SepalWidthCm     150 non-null float64
PetalLengthCm    150 non-null float64
PetalWidthCm     150 non-null float64
Species          150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.1+ KB


In [6]:
data.drop('Id',axis=1,inplace=True)

## Exploratory Analysis

In [7]:
#cool visualization from https://www.kaggle.com/benhamner/python-data-visualizations

sns.pairplot(data, hue='Species', size=3)

<seaborn.axisgrid.PairGrid at 0x7f0161d0b410>

Petal length and Petal Width are highly correlated, highly correlated features can be omitted in feature selection when working on huge datasets 

## Building the Pipeline

Before proceeding 'Species' must be encoded to an integer using ` LabelEncoder()`. *(someone kindly throw light on whether transformations to 'y' can be a part of the pipeline, last time I checked this was an open in issue in sklearn)*

We are building a basic pipeline with two steps,

* Normalize numerical features with `StandardScaler()`
* Run the Classifier, `KNearestClassifier()`


In [50]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data['Species'] = LabelEncoder().fit_transform(data['Species'])
data.iloc[[0,1,-2,-1],:]

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


In [52]:
pipeline = Pipeline([
    ('normalizer', StandardScaler()), #Step1 - normalize data
    ('clf', LogisticRegression()) #step2 - classifier
])
pipeline.steps

[('normalizer', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('clf',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
            penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
            verbose=0, warm_start=False))]

In [53]:
#Seperate train and test data
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,:-1].values,
                                                   data['Species'],
                                                   test_size = 0.4,
                                                   random_state = 10)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(90, 4)
(60, 4)
(90,)
(60,)


### Trying Logistic Regression Classifier

Use Cross-validation to test the accuracy of the pipeline

In [54]:
from sklearn.model_selection import cross_validate

scores = cross_validate(pipeline, X_train, y_train)
scores

{'fit_time': array([ 0.00220585,  0.00263977,  0.001616  ]),
 'score_time': array([ 0.00113821,  0.00096512,  0.00048399]),
 'test_score': array([ 0.83870968,  0.87096774,  0.89285714]),
 'train_score': array([ 0.91525424,  0.88135593,  0.88709677])}

In [55]:
scores['test_score'].mean()

0.86751152073732729

## Average accuracy of pipeline with Logistic Regression 86.75%

## Spot Check Algorithms in the pipeline

#### Trying out the following classification algorithms

    -LogisticRegression
    -Support Vector Machines - linear and rbf
    -K-nearest Classifier
    -Decision Tree Classifier
    -Gradient Bossting Classifier

The classfier step of the pipeline should be modified to the necessary classifier, I am trying out `SVC()` and `KNearestClassifier()`

In [58]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

clfs = []
clfs.append(LogisticRegression())
clfs.append(SVC())
clfs.append(SVC())
clfs.append(KNeighborsClassifier(n_neighbors=3))
clfs.append(DecisionTreeClassifier())
clfs.append(RandomForestClassifier())
clfs.append(GradientBoostingClassifier())

for classifier in clfs:
    pipeline.set_params(clf = classifier)
    scores = cross_validate(pipeline, X_train, y_train)
    print '-----------------------------------'
    print str(classifier)
    print '-----------------------------------'
    for key, values in scores.items():
            print key,' mean ', values.mean()
            print key,' std ', values.std()

-----------------------------------
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
-----------------------------------
score_time  mean  0.0015119711558
score_time  std  0.00067293877536
test_score  mean  0.867511520737
test_score  std  0.0222402952927
train_score  mean  0.894568981228
train_score  std  0.0148132638848
fit_time  mean  0.0037534236908
fit_time  std  0.000704138049282
-----------------------------------
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
-----------------------------------
score_time  mean  0.00143663088481
score_time  std  0.000794463423358
test_score  mean  0.955837173

### Among the Classifier SVC has the highest accuracy of  95.58%, hence choosing SVC

## Cross-Validation and Hyper Parameters Tuning

Cross Validation is the process of finding the best combination of parameters for the model by traning and evaluating the model for each combination of the parameters.

In [59]:
from sklearn.model_selection import GridSearchCV
pipeline.set_params(clf= SVC())
pipeline.steps

[('normalizer', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False))]

Trying out different values for parameters solver and regularization Strength 'C' of logistic regression classifier
to provide values to a parameter of a step in the pipeline, the syntax is *stepname__parameter*

In [60]:
cv_grid = GridSearchCV(pipeline, param_grid = {
    'clf__kernel' : ['linear', 'rbf'],
    'clf__C' : np.linspace(0.1,1.2,12)
})

cv_grid.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('normalizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__C': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ,  1.1,
        1.2]), 'clf__kernel': ['linear', 'rbf']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

The best combination of the parameters can be accessed from `best_params_`

In [61]:
cv_grid.best_params_

{'clf__C': 0.20000000000000001, 'clf__kernel': 'linear'}

In [62]:
cv_grid.best_estimator_

Pipeline(memory=None,
     steps=[('normalizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', SVC(C=0.20000000000000001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [63]:
cv_grid.best_score_

0.97777777777777775

In [67]:
y_predict = cv_grid.predict(X_test)
accuracy = accuracy_score(y_test,y_predict)
print('Accuracy of Logistic Regression after CV is %.3f%%' % (accuracy*100))

Accuracy of Logistic Regression after CV is 95.000%


I'll revisit and make improvements to the pipeline in the future, kindly provide reviews and suggestions to improve this process.

IMPROVEMENTS

- I've added cross validation by using kFolds at each step, to get more meaningfull results.
- Initially, I made the rookie mistake of checking the accuracy with the test dataset for each classifier. By using KFolds I've essentially created and used a validation dataset and saved the test dataset to score the best model
- If the results are really accurate, it's a good idea to tweak the train test split, to get more test data


Kindly upvote if you've found this notebook useful :)
