
# Pipelined Processing - Simplify the Workflow

A typical machine learning task generally involves data preparation to varying degrees. We won't get into the wide array of activities which make up data preparation here, but there are many. Such tasks are known for taking up a large proportion of time spent on any given machine learning task.

After a dataset is cleaned up from a potential initial state of massive disarray, there are still several less-intensive yet no less-important transformative data preprocessing steps such as **feature extraction, feature scaling ** and **dimensionality reduction** to name just a few.

Maybe preprocessing requires only one of these tansformations, such as some form of scaling. 
But maybe we need to string a number of transformations together, and ultimately finish off with an estimator of some sort. 
This is where Scikit-learn Pipelines can be helpful.

Scikit-learn's Pipeline class is designed as a manageable way to apply a series of data transformations followed by the application of an estimator. 

In fact, that's really all it is:

Pipeline of transforms with a final estimator.

This simple tool is useful for:

Convenience in creating a coherent and easy-to-understand workflow
Enforcing workflow implementation and the desired order of step applications
Reproducibility
Value in persistence of entire pipeline objects (goes to reproducibility and convenience)
So let's have a quick look at Pipelines. Specifically, here is what we will do.

Build 3 pipelines, each with a different estimator (classification algorithm), using default hyperparameters:

**Logisitic Regression **

**Support Vector Machine **

**Decision Tree **

To demonstrate pipeline transforms, will perform:

**> feature scaling **

**> dimensionality reduction using PCA to project 4 dimensional data onto 2 dimensional space **

**> then we will be fitting to our final estimators. **

Afterwards, and almost completely unrelated, in order to make this a little more like a full-fledged workflow (it still isn't, but closer), we will:

**> Followup with scoring test data
> Compare pipeline model accuracies and identify the "best" model, meaning that which has the highest accuracy on our test data. **

Persist (save to file) the entire pipeline of the "best" model generated. 

Given that we will use default hyperparameters, this likely won't result in the most accurate possible models, but it will provide a sense of how to use simple pipelines. 



In [0]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline


In [0]:
# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

In [0]:
# Construct some pipelines
pipe_ss_pca_lr = Pipeline([('scl', StandardScaler()),
			('pca', PCA(n_components=2)),
			('clf', LogisticRegression(random_state=42))])



pipe_ss_pca_lr.fit(X_train, y_train)

#-----------------------------------------------------------------------------
# Test...either this way....
pred = pipe_ss_pca_lr.predict(X_test)

print('\n Accuracy SS_PCA_LR : %.3f \n' % accuracy_score(y_test, pred)) 

#-----------------------------------------------------------------------------
# Or more convenient way...by using score method of the Pipeline class
score = pipe_ss_pca_lr.score(X_test, y_test)
print('\n Accuracy SS_PCA_LR : %.3f \n' %score) 



 Accuracy SS_PCA_LR : 0.933 


 Accuracy SS_PCA_LR : 0.933 





#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# Spot Check Many Such Pipelined Estimators/Models

### Now Lets Create a Pipeline of Many Such Classifiers all with their Own Data Preprocessing Pipeline

In [0]:
from sklearn import svm
from sklearn import tree

In [0]:
# Construct a number of pipelines
# Each Pipeline is associated with a different estimator with their own data preprocessing pipeline
# In this example, all the estimators use the same preprocessing steps - standardize data and dimension reduction with PCA, before fitting the model.

# Pipeline for estimator: Linear Regression
pipe_lr = Pipeline([('scl', StandardScaler()),
			('pca', PCA(n_components=2)),
			('clf', LogisticRegression(random_state=42))])

# Pipeline for estimator: Support Vector Machine
pipe_svm = Pipeline([('scl', StandardScaler()),
			('pca', PCA(n_components=2)),
			('clf', svm.SVC(random_state=42))])

# Pipeline for estimator: Decision Tree
pipe_dt = Pipeline([('scl', StandardScaler()),
			('pca', PCA(n_components=2)),
			('clf', tree.DecisionTreeClassifier(random_state=42))])


### Create a List of Pipelines

In [0]:
# List of pipelines for ease of iteration
pipelines = [pipe_lr, pipe_svm, pipe_dt]

# Dictionary of pipelines and classifier types for ease of reference
pipe_dict = {0: 'Logistic Regression', 1: 'Support Vector Machine', 2: 'Decision Tree'}


### Iteratively Train all the Pipelines/Models and Test their Performances on the Test Set

In [0]:

# Fit the pipelines
for pipe in pipelines:
	pipe.fit(X_train, y_train)

# Compare accuracies
for idx, pipe in enumerate(pipelines):
	print('\n %s pipeline test accuracy: %.3f \n' % (pipe_dict[idx], pipe.score(X_test, y_test)))



 Logistic Regression pipeline test accuracy: 0.933 


 Support Vector Machine pipeline test accuracy: 0.900 


 Decision Tree pipeline test accuracy: 0.867 





### Find the Best Model by Comparing Their Accuracies

In [0]:
# Identify the most accurate model on test data
best_acc = 0.0
best_clf = 0
best_pipe = ''

for idx, val in enumerate(pipelines):
	if val.score(X_test, y_test) > best_acc:
		best_acc = val.score(X_test, y_test)
		best_pipe = val
		best_clf = idx
print('\n Classifier with best accuracy: %s \n' % pipe_dict[best_clf])



 Classifier with best accuracy: Logistic Regression 



#Model Persistence - Serialize the Best Model Object

We have two options

1. pickle

2. joblib

# Option 1 :
## Serialization with Pickle

In [0]:
import pickle

In [0]:
output = open('pipeline_model.pkl', 'wb') # open a file in write mode - binary file
pickle.dump(best_pipe, output)
output.close()
print('\n Serialized pipelined model to a disk file.... \n')



 Serialized pipelined model to a disk file.... 



### At some other time you may read the model from the disk and use it for prediction.....

In [0]:

pkl_file = open('pipeline_model.pkl', 'rb')  # open the file in read mode - binary file
model2 = pickle.load(pkl_file)

print(model2)
print('\n Model output %s \n' % model2.predict([[5, 3.5, 2, 1]]))



Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

 Model output [0] 



# Option 2 :

## Serialize with Joblib of Scikit-learn

In [0]:
from sklearn.externals import joblib

In [0]:
# Save pipeline to file
joblib.dump(best_pipe, 'best_pipeline.jbl', compress=1)

print('\n Serialized pipelined model to a disk file.... \n')



 Serialized pipelined model to a disk file.... 



### At some other time you may read the model from the disk and use it for prediction.....

In [0]:
# Load saved model from the file
model1= joblib.load('best_pipeline.jbl')

print(model1)

print('\n Model output %s \n' % model1.predict([[5, 3.5, 2, 1]]))


Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

 Model output [0] 



## What after this?
This is a simple implementation of Scikit-learn pipelines.
In this particular case, our logistic regression-based pipeline with default parameters scored the highest accuracy.

However, these results likely don't represent our best efforts. 
What if we did want to test a series of different hyperparameters? 

Can we use grid search? 
Can we incorporate automated methods for tuning these hyperparameters? 
What about using cross-validation?

Try those...we have already learnt these concepts in our previous lectures.