# Pipelines

At the root, pipelines are a way to pass data through a list of pre-calculated steps.  On each step, we call the data using a fit/transform on each step and passing the transformed data on to the next step in the pipeline.  On the last command, the pipeline will do whatever you want it to do (i.e. if you call a .predict() it will return the predictions of what you wanted).




In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion, make_union
from sklearn.preprocessing import Binarizer, PolynomialFeatures, StandardScaler

In [11]:
# Load in the dataset
df = pd.read_csv('iris.csv')

In [13]:
df.species.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [14]:
# Create funciton that encodes the dataframe based off of species
def split_species(val):
    if val == 'Iris-setosa':
        return 0
    elif val == 'Iris-versicolor':
        return 1
    else:
        return 2

In [15]:
# Apply that function to the dataset
df['species'] = df['species'].apply(split_species)

df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


![](iris.png)

In [16]:
df.species.unique()

array([0, 1, 2])

In some ways we can think of what we just did as one layer of a Pipeline.  We took the data, fit it to our function and transformed it for the output.

In [17]:
# Split the data into predictor (X) / target (y) columns
y = df['species'].copy()
X = df[[column for column in df.columns if column != 'species']].copy()

In [19]:
# train test and split the data
# random_state just maintains a certain starting point for each TTS.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2017)

Now that our data is ready to be implemented into a pipeline.. Let's create the pipeline.

**We should perform three tasks in this pipeline:**

1. FeatureExtractor --> so as to look at each feature and return the values associated with that feature
2. Binarizer --> so as to create a cutoff
3. KNeighborsClassifier --> so as to classify the iris flowers.

**End Goal:** Essentially we want to petal_length feature, bin it into a dummy variable having the cutoff point be the median petal length, and predict the flower species using that feature.

Important note: every step in our pipeline will look like this,

           ('NameOfStep', callable_funciton())
           

In [27]:
# We need to setup our feature extractor

class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[[self.column]].values


## Build the Pipeline

In [30]:
steps = [('extract_petal_length', FeatureExtractor('petal_length')),
        ('cut_off_at_median', Binarizer(X_train['petal_length'].median())),
        ('predict_using_knneighbors', KNeighborsClassifier())]

In [31]:
pipeline1 = Pipeline(steps)

In [33]:
# Looks like we are doing good!
pipeline1

Pipeline(steps=[('extract_petal_length', FeatureExtractor(column='petal_length')), ('cut_off_at_median', Binarizer(copy=True, threshold=4.0)), ('predict_using_knneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))])

In [35]:
# fit and score the training sets to the pipeline
pipeline1.fit(X_train, y_train)
print('Accuracy:', pipeline1.score(X_train, y_train))

Accuracy: 0.571428571429


The pipeline has seen the above data. If we pass it new data (i.e. the test sets) then we will see that it performs all the necessary tasks to provide an accuracy model given that there is a column named 'petal_length'.

In [38]:
print('Accuracy:', pipeline1.score(X_test, y_test))
print('Predicitons:', pipeline1.predict(X_test)[0:15])

Accuracy: 0.526315789474
Predicitons: [1 1 1 1 1 1 1 1 1 1 1 1 0 0 1]


# Let's try a different Dataset

In hopes to truly understand (and possibly predict with better accuracy) let's try using a different dataset.

#### Abalone's

An abalone is a type of marine snail.  This data taken from the University of California at Irvine's Machine Learning Repository is publically available and is used for predicting the age of an Abalone.  Abalone's are tough to manually confirm the age as you would need to cut upen the shell and use a microscope to count the rings.  Let's automate this process so that researchers in the future can benefit by saving time.

**For those interested, a picture of an abalone and it's shell**
![](abalone.jpg)
![](abalone2.jpg)

In [39]:
abalone_columns = ['sex', 'length', 'diameter', 'height',
                  'whole_weight', 'shucked_weight', 'viscera_weight',
                  'shell_weight', 'rings']

abalone = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',
                     names=abalone_columns)

abalone.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [40]:
abalone.dtypes

sex                object
length            float64
diameter          float64
height            float64
whole_weight      float64
shucked_weight    float64
viscera_weight    float64
shell_weight      float64
rings               int64
dtype: object

In [41]:
# Set up predictor and target columns
X = abalone[[column for column in abalone.columns if column != 'rings']].copy()

rings = abalone['rings'].copy()
y = rings.apply(lambda x: 1 if x > rings.mean() else 0).copy()


**Steps for Pipeline:**

1. FeatureExtractor --> so as to pull the diameter column
2. PolynomialFeatures --> so as to create a set of diameter, diameter^2, diameter^3
3. StandardScaler --> so as to standardize the data
4. LogisticRegression --> so as to create a model

In [42]:
# TTS the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2017)


In [49]:
# create the pipeline steps

steps = [('extract_diameter', FeatureExtractor('diameter')),
        ('create_polynomials', PolynomialFeatures(3, include_bias=False)),
        ('standardize', StandardScaler()),
        ('predict', LogisticRegression())]

In [50]:
# create the pipeline

pipeline2 = Pipeline(steps)

In [51]:
pipeline2.fit(X_train, y_train)

Pipeline(steps=[('extract_diameter', FeatureExtractor(column='diameter')), ('create_polynomials', PolynomialFeatures(degree=3, include_bias=False, interaction_only=False)), ('standardize', StandardScaler(copy=True, with_mean=True, with_std=True)), ('predict', LogisticRegression(C=1.0, class_weight=None, dual...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [54]:
# Seems like the model isn't overfitting or underfitting

print('Train Accuracy:', pipeline2.score(X_train, y_train))

Train Accuracy: 0.722203215874


In [55]:
print('Test Accuracy:', pipeline2.score(X_test, y_test))

Test Accuracy: 0.721690590112


In [58]:
from sklearn.metrics import confusion_matrix, classification_report

predictions = pipeline2.predict(X_test)

conf_matrix = pd.DataFrame(confusion_matrix(y_test, predictions),
                          columns=['Predicted 0', 'Predicted 1'],
                          index=['Actual 0', 'Actual 1'])

print(conf_matrix)

          Predicted 0  Predicted 1
Actual 0          427          187
Actual 1          162          478


In [59]:
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.72      0.70      0.71       614
          1       0.72      0.75      0.73       640

avg / total       0.72      0.72      0.72      1254

