## Class 11 Agenda:

  * **Pipelines: Putting your entire ML workflow together**

In [25]:
#data handling, model creation/evaluation
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures
from sklearn import metrics
import scipy.stats as stats

# visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from yellowbrick import ClassBalance
#from yellowbrick.classifier.learning_curve import LearningCurveVisualizer
from yellowbrick.classifier import ROCAUC

### Pipelines: Putting An Entire Model Together End to End

Ok, the last thing we are going to learn how to do, is how to combine every aspect creating and using a supervised machine learning model:

1. Transforming your original data (removing skew, standard scaling, encoding categorical variables as numbers)
2. Training and validating a model on that data
3. Picking parameters for a given model to optimize accuracy/precision/recall/f1 score, etc.

Let's try to see how we would do this without a pipeline. Let's get some data:

In [4]:
columns = ["sex","length","diam","height","whole","shucked","viscera","shell","age"]
numeric_columns = columns[1:-1]
categorical_columns = columns[0]
target = columns[-1]

abalone_data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data",names=columns)
abalone_data.head()

Unnamed: 0,sex,length,diam,height,whole,shucked,viscera,shell,age
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


Now let's preprocess it in the standard way I've shown you:

1. Let's convert the categorical column using one-hot encoding
2. Standard scale (Z-score) the numeric columns

In [5]:
#get categorical features
#drop off last column because its unnecessary
X_categorical = pd.get_dummies(abalone_data[categorical_columns]).astype(int).iloc[:,:-1]

#get and transform numeric features
X_numeric = abalone_data[numeric_columns]
X_numeric[numeric_columns] = StandardScaler().fit_transform(X_numeric)

#get outcome variable
y = abalone_data[target]

#combine transformed categorical and numeric features
X_final = pd.concat((X_numeric,X_categorical),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.loc._setitem_with_indexer((slice(None), indexer), value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_array(key, value)


And now, let's do our standard 10-fold cross-validation scoring:

In [6]:
#create rf regressor and check 10-fold RMSE
rf = RandomForestRegressor()
cross_val_scores = np.abs(cross_val_score(rf,X_final,y,scoring = "neg_mean_squared_error", cv=10))
rmse_cross_val_scores = np.sqrt(cross_val_scores)
print("Mean 10-fold rmse: ", np.mean(rmse_cross_val_scores))
print("Std 10-fold rmse: ", np.std(rmse_cross_val_scores))



Mean 10-fold rmse:  2.2430106478558427
Std 10-fold rmse:  0.6262780240949671


Now, we are going to do the same thing using Scikit-learn's pipeline feature. First, we are going to have a class that allows us to subselect columns that we want to work with. [See this example in the scikit-learn documentation as well](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html):

In [7]:
from sklearn.base import BaseEstimator, TransformerMixin

class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to sklearn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

Now, we are going to make the full pipeline, from start to finish, for the entire dataset:

In [8]:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import OneHotEncoder

#encode the categorical column from strings to ints
le = LabelEncoder()
abalone_data["sex_encoded"] = abalone_data[[categorical_columns]].apply(le.fit_transform)

#extract the y
y = abalone_data.age

#create the feature union for the features
X_transformed_pipe = FeatureUnion(
        transformer_list=[
            # Pipeline for one hot encoding categorical column
            ('sexes', Pipeline([
                ('selector', ItemSelector(key=["sex_encoded"])),
                ('encoder', OneHotEncoder())                    
            ])),
            # Pipeline for pulling out numeric features and scaling them
            ('numeric', Pipeline([
                ('selector', ItemSelector(key=numeric_columns)),
                #('polyfeatures', PolynomialFeatures(degree=2,interaction_only=True)),
                ('scaler', StandardScaler()),
            ]))])
#create the full final pipeline
full_pipeline = Pipeline([("all_features",X_transformed_pipe),("rf_regressor",RandomForestRegressor(n_estimators=100))])

And now let's run the whole pipe through the `cross_val_score` object:

In [9]:
#pass the pipeline directly into cross_val_score
cross_val_scores = np.abs(cross_val_score(full_pipeline,abalone_data,y,cv=10,scoring="neg_mean_squared_error"))
rmse_cross_val_scores = np.sqrt(cross_val_scores)
print("Mean 10-fold rmse: ", np.mean(rmse_cross_val_scores))
print("Std 10-fold rmse: ", np.std(rmse_cross_val_scores))

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the On

Mean 10-fold rmse:  2.1424927228661987
Std 10-fold rmse:  0.6232092271848529


#### Exercise Time!

* Change the pipeline to perform PCA and keep only the first 6 components on the complete feature pipe (after standard scaling numeric features and encoding the categorical feature)

In [10]:
pass

Now let's do this on a slightly more involved example, where we will have to do some imputation (filling in of missing values).

Here the process will be as follows:

1. Encode categorical string columns as numbers using `LabelEncoder`
2. Impute missing categorical values (marked with 0 after encoding) with most frequent category using `Imputer`
3. One-hot encode the categorical columns using `OneHotEncoder`
4. Impute missing numerical values using the median value of each column using `Imputer`
5. Z-score/standardize each numerica column using `StandardScaler`
6. Combine both collections of columns (one-hot encoded categorical columns and standardized numeric columns) using `FeatureUnion`
7. Pass the whole collection to a `RandomForestClassifier` to build a Random Forest classification model.
8. Use `cross_val_score` with 10-fold cross-validation on the entire pipeline.

Ready? Let's start by loading in the data:

In [11]:
kidney_columns = ["age","bp","sg","al","su","rbc","pc","pcc","ba","bgr","bu","sc","sod","pot","hemo","pcv","wc","rc","htn","dm","cad","appet","pe","ane","class"]
kidney_data = pd.read_csv("data/chronic_kidney_disease.csv",
                          header=None,
                          na_values="?",
                          names=kidney_columns)
kidney_data.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd


Let's rearrange the columns so that the numeric columns are together, followed by all of the categorical columns:

In [12]:
#rearrange kidney columns as before
kidney_columns = kidney_columns[:5]+kidney_columns[9:18]+kidney_columns[5:9]+kidney_columns[18:]
kidney_data = kidney_data[kidney_columns]
kidney_data.head()

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,...,pc,pcc,ba,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,121.0,36.0,1.2,,,...,normal,notpresent,notpresent,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,18.0,0.8,,,...,normal,notpresent,notpresent,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,423.0,53.0,1.8,,,...,normal,notpresent,notpresent,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,abnormal,present,notpresent,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,106.0,26.0,1.4,,,...,normal,notpresent,notpresent,no,no,no,good,no,no,ckd


Let's also clean up the null values in the `rbc` column.      

In [13]:
kidney_data[kidney_columns[14:]]

Unnamed: 0,rbc,pc,pcc,ba,htn,dm,cad,appet,pe,ane,class
0,,normal,notpresent,notpresent,yes,yes,no,good,no,no,ckd
1,,normal,notpresent,notpresent,no,no,no,good,no,no,ckd
2,normal,normal,notpresent,notpresent,no,yes,no,poor,no,yes,ckd
3,normal,abnormal,present,notpresent,yes,no,no,poor,yes,yes,ckd
4,normal,normal,notpresent,notpresent,no,no,no,good,no,no,ckd
5,,,notpresent,notpresent,yes,yes,no,good,yes,no,ckd
6,,normal,notpresent,notpresent,no,no,no,good,no,no,ckd
7,normal,abnormal,notpresent,notpresent,no,yes,no,good,yes,no,ckd
8,normal,abnormal,present,notpresent,yes,yes,no,good,no,yes,ckd
9,abnormal,abnormal,present,notpresent,yes,yes,no,poor,no,yes,ckd


In [14]:
kidney_data['rbc'].fillna("unknown", inplace=True)
kidney_data['pc'].fillna("unknown", inplace=True)


Let's encode the strings as numbers:

In [15]:
#convert strings to numbers
le = LabelEncoder()
kidney_data[kidney_columns[14:]] = kidney_data[kidney_columns[14:]].astype(str).apply(le.fit_transform)
#get the X and y
X = kidney_data[kidney_columns[:-1]]
y = kidney_data["class"]
kidney_data.head()

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,...,pc,pcc,ba,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,121.0,36.0,1.2,,,...,1,1,1,2,3,1,0,1,1,0
1,7.0,50.0,1.02,4.0,0.0,,18.0,0.8,,,...,1,1,1,1,2,1,0,1,1,0
2,62.0,80.0,1.01,2.0,3.0,423.0,53.0,1.8,,,...,1,1,1,1,3,1,2,1,2,0
3,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,0,2,1,2,2,1,2,2,2,0
4,51.0,80.0,1.01,2.0,0.0,106.0,26.0,1.4,,,...,1,1,1,1,2,1,0,1,1,0


And here is the code for the entire pipeline:

In [16]:
from sklearn.preprocessing import Imputer

X_transformed_pipe = FeatureUnion(
        transformer_list=[
            # Pipeline for filling in missing values, one hot encoding all categorical columns
            ('categoricals', Pipeline([
                ('selector', ItemSelector(key=kidney_columns[14:-1])),
                ('imputer', Imputer(missing_values=0,strategy="most_frequent",axis=0)),
                ('encoder', OneHotEncoder())                    
            ])),
            # Pipeline for pulling out numeric features, filling in missing values, and scaling them
            ('numeric', Pipeline([
                ('selector', ItemSelector(key=kidney_columns[:14])),
                ('imputer', Imputer(strategy="median",axis=0)),
                ('scaler', StandardScaler()),
            ]))])

full_pipeline = Pipeline([("all_features",X_transformed_pipe),("rf_classifier",RandomForestClassifier())])



In [17]:
cross_val_score(full_pipeline,X,y,cv=10)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([1.   , 0.975, 1.   , 0.95 , 0.95 , 1.   , 0.975, 0.975, 1.   ,
       1.   ])

Each pipeline object contains a sequence of steps, which are stored in a list. Each step is a tuple, where the first element is the name you gave the given step, and the second element is the transformation or model you are applying at that step:

In [18]:
full_pipeline.steps

[('all_features', FeatureUnion(n_jobs=None,
         transformer_list=[('categoricals', Pipeline(memory=None,
       steps=[('selector', ItemSelector(key=['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane'])), ('imputer', Imputer(axis=0, copy=True, missing_values=0, strategy='most_frequent',
      verbose=0)), ('encoder', OneHotEncoder(cat...tegy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True))]))],
         transformer_weights=None)),
 ('rf_classifier',
  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
              max_depth=None, max_features='auto', max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
              oob_score=False, random_state=None, verbose=0,
              warm_start=False))]

Let's take a look at a few steps:

In [19]:
print("The first step in the pipeline:\n",full_pipeline.steps[0])
print()
print("The second step in the pipeline: \n", full_pipeline.steps[1])
print()
print("The second step's transformation/model:\n", full_pipeline.steps[1][1])
#print("Since we know this is a random forest model, lets try to get the models feature importances:\n",full_pipeline.steps[1][1].feature_importances_)

The first step in the pipeline:
 ('all_features', FeatureUnion(n_jobs=None,
       transformer_list=[('categoricals', Pipeline(memory=None,
     steps=[('selector', ItemSelector(key=['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane'])), ('imputer', Imputer(axis=0, copy=True, missing_values=0, strategy='most_frequent',
    verbose=0)), ('encoder', OneHotEncoder(cat...tegy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True))]))],
       transformer_weights=None))

The second step in the pipeline: 
 ('rf_classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=Fal

Remember, in order to be able to get feature importances or coefficients of a given model, it needs to be trained first. Just like any other transformation in sklearn, you can fit a pipeline by calling its `fit` method:

In [20]:
full_pipeline.fit(X,y)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Pipeline(memory=None,
     steps=[('all_features', FeatureUnion(n_jobs=None,
       transformer_list=[('categoricals', Pipeline(memory=None,
     steps=[('selector', ItemSelector(key=['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane'])), ('imputer', Imputer(axis=0, copy=True, missing_values=0, strategy='most_...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

Now that it's been fit, we can extract the feature importances as we wanted:

In [21]:
full_pipeline.steps[1][1].feature_importances_.round(3)

array([0.038, 0.008, 0.   , 0.   , 0.001, 0.002, 0.   , 0.   , 0.088,
       0.005, 0.   , 0.005, 0.014, 0.   , 0.   , 0.   , 0.   , 0.008,
       0.   , 0.   , 0.   , 0.015, 0.001, 0.027, 0.068, 0.004, 0.026,
       0.041, 0.104, 0.049, 0.001, 0.334, 0.115, 0.008, 0.038])

In [22]:
clf = RandomForestClassifier()
clf

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

What's really great about pipelines is that you can even put them into `GridSearchCV` methods, and search across parameters to tune your models. To do so, create a dictionary entry in `param_grid` that names the step (which you named earlier) and parameters you want to test:

In [23]:
# using GridSearchCV with Pipeline
from sklearn.model_selection import GridSearchCV
estimators_range = [20,50,100]
param_grid = dict(rf_classifier__n_estimators=estimators_range)
grid = GridSearchCV(full_pipeline, param_grid, cv=20, scoring='accuracy',n_jobs=-1)
grid.fit(X, y)
print("Best cross-validated accuracy: ",grid.best_score_)
print("Best parameter found: ",grid.best_params_)
print("Fitted_model: ",grid.best_estimator_.steps[1][1])



BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

#### Exercise Time!

* add a PCA transformation step before training the classifier
* search over the number of PCA components to keep using `GridSearchCV` (test whether to keep the first 5,10, or all components)

In [None]:
pass