## Building predictive models with sklearn pipelines

**Objective of the competition**:

The competition dataset contains text from works of fiction written by spooky authors of the public domain:
- Edgar Allan Poe (EAP)
- HP Lovecraft (HPL)
- Mary Wollstonecraft Shelley (MWS)

The objective is to accurately identify the author of the sentences in the test set.

**Objective of the notebook:**

The goal of this notebook is to break down the pipeline process to make it easier to see how they all fit together.

General description and data are available on [Kaggle](https://www.kaggle.com/c/spooky-author-identification).

1. [Load dataset](#1)

2. [Preprocessing and Feature Engineering](#2)

3. [Creating a Pipeline](#3)

4. [Cross Validation To Find The Best Pipeline](#4)

5. [Final Predictions](#5)

## <a name="1">Load dataset</a>

In [1]:
import numpy as np
import pandas as pd

df=pd.read_csv("data/train.csv")

df.dropna(axis=0)
df.set_index("id",inplace=True)

df.head()

Unnamed: 0_level_0,text,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1
id26305,"This process, however, afforded me no means of...",EAP
id17569,It never once occurred to me that the fumbling...,HPL
id11008,"In his left hand was a gold snuff box, from wh...",EAP
id27763,How lovely is spring As we looked from Windsor...,MWS
id12958,"Finding nothing else, not even gold, the Super...",HPL


## <a name="2">Preprocessing and Feature Engineering</a>

Let's do some basic feature engineering. To make it easier to replicate on the submission data, we will encapsulate the logic into a function.

Note, all of this preprocessing is standard stuff, and does not depend on the data it's processing on, so it's ok to do this now. Things like count vectorization and numeric scaling depend on the data it's run on, so that part must be done differently. We will get to that later.

For now, we will count the number of words in each row, the number of characters, the number of non stop words, and the number of commas

In [3]:
import re
from nltk.corpus import stopwords

stopwords=set(stopwords.words("english"))

def processing(df):
    # lowering and removing punctuation
    df["processed"]=df["text"].apply(lambda x: re.sub(r"[^\w\s]","",x.lower()))
    
    # numerical feature engineering
    # total length of sentence
    df["length"]=df["processed"].apply(lambda x: len(x))
    # get number of words
    df["words"]=df["processed"].apply(lambda x: len(x.split(" ")))
    df["words_not_stopword"]=df["processed"].apply(lambda x: len([t for t in x.split(" ") if t not in stopwords]))
    # get the average word length
    df["avg_word_length"]=df["processed"].apply(lambda x: np.mean([len(t) for t in x.split(" ") if t not in stopwords]) if len([len(t) for t in x.split(" ") if t not in stopwords])>0 else 0)
    df["commas"]=df["text"].apply(lambda x: x.count(","))
    
    return df

df=processing(df)

df.head()

Unnamed: 0_level_0,text,author,processed,length,words,words_not_stopword,avg_word_length,commas
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
id26305,"This process, however, afforded me no means of...",EAP,this process however afforded me no means of a...,224,41,21,6.380952,4
id17569,It never once occurred to me that the fumbling...,HPL,it never once occurred to me that the fumbling...,70,14,6,6.166667,0
id11008,"In his left hand was a gold snuff box, from wh...",EAP,in his left hand was a gold snuff box from whi...,195,36,19,5.947368,4
id27763,How lovely is spring As we looked from Windsor...,MWS,how lovely is spring as we looked from windsor...,202,34,21,6.47619,3
id12958,"Finding nothing else, not even gold, the Super...",HPL,finding nothing else not even gold the superin...,170,27,16,7.1875,2


## <a name="3">Creating a Pipeline</a>

Sklearn pipeline functionality makes it easier to repeat commonly occurring steps in our modeling process. Similar to the processing function, it provides a way to take code, fit it to the training data, apply it to the test data without having to copy and paste everything.

So let's build the pipelines up from the bottom. Since pipelines are made from pipelines, it's useful to see how they build on each other.

First step, split data into training and testing.

In [4]:
from sklearn.model_selection import train_test_split

features=[c for c in df.columns.values if c not in ["id","text","author"]]
numeric_features=[c for c in df.columns.values if c not in ["id","text","author","processed"]]
target="author"

X_train,X_test,y_train,y_test=train_test_split(df[features],df[target],test_size=1/3,random_state=42)
X_train.head()

Unnamed: 0_level_0,processed,length,words,words_not_stopword,avg_word_length,commas
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
id05314,it was a passion that had grown with his growt...,139,27,11,6.0,0
id10083,if this rule were always observed if no man al...,310,51,25,7.16,4
id05326,a poet having very unusual pecuniary resources...,263,45,22,7.090909,8
id02106,and because mere walls and windows must soon d...,259,49,28,5.285714,1
id03741,the news spread through athens and the whole c...,189,34,17,5.823529,5


The standard preprocessing apply the same preprocessing to the whole dataset, but in cases where we have heterogeneous data, this doesn't quite work. So first thing to do is create a selector transformer that simply returns the one column in the dataset by the key value I pass.

I was having difficulty getting the selector to play nicely, so I made two different selectors for either text or numeric columns. The return type is different, but other than that they work the same.

In [5]:
from sklearn.base import BaseEstimator,TransformerMixin

class TextSelector(BaseEstimator,TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def __init__(self,key):
        self.key=key
    
    def fit(self,X,y=None):
        return self
    
    def transform(self,X):
        return X[self.key]

class NumberSelector(BaseEstimator,TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self,key):
        self.key=key
    
    def fit(self,X,y=None):
        return self
    
    def transform(self,X):
        return X[[self.key]]

To see how this is used, let's actually run it on one column.

I'm going to call it on the text column and transform it with another step. But again, pipelines are all about encapsulating several steps, so I'm going to make a mini pipeline that consists of two steps: first grab just that column from the dataset, then perform tf-idf on just that column and return the results.

To make a pipeline, just pass an array of tuples of the format (name, object). The first part is the name of the action, and the second is the actual object. So this pipeline consists of "selecting" and then "tfidf-ing" a column.

To execute, use it just like any other transformer. Call text.fit() to fit to training data, text.transform() to apply it to training data, or text.fit_transform() to do both.

Since it's text, it will return a sparse matrix, but we can see that it works:

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

text=Pipeline([
    ("selector",TextSelector(key="processed")),
    ("tfidf",TfidfVectorizer(stop_words="english"))
])

text.fit_transform(X_train)

<13052x21455 sparse matrix of type '<class 'numpy.float64'>'
	with 147424 stored elements in Compressed Sparse Row format>

Since our data is heterogeneous, we might want to do something else on numeric data, so let's build a mini pipeline for that too.

This transformer will be a simple scaler. Since our data is mixed, we must apply it column by column. Let's make one to process the "length" variable I made above. Just like the text one, we combine two steps, first selecting the column, then transforming the column, like so:

In [8]:
from sklearn.preprocessing import StandardScaler

length=Pipeline([
    ("selector",NumberSelector(key="length")),
    ("standard",StandardScaler())
])

length.fit_transform(X_train)

array([[-0.06153889],
       [ 1.52952007],
       [ 1.09221147],
       ...,
       [-0.46162975],
       [-0.14527884],
       [-0.39649868]])

We can see that the transformer pipeline returns a matrix for the column it's called on, so now all that's left to do is join the results from several transformed variables into a single dataset.

First, let's transform all the numeric columns with the standard scaler:

In [9]:
words=Pipeline([
    ("selector",NumberSelector(key="words")),
    ("standard",StandardScaler())
])
words_not_stopword=Pipeline([
    ("selector",NumberSelector(key="words_not_stopword")),
    ("standard",StandardScaler())
])
avg_word_length=Pipeline([
    ("selector",NumberSelector(key="avg_word_length")),
    ("standard",StandardScaler())
])
commas=Pipeline([
    ("selector",NumberSelector(key="commas")),
    ("standard",StandardScaler())
])

To make a pipeline from all of our pipelines, we do the same thing, but now we use a FeatureUnion to join the feature processing pipelines.

The syntax is the same as a regular pipeline, it's just an array of tuple, with the (name, object) format.

The feature union itself is not a pipeline, it's just a union, so we need to do one more step to make it useable: pass it to a pipeline, with the same structure, an array of tuples, with the simple (name, object) format.

We can then apply all those transformations at once with a single fit, transform, or fit_transform call.

In [10]:
from sklearn.pipeline import FeatureUnion

feats=FeatureUnion([
    ("text",text),
    ("length",length),
    ("words",words),
    ("words_not_stopword",words_not_stopword),
    ("avg_word_length",avg_word_length),
    ("commas",commas)])

feature_processing=Pipeline([("feats",feats)])
feature_processing.fit_transform(X_train)

<13052x21460 sparse matrix of type '<class 'numpy.float64'>'
	with 212684 stored elements in Compressed Sparse Row format>

To add a model to the mix and generate predictions as well, we can add a model at the end of the pipeline. The syntax is, you guessed it, an array of tuples, merging the transformations with a model.

In [11]:
from sklearn.ensemble import RandomForestClassifier

pipeline=Pipeline([
    ("features",feats),
    ("classifier",RandomForestClassifier(random_state=42))
])

pipeline.fit(X_train,y_train)

y_pred=pipeline.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_pred)

0.675808181400337

## <a name="4">Cross Validation To Find The Best Pipeline</a>

What if I wanted to do cross validation on my pipeline? How many trees should I use on my classifier? How deep should I go? Or even more complicated, how many words should I use in my tf-idf transform? Should I include stop words? Pipelines allow us to do that with just a few more lines.

Cross validation is all about figuring out what the best hyperparameters of the data set is.

In [12]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'features', 'classifier', 'features__n_jobs', 'features__transformer_list', 'features__transformer_weights', 'features__verbose', 'features__text', 'features__length', 'features__words', 'features__words_not_stopword', 'features__avg_word_length', 'features__commas', 'features__text__memory', 'features__text__steps', 'features__text__verbose', 'features__text__selector', 'features__text__tfidf', 'features__text__selector__key', 'features__text__tfidf__analyzer', 'features__text__tfidf__binary', 'features__text__tfidf__decode_error', 'features__text__tfidf__dtype', 'features__text__tfidf__encoding', 'features__text__tfidf__input', 'features__text__tfidf__lowercase', 'features__text__tfidf__max_df', 'features__text__tfidf__max_features', 'features__text__tfidf__min_df', 'features__text__tfidf__ngram_range', 'features__text__tfidf__norm', 'features__text__tfidf__preprocessor', 'features__text__tfidf__smooth_idf', 'features__text__tfidf__stop_words'

In [None]:
from sklearn.model_selection import GridSearchCV

hyperparameters={"features__text__tfidf__max_df":[0.9,0.95],
                "features__text__tfidf__ngram_range":[(1,1),(1,2)],
                "classifier__max_depth":[50,70],
                "classifier__min_samples_leaf":[1,2]}

clf=GridSearchCV(pipeline,hyperparameters,cv=5)

# Fit and tune model
clf.fit(X_train,y_train)

In [14]:
clf.best_params_

{'classifier__max_depth': 70,
 'classifier__min_samples_leaf': 2,
 'features__text__tfidf__max_df': 0.9,
 'features__text__tfidf__ngram_range': (1, 1)}

Let's call refit to automatically fit the pipeline on all of the training data with the best_params_setting applied.

In [15]:
# refitting on entire training data using best settings
clf.refit

preds=clf.predict(X_test)
probs=clf.predict_proba(X_test)

np.mean(preds==y_test)

0.6633981921250192

## <a name="5">Final Predictions</a>

To generate submission results, we just need to do the preprocessing on the submission data, then call the pipeline with the predict_proba call, since we want to know all the probabilities, not just the label.

The only tricky part for the submission is we need the class names as the column values.

In [23]:
submission=pd.read_csv("data/test.csv")

# preprocessing

submission=processing(submission)
predictions=clf.predict_proba(submission)

preds=pd.DataFrame(data=predictions,columns=clf.best_estimator_.named_steps["classifier"].classes_)

# generating a submission file
result=pd.concat([submission[["id"]],preds],axis=1)
result.set_index("id",inplace=True)
result.head()

# result.to_csv("output/submission.csv")