## Using a ScikitLearn pipeline

In [19]:
## Import all relevent libraries to analyse the data
import pandas as pd
import numpy as np
import datetime
import math
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import nltk
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from nltk.tokenize import RegexpTokenizer
from sklearn.preprocessing import *
from sklearn_pandas import DataFrameMapper
tokenizer = RegexpTokenizer(r'\w+')

from nltk.probability import FreqDist

## This line makes sure that our graphs are rendered within the notebook
%matplotlib inline

The point of a pipeline is to bring all the pieces of the model building together in one place so that in can be take into further anaysis or plugged in to another workflow or app. A data engineer can then take it and use some wrapper to put all these components as a plug into another process. 

The first step is to bring all the functions created in the data preparation part and then finally putting all together one single process that includes a pipeline that calls them all. The follwoing cells will take you through this procedures. 

#### Data Preparation

In [2]:
# Negative and positive rating flags
def get_rating_flags(score):
    def negative_rating(score):
        if score < 3:
            return '1'
        return '0'
    def positive_rating(score):
        if score > 3:
            return '1'
        return '0'
    negative = negative_rating(score)
    positive = positive_rating(score)
    return [negative, positive]

In [3]:
# getting word counts
def get_word_counts(text):
    tokenized_word=tokenizer.tokenize(text)
    fdist = FreqDist(tokenized_word)
    return [fdist.B(), fdist.N()]

In [4]:
# Winter and summer flags
def get_season_flags(month):
    def winter_flag(month):
        if month in [10, 11, 12 , 1, 2]:
            return '1'
        return '0'
    def summer_flag(month):
        if month in [7, 8]:
            return '1'
        return '0'
    winter = winter_flag(month)
    summer = summer_flag(month)
    return [winter, summer]

In [7]:
# Main function in charge of calling the functions created above and other data changes
def new_features(data):
    new_df = data.copy()
    #Score variables
    rating_flags = new_df.apply(
        lambda x: pd.Series(
            get_rating_flags(x.Score), 
            index = ["negative_rating_flag", "positive_rating_flag"]
        ),
        axis=1
    )
    new_df = pd.concat([new_df[:], rating_flags[:]], axis=1)
    
    #Text variables
    text_attributes = new_df.apply(
        lambda x: pd.Series(
            get_word_counts(x.Text), 
            index = ["n_distinct_words", "n_words"]
        ),
        axis=1
    )
    new_df = pd.concat([new_df[:], text_attributes[:]], axis=1)
    
    #Time variables
    new_df['Time'] = pd.to_datetime(new_df['Time'], unit='s')
    new_df['day_of_week'] = new_df['Time'].dt.weekday
    new_df['month'] = new_df['Time'].dt.month

    season_flags = new_df.apply(
        lambda x: pd.Series(
            get_season_flags(x.month), 
            index = ["winter_flag", "summer_flag"]
        ),
        axis=1
    )
    new_df = pd.concat([new_df[:], season_flags[:]], axis=1)
    
    #Product and Reviewer variables
    new_df['product_freq'] = new_df.groupby('ProductId')['ProductId'].transform('count')
    new_df['reviewer_freq'] = new_df.groupby('UserId')['UserId'].transform('count')
    
    model_df = new_df[['Score'
                        ,'negative_rating_flag'
                        ,'positive_rating_flag'
                        ,'day_of_week'
                        ,'month'
                        ,'winter_flag'
                        ,'summer_flag'
                        ,'n_words'
                        ,'product_freq'
                        ,'reviewer_freq']]

    return model_df
       

#### Calling all preparations and models with a scikit learn pipeline

Once all the functions are in place then the followwing code tides everything together:
1. Reads the raw data
1. Defines the data set, in this case all reviews with helpful denominator > 4
1. Call the data preparation function through the Function Transformer called dataprep_transformer
1. Modify the data further by creating dummy variables for the categorical variables and standarising for the numerical ones through a Mapper. 
1. Calling all the previous steps in a pipeline and adding the Logistic regression module
1. Fitting the model to the data

In [21]:
raw_data = pd.read_csv('Reviews.csv')

# Defining the data set
DenominatorBiggerthan3_data = raw_data[raw_data['HelpfulnessDenominator']>4]

#Defining the target
def calculate_helpfulness_ratio(Numerator, Denominator):
    ratio = Numerator/float(Denominator)
    def target(ratio): 
        if ratio > 0.7: 
            return '1'
        return '0'
    useful_flag = target(ratio)
    #ratio = round(ratio,2)
    return [ratio, useful_flag]

helpfulness_ratio = DenominatorBiggerthan3_data.apply(
    lambda x: pd.Series(
        calculate_helpfulness_ratio(x.HelpfulnessNumerator, x.HelpfulnessDenominator), 
        index = ["helpfulness_ratio", "useful_flag"]
    ),
    axis=1
)
HelpfulnesswithTarget_df = pd.concat([DenominatorBiggerthan3_data[:], helpfulness_ratio[:]], axis=1)

dataprep_transformer = FunctionTransformer(new_features, validate=False)

mapper = DataFrameMapper([
    ('day_of_week', LabelBinarizer()),
    ('month', LabelBinarizer()),
    (['n_words'], StandardScaler()),
    (['product_freq'], StandardScaler()),
    (['reviewer_freq'], StandardScaler()),
    (['Score'], StandardScaler())
], default=None)

pipe = sklearn.pipeline.Pipeline([
    ('dataprep', dataprep_transformer),   
    ('featurize', mapper),
    ('log_reg', LogisticRegression())
])
model = pipe.fit(
    X=HelpfulnesswithTarget_df, 
    y=HelpfulnesswithTarget_df.useful_flag.values
)



In [27]:
zip(
    model.named_steps['featurize'].transformed_names_, 
    model.named_steps["log_reg"].coef_[0]
)
logreg = model.named_steps["log_reg"]

[('day_of_week_0', 0.037339849694803376),
 ('day_of_week_1', -0.015561571195092359),
 ('day_of_week_2', 0.03531995804717539),
 ('day_of_week_3', -0.0315097126385658),
 ('day_of_week_4', 0.035748546663204143),
 ('day_of_week_5', 0.018097284214518766),
 ('day_of_week_6', 0.013389343790562391),
 ('month_1', 0.03045793616216519),
 ('month_2', 0.03294235144722164),
 ('month_3', 0.008589147856367416),
 ('month_4', -0.03147642030008585),
 ('month_5', -0.08654927060487967),
 ('month_6', 0.003075582847269026),
 ('month_7', 0.0019249979152510851),
 ('month_8', 0.07582113766184309),
 ('month_9', -0.003732301817545515),
 ('month_10', 0.0015949850669462365),
 ('month_11', -0.0009324543822984366),
 ('month_12', 0.061108006724264745),
 ('n_words', 0.30987650244242254),
 ('product_freq', -0.1702660201771724),
 ('reviewer_freq', -0.06803649921727592),
 ('Score', 0.6951813581961932),
 ('negative_rating_flag', 0.23687678233091417),
 ('positive_rating_flag', 0.8913147477523748),
 ('winter_flag', 0.1251708

Once this all runs smoothly it means we can run the model again and see the coeficients at the end. It also means that the steps are all together in one place and if there are any data updates then the only part to modify is exactly this, the raw data (assuming the column names don't change). There are also other parts that can be added like feature selection or cross validation to mention some.

I think this is such a cool way to put all the statistical efforts together in one place :) 