<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Pipelines

In the last example you may recall that we read in the James Bond dataset and trained a model.

However, we ignored one of the columns - the actor who played bond was dropped. 

This is something we had to do because our pipeline wouldn't have been able to Scale and Centre this Column because it contains categorical information rather than numerical data. 

Ideally we'd have liked to process this column in a different way from the rest. 

In this tutorial we're going build a pipeline that processes both categorical and numerical data seperately using Column Transfomers. We'll scale and centre the numerical columns, and create dummies for the categorical columns. 

Remember that a Pipeline can have several steps that transform the data, but can only contain one ML model. Again we'll be using Logistic Regression. 

In [6]:
import numpy as np
import pandas as pd

In [7]:
bond_df = pd.read_csv('../../Data/JamesBond.csv')
bond_df = bond_df.drop('Movie', axis=1)
bond_df

Unnamed: 0,Year,Bond,US_Gross,US_Adj,World_Gross,World_Adj,Budget,Budget_Adj,Film_Length,Avg_User_IMDB,Avg_User_Rtn_Tom,Conquests,Martinis,BJB,Kills_Bond,Kills_Others,Top_100
0,1962,Sean Connery,16067035,123517,59567035,457928,1000,7688,110,7.3,7.7,3,2,1,4,8,0
1,1963,Sean Connery,24800000,188161,78900000,598624,2000,15174,115,7.5,8.0,4,0,0,11,16,0
2,1964,Sean Connery,51100000,382699,124900000,935404,3000,22468,110,7.8,8.4,2,1,2,9,68,1
3,1965,Sean Connery,63600000,468754,141200000,1040693,9000,66333,130,7.0,6.8,3,0,0,20,90,1
4,1967,Sean Connery,43100000,299591,111600000,775740,9500,66035,117,6.9,6.3,3,1,0,21,175,1
5,1969,George Lazenby,22800000,144234,82000000,518736,8000,50608,142,6.8,6.7,3,1,2,5,37,0
6,1971,Sean Connery,43800000,251083,116000000,664969,7200,41274,120,6.7,6.3,1,0,1,7,42,1
7,1973,Roger Moore,35400000,185105,161800000,846046,7000,36603,121,6.8,5.9,3,0,1,8,5,1
8,1974,Roger Moore,21000000,98894,97600000,459623,7000,32965,125,6.7,5.1,2,0,2,1,5,0
9,1977,Roger Moore,46800000,179297,185400000,710290,14000,53636,125,7.1,6.8,3,1,1,31,116,1


In [8]:
bond_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Year              24 non-null     int64  
 1   Bond              24 non-null     object 
 2   US_Gross          24 non-null     int64  
 3   US_Adj            24 non-null     int64  
 4   World_Gross       24 non-null     int64  
 5   World_Adj         24 non-null     int64  
 6   Budget            24 non-null     int64  
 7   Budget_Adj        24 non-null     int64  
 8   Film_Length       24 non-null     int64  
 9   Avg_User_IMDB     24 non-null     float64
 10  Avg_User_Rtn_Tom  24 non-null     float64
 11  Conquests         24 non-null     int64  
 12  Martinis          24 non-null     int64  
 13  BJB               24 non-null     int64  
 14  Kills_Bond        24 non-null     int64  
 15  Kills_Others      24 non-null     int64  
 16  Top_100           24 non-null     int64  
dtyp

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


In [35]:
X = bond_df.drop('Top_100', axis=1)
y = bond_df['Top_100']


Note that because I'm using a pipeline I don't have to convert X and Y to numpy arrays, I'm keeping them as dataframes.

This also allows me to reference columns by name...

In [39]:
numerical_transformer = Pipeline([('scaler', StandardScaler())])
categorical_transformer = Pipeline([('one_hot', OneHotEncoder())])

cat_features = ['Bond']
num_features = ['Year', 'US_Gross', 'US_Adj', 'World_Gross', 'World_Adj', 'Budget',
       'Budget_Adj', 'Film_Length', 'Avg_User_IMDB', 'Avg_User_Rtn_Tom',
       'Conquests', 'Martinis', 'BJB', 'Kills_Bond', 'Kills_Others']


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, num_features),
        ("cat", categorical_transformer, cat_features)
    ]
)


In [40]:
pipeline = Pipeline(steps=[('preprocessor', preprocessor), 
            ('model', LogisticRegression())
            ])

In [46]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [47]:
pipeline.fit(X_train, y_train)


In [49]:
pipeline.score(X_test, y_test)

0.75

Hooray, we've improved the accuracy of our model by adding in the actor who played bond!


# Further Reading

For a similar example see:
https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf

This example has the added step of varying the Model in the pipeline at the end using a loop. 

# Even more Reading

If you'd like to know how to debug a pipeline, there are several approaches you can take: 
https://www.google.com/search?q=debugging+a+scikit+learn+pipeline&oq=debugging+a+scikit+learn+pipeline&aqs=edge..69i57j0i546l4j69i64.11809j0j1&sourceid=chrome&ie=UTF-8


