<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Pipelines

In this example we're going to load in the James Bond Dataset. 

The target variable will to determine if the movie makes it into the IMDB Top_100. We'll therefore be using Logistic Regression. 

Before training our model we'll need to preprocess our data with 2 transformers: 
* we'll use SimpleImputer to insert missing values (in reality there aren't any, but please play along!)
* we'll use StandardScaler to scale and centre our data (standardisation)

In order to do this we'll build a pipeline with 3 steps: 
1. Impute - insert missing values
2. Standardise - i.e. scale and centre our data
3. Train a Logistic Regression Model

Note that Pipelines can have multiple steps that transform the data, but you can only train a single model - in this case we'll be using Logistic Regression.

In [17]:
import numpy as np
import pandas as pd

In [18]:
bond_df = pd.read_csv('../../Data/JamesBond.csv')
bond_df = bond_df.drop('Movie', axis=1)
bond_df

Unnamed: 0,Year,Bond,US_Gross,US_Adj,World_Gross,World_Adj,Budget,Budget_Adj,Film_Length,Avg_User_IMDB,Avg_User_Rtn_Tom,Conquests,Martinis,BJB,Kills_Bond,Kills_Others,Top_100
0,1962,Sean Connery,16067035,123517,59567035,457928,1000,7688,110,7.3,7.7,3,2,1,4,8,0
1,1963,Sean Connery,24800000,188161,78900000,598624,2000,15174,115,7.5,8.0,4,0,0,11,16,0
2,1964,Sean Connery,51100000,382699,124900000,935404,3000,22468,110,7.8,8.4,2,1,2,9,68,1
3,1965,Sean Connery,63600000,468754,141200000,1040693,9000,66333,130,7.0,6.8,3,0,0,20,90,1
4,1967,Sean Connery,43100000,299591,111600000,775740,9500,66035,117,6.9,6.3,3,1,0,21,175,1
5,1969,George Lazenby,22800000,144234,82000000,518736,8000,50608,142,6.8,6.7,3,1,2,5,37,0
6,1971,Sean Connery,43800000,251083,116000000,664969,7200,41274,120,6.7,6.3,1,0,1,7,42,1
7,1973,Roger Moore,35400000,185105,161800000,846046,7000,36603,121,6.8,5.9,3,0,1,8,5,1
8,1974,Roger Moore,21000000,98894,97600000,459623,7000,32965,125,6.7,5.1,2,0,2,1,5,0
9,1977,Roger Moore,46800000,179297,185400000,710290,14000,53636,125,7.1,6.8,3,1,1,31,116,1


In [19]:
bond_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Year              24 non-null     int64  
 1   Bond              24 non-null     object 
 2   US_Gross          24 non-null     int64  
 3   US_Adj            24 non-null     int64  
 4   World_Gross       24 non-null     int64  
 5   World_Adj         24 non-null     int64  
 6   Budget            24 non-null     int64  
 7   Budget_Adj        24 non-null     int64  
 8   Film_Length       24 non-null     int64  
 9   Avg_User_IMDB     24 non-null     float64
 10  Avg_User_Rtn_Tom  24 non-null     float64
 11  Conquests         24 non-null     int64  
 12  Martinis          24 non-null     int64  
 13  BJB               24 non-null     int64  
 14  Kills_Bond        24 non-null     int64  
 15  Kills_Others      24 non-null     int64  
 16  Top_100           24 non-null     int64  
dtyp

In order to keep this simple let's drop the Bond column. 

In [20]:
bond_df = bond_df.drop('Bond', axis=1)

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


In [22]:
X = bond_df.drop('Top_100', axis=1).values
y = bond_df['Top_100'].values


In [23]:
pipeline = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent', missing_values=np.NaN)), 
            ('scaler', StandardScaler()), 
            ('model', LogisticRegression())
            ])
# There aren't any missing values, so we don't actually need the Imputer

In [24]:
logreg = pipeline.fit(X, y)


# Prediction

Now we can make a prediction for a movie we're going to make in 2030. 

My prediction is actually just the same data as the most recent movie made in 2015. So I guess my question is do we need to increase the budget for the next movie. 

Arguably we should really get rid of the World and US grossing columns as these really aren't input columns. But the aim here is to keep the example simple and focus on how Pipelines work. 



In [25]:

newX = np.array([[2030,200074175,196647,879620923,864553,245000,240803,148,6.8,6.4,3,1,1,30,205]])

In [26]:
y_pred = logreg.predict(newX)

In [27]:
y_pred

array([1])

Hooray, we've made it into the Top 100!

# Splitting into Train and Test

One of the main benefits of building a pipeline is repeatability. 

Using a pipeline it's super-easy to apply the same steps to your training dataset and your test dataset.

Note that it's important that some Tranformers e.g. StandardScaler is run separately on Training and Test, to simulate how the model would perform in new data in production. 

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [29]:
y_pred = pipeline.fit(X_train, y_train)

In [30]:
pipeline.score(X_test, y_test)

0.625