<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Pipelines

In this example we're going to load in the James Bond Dataset. 

The target variable will to determine if the movie makes it into the IMDB Top_100. We'll therefore be using Logistic Regression. 

Before training our model we'll need to preprocess our data with 2 transformers: 
* we'll use SimpleImputer to insert missing values (in reality there aren't any, but please play along!)
* we'll use StandardScaler to scale and centre our data (standardisation)

In order to do this we'll build a pipeline with 3 steps: 
1. Impute - insert missing values
2. Standardise - i.e. scale and centre our data
3. Train a Logistic Regression Model

Note that Pipelines can have multiple steps that transform the data, but you can only train a single model - in this case we'll be using Logistic Regression.

In [None]:
import numpy as np
import pandas as pd

In [None]:
bond_df = pd.read_csv('../../Data/JamesBond.csv')
bond_df = bond_df.drop('Movie', axis=1)
bond_df

In [None]:
bond_df.info()

In order to keep this simple let's drop the Bond column. 

In [None]:
bond_df = bond_df.drop('Bond', axis=1)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


In [None]:
X = bond_df.drop('Top_100', axis=1).values
y = bond_df['Top_100'].values


In [None]:
pipeline = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent', missing_values=np.NaN)), 
            ('scaler', StandardScaler()), 
            ('model', LogisticRegression())
            ])
# There aren't any missing values, so we don't actually need the Imputer

In [None]:
logreg = pipeline.fit(X, y)


# Prediction

Now we can make a prediction for a movie we're going to make in 2030. 

My prediction is actually just the same data as the most recent movie made in 2015. So I guess my question is do we need to increase the budget for the next movie. 

Arguably we should really get rid of the World and US grossing columns as these really aren't input columns. But the aim here is to keep the example simple and focus on how Pipelines work. 



In [None]:

newX = np.array([[2030,200074175,196647,879620923,864553,245000,240803,148,6.8,6.4,3,1,1,30,205]])

In [None]:
y_pred = logreg.predict(newX)

In [None]:
y_pred

Hooray, we've made it into the Top 100!

# Splitting into Train and Test

One of the main benefits of building a pipeline is repeatability. 

Using a pipeline it's super-easy to apply the same steps to your training dataset and your test dataset.

Note that it's important that some Tranformers e.g. StandardScaler is run separately on Training and Test, to simulate how the model would perform in new data in production. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
y_pred = pipeline.fit(X_train, y_train)

In [None]:
pipeline.score(X_test, y_test)