<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Pipelines

In the last example you may recall that we read in the James Bond dataset and trained a model.

However, we ignored one of the columns - the actor who played bond was dropped. 

This is something we had to do because our pipeline wouldn't have been able to Scale and Centre this Column because it contains categorical information rather than numerical data. 

Ideally we'd have liked to process this column in a different way from the rest. 

In this tutorial we're going build a pipeline that processes both categorical and numerical data seperately using Column Transfomers. We'll scale and centre the numerical columns, and create dummies for the categorical columns. 

Remember that a Pipeline can have several steps that transform the data, but can only contain one ML model. Again we'll be using Logistic Regression. 

In [None]:
import numpy as np
import pandas as pd

In [None]:
bond_df = pd.read_csv('../../Data/JamesBond.csv')
bond_df = bond_df.drop('Movie', axis=1)
bond_df

In [None]:
bond_df.info()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


In [None]:
X = bond_df.drop('Top_100', axis=1)
y = bond_df['Top_100']


Note that because I'm using a pipeline I don't have to convert X and Y to numpy arrays, I'm keeping them as dataframes.

This also allows me to reference columns by name...

In [None]:
numerical_transformer = Pipeline([('scaler', StandardScaler())])
categorical_transformer = Pipeline([('one_hot', OneHotEncoder())])

cat_features = ['Bond']
num_features = ['Year', 'US_Gross', 'US_Adj', 'World_Gross', 'World_Adj', 'Budget',
       'Budget_Adj', 'Film_Length', 'Avg_User_IMDB', 'Avg_User_Rtn_Tom',
       'Conquests', 'Martinis', 'BJB', 'Kills_Bond', 'Kills_Others']


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, num_features),
        ("cat", categorical_transformer, cat_features)
    ]
)


In [None]:
pipeline = Pipeline(steps=[('preprocessor', preprocessor), 
            ('model', LogisticRegression())
            ])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
pipeline.fit(X_train, y_train)


In [None]:
pipeline.score(X_test, y_test)

Hooray, we've improved the accuracy of our model by adding in the actor who played bond!


# Further Reading

For a similar example see:
https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf

This example has the added step of varying the Model in the pipeline at the end using a loop. 

# Even more Reading

If you'd like to know how to debug a pipeline, there are several approaches you can take: 
https://www.google.com/search?q=debugging+a+scikit+learn+pipeline&oq=debugging+a+scikit+learn+pipeline&aqs=edge..69i57j0i546l4j69i64.11809j0j1&sourceid=chrome&ie=UTF-8


