# Model Training Pipeline
In the model selection notebook (`model-selection.ipynb`), we tested out a number of different algorithms to support the binary classification and regression models. This notebook will take the final candidates from that notebook to craft a full training pipeline. This full training pipeline will not only perform model training but will also incorporate feature engineering. At the end of this notebook, we will produce two serialized models in `.pkl` form: one for the binary classification model and the other for the regression model.

## Project Setup

In [1]:
# Importing the necessary Python libraries
import cloudpickle
import warnings
import numpy as np
import pandas as pd
from datetime import datetime
from category_encoders.one_hot import OneHotEncoder
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Hiding any warnings
warnings.filterwarnings('ignore')

# Adjusting Pandas output
pd.set_option("display.max_columns", None)

In [2]:
# Loading in the training data
df_raw = pd.read_csv('../data/raw/all_data.csv')

In [3]:
# Dropping any movies with no "ground truth" review
df_raw.drop(df_raw.index[df_raw['biehn_scale_rating'].isnull()], inplace = True)

In [4]:
# Performing the encoding of the "biehn_yes_or_no" predictor feature
for index, row in df_raw.iterrows():
    movie_name = row['movie_name']
    if row['biehn_yes_or_no'] == 'Yes':
        df_raw.loc[index, 'biehn_yes_or_no'] = 1
    elif row['biehn_yes_or_no'] == 'No':
        df_raw.loc[index, 'biehn_yes_or_no'] = 0
        
# Changing the datatype of the 'biehn_yes_or_no' to int
df_raw['biehn_yes_or_no'] = df_raw['biehn_yes_or_no'].astype(int)

## Feature Engineering Helper Functions
As we create our full model training pipeline, we are going to need to do some feature engineering on the raw data. These helper functions in this section contain custom code that we can then apply to the full training pipeline so that we can engineer data from something more raw (unclean) to something that our model can use.

In [5]:
# Creating a helper function to engineer the "movie_age" feature
def generate_movie_age(df):
    """
    Generating a movie age relative to the current year and the year that the movie was released

    Args:
        - df (Pandas DataFrame): A DataFrame containing the raw data for which year the movie was released

    Returns:
        - df (Pandas DataFrame): A DataFrame containing newly engineered feature of relative "year"
    """
    
    # Extracting current year
    currentYear = datetime.now().year
    
    # Engineering the "year" column to be a relative "movie_age" column based on number of years since original release
    for index, row in df.iterrows():
        year_released = row['year']
        movie_age = currentYear - year_released
        df.loc[index, 'movie_age'] = movie_age
        
    return df

In [6]:
# Creating a helper function to perform feature engineering on RT critic score
def engineer_rt_critic_score(df):
    """
    Feature engineering the Rotten Tomatoes critic score

    Args:
        - df (Pandas DataFrame): A DataFrame containing the raw data RT critic score

    Returns:
        - df (Pandas DataFrame): A DataFrame containing an updated version of RT critic score
    """
    
    # Removing percentage sign from RT critic score
    for index, row in df.iterrows():
        if pd.notnull(row['rt_critic_score']):
            df.loc[index, 'rt_critic_score'] = int(row['rt_critic_score'][:2])
    
    # Filling rt_critic_score nulls with critic average of 59%
    df['rt_critic_score'].fillna(59, inplace = True)
    
    # Transforming RT critic score into an integer datatype
    df['rt_critic_score'] = df['rt_critic_score'].astype(int)
    
    return df

In [7]:
# Creating a helper function to handle nulls for the metascore feature
def handle_nulls_for_metascore(df):
    """
    Handling the nulls associated to the metascore feature

    Args:
        - df (Pandas DataFrame): A DataFrame containing the raw data metascore feature

    Returns:
        - df (Pandas DataFrame): A DataFrame containing an updated version of the metascore
    """
    
    # Filling metascore nulls with 50.0
    df['metascore'].fillna(50.0, inplace = True)
    
    return df

In [8]:
# Creating a helper function to handle nulls for the RT audience feature
def handle_nulls_for_rt_audience_score(df):
    """
    Handling the nulls associated to the RT audience score feature

    Args:
        - df (Pandas DataFrame): A DataFrame containing the raw data RT audience score feature

    Returns:
        - df (Pandas DataFrame): A DataFrame containing an updated version of the RT audience score
    """
    
    # Filling rt_audience_score with audience average of 59%
    df['rt_audience_score'].fillna(59.0, inplace = True)
    
    return df

## Pipeline Creation

Now that we have created our helper functions to perform the feature engineering, we are ready to begin packaging everything as a single, unified pipeline.

In [9]:
# Creating the data preprocessor that will perform our feature engineering
data_preprocessor = ColumnTransformer(transformers = [
    ('ohe_engineering', OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore'), ['primary_genre', 'secondary_genre']),
    ('movie_age_engineering', FunctionTransformer(generate_movie_age, validate = False), ['year']),
    ('rt_critic_score_engineering', FunctionTransformer(engineer_rt_critic_score, validate = False), ['rt_critic_score']),
    ('rt_audience_score_engineering', FunctionTransformer(handle_nulls_for_rt_audience_score, validate = False), ['rt_audience_score']),    
    ('metascore_engineering', FunctionTransformer(handle_nulls_for_metascore, validate = False), ['metascore']),
    ('columns_to_drop', 'drop', ['movie_name', 'tmdb_id', 'imdb_id', 'tmdb_popularity'])
],
    remainder = 'passthrough'
)

### Training the Binary Classification Model

In [10]:
# Splitting the predictor value from the remainder of the dataset
X = df_raw.drop(columns = ['biehn_yes_or_no', 'biehn_scale_rating'])
y = df_raw[['biehn_yes_or_no']]

In [11]:
# Creating the full inference pipeline for the binary classification model
binary_classification_pipeline = Pipeline(steps = [
    ('feature_engineering', data_preprocessor),
    ('predictive_modeling', RandomForestClassifier(n_estimators = 50,
                                                   max_depth = 20,
                                                   min_samples_split = 5,
                                                   min_samples_leaf = 2))
])

In [12]:
# Formally training the binary classification pipeline
binary_classification_pipeline.fit(X, y)

Pipeline(steps=[('feature_engineering',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ohe_engineering',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                use_cat_names=True),
                                                  ['primary_genre',
                                                   'secondary_genre']),
                                                 ('movie_age_engineering',
                                                  FunctionTransformer(func=<function generate_movie_age at 0x7f85d9b73700>),
                                                  ['year']),
                                                 ('rt_critic_score_engineering',
                                                  Func...
                                                  FunctionTransformer(func=<function handle_nulls_for_rt_audience_sco

### Training the Regression Model

In [13]:
# Splitting the predictor value from the remainder of the dataset
X = df_raw.drop(columns = ['biehn_yes_or_no', 'biehn_scale_rating'])
y = df_raw[['biehn_scale_rating']]

In [14]:
# Instantiating a StandardScaler object for feature scaling
feature_scaler = StandardScaler()

In [15]:
# Creating the full inference pipeline for the binary classification model
regression_pipeline = Pipeline(steps = [
    ('feature_engineering', data_preprocessor),
    ('feature_scaling', feature_scaler),
    ('predictive_modeling', Lasso(alpha = 0.275))
])

In [16]:
# Formally training the regression pipeline
regression_pipeline.fit(X, y)

Pipeline(steps=[('feature_engineering',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ohe_engineering',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                use_cat_names=True),
                                                  ['primary_genre',
                                                   'secondary_genre']),
                                                 ('movie_age_engineering',
                                                  FunctionTransformer(func=<function generate_movie_age at 0x7f85d9b73700>),
                                                  ['year']),
                                                 ('rt_critic_score_engineering',
                                                  Func...
                                                  FunctionTransformer(func=<function handle_nulls_for_rt_audience_sco

### Saving the Serialized Models

In [17]:
# Saving the binary classification pipeline to a serialized pickle file
with open('../models/binary_classification_pipeline.pkl', 'wb') as f:
    cloudpickle.dump(binary_classification_pipeline, f)

In [18]:
# Saving the regression pipeline to a serialized pickle file
with open('../models/regression_pipeline.pkl', 'wb') as f:
    cloudpickle.dump(regression_pipeline, f)