# Model Selection
Now that we have completed feature engineering, we are ready to start testing our dataset against various algorithms to see which produces the most accurate inferences. As a reminder, this project technically contains **two different models** as we are looking to predict two different values: a binary "yes/no" approval rating and a single-point decimal float between 0.0 and 10.0, more lovingly referred to as the *Biehn Scale*.

For each model we'll be creating, we will be testing **five different types of algorithms** to assess which performs the best. You might be asking the question, "Which algorithm is the best for my situation?" A super strong mathematician might be able to give you a decent answer, but at the end of the day, the truth is simply this: **The algorithm most right for your project is the one that produces the most consistently accurate results!** To that end, we test multiple algorithms instead of settling on a single one.

The goal of this notebook is to assess the results of each of the algorithms. Once we settle on one that seems to produce the best results, then we will create another notebook to formalize the model training process with a full ML pipeline.

## Modeling Strategy
While we already noted that we will be testing out five of each respective algorithm, there are some specific activities we will also need to do when performing the modeling. These things include the following:

- **Hyperparameter Tuning**: In order to ensure each algorithm is performing optimally, we will be performing hyperparameter tuning to seek the ideal hyperparameters for each model.
- **K-Fold Validation**: Because the dataset we will be training against is relatively small, we can't do a typical train-test split like we would with a normally large dataset. Because we want to make the most efficient use of our dataset, we will be using k-fold validation. This processes will shuffle the dataset into little training and validation batches, and this will happen multiple times. The output of this process will allow us to assess the dataset to its fullest extent.
- **Metric Validation**: With the models trained, we will want to ensure they perform effectively be comparing them with proper validation metrics.
- **Feature Scaling (Optional)**: Depending on the algorithm we use, we may or may not need to perform a feature scaling on the dataset.

## Project Setup
Let's go ahead and perform a handful of activities as we prepare start the model selection.

In [1]:
# Importing the necessary Python libraries
import numpy as np
import pandas as pd
from statistics import mean
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, mean_absolute_error, mean_squared_error, r2_score

# Importing the binary classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
# from catboost import CatBoostClassifier

# Importing the regression algorithms
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
# from catboost import CatBoostRegressor

In [2]:
# Importing warnings to suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Loading in the cleaned dataset
df_clean = pd.read_csv('../data/clean/train.csv')

In [4]:
# Dropping the movie name from df_clean
df_clean.drop(columns = ['movie_name'], inplace = True)

## Binary Classification Models
Now that we have loaded in the feature engineered dataset, `df_clean`, we are now ready to begin testing out a number of different binary classification algorithms. As mentioned at the top of this notebook, we will be trying out **five different binary classification algorithms**. Note that we will *not* be testing any deep learning algorithms. This is for two reasons: a) I don't want to have to mess with a GPU, and b) they tend not to perform any better than the algorithms listed below.

These algorithms are the following:

- **Scikit-Learn's Logistic Regression algorithm**: While "regression" in the name can be deceiving, logistic regression is a very simple yet powerful algorithm for binary classification. Because we want to test out various algorithm types, we are selecting Scikit-Learn's logistic regression algorithm as a more simple variant.
- **Scikit-Learn's Gaussian Naive Bayes (GaussianNB) algorithm**: The most popular implementation of a Naive Bayes algorithm, we'll be testing out Scikit-Learn's GaussianNB implementation to see how it fares against our dataset.
- **Scikit-Learn's Support Vector Machine (SVM) algorithm***: While not as simple as the Logistic Regression algorithm, the SVM is a simpler kind of algorithm. This algorithm tends to perform better in higher dimensions (aka datasets with more features), and while our dataset has a fewer number of dimensions, I still think it's worth checking out.
- **Scikit-Learn's Random Forest Classifier algorithm**: This is one of the most popular binary classification algorithms used in the ML industry. This is because it often produces pretty accurate results as well as featuring an easier algorithm explainability. The Random Forest Classifier is also a classic example of what is referred to as an *ensemble model*.
- **CatBoost's CatBoostClassifier algorithm**: You may not have heard of this algorithm before, but it is a very popular one amongst my coworkers at my Fortune 50 company. This is because it has often been proven to provide the best performance results.

**Indicates that algorithm will need feature scaling*

Before we jump into the algorithms, we will need to separate the predictor value, `biehn_yes_or_no`, from the rest of the dataset.

In [5]:
# Splitting the predictor value from the remainder of the dataset
X = df_clean.drop(columns = ['biehn_yes_or_no', 'biehn_scale_rating'])
y = df_clean[['biehn_yes_or_no']]

### Binary Classification Reusable Function
Given that we're going to be running similar code on five different models, I thought it would be helpful to create a reusable function that can easily churn through all five of these models.

In [6]:
# Creating a reusable function for churning through all five binary classification algorithms
def generate_binary_classification_model(X, y, model_algorithm, hyperparameters, needs_scaled = False):
    """
    Generating everything required for training and validation of a binary classification model

    Args:
        - X (Pandas DataFrame): A DataFrame containing the cleaned training data
        - y (Pandas DataFrame): A DataFrame containing the target values correlated to the X training data
        - model_algorithm (object): A model algorithm that will be trained against the X and y data
        - hyperparameters (dict): A dictionary containing all the hyperparameters to test the model with
        - needs_scaled (Boolean): A boolean value that indicates whether or not the input dataset
    """
    
    # Performing a scaling on the data if required
    if needs_scaled:
        
        # Instantiating the StandardScaler
        scaler = StandardScaler()
        
        # Performing a fit_transform on the dataset
        scaled_features = scaler.fit_transform(X)
        
        # Transforming the StandardScaler output back into a Pandas DataFrame
        X = pd.DataFrame(scaled_features, index = X.index, columns = X.columns)
        
    # Instantiating a GridSearch object with the inputted model algorithm and hyperparameters
    gridsearchcv = GridSearchCV(estimator = model_algorithm,
                                param_grid = hyperparameters)
    
    # Fitting the training data to the GridSearch object
    gridsearchcv.fit(X, y)
    
    # Printing out the best hyperparameters
    print(f'Best hyperparameters: {gridsearchcv.best_params_}')
    
    # Instantiating a new model object with the ideal hyperparameters from the GridSearch job
    model_algorithm.set_params(**gridsearchcv.best_params_)
    
    # Creating a container to hold each set of validation metrics
    accuracy_scores, roc_auc_scores, f1_scores = [], [], []
    
    # Instantiating the K-Fold cross validation object
    k_fold = KFold(n_splits = 5)
    
    # Iterating through each of the folds in K-Fold
    for train_index, val_index in k_fold.split(X):

        # Splitting the training set from the validation set for this specific fold
        X_train, X_val = X.iloc[train_index, :], X.iloc[val_index, :]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        
        # Fitting the X_train and y_train datasets to the model algorithm
        model_algorithm.fit(X_train, y_train)

        # Getting inferential predictions for the validation dataset
        val_preds = model_algorithm.predict(X_val)

        # Generating validation metrics by comparing the inferential predictions (val_preds) to the actuals (y_val)
        val_accuracy = accuracy_score(y_val, val_preds)
        val_roc_auc_score = roc_auc_score(y_val, val_preds)
        val_f1_score = f1_score(y_val, val_preds)
        
        # Appending the validation scores to the respective validation metric container
        accuracy_scores.append(val_accuracy)
        roc_auc_scores.append(val_roc_auc_score)
        f1_scores.append(val_f1_score)
        
    # Getting the average (mean) of each validation score
    average_accuracy = int(mean(accuracy_scores) * 100)
    average_roc_auc_score = int(mean(roc_auc_scores) * 100)
    average_f1_score = int(mean(f1_scores) * 100)
    
    # Printing out the average validation metrics
    print(f'Average accuracy score: {average_accuracy}%')
    print(f'Average ROC AUC score: {average_roc_auc_score}%')
    print(f'Average F1 score: {average_f1_score}%')

### Algorithm #1: Logistic Regression

In [7]:
# Setting the hyperparameter grid for the Logistic Regression algorithm
logistic_reg_params = {
    'penalty': ['l1', 'l2'],
    'C': np.logspace(-4, 4, 20),
    'solver': ['lbfgs', 'liblinear']
}

In [8]:
# Instantiating the Logistic Regression algorithm object
logistic_reg_algorithm = LogisticRegression()

In [9]:
# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X = X,
                                     y = y,
                                     model_algorithm = logistic_reg_algorithm,
                                     hyperparameters = logistic_reg_params)

Best hyperparameters: {'C': 0.03359818286283781, 'penalty': 'l1', 'solver': 'liblinear'}
Average accuracy score: 79%
Average ROC AUC score: 54%
Average F1 score: 88%


### Algorithm #2: Gaussian Naive Bayes

In [10]:
# Setting the hyperparameter grid for the GaussianNB algorithm
gaussian_nb_params = {
    'var_smoothing': np.logspace(0, -9, num = 100)
}

In [11]:
# Instantiating the GaussianNB algorithm object
gaussian_nb_algorithm = GaussianNB()

In [12]:
# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X = X,
                                     y = y,
                                     model_algorithm = gaussian_nb_algorithm,
                                     hyperparameters = gaussian_nb_params)

Best hyperparameters: {'var_smoothing': 1.0}
Average accuracy score: 78%
Average ROC AUC score: 50%
Average F1 score: 88%


### Algorithm #3: Support Vector Machine (Support Vector Classifier)

In [13]:
# Setting the hyperparameter grid for the Support Vector Machine (SVM) algorithm
svm_params = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
    'kernel': ['rbf', 'linear', 'poly']
}

In [14]:
# Instantiating the Support Vector Classifier (SVC) algorithm object
svc_algorithm = SVC()

In [15]:
# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X = X,
                                     y = y,
                                     model_algorithm = svc_algorithm,
                                     hyperparameters = svm_params,
                                     needs_scaled = True)

Best hyperparameters: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
Average accuracy score: 78%
Average ROC AUC score: 57%
Average F1 score: 87%


### Algorithm #4: Random Forest Classifier

In [16]:
# Setting the hyperparameter grid for the Random Forest Classifier (RFC) algorithm
rfc_params = {
    'n_estimators': [25, 50, 75],
    'max_depth': [10, 15, 20],
    'min_samples_split': [5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 4]
}

In [17]:
# Instantiating the Random Forest Classifier (RFC) algorithm object
rfc_algorithm = RandomForestClassifier()

In [18]:
# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X = X,
                                     y = y,
                                     model_algorithm = rfc_algorithm,
                                     hyperparameters = rfc_params)

Best hyperparameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 50}
Average accuracy score: 78%
Average ROC AUC score: 50%
Average F1 score: 88%


### Algorithm #5: CatBoost Classifier

In [None]:
# Setting the hyperparameter grid for the CatBoost Classifier algorithm
catboost_params = {
    'depth': [1, 2, 3],
    'learning_rate': [0.001, 0.002, 0.003],
    'iterations': [1, 2, 5]
}

In [None]:
# Instantiating the CatBoost Classifier algorithm object
catboost_algorithm = CatBoostClassifier(silent = True)

In [None]:
# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X = X,
                                     y = y,
                                     model_algorithm = catboost_algorithm,
                                     hyperparameters = catboost_params)

## Regression Models
Now that we have worked our way through the binary classification algorithms, we're ready to start looking at the regression algorithms. As a refresher, recall that we are creating this model to predict the score that Caelan gives to a movie on a 0.0 to 10.0 scale known as the **Biehn Scale**. Again, we will not be using any deep learning algorithms here. Here are the list of regression algorithms we will be testing out:

- **Scikit-Learn's Linear Regression**: Like the logistic regression algorithm we analyzed with the binary classifiers, this is probably the simplest implementation of a regression algorithm we can test with. I transparently am not expecting much given its simplicity, but it's always worth checking out anyway!
- **Scikit-Learn's Lasso Regression**: This algorithm is in the same family as the algorithm above, Lasso is actually an acronym that stands for  Least Absolute Selection Shrinkage Operator. To be completely transparent, I am not well versed on the math underlying this algorithm, so I'm not even going to try explaining it. :)
- **Scikit-Learn's Support Vector Regressor**:
- **Scikit-Learn's Random Forest Regression**:
- **CatBoost's CatBoostRegressor**:

In [19]:
# Splitting the predictor value from the remainder of the dataset
X = df_clean.drop(columns = ['biehn_scale_rating', 'biehn_yes_or_no'])
y = df_clean[['biehn_scale_rating']]

### Regression Reusable Function
Just like with our binary classification models, we'll make our lives a lot easier if we can create a reusable function to quickly churn through all the regression models. The cell below does just that!

In [64]:
# Creating a reusable function for churning through all five regression algorithms
def generate_regression_model(X, y, model_algorithm, hyperparameters, needs_scaled = False):
    """
    Generating everything required for training and validation of a regression model

    Args:
        - X (Pandas DataFrame): A DataFrame containing the cleaned training data
        - y (Pandas DataFrame): A DataFrame containing the target values correlated to the X training data
        - model_algorithm (object): A model algorithm that will be trained against the X and y data
        - hyperparameters (dict): A dictionary containing all the hyperparameters to test the model with
        - needs_scaled (Boolean): A boolean value that indicates whether or not the input dataset
    """
    
    # Performing a scaling on the data if required
    if needs_scaled:
        
        # Instantiating the StandardScaler
        scaler = StandardScaler()
        
        # Performing a fit_transform on the dataset
        scaled_features = scaler.fit_transform(X)
        
        # Transforming the StandardScaler output back into a Pandas DataFrame
        X = pd.DataFrame(scaled_features, index = X.index, columns = X.columns)
        
    # Instantiating a GridSearch object with the inputted model algorithm and hyperparameters
    gridsearchcv = GridSearchCV(estimator = model_algorithm,
                                param_grid = hyperparameters)
    
    # Fitting the training data to the GridSearch object
    gridsearchcv.fit(X, y)
    
    # Printing out the best hyperparameters
    print(f'Best hyperparameters: {gridsearchcv.best_params_}')
    
    # Instantiating a new model object with the ideal hyperparameters from the GridSearch job
    model_algorithm.set_params(**gridsearchcv.best_params_)
    
    # Creating a container to hold each set of validation metrics
    mae_scores, rmse_scores, r2_scores = [], [], []
    
    # Instantiating the K-Fold cross validation object
    k_fold = KFold(n_splits = 5)
    
    # Iterating through each of the folds in K-Fold
    for train_index, val_index in k_fold.split(X):

        # Splitting the training set from the validation set for this specific fold
        X_train, X_val = X.iloc[train_index, :], X.iloc[val_index, :]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        
        # Fitting the X_train and y_train datasets to the model algorithm
        model_algorithm.fit(X_train, y_train)

        # Getting inferential predictions for the validation dataset
        val_preds = model_algorithm.predict(X_val)

        # Generating validation metrics by comparing the inferential predictions (val_preds) to the actuals (y_val)
        val_mae_score = mean_absolute_error(y_val, val_preds)
        val_mse_score = mean_squared_error(y_val, val_preds)
        val_rmse_score = mean_squared_error(y_val, val_preds, squared = False)
        val_r2_score = r2_score(y_val, val_preds)
        
        # Appending the validation scores to the respective validation metric container
        mae_scores.append(val_mae_score)
        rmse_scores.append(val_rmse_score)
        r2_scores.append(val_r2_score)
        
    # Getting the average (mean) of each validation score
    average_mae = mean(mae_scores)
    average_rmse = mean(rmse_scores)
    average_r2 = mean(r2_scores)
     
    # Printing out the average validation metrics
    print(f'Average mean absolute error: {average_mae}')
    print(f'Average root mean squared error: {average_rmse}')
    print(f'Average R2 score: {average_r2}')

### Algorithm #1: Linear Regression

In [68]:
# Setting the hyperparameter grid for the Linear Regression algorithm
linear_reg_params = {
    # No hyperparameters to tune!
}

In [69]:
# Instantiating the Linear Regression algorithm object
linear_reg_algorithm = LinearRegression()

In [70]:
# Feeding the algorithm into the reusable regression function
generate_regression_model(X = X,
                          y = y,
                          model_algorithm = linear_reg_algorithm,
                          hyperparameters = linear_reg_params)

Best hyperparameters: {}
Average mean absolute error: 1.8523793215485933
Average root mean squared error: 2.2251023178018743
Average R2 score: -0.25079718140411755


### Algorithm #2: Lasso Regression

In [89]:
# Setting the hyperparameter grid for the Lasso Regression algorithm
lasso_reg_params = {
    'alpha': np.linspace(0.2, 2, 25)
}

In [87]:
# Instantiating the Lasso Regression algorithm object
lasso_reg_algorithm = Lasso()

In [88]:
# Feeding the algorithm into the reusable regression function
generate_regression_model(X = X,
                          y = y,
                          model_algorithm = lasso_reg_algorithm,
                          hyperparameters = lasso_reg_params,
                          needs_scaled = True)

Best hyperparameters: {'alpha': 0.275}
Average mean absolute error: 1.5216118819619133
Average root mean squared error: 1.926506186578574
Average R2 score: 0.08599723674901145


### Algorithm #3: Support Vector Machine (Support Vector Regressor)

In [90]:
# Setting the hyperparameter grid for the Support Vector Regressor algorithm
svr_params = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
}

In [91]:
# Instantiating the Support Vector Regressor algorithm object
svr_algorithm = SVR()

In [92]:
# Feeding the algorithm into the reusable regression function
generate_regression_model(X = X,
                          y = y,
                          model_algorithm = svr_algorithm,
                          hyperparameters = svr_params,
                          needs_scaled = True)

Best hyperparameters: {'C': 10, 'gamma': 0.001}
Average mean absolute error: 1.492840348529517
Average root mean squared error: 1.958912842470548
Average R2 score: 0.05679773575439473


### Algorithm #4: Random Forest Regressor

In [101]:
# Setting the hyperparameter grid for the Random Forest Regressor algorithm
random_forest_regressor_params = {
    'n_estimators': [50, 75, 100],
    'max_depth': [15, 20, 25, 30],
    'min_samples_split': [9, 10, 11, 12],
    'min_samples_leaf': [1, 2, 3]
}

In [102]:
# Instantiating the Random Forest Regressor algorithm object
random_forest_regressor_algorithm = RandomForestRegressor()

In [103]:
# Feeding the algorithm into the reusable regression function
generate_regression_model(X = X,
                          y = y,
                          model_algorithm = random_forest_regressor_algorithm,
                          hyperparameters = random_forest_regressor_params)

Best hyperparameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 12, 'n_estimators': 50}
Average mean absolute error: 1.5642612058748273
Average root mean squared error: 1.9944205167340874
Average R2 score: 0.02216908560864377


### Algorithm #5: CatBoost Regressor

In [40]:
# Setting the hyperparameter grid for the CatBoost Regressor algorithm
catboost_reg_params = {

}

In [41]:
# Instantiating the CatBoost Classifier algorithm object
catboost_reg_algorithm = CatBoostRegressor(silent = True)

NameError: name 'CatBoostRegressor' is not defined

In [None]:
# Feeding the algorithm into the reusable regression function
generate_regression_model(X = X,
                          y = y,
                          model_algorithm = catboost_reg_algorithm,
                          hyperparameters = catboost_reg_params)