# Model Selection
Now that we have completed feature engineering, we are ready to start testing our dataset against various algorithms to see which produces the most accurate inferences. As a reminder, this project technically contains **two different models** as we are looking to predict two different values: a binary "yes/no" approval rating and a single-point decimal float between 0.0 and 10.0, more lovingly referred to as the *Biehn Scale*.

For each model we'll be creating, we will be testing **five different types of algorithms** to assess which performs the best. You might be asking the question, "Which algorithm is the best for my situation?" A super strong mathematician might be able to give you a decent answer, but at the end of the day, the truth is simply this: **The algorithm most right for your project is the one that produces the most consistently accurate results!** To that end, we test multiple algorithms instead of settling on a single one.

The goal of this notebook is to assess the results of each of the algorithms. Once we settle on one that seems to produce the best results, then we will create another notebook to formalize the model training process with a full ML pipeline.

## Modeling Strategy
While we already noted that we will be testing out five of each respective algorithm, there are some specific activities we will also need to do when performing the modeling. These things include the following:

- **Hyperparameter Tuning**: In order to ensure each algorithm is performing optimally, we will be performing hyperparameter tuning to seek the ideal hyperparameters for each model.
- **K-Folds Validation**: Because the dataset we will be training against is relatively small, we can't do a typical train-test split like we would with a normally large dataset. Because we want to make the most efficient use of our dataset, we will be using k-folds validation. This processes will shuffle the dataset into little training and validation batches, and this will happen multiple times. The output of this process will allow us to assess the dataset to its fullest extent.
- **Metric Validation**: With the models trained, we will want to ensure they perform effectively be comparing them with proper validation metrics.
- **Feature Scaling (Optional)**: Depending on the algorithm we use, we may or may not need to perform a feature scaling on the dataset.

## Project Setup
Let's go ahead and perform a handful of activities as we prepare start the model selection.

In [1]:
# Importing the necessary Python libraries
import numpy as np
import pandas as pd

In [2]:
# Loading in the cleaned dataset
df_clean = pd.read_csv('../data/clean/train.csv')

In [3]:
# Dropping the movie name from df_clean
df_clean.drop(columns = ['movie_name'], inplace = True)

## Binary Classification Models
Now that we have loaded in the feature engineered dataset, `df_clean`, we are now ready to begin testing out a number of different binary classification algorithms. As mentioned at the top of this notebook, we will be trying out **five different binary classification algorithms**. Note that we will *not* be testing any deep learning algorithms. This is for two reasons: a) I don't want to have to mess with a GPU, and b) they tend not to perform any better than the algorithms listed below.

These algorithms are the following:

- **Scikit-Learn's Logistic Regression algorithm**: While "regression" in the name can be deceiving, logistic regression is a very simple yet powerful algorithm for binary classification. Because we want to test out various algorithm types, we are selecting Scikit-Learn's logistic regression algorithm as a more simple variant.
- **Scikit-Learn's Gaussian Naive Bayes (GaussianNB) algorithm**: The most popular implementation of a Naive Bayes algorithm, we'll be testing out Scikit-Learn's GaussianNB implementation to see how it fares against our dataset.
- **Scikit-Learn's Support Vector Machine (SVM) algorithm**: While not as simple as the Logistic Regression algorithm, the SVM is a simpler kind of algorithm. This algorithm tends to perform better in higher dimensions (aka datasets with more features), and while our dataset has a fewer number of dimensions, I still think it's worth checking out.
- **Scikit-Learn's Random Forest Classifier algorithm**: This is one of the most popular binary classification algorithms used in the ML industry. This is because it often produces pretty accurate results as well as featuring an easier algorithm explainability. The Random Forest Classifier is also a classic example of what is referred to as an *ensemble model*.
- **CatBoost's CatBoostClassifier algorithm**: You may not have heard of this algorithm before, but it is a very popular one amongst my coworkers at my Fortune 50 company. This is because it has often been proven to provide the best performance results.

Before we jump into the algorithms, we will need to separate the predictor value, `biehn_yes_or_no`, from the rest of the dataset.

In [6]:
# Splitting the predictor value from the remainder of the dataset
X = df_clean.drop(columns = ['biehn_yes_or_no'])
y = df_clean[['biehn_yes_or_no']]

In [7]:
y

Unnamed: 0,biehn_yes_or_no
0,1
1,1
2,1
3,1
4,1
...,...
118,1
119,1
120,0
121,1


## Regression Models
Now that we have worked our way through the binary classification algorithms, we're ready to start looking at the regression algorithms. As a refresher, recall that we are creating this model to predict the score that Caelan gives to a movie on a 0.0 to 10.0 scale known as the **Biehn Scale**. Again, we will not be using any deep learning algorithms here. Here are the list of regression algorithms we will be testing out:

- **Scikit-Learn's Linear Regression**: Like the logistic regression algorithm we analyzed with the binary classifiers, this is probably the simplest implementation of a regression algorithm we can test with. I transparently am not expecting much given its simplicity, but it's always worth checking out anyway!