# Part 3: Algorithm Selection
Now that we have finished appropriately feature engineering dataset, we are ready to begin testing out different algorithms to see which is the most performant. This notebook will include the following actions:

- Importing the cleansed dataset from the "/data/clean" directory
- Splitting the cleansed dataset in two for a training and validation dataset
- Performing a feature scaling where required
- Performing a GridSearch for ideal hyperparameter tuning
- Testing out a number of different algorithms
- Validating the results of each algorithm with model validation metrics / visualizations

In [9]:
# Importing the necessary Python libraries
import numpy as np
import pandas as pd
from statistics import mean
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, mean_absolute_error, mean_squared_error, r2_score

# Importing the binary classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings('ignore')

In [10]:
# Loading in the cleaned dataset
df_clean = pd.read_csv('../data/clean/new_data.csv')

# Splitting the predictor value from the remainder of the dataset
X = df_clean.drop(columns = ['Survived'])
y = df_clean[['Survived']]


In [11]:
# Creating a reusable function for churning through all five binary classification algorithms
def generate_binary_classification_model(X, y, model_algorithm, hyperparameters, needs_scaled = False):
    """
    Generating everything required for training and validation of a binary classification model

    Args:
        - X (Pandas DataFrame): A DataFrame containing the cleaned training data
        - y (Pandas DataFrame): A DataFrame containing the target values correlated to the X training data
        - model_algorithm (object): A model algorithm that will be trained against the X and y data
        - hyperparameters (dict): A dictionary containing all the hyperparameters to test the model with
        - needs_scaled (Boolean): A boolean value that indicates whether or not the input dataset
    """

    # Performing a scaling on the data if required
    if needs_scaled:

        # Instantiating the StandardScaler
        scaler = StandardScaler()

        # Performing a fit_transform on the dataset
        scaled_features = scaler.fit_transform(X)

        # Transforming the StandardScaler output back into a Pandas DataFrame
        X = pd.DataFrame(scaled_features, index = X.index, columns = X.columns)

    # Instantiating a GridSearch object with the inputted model algorithm and hyperparameters
    gridsearchcv = GridSearchCV(estimator = model_algorithm,
                                param_grid = hyperparameters)

    # Fitting the training data to the GridSearch object
    gridsearchcv.fit(X, y)

    # Printing out the best hyperparameters
    print(f'Best hyperparameters: {gridsearchcv.best_params_}')

    # Instantiating a new model object with the ideal hyperparameters from the GridSearch job
    model_algorithm.set_params(**gridsearchcv.best_params_)

    # Creating a container to hold each set of validation metrics
    accuracy_scores, roc_auc_scores, f1_scores = [], [], []

    # Instantiating the K-Fold cross validation object
    k_fold = KFold(n_splits = 5)

    # Iterating through each of the folds in K-Fold
    for train_index, val_index in k_fold.split(X):

        # Splitting the training set from the validation set for this specific fold
        X_train, X_val = X.iloc[train_index, :], X.iloc[val_index, :]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]

        # Fitting the X_train and y_train datasets to the model algorithm
        model_algorithm.fit(X_train, y_train)

        # Getting inferential predictions for the validation dataset
        val_preds = model_algorithm.predict(X_val)

        # Generating validation metrics by comparing the inferential predictions (val_preds) to the actuals (y_val)
        val_accuracy = accuracy_score(y_val, val_preds)
        val_roc_auc_score = roc_auc_score(y_val, val_preds)
        val_f1_score = f1_score(y_val, val_preds)

        # Appending the validation scores to the respective validation metric container
        accuracy_scores.append(val_accuracy)
        roc_auc_scores.append(val_roc_auc_score)
        f1_scores.append(val_f1_score)

    # Getting the average (mean) of each validation score
    average_accuracy = int(mean(accuracy_scores) * 100)
    average_roc_auc_score = int(mean(roc_auc_scores) * 100)
    average_f1_score = int(mean(f1_scores) * 100)

    # Printing out the average validation metrics
    print(f'Average accuracy score: {average_accuracy}%')
    print(f'Average ROC AUC score: {average_roc_auc_score}%')
    print(f'Average F1 score: {average_f1_score}%')

Some things to note:
1. Feature Scaling: all data points need to be on similar scale. An algorithm like SVM may require it.
2. Hyperparameter: parameter from a prior distribution; it captures the prior belief before data is observed.
   - controls the behaviour of the training algorithm -> impact on the performance of the model is being trained

Score outputs:
1. Accuracy: __
2. ROC AUC: __
3. F1: __

In [12]:
"""
ALGORITHM 1: LOGISTIC REGRESSION
this algorithm is used to predict the likelihood of all kinds of “yes” or “no” outcomes
"""

# Setting the hyperparameter grid for the Logistic Regression algorithm
logistic_reg_params = {
    'penalty': ['l1', 'l2'],
    'C': np.logspace(-4, 4, 20),
    'solver': ['lbfgs', 'liblinear']
}

# Instantiating the Logistic Regression algorithm object
logistic_reg_algorithm = LogisticRegression()

# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X = X, y = y, model_algorithm = logistic_reg_algorithm, hyperparameters = logistic_reg_params)

Best hyperparameters: {'C': 10000.0, 'penalty': 'l2', 'solver': 'lbfgs'}
Average accuracy score: 79%
Average ROC AUC score: 78%
Average F1 score: 72%


In [13]:
"""
ALGORITHM 2: GAUSSIAN NAIVE BAEYS

The Naïve Bayes classifier assumes that the value of one feature is independent of the value of any other feature. (Bayes Theorem)
Naïve Bayes classifiers need training data to estimate the parameters required for classification
"""

# Setting the hyperparameter grid for the GaussianNB algorithm
gaussian_nb_params = {
    'var_smoothing': np.logspace(0, -9, num=100)
}

# Instantiating the GaussianNB algorithm object
gaussian_nb_algorithm = GaussianNB()

# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X=X, y=y, model_algorithm=gaussian_nb_algorithm,
                                     hyperparameters=gaussian_nb_params)

Best hyperparameters: {'var_smoothing': 1e-06}
Average accuracy score: 79%
Average ROC AUC score: 78%
Average F1 score: 73%


In [14]:
"""
ALGORITHM 3: Support Vector Machine (Support Vector Classifier)

Let’s imagine we have two tags: red and blue, and our data has two features: x and y.

A support vector machine takes these data points and outputs the hyperplane
(which in two dimensions it’s simply a line) that best separates the tags.
This line is the decision boundary: anything that falls to one side of it we will classify as blue,
and anything that falls to the other as red.

This algorithm requires scaling.

"""

# Setting the hyperparameter grid for the Support Vector Machine (SVM) algorithm
svm_params = {
    'C': [0.1, 1000],
    'gamma': [1, 0.0001],
}
# Instantiating the Support Vector Classifier (SVC) algorithm object
svc_algorithm = SVC()

# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X=X, y=y, model_algorithm=svc_algorithm, hyperparameters=svm_params,
                                     needs_scaled=True)

Best hyperparameters: {'C': 1000, 'gamma': 1}
Average accuracy score: 79%
Average ROC AUC score: 76%
Average F1 score: 70%


In [15]:

"""
ALGORITHM 4: RANDOM FOREST CLASSIFIER

- multiple decision trees
- Assuming your dataset has “m” features, the random forest will randomly choose “k” features where k < m.
- Now, the algorithm will calculate the root node among the k features by picking a node that has the highest information gain
- After that, the algorithm splits the node into child nodes and repeats this process “n” times

- GOAL: fits a number of decision tree classifiers on various sub-samples of the
- dataset and uses averaging to improve the predictive accuracy and control over-fitting.

"""

# Setting the hyperparameter grid for the Random Forest Classifier (RFC) algorithm
rfc_params = {
    'n_estimators': [25, 50, 75],
    'max_depth': [10, 15, 20],
    'min_samples_split': [5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 4]
}

# Instantiating the Random Forest Classifier (RFC) algorithm object
rfc_algorithm = RandomForestClassifier()

# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X=X, y=y, model_algorithm=rfc_algorithm, hyperparameters=rfc_params)

Best hyperparameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 25}
Average accuracy score: 83%
Average ROC AUC score: 80%
Average F1 score: 76%


In [16]:
"""
ALGORITHM 5: CATBOOST CLASSIFIER

"""

# Setting the hyperparameter grid for the CatBoost Classifier algorithm
catboost_params = {
    'depth': [1, 2, 3],
    'learning_rate': [0.001, 0.002, 0.003],
    'iterations': [1, 2, 5]
}
# Instantiating the CatBoost Classifier algorithm object
catboost_algorithm = CatBoostClassifier(silent=True)
# Feeding the algorithm into the reusable binary classification function
generate_binary_classification_model(X=X,
                                     y=y,
                                     model_algorithm=catboost_algorithm,
                                     hyperparameters=catboost_params)

Best hyperparameters: {'depth': 2, 'iterations': 1, 'learning_rate': 0.001}
Average accuracy score: 79%
Average ROC AUC score: 77%
Average F1 score: 71%
