# Final Project Submission

Please fill out:
* Student name: Doug Steen
* Student pace: Full time
* Scheduled project review date/time: 2/8/2020, 11:30 AM CT
* Instructor name: James Irving, PhD
* Blog post URL: 


## Introduction

### Machine Learning Algorithms Used
#### k-Nearest Neighbors (kNN)
kNN is a supervised algorithm that can be used for classification or regression problems. In classification, the algorithm predicts test class labels based on the distance to the nearest k training examples in n-dimensional feature space.

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

#### Decision Trees
A Decision Tree is a supervised algorithm that can be used for classification or regression problems. In classification, a Decision Tree is constructed using the training data to incrementally partition examples using features that maximize information gain (with respect to training labels) at each step. Labels for test data are then predicted using the Decision Tree constructed from the training data.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

#### Random Forest
A Random Forest is a Decision Tree-based supervised learning ensemble method. Random Forests can be used for classification or regression problems. A Random Forest includes many Decision Trees that each utilize (1) a bootstrap-sampled version of the original dataset and (2) random subsets of the dataset features. In classification problems, each of the Decision Trees in the Random Forest get a 'vote' towards the classification of each example in the test dataset. This method helps counteract the 'overfitting' that can take place when using a single Decision Tree.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

#### AdaBoost
AdaBoost (short for 'Adaptive Boosting') is a Decision-Tree-based supervised learning ensemble method. AdaBoost can be used for classification or regression problems. An AdaBoost algorithm includes many Decision Trees that are 'weak learners' (i.e., each tree has a depth of 1). Unlike a Random Forest, the trees in AdaBoost are trained sequentially, so that examples that were misclassified in previous trees are more heavily weighted in subsequent trees. This method also helps counteract the 'overfitting' that can take place when using a single Decision Tree.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

#### XGBoost
XGBoost (short for eXtreme Gradient Boost)

https://xgboost.readthedocs.io/en/latest/

## Import

In [None]:
#!pip install -U fsds_100719
#!pip install imblearn
import warnings
from fsds_100719.imports import *
from tqdm import tqdm_notebook
import pandas_profiling as pp
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, plot_confusion_matrix, roc_auc_score
from sklearn.tree import export_graphviz
from IPython.display import Image
from sklearn.tree import export_graphviz
from pydotplus import graph_from_dot_data

from tqdm import tqdm_notebook

# Display all columns of large dataframes
pd.set_option('display.max_columns', 0)

# Ignore warnings
warnings.filterwarnings('ignore')

# Set default plot style & inline plotting
plt.style.use('seaborn-dark')
%matplotlib inline

## Functions

In [None]:
def multi_class_SMOTE(X, y, n, random_state, verbose=1):
    """Using imblearn.over_sampling.SMOTE, performs (n-1) iterations of SMOTE to facilitate creating balanced target classes when multiple classes are present.

    Parameters
    ----------
    X : array-like
        Matrix containing the feature data to be sampled
    y : array-like (1-d)
        Corresponding target labels for each sample in X
    n : int
        Number of unique classes/labels in y
    random_state : int
        Value to set as the random_state for SMOTE function reproducibility
    verbose : int (1 or 2)
        If 1, prints label counts only after final SMOTE iteration
        If 2, prints label counts at each SMOTE iteration (including initial)

    Returns
    ----------
    X_resampled : array-like
        Matrix containing the resampled feature data
    y_resampled : array-like (1-d)
        Corresponding target labels for X_resampled
    """

    from imblearn.over_sampling import SMOTE
    import pandas as pd

    # Initialize a SMOTE object
    smote = SMOTE(random_state=random_state)

    # Output if verbose = 2
    if verbose == 2:
        print(f'Label counts for Original y:\n{pd.Series(y).value_counts()}')

    # Perform SMOTE n-1 times to achieve balanced target classes
    for i in range(n - 1):
        X, y = smote.fit_sample(X, y)

        # Print value counts after each step if verbose == 2
        if verbose == 2:
            print(
                f'Label counts after SMOTE # {i+1}:\n{pd.Series(y).value_counts()}')

    # Print final value counts if verbose == 1
    if verbose == 1:
        print(
            f'Label counts after SMOTE # {n-1}:\n{pd.Series(y).value_counts()}')

    X_resampled = X
    y_resampled = y

    return X_resampled, y_resampled


def train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf, multi_class=False):
    """Returns classification accuracy score and ROC AUC score for both train and test data after train-test-split.

    Parameters
    ----------
    X_train : array-like
        Matrix containing the training feature data    
    X_test : array-like
        Matrix containing the testing feature data
    y_train : array-like (1-d)
        Corresponding target labels for each sample in X_train
    y_test : array-like (1-d)
        Corresponding target labels for each sample in X_test
    y_hat_train : array-like (1-d)
        Model predictions for each sample in X_train
    y_hat_test : array-like (1-d)
        Model predictions for each sample in X_test
    clf : Sklearn-type classifier object
        Classifier used to generate model predictions
    multi_class : Bool
        If True, computes AUC for multi-class classification problem

    Returns
    ---------
    Accuracy score and ROC AUC score for both training and test data.
    """
    from sklearn.metrics import accuracy_score, roc_auc_score

    test_acc = accuracy_score(y_test, y_hat_test)
    train_acc = accuracy_score(y_train, y_hat_train)

    if multi_class:
        y_score_train = clf.predict_proba(X_train)
        auc_train = roc_auc_score(
            y_train, y_score=y_score_train, multi_class='ovr')

        y_score_test = clf.predict_proba(X_test)
        auc_test = roc_auc_score(
            y_test, y_score=y_score_test, multi_class='ovr')

    else:
        y_score_train = clf.predict_proba(X_train)
        auc_train = roc_auc_score(y_train, y_score=y_score_train)

        y_score_test = clf.predict_proba(X_test)
        auc_test = roc_auc_score(y_test, y_score=y_score_test)

    print(f'Training Accuracy Score: {round(train_acc,2)}')
    print(f'Training AUC: {round(auc_train,2)}\n')
    print(f'Testing Accuracy Score: {round(test_acc,2)}')
    print(f'Testing AUC: {round(auc_test,2)}')

## Obtain

In [None]:
# load dataset from directory (obtained in separate notebook using api-football calls)

df = pd.read_csv('premier_league.csv')
df.head()

## Scrub/Explore

In [None]:
# Get an idea of datatypes in the dataframe
df.info()

In [None]:
# Designate columns that will not be important for the classification model
to_drop = ['league_id', 'league', 'event_date', 'event_timestamp', 'firstHalfStart', 'secondHalfStart',
           'round', 'status', 'statusShort', 'venue', 'referee', 'homeTeam', 'awayTeam', 'elapsed', 'score']

df = df.drop(to_drop, axis=1)
df.tail()

In [None]:
# Run pandas profiling for inital EDA

pp.ProfileReport(df)

In [None]:
# Missing values for Blocked Shots, Goalkeeper Saves, Offsides, Passes %, Red Cards, Yellow Cards

# Going to fill each (except Passes %) with median value for that column

fill_cols = ['Blocked_Shots', 'Goalkeeper_Saves',
             'Offsides', 'Red_Cards', 'Yellow_Cards']

for col in fill_cols:
    df[col].fillna(value=df[col].median(), inplace=True)

df.info()

### Re-cast object features

In [None]:
# Convert team column to binary (0 = Home, 1 = Away)

for i in range(len(df)):
    if df['team'][i] == 'home':
        df['team'][i] = 0
    elif df['team'][i] == 'away':
        df['team'][i] = 1

df.team = df.team.astype('int64')

In [None]:
# Strip % from Ball Possession and re-cast as a numerical variable

df['Ball_Possession'] = df['Ball_Possession'].str.rstrip('%').astype('int')
df['Ball_Possession']

In [None]:
# Re-calculate Passes (%) as Passes accurate / Total passes to handle missing values

df['Passes_%'] = df['Passes_accurate'] / df['Total_passes']
df['Passes_%']

### Collapse df to one row per match (instead of one row per team per match)

In [None]:
df.info()

In [None]:
# Dataframe for only home team stats

df_home = df.loc[df.team == 0]
df_home.head()

In [None]:
# Rename df_home columns before concatenation

for col in df_home.columns:
    df_home.rename(columns={col: f'{col}_H'}, inplace=True)

df_home.head()

In [None]:
# Reset indes of df_home
df_home.set_index('fixture_id_H', inplace=True)

In [None]:
# Dataframe for only away team stats

df_away = df[df.team == 1]
df_away.head()

In [None]:
# Rename df_away columns before concatenation

for col in df_away.columns:
    df_away.rename(columns={col: f'{col}_A'}, inplace=True)

df_away.head()

In [None]:
# Reset index of df_away
df_away.set_index('fixture_id_A', inplace=True)

In [None]:
# Concatenate df_home and df_away dataframes

df_final = pd.concat([df_home, df_away], axis=1)

### Create target variable (class labels)

0 = Home Team Win, 1 = Away Team Win, 2 = Draw

In [None]:
# Create target variable column: 0 = Win, 1 = Loss, 2 = Draw

target = []
for i in range(len(df_final)):
    if df_final['goalsHomeTeam_H'].iloc[i] > df_final['goalsAwayTeam_H'].iloc[i]:
        target.append(0)  # Home team win
    elif df_final['goalsHomeTeam_H'].iloc[i] < df_final['goalsAwayTeam_H'].iloc[i]:
        target.append(1)  # Away team win
    elif df_final['goalsHomeTeam_H'].iloc[i] == df_final['goalsAwayTeam_H'].iloc[i]:
        target.append(2)  # Draw

In [None]:
df_final['target'] = target

In [None]:
df_final.head()

In [None]:
# Feature engineer new column: Ball_Pos_Diff as Ball_Possession_H - Ball_Possession_A
# These two columns are going to be perfectly negatively correlated, so makes sense to collapse them

df_final['Ball_Pos_Diff'] = df_final['Ball_Possession_H'] - \
    df_final['Ball_Possession_A']
df_final.drop(['Ball_Possession_H', 'Ball_Possession_A'], axis=1, inplace=True)

In [None]:
# Final drop of unnecessary columns from df_diff (fixture_id remaining as index)

df_final.drop(['team_H', 'goalsHomeTeam_H', 'goalsAwayTeam_H',
               'team_A', 'goalsHomeTeam_A', 'goalsAwayTeam_A'], axis=1, inplace=True)

In [None]:
df_final.head()

In [None]:
# Check class imbalance of dataset

df_final.target.value_counts(normalize=True)

In [None]:
# Create labels dictionary for future use

labels = {'Home Win': 0, 'Away Win': 1, 'Draw': 2}

In [None]:
# Visualize pandas profiling again

pp.ProfileReport(df_final)

In [None]:
df_final.info()

## Models & Interpretations

### Model 1a: Vanilla K Nearest Neighbors (KNN) Classifier

In [None]:
# Separate features and target labels

X = df_final.drop('target', axis=1)
y = df_final['target']

In [None]:
# Perform a train_test split on the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25)

In [None]:
# Scale X data before passing to KNN algorithm

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train.shape, y_train.shape

In [None]:
# Perform Synthetic Minority Over-Sampling Technique (SMOTE) to achieve class balance

X_train, y_train = multi_class_SMOTE(
    X_train, y_train, n=3, random_state=42, verbose=2)

In [None]:
# Fit a vanilla KNN classifier

clf1a = KNeighborsClassifier()
clf1a.fit(X_train, y_train)

y_hat_test = clf1a.predict(X_test)
y_hat_train = clf1a.predict(X_train)

In [None]:
# Check accuracy score and AUC score of vanilla KNN model

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf1a,
                   multi_class=True)

In [None]:
# Check classification report of vanilla KNN model

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for vanilla KNN model

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf1a, X_test, y_test, cmap='Greens', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 1a Interpretation
The vanilla KNN classifier performs poorly for this task, with a test AUC of 0.63 and a test accuracy score of 0.43. This classifier is therefore only slightly better than guessing (which would be 33% accuracy) for this 3-class classification problem. 

The classifier correctly labeled 45% of True Home Wins, 47% of True Away Wins, and only 31% of True Draws.

### Model 1b: KNN Classifier with Hyperparameter tuning of k

In [None]:
# # Trying many values for n_neighbors parameter to improve overall AUC (uncomment to run)

k_neighbors = range(1, 200)

train_auc_list = []
test_auc_list = []

for i in k_neighbors:
    clf1b = KNeighborsClassifier(n_neighbors=i)
    clf1b.fit(X_train, y_train)
    y_hat_test = clf1b.predict(X_test)
    y_hat_train = clf1b.predict(X_train)

    y_score_train = clf1b.predict_proba(X_train)
    auc_train = roc_auc_score(
        y_train, y_score=y_score_train, multi_class='ovr')
    y_score_test = clf1b.predict_proba(X_test)
    auc_test = roc_auc_score(y_test, y_score=y_score_test, multi_class='ovr')

    train_auc_list.append(auc_train)
    test_auc_list.append(auc_test)

print(train_auc_list)
print(test_auc_list)

In [None]:
# Figure to visualize how train-test AUC change with # of Neighbors in KNN

plt.figure(figsize=(15, 6))
plt.plot(k_neighbors, train_auc_list, label='Train AUC')
plt.plot(k_neighbors, test_auc_list, label='Test AUC')
plt.legend()
plt.xlabel('# of Neighbors Considered in KNN Classifier')
plt.ylabel('AUC Score')
plt.show()

The test AUC appears to stop improving at k = ~150.

In [None]:
# Fit a TUNED KNN classifier (n_neighbors = 150)

clf1b = KNeighborsClassifier(n_neighbors=150)
clf1b.fit(X_train, y_train)

y_hat_test = clf1b.predict(X_test)
y_hat_train = clf1b.predict(X_train)

In [None]:
# Check accuracy score and AUC score of TUNED KNN model

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf1b,
                   multi_class=True)

In [None]:
# Check classification report for TUNED KNN model

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for TUNED KNN model

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf1b, X_test, y_test, cmap='Blues', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 1b Interpretation
After tuning the number of nearest neighbors (n_neighbors) to 150, the test accuracy and test AUC are slightly improved to 0.48 and 0.72, respectively.  

The classifier correctly labeled 36% of True Home Wins, 62% of True Away Wins, and 57% of True Draws. 

Interestingly, the tuned KNN classifier is signficantly worse at correctly labeling True Home Wins than the vanilla KNN classifier.

### Model 2a: Vanilla Decision Tree Classifier

In [None]:
# Perform a train_test split on the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25)

In [None]:
# Perform Synthetic Minority Over-Sampling Technique (SMOTE) to achieve class balance

X_train, y_train = multi_class_SMOTE(
    X_train, y_train, n=3, random_state=42, verbose=2)

In [None]:
# Fit a vanilla Decision Tree classifier

clf2a = DecisionTreeClassifier(random_state=42)
clf2a.fit(X_train, y_train)

y_hat_test = clf2a.predict(X_test)
y_hat_train = clf2a.predict(X_train)

In [None]:
# Check accuracy and AUC of vanilla Decision Tree classifier

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf2a,
                   multi_class=True)

In [None]:
# Check classification report of vanilla Decision Tree classifier

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for vanilla Decision Tree Classifier

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf2a, X_test, y_test, cmap='Blues', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 2a Interpretation
The vanilla Decision Tree classifier performs much better at this task than KNN, with an test prediction accuracy of 0.59 and a test AUC of 0.67. 

The classifier correctly labeled 67% of True Home Wins, 66% of True Away Wins, and 32% of True Draws.

The classifier clearly has difficulty correcly predicting Draws, and often incorrectly predicts other results (Home Wins and Away Wins) as Draws.

### Model 2b: Decision Tree with Hyperparameter Tuning

In [None]:
# Perform a train_test split on the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25)

In [None]:
# Perform Synthetic Minority Over-Sampling Technique (SMOTE) to achieve class balance

X_train, y_train = multi_class_SMOTE(
    X_train, y_train, n=3, random_state=42, verbose=2)

In [None]:
# Instantiate initial DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=42)

In [None]:
# # Parameter grid for search (uncomment to run)

# dt_param_grid = {
#     'criterion': ['gini', 'entropy'],
#     'max_depth': [None, 2, 3, 4, 5, 6, 8, 10, 12, 15],
#     'min_samples_split': [2, 5, 10, 20, 30],
#     'min_samples_leaf': [1, 2, 3, 4, 5, 6, 8, 10, 12],
#     'max_features': [None, 1, 2, 5, 10, 15, 20, 30],
#     'max_leaf_nodes': [None, 5, 10, 20]
# }

# # Instantiate GridSearchCV
# dt_grid_search = GridSearchCV(dt_clf, dt_param_grid, cv=3, return_train_score=True, verbose=1)

# # Fit to the data
# dt_grid_search.fit(X_train, y_train)

In [None]:
# Visualize best parameters
dt_grid_search.best_params_

In [None]:
# Fit a Decision Tree classifier using best_params from grid search

clf2b = DecisionTreeClassifier(**dt_grid_search.best_params_,
                               random_state=42)

clf2b.fit(X_train, y_train)

y_hat_test = clf2b.predict(X_test)
y_hat_train = clf2b.predict(X_train)

In [None]:
# Check accuracy and AUC of TUNED Decision Tree classifier

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf2b,
                   multi_class=True)

In [None]:
# Check classification report of TUNED Decision Tree classifier

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for tuned Decision Tree classifier

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf2b, X_test, y_test, cmap='Blues', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 2b Interpretation
Hyperparameter tuning of the Decision Tree using GridSearchCV did not improve the classifier's test prediction accuracy (0.59), but did slightly improve the AUC score, from 0.67 (vanilla) to 0.73 (tuned).

The classifier correctly labeled 65% of True Home Wins, 69% of True Away Wins, and 34% of True Draws.

### Model 3a: Vanilla Random Forest Classifier

In [None]:
# Perform a train_test_split on the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25)

In [None]:
# Perform Synthetic Minority Over-Sampling Technique (SMOTE) to achieve class balance

X_train, y_train = multi_class_SMOTE(
    X_train, y_train, n=3, random_state=42, verbose=2)

In [None]:
# Fit a vanilla Random Forest classifier

clf3a = RandomForestClassifier(random_state=42)
clf3a.fit(X_train, y_train)

y_hat_test = clf3a.predict(X_test)
y_hat_train = clf3a.predict(X_train)

In [None]:
# Check accuracy and AUC of vanilla Random Forest classifier

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf3a,
                   multi_class=True)

In [None]:
# Check classification report of vanilla Random Forest classifier

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for vanilla Random Forest classifier

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf3a, X_test, y_test, cmap='Blues', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 3a Interpretation
The vanilla Random Forest Classifier increased overall test prediction accuracy to 0.63, and improved the test AUC score to 0.77. This is a significant performance increase over both the KNN and Decision Tree classifiers.

The classifier correctly labeled 76% of True Home Wins, 70% of True Away Wins, and 25% of True Draws.

### Model 3b: Random Forest with Hyperparameter Tuning

In [None]:
# Perform a train_test_split on the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25)

In [None]:
# Perform Synthetic Minority Over-Sampling Technique (SMOTE) to achieve class balance

X_train, y_train = multi_class_SMOTE(
    X_train, y_train, n=3, random_state=42, verbose=2)

In [None]:
# Instantiate Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

In [None]:
# # Parameter grid for Random Forest search (uncomment to run)

# rf_param_grid = {'criterion': ['gini', 'entropy'],
#  'max_depth': [5, 10],
#  'max_features': [None, 15, 20],
#  'max_leaf_nodes': [None, 5],
#  'min_samples_leaf': [1, 2, 5],
#  'min_samples_split': [2, 5, 10]}

# # Instantiate GridSearchCV
# rf_grid_search = GridSearchCV(rf, rf_param_grid, cv=3, return_train_score=True, verbose=1)

# # Fit to the data
# rf_grid_search.fit(X_train, y_train)

In [None]:
# Visualize best parameters

rf_grid_search.best_params_

In [None]:
# Fit a Random Forest classifier using best_params

clf3b = RandomForestClassifier(**rf_grid_search.best_params_,
                               random_state=42)
clf3b.fit(X_train, y_train)

y_hat_test = clf3b.predict(X_test)
y_hat_train = clf3b.predict(X_train)

In [None]:
# Check accuracy and AUC of TUNED Random Forest classifier

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf3b,
                   multi_class=True)

In [None]:
# Check classification report of TUNED Random Forest classifier

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for TUNED Random Forest classifier

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf3b, X_test, y_test, cmap='Blues', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 3b Interpretation
The tuned Random Forest Classifier increased overall test prediction accuracy to 0.71, and improved the test AUC score to 0.83. This is a significant performance increase over the vanilla Random Forest Classifier.

The classifier correctly labeled 83% of True Home Wins, 77% of True Away Wins, and 36% of True Draws.

### Model 4A: Vanilla Adaboost Classifier

In [None]:
# Perform a train_test split on the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25)

In [None]:
# Perform Synthetic Minority Over-Sampling Technique (SMOTE) to achieve class balance

X_train, y_train = multi_class_SMOTE(
    X_train, y_train, n=3, random_state=42, verbose=2)

In [None]:
# Fit a vanilla AdaBoost classifier

clf4a = AdaBoostClassifier(random_state=42)
clf4a.fit(X_train, y_train)

y_hat_test = clf4a.predict(X_test)
y_hat_train = clf4a.predict(X_train)

In [None]:
# Check accuracy and AUC of vanilla AdaBoost classifier

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf4a,
                   multi_class=True)

In [None]:
# Check classification report of vanilla AdaBoost classifier

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for vanilla AdaBoost classifier

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf4a, X_test, y_test, cmap='Blues', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 4a Interpretation
The vanilla AdaBoost Classifier slightly increased overall test prediction accuracy to 0.74, and improved the test AUC score to 0.84. This is a slight performance increase over tuned Random Forest Classifier.

The classifier correctly labeled 79% of True Home Wins, 71% of True Away Wins, and 66% of True Draws.

The vanilla AdaBoost Classifier is significantly better than previous models at correctly predicting True Draws.

### Model 4b: Adaboost with Hyperparameter Tuning

In [None]:
# Perform a train_test split on the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25)

In [None]:
# Perform Synthetic Minority Over-Sampling Technique (SMOTE) to achieve class balance

X_train, y_train = multi_class_SMOTE(
    X_train, y_train, n=3, random_state=42, verbose=2)

In [None]:
# Instantiate Adaboost Classifier

ada = AdaBoostClassifier(random_state=42)

In [None]:
# # Parameter grid for Random Forest search (uncomment to run)

# ada_param_grid = {'n_estimators': [50, 75, 100, 125],
#                  'learning_rate': [0.1, 0.5, 1.0, 1.5, 3],
#                  'algorithm': ['SAMME', 'SAMME.R']}

# # Instantiate GridSearchCV
# ada_grid_search = GridSearchCV(ada, ada_param_grid, cv=3, scoring='accuracy', verbose=2)

# # Fit to the data
# ada_grid_search.fit(X_train_sm2, y_train_sm2)

In [None]:
# Visualize best parameters

ada_grid_search.best_params_

In [None]:
# Fit an AdaBoost Classifier using best_params

clf4b = AdaBoostClassifier(**ada_grid_search.best_params_,
                           random_state=42)

clf4b.fit(X_train, y_train)

y_hat_test = clf4b.predict(X_test)
y_hat_train = clf4b.predict(X_train)

In [None]:
# Check accuracy and AUC of TUNED AdaBoost classifier

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf4b,
                   multi_class=True)

In [None]:
# Check classification report of TUNED Random Forest classifier

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for tuned KNN model

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf4b, X_test, y_test, cmap='Blues', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 4b Interpretation
The tuned AdaBoost Classifier performance is almost identical to, but slightly lower than, the vanilla AdaBoost Classifier, with a test accuracy of 0.73 and a test AUC of 0.82.

The classifier correctly labeled 77% of True Home Wins, 70% of True Away Wins, and 69% of True Draws.

The AdaBoost classifier clearly does not benefit very much (if at all) from hyperparameter tuning in this case.

### Model 5a: Vanilla XGBoost Classifier

In [None]:
# Perform a train_test split on the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25)

In [None]:
# Perform Synthetic Minority Over-Sampling Technique (SMOTE) to achieve class balance

X_train, y_train = multi_class_SMOTE(
    X_train, y_train, n=3, random_state=42, verbose=2)

# Must convert X_train_sm2 back to df to use in XGBoost

X_train = pd.DataFrame(X_train, columns=X.columns)

In [None]:
# Fit a vanilla XGBoost classifier

clf5a = XGBClassifier(random_state=42)
clf5a.fit(X_train, y_train)

y_hat_test = clf5a.predict(X_test)
y_hat_train = clf5a.predict(X_train)

In [None]:
# Check accuracy and AUC of vanilla XGBoost classifier

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf5a,
                   multi_class=True)

In [None]:
# Check classification report of vanilla XGBoost classifier

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for vanilla XGBoost classifier

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf5a, X_test, y_test, cmap='Blues', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 5a Interpretation
The vanilla XGBoost Classifier performance is generally a significant improvement over previous models, with a test accuracy of 0.79 and a test AUC of 0.90.

The classifier correctly labeled 90% of True Home Wins, 88% of True Away Wins, and 44% of True Draws.

### Model 5b: XGBoost with Hyperparameter Tuning

In [None]:
# Perform a train_test split on the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25)

In [None]:
# Perform Synthetic Minority Over-Sampling Technique (SMOTE) to achieve class balance

X_train, y_train = multi_class_SMOTE(
    X_train, y_train, n=3, random_state=42, verbose=2)

# Must convert X_train_sm2 back to df to use in XGBoost

X_train = pd.DataFrame(X_train, columns=X.columns)

In [None]:
# Instantiate XGBoost Classifier

xg = XGBClassifier(random_state=42)

In [None]:
# # Parameter grid for XGBoost search (uncomment to run)

# xg_param_grid = {
#     'learning_rate': [0.01, 0.1, 1],
#     'max_depth': [5, 7, 9, 12],
#     'min_child_weight': [1, 2, 3],
#     'n_estimators': [50, 100, 125],
#     'subsample': [0.5, 0.75, 1.0]
# }

# # Instantiate GridSearchCV
# xg_grid_search = GridSearchCV(
#     xg, xg_param_grid, scoring='accuracy', verbose=2)

# # Fit to the data
# xg_grid_search.fit(X_train_sm2, y_train_sm2)

In [None]:
# Visualize best parameters

xg_grid_search.best_params_

In [None]:
# Fit an XGBoost Classifier using best_params

clf5b = XGBClassifier(**xg_grid_search.best_params_,
                      random_state=42)

clf5b.fit(X_train, y_train)

y_hat_test = clf5b.predict(X_test)
y_hat_train = clf5b.predict(X_train)

In [None]:
# Check accuracy and AUC of TUNED XGBoost classifier

train_test_acc_auc(X_train, X_test, y_train, y_test, y_hat_train, y_hat_test, clf=clf5b,
                   multi_class=True)

In [None]:
# Check classification report of TUNED XGBoost classifier

print(classification_report(y_test, y_hat_test))

In [None]:
# Plot confusion matrix for tuned XGBoost model

fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf5b, X_test, y_test, cmap='Blues', display_labels=labels.keys(),
                      normalize='true', ax=ax)

#### Model 5b Interpretation
The tuned XGBoost Classifier performance is almost identical to, but slightly lower than, the vanilla XGBoost Classifier, with a test accuracy of 0.78 and a test AUC of 0.88.

The classifier correctly labeled 90% of True Home Wins, 92% of True Away Wins, and 32% of True Draws.

The XGBoost classifier clearly does not benefit very much (if at all) from hyperparameter tuning in this case.

## Overall Interpretations

### Comparison of Classifier Performance

|Classifier              |Train Accuracy|Train AUC|Test Accuracy|Test AUC|
|------------------------|--------------|---------|-------------|--------|
|KNN (vanilla)           |0.79          |0.94     |0.43         |0.63    |
|KNN (tuned)             |0.53          |0.75     |0.48         |0.72    |
|Decision Tree (vanilla) |1.0           |1.0      |0.59         |0.67    |
|Decision Tree (tuned)   |0.89          |0.98     |0.59         |0.73    |
|Random Forest (vanilla) |1.0           |1.0      |0.63         |0.77    |
|Random Forest (tuned)   |1.0           |1.0      |0.71         |0.83    |
|AdaBoost (vanilla)      |0.76          |0.88     |0.74         |0.84    |
|AdaBoost (tuned)        |0.81          |0.89     |0.73         |0.82    |
|XGBoost (vanilla)       |0.93          |0.99     |0.79         |0.90    |
|XGBoost (tuned)         |1.0           |1.0      |0.78         |0.88    |