# March Machine Learning Madness 2024: Modeling
### Predicting NCAA Basketball Tournament Results
##### From the [Kaggle Competition: "March Machine Learning Mania 2024"](https://www.kaggle.com/competitions/march-machine-learning-mania-2024/overview)
##### By David Hartsman

### Overview:
In the previous notebook, I prepared data by selecting elements from over 30 different file sources. In this notebook, I will use that data to begin training models with the purpose of correctly predicting NCAA Tournament games. I have created 4 different target values:
- Binary W/L Target
- Continuous Team_A Pts Scored
- Continuous Team_B Pts Scored
- Continuous Team_A - Team_B Pts Differential

I will begin this notebook by attempting to build a successful classification model using algorithms from the Scikit-Learn Library. Depending on those results, I may move on and attempt to develop accurate Linear Regression models as well. 

<hr style="border: 4px solid blue">

In [4]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import os
import warnings
warnings.filterwarnings("ignore")

# Sklearn Accessories
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, KFold, RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, log_loss, auc, roc_curve, \
roc_auc_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector as selector

# Model Types
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, VotingClassifier, StackingClassifier,\
GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

import xgboost
from xgboost import XGBClassifier


pd.set_option("display.max_rows", 25)
pd.set_option('display.max_columns', 50)
sns.set_style("darkgrid")

<div class="alert alert-block alert-info" style="font-size: 2em;">
<b>Model Creation:</b> Binary Classification
</div>
After loading in the data, I will take a few final steps to prepare the data for modeling. I will select the features that I would like to use. As a starting point, I will opt to use as many features as possible, and subsequently weed out any features that may not be contributing much signal. 

I will also be creating "dummy" predictions. These predictions will be created by:

- A.) Choosing the lower of the 2 "Chalk_Seeds" as the winner, and
- B.) If the seeds are the same, choosing the team with the better winning percentage

These predictions are equivalent to picking *"chalk"*, or the on-paper favorite in every match-up without the benefit of any additional analysis. After that, I will perform a train-test split and create an sklearn pipeline to handle pre-modeling transformations such as *scaling* and *one-hot-encoding*. Then, utilize a class-object that I have created to store the results and metrics of different model iterations. 

In [6]:
# Load in data

# Create a path reference for easier readability 
path = '/Users/samalainabayeva/Desktop/FLAT_IRON!!!/NCAA_KAGGLE/march-machine-learning-mania-2024/'

# Load and inspect the data
df = pd.read_csv(os.path.join(path, 'Data_For_Modeling.csv'), index_col=0)
df.head()

Unnamed: 0,DayNum,League,Season,Team_A_Avg3ptAtt,Team_A_Avg3ptMade,Team_A_AvgFGAtt,Team_A_AvgFGMade,Team_A_AvgOppScore,Team_A_AvgPtDiff,Team_A_AvgTeamScore,Team_A_Avg_Assts,Team_A_Avg_Blocks,Team_A_Avg_Def_Rebs,Team_A_Avg_FT_Att,Team_A_Avg_FT_Made,Team_A_Avg_Fouls,Team_A_Avg_Off_Rebs,Team_A_Avg_Steals,Team_A_Avg_TO,Team_A_BestRanking,Team_A_Chalk_Seed,Team_A_CloseGames,Team_A_CloseWins,Team_A_Coach,Team_A_Conference,...,Team_B_StdDevOppScore,Team_B_StdDevTeamScore,Team_B_StdPtDiff,Team_B_TeamID,Team_B_Total3ptAtt,Team_B_Total3ptMade,Team_B_TotalFGAtt,Team_B_TotalFGMade,Team_B_Total_Assts,Team_B_Total_Def_Rebs,Team_B_Total_FT_Att,Team_B_Total_FT_Made,Team_B_Total_Off_Rebs,Team_B_Total_TO,Team_B_WinTrend,Team_B_Win_Total,Team_B_WorstRanking,Team_B_close_game_win_perc,Team_B_home_win_perc,Team_B_neutral_win_perc,Team_B_ot_win_perc,Team_B_road_win_perc,Team_B_win_perc,Team_A_Win,Game_Point_Differential
1136,134,Men,2003,18.5,5.933333,55.266667,24.733333,70.833333,1.966667,72.8,14.2,2.233333,24.8,28.066667,17.4,18.3,13.166667,6.433333,15.233333,233.0,16.0,14.0,9.0,,swac,...,13.064637,11.638542,17.508478,1421,522.0,188.0,1647.0,707.0,378.0,672.0,607.0,463.0,356.0,470.0,Downtrend,13.0,309.0,0.7,0.833333,1.0,1.0,0.125,0.448276,0,-8
1137,136,Men,2003,20.071429,7.035714,65.714286,30.321429,70.25,14.964286,85.214286,17.642857,4.214286,27.642857,25.0,17.535714,17.75,15.178571,8.464286,14.785714,1.0,1.0,5.0,3.0,,pac_ten,...,10.384481,11.995176,12.601802,1436,449.0,153.0,1620.0,720.0,412.0,746.0,567.0,373.0,376.0,408.0,Uptrend,19.0,191.0,0.428571,0.9,0.6,0.0,0.5,0.655172,1,29
1138,136,Men,2003,12.586207,4.0,56.896552,27.206897,69.172414,6.793103,75.965517,15.551724,4.241379,23.310345,26.206897,17.551724,19.413793,13.689655,5.206897,14.0,53.0,10.0,8.0,4.0,,pac_ten,...,11.354889,8.724435,10.964565,1272,582.0,203.0,1740.0,762.0,482.0,753.0,664.0,434.0,408.0,400.0,Uptrend,23.0,29.0,0.5,0.875,1.0,0.0,0.636364,0.793103,1,13
1139,136,Men,2003,20.484848,7.969697,57.454545,28.69697,64.333333,14.909091,79.242424,16.818182,4.454545,23.181818,20.030303,13.878788,17.272727,10.878788,8.393939,13.363636,18.0,6.0,7.0,6.0,,mvc,...,10.168773,13.102006,13.167451,1141,520.0,198.0,1528.0,772.0,453.0,675.0,730.0,559.0,307.0,529.0,Uptrend,23.0,102.0,0.8,0.916667,1.0,,0.642857,0.793103,0,-6
1140,136,Men,2003,22.5,7.966667,53.333333,24.333333,68.0,4.4,72.4,14.666667,3.066667,22.033333,20.466667,15.766667,18.666667,9.733333,7.766667,14.2,51.0,9.0,4.0,2.0,,acc,...,10.411721,7.75804,10.666692,1143,494.0,186.0,1703.0,793.0,464.0,707.0,566.0,388.0,326.0,411.0,Downtrend,21.0,33.0,0.571429,0.928571,0.25,0.333333,0.636364,0.724138,0,-2


In [13]:
df[["Team_A_Chalk_Seed", 'Team_B_Chalk_Seed', "Team_A_win_perc", 'Team_B_win_perc']].head()

Unnamed: 0,Team_A_Chalk_Seed,Team_B_Chalk_Seed,Team_A_win_perc,Team_B_win_perc
1136,16.0,16.0,0.6,0.448276
1137,1.0,16.0,0.892857,0.655172
1138,10.0,7.0,0.62069,0.793103
1139,6.0,11.0,0.878788,0.793103
1140,9.0,8.0,0.6,0.724138


In [14]:
# Creating dummy predictions
dummy_preds = []
for idx, row in df.iterrows():
    if df.loc[idx, "Team_A_Chalk_Seed"] < df.loc[idx, "Team_B_Chalk_Seed"]:
        dummy_preds.append(1)
    elif df.loc[idx, "Team_A_Chalk_Seed"] > df.loc[idx, "Team_B_Chalk_Seed"]:
        dummy_preds.append(0)
    else:
        if df.loc[idx, "Team_A_win_perc"] > df.loc[idx, "Team_B_win_perc"]:
            dummy_preds.append(1)
        else:
            dummy_preds.append(0)

In [15]:
len(dummy_preds)

2583

In [16]:
df["Dummy_Predictions"] = dummy_preds

In [20]:
df[["Team_A_Chalk_Seed", 'Team_B_Chalk_Seed', "Team_A_win_perc", 'Team_B_win_perc', "Dummy_Predictions"]]\
.query("Team_A_Chalk_Seed == Team_B_Chalk_Seed")["Dummy_Predictions"].value_counts()

Dummy_Predictions
1    53
0    41
Name: count, dtype: int64