# March Machine Learning Madness 2024
### Predicting NCAA Basketball Tournament Results
##### From the [Kaggle Competition: "March Machine Learning Mania 2024"](https://www.kaggle.com/competitions/march-machine-learning-mania-2024/overview)
##### By David Hartsman

<div class="alert alert-block alert-info" style="font-size: 1em; background-color:blue; color:white";>
<b>Generating the 2024 Men's Bracket:</b>
</div>

### Overview

In this notebook, I will load in the model that I trained to predict Men's NCAA Tournament games and select this year's bracket. In order to accomplish that goal, I will need to import some of the previously used files, as well as the previously unused season data for Men's 2024 teams. The data preparation requires aggregation. After aggregating the season data, several Net/Off/Def-Rating features will need to be created. Creating appropriately composed rows of data for 1st round match-ups will require some additional work as well. The rows will initially contain aggregate data, but that data will need to be used to generate the differentials between Team_A and Team_B. Some of these differential terms will also be used in creating 7 additional interaction terms. 

With the data formatted correctly for the model to generate predictions, I will predict a winner of each first round match-up. Then, this whole process of data preparation will be repeated for each subsequent round of **predicted match-ups** until I crown a new (predicted) NCAA Men's Basketball Champion. 

In [11]:
# Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from gc import collect
import os
import joblib
import sys
from tqdm import tqdm

# display 100 rows and 100 columns
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

# global random seed
SEED = 0

# set numpy seed
np.random.seed(SEED)

### Loading in the Data

In [13]:
# For concise directory info
path = '/Users/samalainabayeva/Desktop/FLAT_IRON!!!/NCAA_KAGGLE/march-machine-learning-mania-2024/'

In [338]:
# 2024 Tournament Seed Info

seeds = pd.read_csv(os.path.join(path, "2024_tourney_seeds.csv"))
seeds.head()

Unnamed: 0,Tournament,Seed,TeamID
0,M,W01,1163
1,M,W02,1235
2,M,W03,1228
3,M,W04,1120
4,M,W05,1361


In [306]:
# This df contains aggregated data through the 2024 season

df = pd.read_csv(os.path.join(path, "Aggregated_Season_Data.csv"), index_col=0)
df.head()

Unnamed: 0,Season,TeamID,AvgTeamScore,StdDevTeamScore,AvgOppScore,StdDevOppScore,AvgFGMade,AvgFGAtt,TotalFGMade,TotalFGAtt,Avg3ptMade,Avg3ptAtt,Total3ptMade,Total3ptAtt,Avg_FT_Made,Avg_FT_Att,Total_FT_Made,Total_FT_Att,Avg_Off_Rebs,Total_Off_Rebs,Avg_Def_Rebs,Total_Def_Rebs,Avg_Assts,Total_Assts,Avg_TO,Total_TO,Avg_Steals,Avg_Blocks,Avg_Fouls,OppAvgFGMade,OppAvgFGAtt,OppTotalFGMade,OppTotalFGAtt,OppAvg3ptMade,OppAvg3ptAtt,OppTotal3ptMade,OppTotal3ptAtt,OppAvg_FT_Made,OppAvg_FT_Att,OppTotal_FT_Made,OppTotal_FT_Att,OppAvg_Off_Rebs,OppTotal_Off_Rebs,OppAvg_Def_Rebs,OppTotal_Def_Rebs,OppAvg_Assts,OppTotal_Assts,OppAvg_TO,OppTotal_TO,OppAvg_Steals,OppAvg_Blocks,OppAvg_Fouls,AvgPtDiff,MedPtDiff,StdPtDiff,Win_Total,Loss_Total,HomeWins,HomeLoss,RoadWins,RoadLoss,NeutralWins,NeutralLoss,OTWins,OTLoss,CloseGames,CloseWins,MaxWStreak,MaxLStreak,LastTenWinPerc,LastFiveWinPerc,WinTrend,LastTenPtDiff,LastFivePtDiff,DiffTrend,Conference,Coach,MedianRanking,BestRanking,WorstRanking,Chalk_Seed,Seed,win_perc,home_win_perc,road_win_perc,neutral_win_perc,ot_win_perc,close_game_win_perc,Relative_Diff_Trend,Relative_Win_Trend
0,2003,1102,57.25,13.892777,57.0,12.232319,19.142857,39.785714,536,1114,7.821429,20.821429,219,583,11.142857,17.107143,312,479,4.178571,117,16.821429,471,13.0,364,11.428571,320,5.964286,1.785714,18.75,19.285714,42.428571,540,1188,4.75,12.428571,133,348,13.678571,19.25,383,539,9.607143,269,20.142857,564,9.142857,256,12.964286,363,5.428571,1.571429,18.357143,0.25,-3.0,16.16953,12,16,9,4,3,10,0,2,0,0,8,3,4,5,0.472281,0.438885,Downtrend,-8.0,-3.6,Uptrend,mwc,,158.0,144.0,169.0,,,0.428571,0.692308,0.230769,0.0,,0.375,Trending Up,Trending Down
1,2003,1103,78.777778,15.272734,78.148148,12.790332,27.148148,55.851852,733,1508,5.444444,16.074074,147,434,19.037037,25.851852,514,698,9.777778,264,19.925926,538,15.222222,411,12.62963,341,7.259259,2.333333,19.851852,27.777778,57.0,750,1539,6.666667,18.37037,180,496,15.925926,22.148148,430,598,12.037037,325,22.037037,595,15.481481,418,15.333333,414,6.407407,2.851852,22.444444,0.62963,-2.0,11.337983,13,14,9,5,4,9,0,0,3,1,12,6,4,4,0.478654,0.495948,Uptrend,0.6,3.6,Uptrend,mac,,166.5,152.0,171.0,,,0.481481,0.642857,0.307692,,0.75,0.5,On Trend,On Trend
2,2003,1104,69.285714,11.375273,65.0,8.645273,24.035714,57.178571,673,1601,6.357143,19.857143,178,556,14.857143,20.928571,416,586,13.571429,380,23.928571,670,12.107143,339,13.285714,372,6.607143,3.785714,18.035714,23.25,55.5,651,1554,6.357143,19.142857,178,536,12.142857,17.142857,340,480,10.892857,305,22.642857,634,11.678571,327,13.857143,388,5.535714,3.178571,19.25,4.285714,6.0,13.391145,17,11,13,2,1,8,3,1,1,0,6,1,9,3,0.644503,0.631124,Downtrend,1.1,5.0,Uptrend,sec,,35.0,35.0,35.0,10.0,Y10,0.607143,0.866667,0.111111,0.75,1.0,0.166667,On Trend,On Trend
3,2003,1105,71.769231,13.051614,76.653846,14.005548,24.384615,61.615385,634,1602,7.576923,20.769231,197,540,15.423077,21.846154,401,568,13.5,351,23.115385,601,14.538462,378,18.653846,485,9.307692,2.076923,20.230769,27.0,58.961538,702,1533,6.269231,17.538462,163,456,16.384615,24.5,426,637,13.192308,343,26.384615,686,15.807692,411,18.807692,489,9.384615,4.192308,19.076923,-4.884615,-3.5,15.860207,7,19,5,7,2,12,0,0,0,3,9,2,2,6,0.277992,0.283595,Uptrend,-0.2,-0.8,Downtrend,swac,,309.0,304.0,313.0,,,0.269231,0.416667,0.142857,,0.0,0.222222,On Trend,On Trend
4,2003,1106,63.607143,11.402856,63.75,12.636294,23.428571,55.285714,656,1548,6.107143,17.642857,171,494,10.642857,16.464286,298,461,12.285714,344,23.857143,668,11.678571,327,17.035714,477,8.357143,3.142857,18.178571,21.714286,53.392857,608,1495,4.785714,15.214286,134,426,15.535714,21.964286,435,615,11.321429,317,22.357143,626,11.785714,330,15.071429,422,8.785714,3.178571,16.142857,-0.142857,-1.0,12.601335,13,15,8,4,5,9,0,2,1,0,10,5,7,5,0.487403,0.469128,Downtrend,0.1,-2.2,Downtrend,swac,,262.0,212.0,293.0,,,0.464286,0.666667,0.357143,0.0,1.0,0.5,On Trend,On Trend


In [307]:
# Men's Data for the 2024 Season

men_2024 = df.query("Season == 2024 & TeamID < 3000").copy()

In [308]:
# Inspection

print(men_2024.shape)
men_2024.head()

(362, 90)


Unnamed: 0,Season,TeamID,AvgTeamScore,StdDevTeamScore,AvgOppScore,StdDevOppScore,AvgFGMade,AvgFGAtt,TotalFGMade,TotalFGAtt,Avg3ptMade,Avg3ptAtt,Total3ptMade,Total3ptAtt,Avg_FT_Made,Avg_FT_Att,Total_FT_Made,Total_FT_Att,Avg_Off_Rebs,Total_Off_Rebs,Avg_Def_Rebs,Total_Def_Rebs,Avg_Assts,Total_Assts,Avg_TO,Total_TO,Avg_Steals,Avg_Blocks,Avg_Fouls,OppAvgFGMade,OppAvgFGAtt,OppTotalFGMade,OppTotalFGAtt,OppAvg3ptMade,OppAvg3ptAtt,OppTotal3ptMade,OppTotal3ptAtt,OppAvg_FT_Made,OppAvg_FT_Att,OppTotal_FT_Made,OppTotal_FT_Att,OppAvg_Off_Rebs,OppTotal_Off_Rebs,OppAvg_Def_Rebs,OppTotal_Def_Rebs,OppAvg_Assts,OppTotal_Assts,OppAvg_TO,OppTotal_TO,OppAvg_Steals,OppAvg_Blocks,OppAvg_Fouls,AvgPtDiff,MedPtDiff,StdPtDiff,Win_Total,Loss_Total,HomeWins,HomeLoss,RoadWins,RoadLoss,NeutralWins,NeutralLoss,OTWins,OTLoss,CloseGames,CloseWins,MaxWStreak,MaxLStreak,LastTenWinPerc,LastFiveWinPerc,WinTrend,LastTenPtDiff,LastFivePtDiff,DiffTrend,Conference,Coach,MedianRanking,BestRanking,WorstRanking,Chalk_Seed,Seed,win_perc,home_win_perc,road_win_perc,neutral_win_perc,ot_win_perc,close_game_win_perc,Relative_Diff_Trend,Relative_Win_Trend
12135,2024,1101,71.115385,11.261712,74.692308,10.356715,24.461538,58.653846,636,1525,5.115385,15.307692,133,398,17.076923,23.230769,444,604,7.769231,202,21.230769,552,11.653846,303,11.923077,310,7.884615,2.038462,20.115385,26.192308,55.769231,681,1450,5.769231,17.384615,150,452,16.538462,23.5,430,611,9.076923,236,26.846154,698,11.961538,311,14.384615,374,6.269231,3.576923,19.307692,-3.576923,-2.5,12.339119,11,15,6,5,3,9,2,1,1,2,8,3,4,4,0.365767,0.372817,Uptrend,-1.6,-1.6,Downtrend,wac,brette_tanner,192.0,143.0,236.0,,,0.423077,0.545455,0.25,0.666667,0.333333,0.375,On Trend,On Trend
12136,2024,1102,66.62963,11.115298,71.592593,10.842435,23.592593,51.814815,637,1399,8.777778,24.074074,237,650,10.666667,15.851852,288,428,6.555556,177,19.777778,534,14.703704,397,10.888889,294,6.740741,4.148148,17.37037,24.666667,52.666667,666,1422,7.259259,19.481481,196,526,15.0,20.222222,405,546,8.185185,221,21.481481,580,12.518519,338,11.333333,306,5.666667,3.037037,16.222222,-4.962963,-5.0,16.32714,9,18,4,11,5,6,0,1,1,1,8,3,6,8,0.369073,0.336129,Downtrend,-11.2,-13.6,Downtrend,mwc,joe_scott,259.0,178.0,324.0,,,0.333333,0.266667,0.454545,0.0,0.5,0.375,On Trend,Trending Down
12137,2024,1103,72.653846,8.394962,66.461538,10.538428,25.615385,56.269231,666,1463,7.653846,23.230769,199,604,13.769231,19.115385,358,497,8.230769,214,23.5,611,12.230769,318,10.576923,275,5.538462,2.730769,16.769231,24.346154,56.346154,633,1465,6.384615,20.730769,166,539,11.384615,16.384615,296,426,7.384615,192,21.615385,562,11.884615,309,10.692308,278,6.230769,2.653846,17.307692,6.192308,6.0,11.648242,18,8,10,0,7,5,1,3,1,0,7,3,7,3,0.732272,0.717409,Downtrend,6.1,-1.0,Downtrend,mac,john_groce,,,,,,0.692308,1.0,0.583333,0.25,1.0,0.428571,Trending Down,On Trend
12138,2024,1104,91.535714,12.115133,79.214286,15.55958,31.035714,64.142857,869,1796,11.607143,30.5,325,854,17.857143,22.571429,500,632,11.107143,311,24.928571,698,16.178571,453,11.928571,334,7.535714,4.107143,19.214286,27.464286,62.428571,769,1748,7.321429,22.785714,205,638,16.964286,23.464286,475,657,9.75,273,21.071429,590,12.571429,352,11.607143,325,7.464286,4.178571,19.785714,12.321429,12.0,20.560757,20,8,13,1,5,4,2,3,1,0,4,3,6,3,0.709851,0.715418,Uptrend,8.8,8.0,Downtrend,sec,nate_oats,8.0,3.0,17.0,,,0.714286,0.928571,0.555556,0.4,1.0,0.75,On Trend,On Trend
12139,2024,1105,68.928571,10.150455,77.964286,14.586859,23.0,55.035714,644,1541,4.178571,14.607143,117,409,18.75,26.285714,525,736,9.428571,264,22.25,623,10.285714,288,15.714286,440,7.642857,3.821429,21.285714,25.071429,57.785714,702,1618,7.75,21.392857,217,599,20.071429,26.964286,562,755,10.428571,292,22.964286,643,13.714286,384,13.357143,374,9.714286,3.428571,20.107143,-9.035714,-11.0,13.535911,8,20,5,4,3,14,0,2,1,0,5,3,3,7,0.236274,0.283941,Uptrend,0.3,4.2,Uptrend,swac,,347.0,308.0,362.0,,,0.285714,0.555556,0.176471,0.0,1.0,0.6,On Trend,Trending Up


### Several features need to be engineered

In [309]:
# List of the features used by my trained model to generate predictions

feature_columns = ['Chalk_Seed_diff', 'Team_A_MedianRanking', 'Team_B_MedianRanking',
       'Team_A_BestRanking', 'AvgPtDiff_diff', 'Team_B_BestRanking',
       'Team_B_WorstRanking', 'Team_A_WorstRanking', 'LastTenWinPerc_diff',
       'Team_B_Chalk_Seed', 'NetRating_diff', 'Team_A_Chalk_Seed',
       'Avg_Blocks_diff', 'Team_B_NetRating', 'AvgFGMade_diff',
       'MedPtDiff_diff', 'LastFiveWinPerc_diff', 'Avg_Steals_diff',
       'Team_A_Def_diff_Team_B_Off', 'Team_B_Off_Eff', 'Team_A_win_perc',
       'Team_B_win_perc', 'Team_A_Off_diff_Team_B_Def', 'Avg_Assts_diff',
       'Avg_Def_Rebs_diff', 'Avg_Off_Rebs_diff', 'Avg_TO_diff', 'Rebound_diff',
       'Int_Avg_Assts_diff_x_Avg_Steals_diff',
       'Int_MedPtDiff_diff_x_Avg_TO_diff',
       'Int_MedPt_Diff_diff_x_Team_A_Def_diff_Team_B_Off',
       'Int_Team_A_Off_diff_Team_B_Def_x_Avg_Off_Rebs_diff',
       'Int_Avg3ptAtt_diff_x_Avg3ptMade_diff',
       'Int_Avg_Blocks_diff_x_StdPtDiff_diff',
       'Int_CloseWins_diff_x_close_game_win_perc_diff']

len(feature_columns)

35

In [310]:
# Slightly different form of the same function from the previous notebook 

def calculate_efficiency(data):
    """
    Function to calculate both Teams' offensive and defensive efficiency scores
    
    Parameters:
    ----------------
    data: pandas.DataFrame | a dataframe with aggregated statistics from a team's season
    
    Returns:
    ----------------
    Team A Offensive and Defensive Efficiency scores, Team_B Offensive and Defensive Efficiency scores
    """
    
        
    # Team A Offense
    team_a_off_poss = data["AvgFGAtt"] + data["Avg_TO"] + data['Avg_FT_Att']\
                    - data["Avg_Off_Rebs"]

    team_a_off_eff =  data["AvgTeamScore"] / team_a_off_poss
    
    # Team A Defense
    team_a_def_poss = data["OppAvgFGAtt"] + data["OppAvg_TO"] + data['OppAvg_FT_Att']\
                    - data["OppAvg_Off_Rebs"]

    team_a_def_eff =  data["AvgOppScore"] / team_a_def_poss
    
    team_net_rating = team_a_off_eff - team_a_def_eff
    
    return team_a_off_eff, team_a_def_eff, team_net_rating

In [337]:
# Add Efficiency Statistics
men_2024["Off_Eff"], men_2024["Def_Eff"], men_2024["NetRating"] = calculate_efficiency(men_2024)

men_2024[['Off_Eff', 'Def_Eff', 'NetRating']].head()

Unnamed: 0,Off_Eff,Def_Eff,NetRating
0,0.934686,0.874052,0.060634
1,1.045696,0.902727,0.142969
2,1.005995,0.85261,0.153386
3,0.980118,0.814846,0.165272
4,0.992294,0.903119,0.089175


In [314]:
# Updating the df with the Tournament Seed Information

men_2024 = men_2024.drop(columns=["Chalk_Seed", "Seed"]).merge(seeds, on="TeamID", how="inner")

In [316]:
# Creating Chalk_Seed feature

men_2024['Chalk_Seed'] = men_2024["Seed"].apply(lambda x: int(x[1:]))

In [317]:
# Filling the two nulls in each of these columns

# # Median
men_2024["MedianRanking"].fillna(men_2024["Chalk_Seed"] * 4, inplace=True)


# Best
men_2024["BestRanking"].fillna(men_2024["Chalk_Seed"] * 4 - (.25 * men_2024["MedianRanking"]), inplace=True)


# Worst
men_2024["WorstRanking"].fillna(men_2024["Chalk_Seed"] * 4 + (.25 * men_2024["MedianRanking"]), inplace=True)

# Inspection
men_2024[['MedianRanking', 'BestRanking', 'WorstRanking']].head()

Unnamed: 0,MedianRanking,BestRanking,WorstRanking
0,56.0,42.0,70.0
1,8.0,3.0,17.0
2,4.0,1.0,12.0
3,7.0,4.0,14.0
4,21.0,7.0,54.0


In [318]:
# Adding average total rebounds

men_2024["Rebound"] = men_2024["Avg_Def_Rebs"] + men_2024["Avg_Off_Rebs"]

### Loading the Pre-Trained Model

In [320]:
# Model Load

my_model = joblib.load(os.path.join(path, 'Trained_NCAA_Men_Model.joblib'))

In [321]:
# Slots Data Loading

slots = pd.read_csv(os.path.join(path, 'MNCAATourneySlots.csv'))
slots.head()

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed
0,1985,R1W1,W01,W16
1,1985,R1W2,W02,W15
2,1985,R1W3,W03,W14
3,1985,R1W4,W04,W13
4,1985,R1W5,W05,W12


In [322]:
# Filtering Slots 
slots = slots[slots['Season'] == 2024]
slots = slots[slots['Slot'].str.contains('R')].reset_index(drop=True)

slots.head()

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed
0,2024,R1W1,W01,W16
1,2024,R1W2,W02,W15
2,2024,R1W3,W03,W14
3,2024,R1W4,W04,W13
4,2024,R1W5,W05,W12


In [330]:
# Taken from Heath Jones notebook and modified slightly:
# https://github.com/heefjones/march_madness/blob/main/preds.ipynb


def generate_bracket(data, estimator, tournament, num_brackets, slots_df=slots):
    """
    Generate a single bracket for the 2024 NCAA tournament.

    Parameters
    ----------
    data : pd.DataFrame
        Regular season data for the 2024 teams competing in the tournament.
    estimator : sklearn estimator
        Pre-trained estimator to use for modeling.
    tournament : str
        'M' or 'W'.
    num_brackets : int
        Number of brackets to generate.
    slots : pd.DataFrame
        Slots for the 2024 tournament.

    Returns
    -------
    all_brackets : pd.DataFrame
        DataFrame with the predicted outcomes of the tournament.
    
    """

    # Get a copies of data to avoid modifying the original
    features = data.copy()

    # Create empty df for all brackets
    all_brackets = pd.DataFrame()

    # Loop for each individual bracket created in entireity
    for n in range(1, num_brackets+1):
        # Create bracket-specific slots table
        slots = slots_df.copy()

        # Create empty results for round
        result_df = pd.DataFrame(columns=["Slot", "Team"])

        # 6 rounds in a single bracket
        for i in range(1, 7):
            # get slots for round
            slots_round = slots[slots['Slot'].str.contains(f'R{i}')].reset_index(drop=True)

            # Container holds data for each matchup
            round_matchups = []

            # Loop through the slots
            for idx, row in slots_round.iterrows():
                # Get team A and team B
                A = features[features['Seed'] == row['StrongSeed']].reset_index(drop=True)
                B = features[features['Seed'] == row['WeakSeed']].reset_index(drop=True)

                # Rename cols
                A = A.add_prefix('Team_A_')
                B = B.add_prefix('Team_B_')

                # Create matchup dataframe
                combined = pd.concat([A, B], axis=1)

                # Append combined row to the list
                round_matchups.append(combined)

            # Concatenate all matchup rows for a GIVEN ROUND into a single DataFrame
            round_df = pd.concat(round_matchups, axis=0).reset_index(drop=True)
            
            #### Feature Engineering Here ##########
            
            # Create a new dataframe for the differenced data
            diff_df = pd.DataFrame()

            # Iterate through X to see if the column name starts with "Team_A" 
            for j in round_df.select_dtypes(include=np.number).columns:
                if j[:7] == "Team_A_":
                    # Capture the general statistic
                    feature = j[7:]

                    # Create new column in new 'X-like' dataframe
                    diff_df[f"{feature}_diff"] = round_df[f"Team_A_{feature}"] - round_df[f"Team_B_{feature}"]
                    
            # Offense - Defense
            diff_df["Team_A_Off_diff_Team_B_Def"] = round_df["Team_A_Off_Eff"] - round_df["Team_B_Def_Eff"]

            # Defense - Offense
            diff_df["Team_A_Def_diff_Team_B_Off"] = round_df["Team_A_Def_Eff"] - round_df["Team_B_Off_Eff"]


            # Add the categorical features back to the data
            diff_df = pd.concat([diff_df, round_df], axis=1)
            
            
            # Interaction Term Feature Creation

            # Assists_diff x Steals_diff
            diff_df["Int_Avg_Assts_diff_x_Avg_Steals_diff"] = diff_df["Avg_Assts_diff"] * diff_df["Avg_Steals_diff"]

            # MedPt_Diff_diff x Avg_TO_diff
            diff_df["Int_MedPtDiff_diff_x_Avg_TO_diff"] = diff_df["MedPtDiff_diff"] * diff_df["Avg_TO_diff"]

            # MedPt_Diff_diff x Team_A_Def_diff_Team_B_Off
            diff_df["Int_MedPt_Diff_diff_x_Team_A_Def_diff_Team_B_Off"] = diff_df["MedPtDiff_diff"] * diff_df['Team_A_Def_diff_Team_B_Off']

            # Team_A_Off_diff_Team_B_Def_x_Avg_Off_Rebs_diff
            diff_df["Int_Team_A_Off_diff_Team_B_Def_x_Avg_Off_Rebs_diff"] = diff_df["Team_A_Off_diff_Team_B_Def"] * diff_df['Avg_Off_Rebs_diff']

            # 3pt att vs makes
            diff_df["Int_Avg3ptAtt_diff_x_Avg3ptMade_diff"] = diff_df["Avg3ptAtt_diff"] * diff_df["Avg3ptMade_diff"]

            # Blocks v std_dev of point differential
            diff_df["Int_Avg_Blocks_diff_x_StdPtDiff_diff"] = diff_df["Avg_Blocks_diff"] * diff_df["StdPtDiff_diff"]

            # Close wins diff vs close game win % diff
            diff_df["Int_CloseWins_diff_x_close_game_win_perc_diff"] = diff_df["CloseWins_diff"] * diff_df["close_game_win_perc_diff"]
            

            if i == 1:
                pass
            else:
                diff_df[f'round_{i}'] = 1


            # Define X and select cols based on the features I want to use
            global feature_columns
            X = diff_df[feature_columns]

            
            # Predict the outcomes of the round -> can use .predict_proba() and then un-comment the following lines
            preds = estimator.predict(X)

#             # Generate random values for each observation/prediction-probability
#             random_values = np.random.rand(len(preds))
            
#             # Boolean Values returned by comparing probabilty of Class-0 to a random value between 0 and 1
#             # As long as the random value is greater than the probability of Class-0, Team_A Wins
#             preds = (random_values > preds[:, 0]).astype(int)

            # replace preds with full seed of winning team
            preds = np.where(preds > 0, round_df['Team_A_Seed'], round_df['Team_B_Seed'])

            # Update the Result df, which contains all results for this individual run of the bracket
            for slot, winner_seed in zip(slots_round['Slot'], preds):
                # save results to result_df
                result_df.loc[len(result_df.index)] = [slot, winner_seed]

            # Edit slots df for next round
            if i != 6:
                next_round_slots = slots[slots['Slot'].str.contains(f'R{i+1}')]

                for idx, row in next_round_slots.iterrows():
                    # Get the teams playing in that slot for the next round
                    team1 = result_df[result_df['Slot'] == row['StrongSeed']]['Team'].values[0]
                    team2 = result_df[result_df['Slot'] == row['WeakSeed']]['Team'].values[0]

                    # Update the slots df
                    slots.loc[slots['Slot'] == row['Slot'], 'StrongSeed'] = team1
                    slots.loc[slots['Slot'] == row['Slot'], 'WeakSeed'] = team2

            # drop teams that have been eliminated
            # features = features[features['FullSeed'].isin(result_df['Team'])].reset_index(drop=True)

        # add bracket col
        result_df['Bracket'] = n

        # append to all_brackets
        all_brackets = pd.concat([all_brackets, result_df], axis=0)

    # add tournament col
    all_brackets['Tournament'] = tournament

    return all_brackets

In [331]:
# Creating a first bracket
bracket = generate_bracket(data=men_2024, estimator=my_model, tournament="M", num_brackets=1, slots_df=slots)

In [332]:
# Inspecting the first bracket
bracket.head()

Unnamed: 0,Slot,Team,Bracket,Tournament
0,R1W1,W01,1,M
1,R1W2,W02,1,M
2,R1W3,W03,1,M
3,R1W4,W04,1,M
4,R1W5,W05,1,M


In [188]:
# Reading in the Team Name Data
team_names = pd.read_csv(os.path.join(path, 'MTeams.csv'))
team_names

Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2024
1,1102,Air Force,1985,2024
2,1103,Akron,1985,2024
3,1104,Alabama,1985,2024
4,1105,Alabama A&M,2000,2024
...,...,...,...,...
373,1474,Queens NC,2023,2024
374,1475,Southern Indiana,2023,2024
375,1476,Stonehill,2023,2024
376,1477,TX A&M Commerce,2023,2024


In [333]:
# Intermediate join
names_seeds = men_2024.merge(team_names[["TeamID", "TeamName"]], on="TeamID")[["TeamID", "TeamName", "Seed"]]

In [334]:
# Rename for merging
names_seeds.rename(columns={"Seed":"Team"}, inplace=True)

Unnamed: 0,TeamID,TeamName,Team
11,1163,Connecticut,W01


In [335]:
# With School Names
bracket.merge(names_seeds, on="Team", how="left")

Unnamed: 0,Slot,Team,Bracket,Tournament,TeamID,TeamName
0,R1W1,W01,1,M,1163,Connecticut
1,R1W2,W02,1,M,1235,Iowa St
2,R1W3,W03,1,M,1228,Illinois
3,R1W4,W04,1,M,1120,Auburn
4,R1W5,W05,1,M,1361,San Diego St
5,R1W6,W06,1,M,1140,BYU
6,R1W7,W07,1,M,1450,Washington St
7,R1W8,W08,1,M,1194,FL Atlantic
8,R1X1,X01,1,M,1314,North Carolina
9,R1X2,X02,1,M,1112,Arizona


<div class="alert alert-block alert-info" style="font-size: 2em; background-color:Blue; color:White">
    <b>Conclusion:</b>
    </div>
    
There are two methods that I can use to select results for the bracket. In the first method, I can simply use the .predict() method of my model to predict Team_A wins. Using this method, I would've predicted a Houston victory of UConn in the final. Obviously this would have been wrong, and my bracket would have been busted long before the final. The largest upset it predicted was an 11 seed over a 6 seed in the first round.

The second method uses a combination of the .predict_proba() method to generate probabilities of each class and a random number between 0 and 1 generated by numpy. This second method is better for producing simulations of bracket runs. 

Obviously, one-game samples can produce volatile outcomes. That is both what makes people love March Madness, and also what makes the outcomes so difficult to predict. This experience certainly made me want to dive deeper into even more granular data to see if more accurate models could still be trained. 

The next steps that I would like to have incorporated would have been to have access to individual players' data on teams, and used those stats as features. Also, it would be useful to perform further analysis on team's strategic weakness. Categories could be created illustrating team weaknesses and strengths. 

### Good luck to all of you sports prognosticators out there. 