<h1>March Madness Tournament Games Predictor</h1>

The main objective of this project is to train a predictive model using the NCAAM regular season statistics and tournament game results from 1985 through 2018 to predict the results of the games of 2019 using the regular season statistics.

<h3>Data Manipulation</h3>

This notebook includes the data manipulation to bring the data to the desired form. I'd like to define a feature vector for each tournament game played after 2003 through 2019 inclusive. 

For a given game from the tournament at a particular year, I set either the winning or the losing team as Team 1 and the other team as Team 2 randomly. I want the feature vector of the game include the following statistics of both team 1 and team 2 seperately from the regular season of the tournament game:

The number of the games the team won,
the number of the games the team lost,
the ratio of wins at home games,
the ratio of win at away games,
average score per game,
average field goal per game,
the number of field goal attemp per game,
the average 3 point scores per game,
the number of 3 point attemps per game,
the number of offensive rebounds per game,
the number of defensive rebounds per game,
the number of assists per game,
the number of steals per game,
the number of blocks per game,
the number of personal fouls per game,
average of the ranking of the team at the last day of the season,
seeds if the team has any.

Apart from these features which is recorded for two teams seperately I add some features which is specific for the game. These are:

Results (as win or loose) of previous games between two teams if any,
which team is the home team.

Lastly, I record all the data to a csv file.

Let's start importing the necessary packages and the data. Data files should be placed in a folder name mm_data located at the same place with notebook in order this code to work.

In [1]:
import numpy as np
import pandas as pd
import sklearn as sk
import random

In [2]:
reg_se_dt = pd.read_csv("mm_data\MDataFiles_Stage1\MRegularSeasonDetailedResults.csv")
tour_com = pd.read_csv("mm_data\MDataFiles_Stage1\MNCAATourneyCompactResults.csv")
rankings = pd.read_csv("mm_data\MDataFiles_Stage1\MMasseyOrdinals.csv")

FileNotFoundError: [Errno 2] File mm_data\MDataFiles_Stage1\MRegularSeasonDetailedResults.csv does not exist: 'mm_data\\MDataFiles_Stage1\\MRegularSeasonDetailedResults.csv'

In [None]:
reg_se_cm = pd.read_csv("mm_data\MDataFiles_Stage1\MRegularSeasonCompactResults.csv")
seed_data = pd.read_csv('mm_data\MDataFiles_Stage1\MNCAATourneySeeds.csv')

I first create a data frame which contains the information on each game I want to use as a data point. For each game the code decides to call either winning team 'team1' or 'team2' randomly. If team 1 is the winner of the game it labels the game 0 as the result if team 1 is the losing team it labels the game 1 as the result. This initial data frame will also include team1_id, team2_id, year of the game, the day of the game.

I define the following function to create the described data frame.

In [None]:
from IPython.display import clear_output

def games_data_col(years, games_data):
    games_df = pd.DataFrame(columns=['game_year','game_day','team1_id','team2_id','result']) 
    active_games = games_data[games_data['Season'].isin(years)]
    for i in range(len(active_games)):
        #record the ids of winning and losing team.
        #record the year of the game
        year = active_games.Season.iloc[i]
        game_day = active_games.DayNum.iloc[i]
        win_id = active_games.WTeamID.iloc[i]
        los_id = active_games.LTeamID.iloc[i]
        #decide which team will be called 1. If dice gives 0 then Team 1 is winning and Team 2 is losing.
        #if dice gives 1 then Team 2 is winning and Team1 is winning
        dice = random.randint(0,1)
        #create data array for the game.
        if dice == 0:
            data_array = [year, game_day, win_id, los_id] + [0] 
        elif dice == 1:
            data_array = [year,game_day,los_id,win_id] + [1]
        games_df.loc[len(games_df)] = data_array
        clear_output(wait= True)
        print('%' + str(round(i/len(active_games)*100,0))+' Completed')
    games_df['result'] = games_df['result'].astype(int)
    return games_df

Now I call the previous function for the tournament games between 2003-2019 (both included). I called this data frame game information and features data frame or GIF.

In [None]:
gif = games_data_col(range(2003,2020),tour_com)

The following function is defined to add two new feature columns to the GIF. Columns represent the same statistics for team1 and team 2 competing. It accepts 4 argument: 

- feat_name: the name of the new column
- df:data frame to update
- source_data: the data frame which the team statistics will be collected from
- stat_col: a function which to extract the data for each team from source data
    
It returns the updated data frame.

<h3>Remark:</h3>I should say this is definitely not the fastest way to transform the data. It is actually quite slower than my previous tries. The main reason I am following this strategy is seperating the statistic collector functions from row by row modification of the data frame I can re-use these functions for other sets of data points or similar data collection precuders with small adjustments.

The following function has three seperate main features. If the input count is not given then the function assumes it is called for a tournament game and it collects the statistics from the (whole) regular season of the tournament year. In this case the function only passes the this year's data to the data collector.

If count=n is a positive integer then it finds n many previous games of the given team and collects the same statistics for those games. I modified the function to use it for regular season games in a reasonable way. In this case the function is passing only last n games data to the statistics collector.

If the count is -1 then it the function passes all data to the data collector and the data collector should decide which portions of the data will be used.

In [None]:
def add_a_feat(stat_col,df,feat_name,data,count=0):
    if count == 0:
        feat1 = feat_name + str(1)
        feat2 = feat_name + str(2)
        df[feat1] = np.nan
        df[feat2] = np.nan
        for i in range(len(df)):
            year = df.game_year.loc[i]
            data_adj = data.loc[data['Season']==year]
            df.at[i,feat1] = stat_col(df.team1_id.loc[i],data_adj)
            df.at[i,feat2] = stat_col(df.team2_id.loc[i],data_adj)
            clear_output(wait= True)
            print('%' + str(round(i/len(df)*100,0))+' Completed')
        return df
    elif count == -1:
        feat1 = feat_name + str(1)
        feat2 = feat_name + str(2)
        df[feat1] = np.nan
        df[feat2] = np.nan
        for i in range(len(df)):
            df.at[i,feat1] = stat_col(df.game_year.loc[i],df.team1_id.loc[i],data)
            df.at[i,feat2] = stat_col(df.game_year.loc[i],df.team2_id.loc[i],data)
            clear_output(wait= True)
            print('%' + str(round(i/len(df)*100,0))+' Completed')
        return df
    else:
        feat1 = feat_name + str(1)
        feat2 = feat_name + str(2)
        df[feat1] = np.nan
        df[feat2] = np.nan
        for i in range(len(df)):
            team1 = df.team1_id.loc[i]
            team2 = df.team2_id.loc[i]
            year = df.game_year.loc[i] 
            day = df.game_day.loc[i]
            data_adj1 = data.loc[((data['WTeamID'] == team1) | (data['LTeamID'] == team1)) & (data['Season'] <= year) & (data['DayNum'] < day)]
            data_adj1 = data_adj1.tail(count)
            if len(data_adj1)>=count:
                df.at[i,feat2] = stat_col(team2,data_adj2)    
            data_adj2 = data.loc[((data['WTeamID'] == team2) | (data['LTeamID'] == team2)) & (data['Season'] <= year) & (data['DayNum'] < day)]
            data_adj2 = data_adj2.tail(count)
            if len(data_adj2)>=count:
                df.at[i,feat1] = stat_col(team1,data_adj1)
            clear_output(wait= True)
            print('%' + str(round(i/len(df)*100,0))+' Completed')
        return df
    

Now we are going to use the previous function to add new feature columns to the GIF. The following four functions are used to extract the following statistics of a given team on a given data.

<ul>
<li>win_game_stat: games won/games played (on the regular season of the given year). </li>
<li> loss_game_stat: games lost/games played (on the regular season of the given year).</li>
<li> home_stat: games won at home/games played at home (on the regular season of the given year)(not every game has the info on if the home team won therefore this statistics represent the known cases only).</li>    
<li> away_stat: games won at away/games played away (on the regular season of the given year)(not every game has the info on if the home team won therefore this statistics represent the known cases only).</li>

</ul>

In [None]:
def win_game_stat(teamid,data):
    return len(data[(data.WTeamID == teamid)])
def loss_game_stat(teamid,data):
    return len(data[(data.LTeamID == teamid)])
def home_stat(teamid,data):
    team_wins_home = len(data[(data['WTeamID'] == teamid) & (data['WLoc'] =='H')])
    team_los_home = len(data[(data['LTeamID'] == teamid) & (data['WLoc'] =='A')])
    if team_wins_home+team_los_home == 0:
        return
    return team_wins_home/(team_wins_home+team_los_home)
def away_stat(teamid,data):
    team_wins_away = len(data[(data['WTeamID'] == teamid) & (data['WLoc'] =='A')])
    team_los_away = len(data[(data['LTeamID'] == teamid) & (data['WLoc'] =='H')])
    if team_los_away+team_wins_away == 0:
        return
    return team_wins_away/(team_los_away+team_wins_away)

In [None]:
gif = add_a_feat(win_game_stat,gif,'win',reg_se_dt)
gif = add_a_feat(loss_game_stat,gif,'loss',reg_se_dt)
gif = add_a_feat(home_stat,gif,'home_win',reg_se_cm)
gif = add_a_feat(away_stat,gif,'away_win',reg_se_cm)

The following function records the ranking of a given team on a given year, in the last day of the regular season.

In [None]:
def rank_year0(teamid,data):
    rank_day = data[(data['TeamID']==teamid)].RankingDayNum.max()
    return data[(data['TeamID']==teamid) & (data['RankingDayNum']==rank_day)].OrdinalRank.mean()

In [None]:
gif = add_a_feat(rank_year0,gif,'rank_year0',rankings)

The following two functions extracts the score average of a team and the score average of team allowed from the given data.

In [None]:
def score_stat(teamid,data):
    win_game_score = data[(data.WTeamID == teamid)].WScore.sum()
    lose_game_score = data[(data.LTeamID == teamid)].LScore.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_game_score+lose_game_score)/total_game
def score_against_stat(teamid,data):
    win_game_score = data[(data.WTeamID == teamid)].LScore.sum()
    lose_game_score = data[(data.LTeamID == teamid)].WScore.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_game_score+lose_game_score)/total_game

In [None]:
gif = add_a_feat(score_stat,gif,'score_av',reg_se_dt)
gif = add_a_feat(score_against_stat,gif,'score_ag_av',reg_se_dt)

The following four functions are used to extract the following statistics of a given team on a given data:


- fscore_stat: field score average of the team in given data. 
- pscore_stat: 3 pointer average of the team in given data.
- fscore_against_stat: field score average the team allowed in the given data.    
- pscore_against_stat: 3 pointer average the team allowed in the given data.

In [None]:
def fscore_stat(teamid,data):
    win_fscore = data[(data.WTeamID == teamid)].WFGM.sum()
    lose_fscore = data[(data.LTeamID == teamid)].LFGM.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_fscore+lose_fscore)/total_game
def pscore_stat(teamid,data):
    win_pscore = data[(data.WTeamID == teamid)].WFGM3.sum()
    lose_pscore = data[(data.LTeamID == teamid)].LFGM3.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_pscore+lose_pscore)/total_game
def fscore_against_stat(teamid,data):
    win_fscore = data[(data.WTeamID == teamid)].LFGM.sum()
    lose_fscore = data[(data.LTeamID == teamid)].WFGM.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_fscore+lose_fscore)/total_game
def pscore_against_stat(teamid,data):
    win_pscore = data[(data.WTeamID == teamid)].LFGM3.sum()
    lose_pscore = data[(data.LTeamID == teamid)].WFGM3.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_pscore+lose_pscore)/total_game

In [None]:
gif = add_a_feat(fscore_stat,gif,'field_score', reg_se_dt)
gif = add_a_feat(pscore_stat,gif,'3pointer_score', reg_se_dt)
gif = add_a_feat(fscore_against_stat, gif, 'field_score_ag', reg_se_dt)
gif = add_a_feat(pscore_against_stat, gif, '3pointer_score_ag', reg_se_dt)

The following three functions are used to extract the following statistics of a given team on a given data:

- asist_stat: asist average of the team in given data. 
- pfoul_stat: personal foul average of the team in given data.
- turn_over_stat: turnover average of the team in given data.   

In [None]:
def asist_stat(teamid,data):
    win_ast = data[(data.WTeamID == teamid)].WAst.sum()
    lose_ast = data[(data.LTeamID == teamid)].LAst.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_ast+lose_ast)/total_game
def pfoul_stat(teamid,data):
    win_pfoul = data[(data.WTeamID == teamid)].WPF.sum()
    lose_pfoul = data[(data.LTeamID == teamid)].LPF.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_pfoul+lose_pfoul)/total_game
def turn_over_stat(teamid,data):
    win_turn_over = data[(data.WTeamID == teamid)].WTO.sum()
    loss_turn_over = data[(data.LTeamID == teamid)].LTO.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_turn_over+loss_turn_over)/total_game

In [None]:
gif = add_a_feat(asist_stat,gif,'asists',reg_se_dt)
gif = add_a_feat(pfoul_stat,gif,'personal_fouls',reg_se_dt)
gif = add_a_feat(turn_over_stat,gif,'turn_overs',reg_se_dt)

The following four functions are used to extract the following statistics of a given team on a given data:

- block_stat: block average of the team in given data. 
- def_rb_stat: defansive rebound average of the team in given data.
- off_rb_stat: offensive rebound average of the team in given data.
- steal_stat: steals average of the team in the given data.

In [None]:
def block_stat(teamid,data):
    win_block = data[(data.WTeamID == teamid)].WBlk.sum()
    lose_block = data[(data.LTeamID == teamid)].LBlk.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_block+lose_block)/total_game
def def_rb_stat(teamid,data):
    win_def_rb = data[(data.WTeamID == teamid)].WDR.sum()
    lose_def_rb = data[(data.LTeamID == teamid)].LDR.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_def_rb+lose_def_rb)/total_game
def off_rb_stat(teamid,data):
    win_off_rb = data[(data.WTeamID == teamid)].WOR.sum()
    lose_off_rb = data[(data.LTeamID == teamid)].LOR.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_off_rb+lose_off_rb)/total_game
def steal_stat(teamid,data):
    win_stl = data[(data.WTeamID == teamid)].WStl.sum()
    lose_stl = data[(data.LTeamID == teamid)].LStl.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_stl+lose_stl)/total_game

In [None]:
gif = add_a_feat(block_stat,gif,'blocks',reg_se_dt)
gif = add_a_feat(steal_stat,gif,'steals',reg_se_dt)
gif = add_a_feat(def_rb_stat,gif,'defansive_rebound',reg_se_dt)
gif = add_a_feat(off_rb_stat,gif,'offensive_rebound',reg_se_dt)


The following four functions are used to extract the following statistics of a given team on a given data:

- block_ag_stat: block average the team allowed in given data. 
- def_rb_ag_stat: defansive rebound average the team allowed in given data.
- off_rb_ag_stat: offensive rebound average the team allowed in given data.
- steal_ag_stat: steals average the team allowed in the given data.

In [None]:
def block_ag_stat(teamid,data):
    win_block = data[(data.WTeamID == teamid)].LBlk.sum()
    lose_block = data[(data.LTeamID == teamid)].WBlk.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_block+lose_block)/total_game
def def_rb_ag_stat(teamid,data):
    win_def_rb = data[(data.WTeamID == teamid)].LDR.sum()
    lose_def_rb = data[(data.LTeamID == teamid)].WDR.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_def_rb+lose_def_rb)/total_game
def off_rb_ag_stat(teamid,data):
    win_off_rb = data[(data.WTeamID == teamid)].LOR.sum()
    lose_off_rb = data[(data.LTeamID == teamid)].WOR.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_off_rb+lose_off_rb)/total_game
def steal_ag_stat(teamid,data):
    win_stl = data[(data.WTeamID == teamid)].LStl.sum()
    lose_stl = data[(data.LTeamID == teamid)].WStl.sum()
    total_game = len(data[(data.WTeamID == teamid) | (data.LTeamID == teamid)])
    return (win_stl+lose_stl)/total_game

In [None]:
gif = add_a_feat(block_ag_stat,gif,'blocks_ag',reg_se_dt)
gif = add_a_feat(steal_ag_stat,gif,'steals_ag',reg_se_dt)
gif = add_a_feat(def_rb_ag_stat,gif,'defansive_rebound_ag',reg_se_dt)
gif = add_a_feat(off_rb_ag_stat,gif,'offensive_rebound_ag',reg_se_dt)

In [None]:
def seed_stat(teamid,data):
    team_data = data[(data.TeamID == teamid)]
    if len(team_data)==0:
        return 17
    elif len(team_data)>=2:
        print('there are two seperate seeds for a team')
        return
    rank_str = team_data['Seed'].iloc[0]
    digits = [i for i in rank_str if i.isdigit()]
    digits_str = ''
    for i in digits:
        digits_str += i
    return int(digits_str)

In [None]:
gif = add_a_feat(seed_stat,gif,'seeds',seed_data)

The following three functions are used to extract the following statistics of a given team on a given data for a given year:

- last_tour_wins: how many games the team won in the last year's tournament. 
- last_year_played: how many games the team played in the last year's tournament.
- last_year_level: which level the team got up to last year.

These functions need tournament compact results in order to work properly.

In [None]:
def last_tour_wins(year,teamid, data):
    year-=1
    win_game_ct = len(data[(data.Season == year) & (data.WTeamID == teamid)])
    return win_game_ct
def last_year_played(year,teamid,data):
    year-=1
    return len(data[(data.Season == year) & ((data.WTeamID == teamid) | (data.LTeamID == teamid))])
def last_year_level(year,teamid,data):
    winning = data[(data.Season == year-1) & (data.WTeamID == teamid)].LTeamID
    losing = data[(data.Season == year-1) & (data.LTeamID == teamid)].WTeamID
    return len(pd.concat([winning,losing]).unique())

In [None]:
gif = add_a_feat(last_tour_wins,gif,"wins_last_tour",tour_com,-1)
gif = add_a_feat(last_year_played,gif,"played_last_tour",tour_com,-1)
gif = add_a_feat(last_year_level,gif,"highest_level_last_tour",tour_com,-1)

The following function collects the previous games between two teams from last two regular season and current regular season and last three tournament. Then counts the wins and losses. Team 2 winnig +1 and team winnig -1.

In [None]:
def add_prev(df):
    df['previous_results']=np.nan
    for i in range(len(df)):
        year = df.at[i,'game_year']
        team1 = df.at[i,'team1_id']
        team2 = df.at[i,'team2_id']
        com_count = 0
        com_count += len(reg_se_cm[(reg_se_cm.Season.isin(range(year-2,year+1))) & (reg_se_cm.WTeamID == team2) & (reg_se_cm.LTeamID == team1)])
        com_count -= len(reg_se_cm[(reg_se_cm.Season.isin(range(year-2,year+1))) & (reg_se_cm.WTeamID == team1) & (reg_se_cm.LTeamID == team2)])
        com_count += len(tour_com[(tour_com.Season.isin(range(year-3,year))) & (tour_com.WTeamID == team2) & (tour_com.LTeamID == team1)])
        com_count -= len(tour_com[(tour_com.Season.isin(range(year-3,year))) & (tour_com.WTeamID == team1) & (tour_com.LTeamID == team2)])
        df.at[i,'previous_results'] = com_count
    return df


In [None]:
gif = add_prev(gif)

In [None]:
gif.to_csv('game_data.csv',index = False)