In [1]:
# -.-|m { input: false, output_error: false, input_fold: hide }
# load pretty jupyter
%load_ext pretty_jupyter

<style>
    body {
        background-color: #20B5E959
    }

    .table-striped tbody tr:nth-of-type(even) {
        background-color: #f0ccfc;
    }
    
    .table-striped tbody tr:nth-of-type(odd) {
        background-color: #eff1ff;
    }
    
    .table-striped thead th {
        background-color: #f0ccfc;
    }
    
</style>

## Introduction

This project is a college basketball simulator. By taking statistics from every possession played in the 2024 NCAA Men's college basketball season, probabilities for teams were caluclated and can be used in a possession by possession simulation. The work for making the simulation is split up into a few parts. These parts can be broken down into the following:

1. Data Collection
2. Data Processing 
3. Statistical Analysis
4. Simulating Games

The data collection step consists of reading in information from play-by-play data. This is necessary because the simulation attempts to simulate a game by estimating the result of each possession between two teams. In order to get accurate data for each team's possession results, play-by-play data is parsed. 

The data processing step takes the collected data and prepares it to be used in the statistical analysis step. The raw data originally collected is interesting to look at, but not very useful in any kind of data science sense. It consists of season level data to look at how teams compare to each other over the course of the whole season, and game level data used to see how individual teams compared in a head to head matchup. The data processing step mainly looks at the game level data. It uses an arbitrary data that is around halfway into the season, and compiles the sum of the data before that point. For every game after that date, it updates the sum of the previous game stats, and also notes the stats from the current game. The data is then altered so that instead of listing how many possession results occured in the game, a row for each possession is included in the dataset. This makes the result categorical, so that it can be more easily analyzed in the next step. 

The statistical analysis step is relatively simlpe. A multinomial logistic regression is performed on the data. The regression takes the offense's and defense's previous probabilities as inputs, and tries to come up with the probability for every possession result. This is done with SciKitLearn's LogisticRegression class. 

After the statistical analysis is concluded, there is now a model that can take two teams' possession probabilities, and come up with expected results for their possessions. These probabilities are fed into a simulation which simulates a game possession by possession. Many of these games are simulated, their results are all listed, as well as their average score.

## Data collection
For the data collection step, a number of functions are used to handle the different possibilities of a possession. The possibilities that pertain to the possession count consist of shots, rebounds, turnovers, and fouls. Free throws are also looked at to help simulate scores more accurately.

pandas will be used for lots of the work

In [2]:
import pandas as pd

# get rid of the max display columns so it is always possible to see all team statistics
pd.set_option('display.max_columns', None)

### Handle Game
The handle game function is used to find all of the useful statistics necessary for simulating a game. It does this by reading through the play-by-play data for an individual game, and determining what kind of play each row is describing. If the kind of play is relevant to the simulation statistics, helper functions are used to break down the contents of that play. 

In [3]:
def handle_game(group_data, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):    
    
    # array to hold the dictionaries containing the team stats
    stats = [home_team_poss_res, away_team_poss_res]
    

    for play in group_data.itertuples():

        play_type = play.type_text

        if "Shot" in play_type and play_type != "Block Shot":
            stats = handle_shot(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)

        elif "Rebound" in play_type:
            stats = handle_rebound(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)

        elif "Turnover" in play_type:
            stats = handle_turnover(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)

        elif "Foul" in play_type:
            stats = handle_foul(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)

        elif "FreeThrow" in play_type:
            stats = handle_free_throw(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)
    
    return [stats[0], stats[1]]

### Handle shot
The handle shot function is used to determine what happens after a shot in the game. It works as following:

Determine if the shot was made. If the shot was made, determine whether it was a two-point shot or a three-point shot. Adjust the team statistics accodingly. (Indicate what kind of shot was made, and increment the team's possession count)

If the shot was missed, increment the team's missed field goal count

In [4]:
def handle_shot(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
    
    if play.scoring_play:
        if play.score_value == 3:
            if play.team_id == home_team_id:
                home_team_poss_res['thr_fgm'] += 1
                home_team_poss_res['poss'] += 1
            elif play.team_id == away_team_id:
                away_team_poss_res['thr_fgm'] += 1
                away_team_poss_res['poss'] += 1
    
        elif play.score_value == 2:
            if play.team_id == home_team_id:
                home_team_poss_res['two_fgm'] += 1
                home_team_poss_res['poss'] += 1
            elif play.team_id == away_team_id:
                away_team_poss_res['two_fgm'] += 1
                away_team_poss_res['poss'] += 1
                
    # for plays that are not scoring plays                
    else:
        if play.team_id == home_team_id:
            home_team_poss_res['fg_miss'] += 1
        if play.team_id == away_team_id:
            away_team_poss_res['fg_miss'] += 1
        
                    
    return [home_team_poss_res, away_team_poss_res]

### Handle Rebound
The handle shot function is used to determine the results of a rebound. 

The first thing that is checked is if the rebound was an offensive rebound or a defensive rebound. If the rebound was an offensive rebound, the team that got the rebound is determined, and that team's offensive rebound count and possession count are incremented. While this is not the standard method for counting possessions, their possession count is incremented in this project to more accurately reflect how the probabilities for each result of their possession. (*Elaborate here*)

The procedure is slightly different for defensive rebounds. The function determines which team got the rebound, and the other team's possession count is incremented. This is because this project counts possessions at the end of the possession (rather than the start). A defensive rebound is due to the other team missing a shot, and ending their possession. Thus, defensive rebounds result in the opposing team's possession count being incremented.



In [5]:
def handle_rebound(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):  

    if "Defensive" in play.text:
        if play.team_id == home_team_id:
            away_team_poss_res['poss'] += 1
        elif play.team_id == away_team_id:
            home_team_poss_res['poss'] += 1
            
    elif "Offensive" in play.text:
        if play.team_id == home_team_id:
            home_team_poss_res['oreb'] += 1
            home_team_poss_res['poss'] += 1
                
        elif play.team_id == away_team_id:
            away_team_poss_res['oreb'] += 1
            away_team_poss_res['poss'] += 1

    return [home_team_poss_res, away_team_poss_res]

### Handle turnover
The handle turnover function is simple to follow. It is determined which team committed the turnover. As a turnover means that team loses their possession, that team's turnover count and possession count are then incremented.

In [6]:
def handle_turnover(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
    
    if play.team_id == home_team_id:
        home_team_poss_res['tov'] += 1
        home_team_poss_res['poss'] += 1
    elif play.team_id == away_team_id:
        away_team_poss_res['tov'] += 1
        away_team_poss_res['poss'] += 1
    
    return [home_team_poss_res, away_team_poss_res]        

### Handle foul
The handle foul functions is important for tracking possessions that did not end in a shot. Check which team committed the foul, and update the opposing team's stats (the team that was fouled). The team that was fouled gets their possession count incremented, as well as the count for how many times they were fouled

In [7]:
def handle_foul(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
    
    if play.team_id == home_team_id:
        away_team_poss_res['got_fouled'] += 1
        away_team_poss_res['poss'] += 1
    elif play.team_id == away_team_id:
        home_team_poss_res['got_fouled'] += 1
        home_team_poss_res['poss'] += 1 
     
    return [home_team_poss_res, away_team_poss_res]    

### Handle free throw
The handle free throw function does nothing to affect possession counts; it is simply used to keep track of a team's free throw percentage for sake of the simulation. The team is determined, and either their free throws made or free throws missed are incremented accordingly.

In [8]:
def handle_free_throw(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
        
    if play.scoring_play:
        if play.team_id == home_team_id: 
            home_team_poss_res['ft'] += 1
        elif play.team_id == away_team_id:
            away_team_poss_res['ft'] += 1
    elif not play.scoring_play:
        if play.team_id == home_team_id: 
            home_team_poss_res['ft_miss'] += 1
        elif play.team_id == away_team_id:
            away_team_poss_res['ft_miss'] += 1
        
    return [home_team_poss_res, away_team_poss_res]

### Update stat_list
The update stat list function is used to keep track of team's stats across many games. The stat_list structure is a dictionary of dictionarys. The outer dictionary consists of team names as keys, and their statistics as values. The inner dictionaries consist of team statistics as keys, and the actual counts as values. 

This function works by first determining whether or not a team is already in the dictionary. If it is not, it enters their stats from their first game. Otherwise, it uses more current game data to add to the team's total stat count across the whole season.

In [9]:
def update_stat_list(team_stat_list, home_team_name, away_team_name, home_team_poss_res, away_team_poss_res):
    
    # update stats for home team
    if home_team_name in team_stat_list:
        for key, value in home_team_poss_res.items():
            if type(value) != str:
                team_stat_list[home_team_name][key] += value
    else:
        team_stat_list[home_team_name] = home_team_poss_res
        
    # update stats for away team
    if away_team_name in team_stat_list:
        for key, value in away_team_poss_res.items():
            if type(value) != str:
                team_stat_list[away_team_name][key] += value
    else:
        team_stat_list[away_team_name] = away_team_poss_res
    
    return team_stat_list

### Update opp list
The update opp list function effectively tracks a team's defense. The first thing that is done is a renaming. By taking one team's name and adding "_defense" to it, and then attaching that to the other team's stats, the result is the first team's defensive statistics. 

Besides the switching of the team names, the rest of this function operates exactly the same as the update stat list function. Check if a team exists in the list, if so update their stats accordingly. If they don't already exist in the list, initialize their stat values.

In [10]:
def update_opp_list(team_opp_list, home_team_name, away_team_name, home_team_poss_res, away_team_poss_res):
    # Make copies of the input dictionaries
    home_stats = home_team_poss_res.copy()
    away_stats = away_team_poss_res.copy()
    
    # Modify the copies
    away_stats['team_name'] = home_team_name + "_defense"
    home_stats['team_name'] = away_team_name + "_defense"
    
    # Update stats for home team
    if home_team_name in team_opp_list:
        for key, value in away_stats.items():
            if isinstance(value, (int, float)):
                team_opp_list[home_team_name][key] += value
    else:
        team_opp_list[home_team_name] = away_stats
        
    # Update stats for away team
    if away_team_name in team_opp_list:
        for key, value in home_stats.items():
            if isinstance(value, (int, float)):
                team_opp_list[away_team_name][key] += value
    else:
        team_opp_list[away_team_name] = home_stats
    
    return team_opp_list

### Core data collection loop
The loop to collect all of the data is rather simple due to having so many helper functions. The play-by-play data is read into a pandas dataframe from a csv file. As the file consists of many games listed one after another, the pandas groupby() function is used to look at each individual game by game_id. When looking at an individual game, the home team and away team is determined. The teams, as well as dictionaries ready to hold their stats are passed to the handle_game function. The handle_game function returns those updated dictionaries, and they are passed to the update_stat_list and update_opp_list functions. After reading through the whole game, the game date and each team's possession statistics are recorded in the result_list array. After doing this for the whole dataset, the team_stat_list and team_opp_list dictionaries contains every team's season statistics.

In [11]:
# read in the play by play data
games = pd.read_csv("2024_play_by_play.csv")

# group the play by play data to be able to look at individual games
grouped = games.groupby('game_id')

# two dictionaries used to keep track of offensive and defensive possession statistics
team_stat_list = {}
team_opp_list = {}

# used to keep track of when games occured and the team statistics from that game
result_list = []

for game_id, group_data in grouped:
    
    home_team_id = group_data.iloc[0]["home_team_id"]
    away_team_id = group_data.iloc[0]["away_team_id"]

    home_team_name = group_data.iloc[0]["home_team_name"]
    away_team_name = group_data.iloc[0]["away_team_name"]
    
    game_date = group_data.iloc[0]["game_date"]
    
    # dictionaries to keep track of team possession statistics
    home_team_poss_res = {"team_name": home_team_name, "two_fgm": 0, "thr_fgm": 0, "fg_miss": 0, "ft": 0, "ft_miss": 0, "tov": 0, "oreb": 0, "got_fouled": 0, "poss": 0}
    away_team_poss_res = {"team_name": away_team_name, "two_fgm": 0, "thr_fgm": 0, "fg_miss": 0, "ft": 0, "ft_miss": 0, "tov": 0, "oreb": 0, "got_fouled": 0, "poss": 0}
    
    # handle game returns a list containing the possession statistics for each team
    game_stats = handle_game(group_data, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res)
    
    # take note of the game date, and add a copy of each teams statistics from the game_stats array
    result_list.append([game_date, game_stats[0].copy(), game_stats[1].copy()])
    
    
    team_stat_list = update_stat_list(team_stat_list, home_team_name, away_team_name, game_stats[0], game_stats[1] )
    team_opp_list  = update_opp_list( team_opp_list,  home_team_name, away_team_name, game_stats[0], game_stats[1] )
    

### Examples of season level stats

In [12]:
# turn offensive and defensive stats into pandas DataFrames
team_off_stats = pd.DataFrame.from_dict(team_stat_list, orient='index')
team_def_stats = pd.DataFrame.from_dict(team_opp_list, orient='index')

In [13]:
team_off_stats.head()

Unnamed: 0,team_name,two_fgm,thr_fgm,fg_miss,ft,ft_miss,tov,oreb,got_fouled,poss
Eastern Washington,Eastern Washington,620,274,889,525,156,422,327,628,2982
Montana,Montana,725,278,1068,477,125,371,330,582,3135
Idaho,Idaho,535,243,991,357,130,366,273,487,2705
Montana State,Montana State,608,300,1041,414,159,391,290,589,3068
Idaho State,Idaho State,623,218,1027,429,196,372,376,607,2970


In [14]:
team_def_stats.head()

Unnamed: 0,team_name,two_fgm,thr_fgm,fg_miss,ft,ft_miss,tov,oreb,got_fouled,poss
Eastern Washington,Eastern Washington_defense,549,282,1076,454,158,385,396,587,3029
Montana,Montana_defense,697,225,1159,509,197,349,397,606,3210
Idaho,Idaho_defense,546,236,945,464,172,356,316,571,2780
Montana State,Montana State_defense,678,222,1035,495,199,457,392,637,3201
Idaho State,Idaho State_defense,665,200,954,422,144,415,294,549,2882


## Data Processing

The goal of the data processing step is to take all of the data collected above and format it so that a multinomial logistic regression can be performed. This means taking aggregate possession data from previous games, turning it into percentages, and coming up with a categorical results column.

### Game by game data

This code makes a dataframe to hold data (from each team's perspective) from every individual game. The data is read from the result_list array. The game date is noted, as well as each team's possession statistics from said game. One row is made in the dataframe from the perspective of the home team, and another row is made from the perspective of the away team. The end result is a dataframe with two entries for every game played.

In [15]:
# sort the list of results by game_date
result_list= sorted(result_list, key=lambda x: x[0])

stats_on_date = pd.DataFrame({})
for row in result_list:
    date = row[0]
    team1 = row[1]
    team2 = row[2]

    game = {"date": date, 
            
            "team_name": team1['team_name'], 
            "team_twos": team1['two_fgm'], 
            "team_threes": team1['thr_fgm'], 
            "team_miss": team1['fg_miss'],
            "team_ft": team1['ft'],
            "team_ft_miss": team1['ft_miss'],
            "team_tov": team1['tov'],
            "team_oreb": team1['oreb'],
            "team_fouled": team1['got_fouled'],
            "team_poss": team1['poss'],
            
            "opp_name": team2['team_name'], 
            "opp_twos": team2['two_fgm'], 
            "opp_threes": team2['thr_fgm'], 
            "opp_miss": team2['fg_miss'],
            "opp_ft": team2['ft'],
            "opp_ft_miss": team2['ft_miss'],
            "opp_tov": team2['tov'],
            "opp_oreb": team2['oreb'],
            "opp_fouled": team2['got_fouled'],
            "opp_poss": team2['poss']
           }
    
    opp_game = {"date": date, 
            
            "team_name": team2['team_name'], 
            "team_twos": team2['two_fgm'], 
            "team_threes": team2['thr_fgm'], 
            "team_miss": team2['fg_miss'],
            "team_ft": team2['ft'],
            "team_ft_miss": team2['ft_miss'],
            "team_tov": team2['tov'],
            "team_oreb": team2['oreb'],
            "team_fouled": team2['got_fouled'],
            "team_poss": team2['poss'],
            
            "opp_name": team1['team_name'], 
            "opp_twos": team1['two_fgm'], 
            "opp_threes": team1['thr_fgm'], 
            "opp_miss": team1['fg_miss'],
            "opp_ft": team1['ft'],
            "opp_ft_miss": team1['ft_miss'],
            "opp_tov": team1['tov'],
            "opp_oreb": team1['oreb'],
            "opp_fouled": team1['got_fouled'],
            "opp_poss": team1['poss']
           }
    
    # add values from home team's perspective
    new_row = pd.DataFrame.from_dict([game])
    stats_on_date = pd.concat([stats_on_date, new_row])
    
    # add values from away team's perspective
    new_row = pd.DataFrame.from_dict([opp_game])
    stats_on_date = pd.concat([stats_on_date, new_row])
    
stats_on_date = stats_on_date.reset_index(drop=True)
stats_on_date.head(10)
    
 

Unnamed: 0,date,team_name,team_twos,team_threes,team_miss,team_ft,team_ft_miss,team_tov,team_oreb,team_fouled,team_poss,opp_name,opp_twos,opp_threes,opp_miss,opp_ft,opp_ft_miss,opp_tov,opp_oreb,opp_fouled,opp_poss
0,2023-11-06,Tulsa,16,8,36,14,6,18,20,18,100,Central Arkansas,14,6,40,7,5,12,11,15,89
1,2023-11-06,Central Arkansas,14,6,40,7,5,12,11,15,89,Tulsa,16,8,36,14,6,18,20,18,100
2,2023-11-06,Northwestern,20,5,34,17,4,12,16,19,94,Binghamton,15,7,31,10,2,19,10,16,90
3,2023-11-06,Binghamton,15,7,31,10,2,19,10,16,90,Northwestern,20,5,34,17,4,12,16,19,94
4,2023-11-06,Syracuse,23,5,39,22,5,11,17,22,105,New Hampshire,17,8,43,14,5,16,15,17,106
5,2023-11-06,New Hampshire,17,8,43,14,5,16,15,17,106,Syracuse,23,5,39,22,5,11,17,22,105
6,2023-11-06,Minnesota,19,5,22,27,8,17,13,29,100,Bethune-Cookman,19,4,47,10,4,14,23,16,104
7,2023-11-06,Bethune-Cookman,19,4,47,10,4,14,23,16,104,Minnesota,19,5,22,27,8,17,13,29,100
8,2023-11-06,Nebraska,16,11,29,19,10,9,8,21,92,Lindenwood,19,3,46,5,4,12,13,12,95
9,2023-11-06,Lindenwood,19,3,46,5,4,12,13,12,95,Nebraska,16,11,29,19,10,9,8,21,92


### Mid season aggregate data

The following code comes up with the aggregate possession data. Feburary 1st is chosen as the point to start calculating game by game data. Every game *before* Feb 1 has its data summed up. Next, every game after Feb 1 is iterated through, the team's aggregate data is found and added to, and the game data for the current game is kept the same. 

In [16]:
# Convert date column to datetime type
stats_on_date['date'] = pd.to_datetime(stats_on_date['date'])

start_date = '2024-02-01'

# Filter dataframe to include only games after or on the start date
filtered_df = stats_on_date[stats_on_date['date'] >= start_date]

# Initialize list to store rows for the new dataframe
new_rows = []

team_stats_columns = ["team_twos", "team_threes", "team_miss", "team_tov", "team_oreb", "team_fouled", "team_poss"]
opp_stats_columns = ["opp_twos", "opp_threes", "opp_miss", "opp_tov", "opp_oreb", "opp_fouled", "opp_poss"]

# Iterate through each game in the original dataframe
for index, game in filtered_df.iterrows():
    # Extract team name and opponent name for the current game
    team_name = game['team_name']
    opp_name = game['opp_name']

    # Calculate sum of stats for team and opponent based on games before the current game
    team_prev_sum = stats_on_date[stats_on_date['team_name'] == team_name].loc[:index-1][team_stats_columns].sum()
    opp_prev_sum = stats_on_date[stats_on_date['opp_name'] == opp_name].loc[:index-1][team_stats_columns].sum()

    # Extract team stats for the current game
    team_game_stat = game[team_stats_columns].tolist()

    # Combine all data into a single row
    new_row = [game['date'], team_name] + team_prev_sum.tolist() + [opp_name] + opp_prev_sum.tolist() + team_game_stat

    # Append row to the list
    new_rows.append(new_row)

new_columns = ["date", 
               "team_name", "prev_team_twos", "prev_team_threes", "prev_team_miss", "prev_team_tov", "prev_team_oreb", "prev_team_fouled", "prev_team_poss",
               "opp_name", "prev_opp_twos", "prev_opp_threes", "prev_opp_miss", "prev_opp_tov", "prev_opp_oreb", "prev_opp_fouled", "prev_opp_poss"] + \
               team_stats_columns

mid_season_data = pd.DataFrame(new_rows, columns=new_columns)
mid_season_data.head(6)

Unnamed: 0,date,team_name,prev_team_twos,prev_team_threes,prev_team_miss,prev_team_tov,prev_team_oreb,prev_team_fouled,prev_team_poss,opp_name,prev_opp_twos,prev_opp_threes,prev_opp_miss,prev_opp_tov,prev_opp_oreb,prev_opp_fouled,prev_opp_poss,team_twos,team_threes,team_miss,team_tov,team_oreb,team_fouled,team_poss
0,2024-02-01,Montana State,361.0,163.0,603.0,234.0,164.0,355.0,1803.0,Eastern Washington,345.0,156.0,675.0,253.0,252.0,374.0,1900.0,15,9,41,12,19,20,105
1,2024-02-01,Eastern Washington,370.0,188.0,570.0,265.0,212.0,372.0,1858.0,Montana State,394.0,128.0,594.0,266.0,231.0,392.0,1889.0,20,2,27,18,9,18,92
2,2024-02-01,Montana,439.0,157.0,643.0,215.0,203.0,327.0,1850.0,Idaho,335.0,158.0,609.0,227.0,196.0,345.0,1751.0,16,8,25,11,10,21,87
3,2024-02-01,Idaho,333.0,158.0,647.0,218.0,182.0,313.0,1729.0,Montana,404.0,132.0,678.0,209.0,230.0,351.0,1885.0,19,9,27,10,6,8,74
4,2024-02-01,Northern Colorado,421.0,169.0,634.0,220.0,205.0,336.0,1868.0,Idaho State,417.0,114.0,588.0,255.0,198.0,341.0,1793.0,19,9,30,10,5,22,94
5,2024-02-01,Idaho State,381.0,134.0,630.0,244.0,233.0,353.0,1828.0,Northern Colorado,370.0,184.0,658.0,244.0,192.0,323.0,1851.0,17,10,44,11,19,27,115


### Possession results as percentages

This code takes the aggregate possession data and turns each possession result into the percentages of all the team's possessions.

In [17]:
poss_percent = mid_season_data.copy()

for index, row in poss_percent.iterrows():
    if row['prev_team_poss'] > 0:
        prev_team_poss = row['prev_team_poss']
        
        
        poss_percent.at[index, 'prev_team_twos'] =   row['prev_team_twos'] / prev_team_poss
        poss_percent.at[index, 'prev_team_threes'] = row['prev_team_threes'] / prev_team_poss
        poss_percent.at[index, 'prev_team_miss'] =   row['prev_team_miss'] / prev_team_poss
        poss_percent.at[index, 'prev_team_tov'] =    row['prev_team_tov'] / prev_team_poss
        poss_percent.at[index, 'prev_team_oreb'] =   row['prev_team_oreb'] / prev_team_poss
        poss_percent.at[index, 'prev_team_fouled'] = row['prev_team_fouled'] / prev_team_poss
        
    if row['prev_opp_poss'] > 0:
        prev_opp_poss = row['prev_opp_poss']
        
        poss_percent.at[index, 'prev_opp_twos'] =   row['prev_opp_twos'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_threes'] = row['prev_opp_threes'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_miss'] =   row['prev_opp_miss'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_tov'] =    row['prev_opp_tov'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_oreb'] =   row['prev_opp_oreb'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_fouled'] = row['prev_opp_fouled'] / prev_opp_poss
        
        
poss_percent.head(6)

Unnamed: 0,date,team_name,prev_team_twos,prev_team_threes,prev_team_miss,prev_team_tov,prev_team_oreb,prev_team_fouled,prev_team_poss,opp_name,prev_opp_twos,prev_opp_threes,prev_opp_miss,prev_opp_tov,prev_opp_oreb,prev_opp_fouled,prev_opp_poss,team_twos,team_threes,team_miss,team_tov,team_oreb,team_fouled,team_poss
0,2024-02-01,Montana State,0.200222,0.090405,0.334443,0.129784,0.09096,0.196894,1803.0,Eastern Washington,0.181579,0.082105,0.355263,0.133158,0.132632,0.196842,1900.0,15,9,41,12,19,20,105
1,2024-02-01,Eastern Washington,0.199139,0.101184,0.306781,0.142626,0.114101,0.200215,1858.0,Montana State,0.208576,0.067761,0.314452,0.140815,0.122287,0.207517,1889.0,20,2,27,18,9,18,92
2,2024-02-01,Montana,0.237297,0.084865,0.347568,0.116216,0.10973,0.176757,1850.0,Idaho,0.191319,0.090234,0.347801,0.12964,0.111936,0.19703,1751.0,16,8,25,11,10,21,87
3,2024-02-01,Idaho,0.192597,0.091382,0.374205,0.126084,0.105263,0.181029,1729.0,Montana,0.214324,0.070027,0.359682,0.110875,0.122016,0.186207,1885.0,19,9,27,10,6,8,74
4,2024-02-01,Northern Colorado,0.225375,0.090471,0.3394,0.117773,0.109743,0.179872,1868.0,Idaho State,0.232571,0.063581,0.327942,0.14222,0.110429,0.190184,1793.0,19,9,30,10,5,22,94
5,2024-02-01,Idaho State,0.208425,0.073304,0.344639,0.133479,0.127462,0.193107,1828.0,Northern Colorado,0.199892,0.099406,0.355484,0.131821,0.103728,0.1745,1851.0,17,10,44,11,19,27,115


### Categorical Results

This cell makes the categorical result column. Every game is read through, and for each possession result from that games, a loop is created and every result gets its own line.

In [18]:

row_data = []
for index, row in poss_percent.iterrows():
    twos = row['team_twos']
    threes = row['team_threes']
    misses = row['team_miss']
    tov = row['team_tov']
    oreb = row['team_oreb']
    fouls = row['team_fouled']
    poss = row['team_poss']
    
    data = row.values.tolist()
    
    prev_data = data[0:17]
    
    for i in range(misses):
        res = [0]
        new_row = prev_data + res
        row_data.append(new_row)
    
    for i in range(oreb):
        res = [1]
        new_row = prev_data + res
        row_data.append(new_row)
    
    for i in range(twos):
        res = [2]
        new_row = prev_data + res
        row_data.append(new_row)
    
    for i in range(threes):
        res = [3]
        new_row = prev_data + res
        row_data.append(new_row)
        
    for i in range(tov):
        res = [4]
        new_row = prev_data + res
        row_data.append(new_row)
    
    for i in range(fouls):
        res = [5]
        new_row = prev_data + res
        row_data.append(new_row)
        
    
        
columns = ['date', 'team_name', 'prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled', 'prev_poss',
                    'opp_name', 'opp_twos',  'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled','opp_poss',
            'result']

game_results = pd.DataFrame(row_data, columns=columns)
game_results.tail(6)

Unnamed: 0,date,team_name,prev_twos,prev_threes,prev_miss,prev_tov,prev_oreb,prev_fouled,prev_poss,opp_name,opp_twos,opp_threes,opp_miss,opp_tov,opp_oreb,opp_fouled,opp_poss,result
425629,2024-04-08,Purdue,0.210003,0.087593,0.31307,0.119923,0.149489,0.216358,3619.0,UConn,0.185498,0.067851,0.394292,0.120268,0.128422,0.186372,3434.0,5
425630,2024-04-08,Purdue,0.210003,0.087593,0.31307,0.119923,0.149489,0.216358,3619.0,UConn,0.185498,0.067851,0.394292,0.120268,0.128422,0.186372,3434.0,5
425631,2024-04-08,Purdue,0.210003,0.087593,0.31307,0.119923,0.149489,0.216358,3619.0,UConn,0.185498,0.067851,0.394292,0.120268,0.128422,0.186372,3434.0,5
425632,2024-04-08,Purdue,0.210003,0.087593,0.31307,0.119923,0.149489,0.216358,3619.0,UConn,0.185498,0.067851,0.394292,0.120268,0.128422,0.186372,3434.0,5
425633,2024-04-08,Purdue,0.210003,0.087593,0.31307,0.119923,0.149489,0.216358,3619.0,UConn,0.185498,0.067851,0.394292,0.120268,0.128422,0.186372,3434.0,5
425634,2024-04-08,Purdue,0.210003,0.087593,0.31307,0.119923,0.149489,0.216358,3619.0,UConn,0.185498,0.067851,0.394292,0.120268,0.128422,0.186372,3434.0,5


## Statistical Analysis

### Multinomial Logistic Regression

A multinomial regression is a statistical method used to predict the outcome of a categorical dependent variable with more than two categories. This project uses input data of a team's offensive capabilities (measured by their percentage of possessions that end in missed shots, two-point shots, three-point shots, offensive rebounds, turnovers, and fouls), and another team's defensive capabilities (measured by the same results). By running a multinomial regression on a team's offense and team's defense, probabilities for the team's offensive possession ending in one of those results are calculated. 

In [19]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression

# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)

# fit the model on the whole dataset
test_columns = ['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled', 'opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']
X = game_results[test_columns]
y = game_results['result']
model.fit(X, y)

# take a random row of data from the game_results dataframe to test
row = [0.210003, 0.087593, 0.31307, 0.119923, 0.149489, 0.216358, 0.185498, 0.067851, 0.394292, 0.120268, 0.128422, 0.186372]
row_df = pd.DataFrame([row], columns=test_columns)

# predict a multinomial probability distribution
results = model.predict_proba(row_df)

# summarize the predicted probabilities
print(f"Predicted Probabilities of Purdue when facing UConn:\n{results[0]}")

Predicted Probabilities of Purdue when facing UConn:
[0.32276465 0.12013916 0.18664887 0.07507893 0.10356496 0.19180343]


## Simulating Games

The game simulation is done by simulating alternating possessions between two teams. The team's possession result probabilities are obtained from the model built with multinomial regression. Probabilities are used from both teams' offense and defense in order to come up with the liklihood for each team scoring on a given possession. After simulating a predefined number of possessions, the scores are reported. 

### Get team possession probabilities

This function is used to determine each team's possession result probabilities. The first thing that is done is obtaining the most recent aggregate probabilities for each team's offense and defense. These are combined so that one team's offensive and the other team's defensive probabilities are grouped together. These are then passed to the model built on the multinomial regression, and the calculated values are returned. 

In [33]:
# get team probabilities
def get_probs(team1, team2):
    
    team1_off = game_results[game_results['team_name'] == team1][['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled']].iloc[-1].tolist() 
    team2_off = game_results[game_results['team_name'] == team2][['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled']].iloc[-1].tolist()
    
    team1_def = game_results[game_results['opp_name'] == team1][['opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']].iloc[-1].tolist()
    team2_def = game_results[game_results['opp_name'] == team1][['opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']].iloc[-1].tolist()
        
    most_recent_team1 = team1_off + team2_def
    most_recent_team2 = team2_off + team1_def
    
    test_columns = ['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled', 'opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']
    
    
    team1_probs = model.predict_proba(pd.DataFrame([most_recent_team1], columns=test_columns)).tolist()[0]
    team2_probs = model.predict_proba(pd.DataFrame([most_recent_team2], columns=test_columns)).tolist()[0]
    
    print([team1_probs]) 
    print([team2_probs])
    
    return [team1_probs, team2_probs]

probs = get_probs("North Carolina", "Duke")
print(f"Example probabilities between North Carolina and Duke:\n{probs[0]}\n{probs[1]}")


[[0.36136018257551633, 0.10115549297568453, 0.19492609017437149, 0.07580382814234912, 0.09460779242619718, 0.17214661370588133]]
[[0.3535532476244952, 0.09711739050523685, 0.2028816180158325, 0.08113898256442995, 0.0960341810888952, 0.1692745802011103]]
Example probabilities between North Carolina and Duke:
[0.36136018257551633, 0.10115549297568453, 0.19492609017437149, 0.07580382814234912, 0.09460779242619718, 0.17214661370588133]
[0.3535532476244952, 0.09711739050523685, 0.2028816180158325, 0.08113898256442995, 0.0960341810888952, 0.1692745802011103]


### Get team misc stats

Since the dataframe containing the aggregate possession result probabilities does not contain helpful information about free throw percentage or possession count, these numbers are obtained in the get_misc_stats function. The teams' free throw counts are obtained from the *team_off_stats* dataframe which contains season-level statistics, and the free throw percentage is calculated. 

Next the average number of possessions per game for each team is determined. This is done with the *stats_on_date* dataframe. This dataframe contains game-level stats for every team, and the average possession count is found by simply calling the mean function on the possession column for each team.

In [21]:
def get_misc_stats(team1, team2):
    
    # grab free throw stats for teams
    team1_fts = team_off_stats.loc[team1][['ft', 'ft_miss']].tolist()
    team2_fts = team_off_stats.loc[team2][['ft', 'ft_miss']].tolist()
        
    # calculate free throw percentage
    team1_ft_p = team1_fts[0] / (team1_fts[0] + team1_fts[1])
    team2_ft_p = team2_fts[0] / (team2_fts[0] + team2_fts[1])
    
    # grab average number of possesions in a game
    team1_poss = round(stats_on_date.loc[stats_on_date['team_name'] == team1, 'team_poss'].mean())    
    team2_poss = round(stats_on_date.loc[stats_on_date['team_name'] == team2, 'team_poss'].mean())
    
    team1_stats = [team1_poss, team1_ft_p]
    team2_stats = [team2_poss, team2_ft_p]
    
    return [team1_stats, team2_stats]

stats = get_misc_stats("North Carolina", "Duke")
print(f"Example of the misc stats collected for North Carolina and Duke:\n{stats[0]}\n{stats[1]}")


Example of the misc stats collected for North Carolina and Duke:
[94, 0.7589073634204275]
[87, 0.7227586206896551]


### Simulate a possession

The sim_poss function is called to simulate a possession and record its result during a game simulation. It works by simulating a team's offensive possession. That team's possession result probabilities are given as an argument to the function. Using those probabilities, a result is chosen with the random.choices method. Depending on the result, the score that the result of the play would finish in is returned. 

In [22]:
import random
def adj(mean, mu):
    return random.normalvariate(mean, mu)

def sim_poss( probs ):
    while ( True ):
        # list of every option for a given possession
        options = ['fg_miss', 'two_pointer', 'three_pointer', 'turnover', 'foul']
        # list of probability for every possession option
        probabilities = [adj(probs[0], .0),      # miss
                         adj(probs[2], .00),     # two
                         adj(probs[3], .0),      # three
                         adj(probs[4], .00),     # tov
                         adj(probs[5], .00) ]    # foul
        

        # randomly choose possesion option
        result = random.choices(options, weights=probabilities, k=1)[0]

        # return how each possession option will affect the score
        if result == 'fg_miss':
            # if offensive rebound, keep current possession
            x = random.random()
            if (x < adj(probs[1], .008 ) ) : 
                pass
            else:
                return 0
            
        elif result == 'two_pointer':
            return 2
        
        elif result == 'three_pointer':
            return 3
        
        elif result == 'foul':
            ft_made = 0
            # simulate two free throw shots
            for i in range(2):
                x = random.random()
                if (x < adj(probs[6], .015) ):
                    ft_made += 1
            
            return ft_made
            
        else:
            return 0

### Average function

Very simple helper function used to calculate the average score after lots of games are simulated

In [23]:
# used to compute the average score of lots of simulated games
def Average(x): 
    return sum(x) / len(x) 

### Core game simulation loop

The sim_games function is the culmination of this project. It repeatedly simulates games between two user-given teams. 

It first asks for the two teams that should be playing each other in the simulation. It uses the *get_probs* and *get_misc_stats* functions to obtain the teams' necessary statistics, and combines these into one array. 

Next a number of games (given by the optional argument to the function) are simulated. A game is simulated by calling the *sim_poss* function until the sum of each team's average possession count has been reached. The scores of the game are noted, as well as a win count for each team. After the requested number of games has been reached (or ten games if no argument was given), the scores of each game are reported, as well as each team's average score and win count.

In [24]:
def sim_games(num_games = 10, team1 = "North Carolina", team2 = "Duke"):

    # grab team stats from helper functions
    probs = get_probs(team1, team2)
    misc  = get_misc_stats(team1, team2)

    # these contain the possession result probabilities
    team1_probs = probs[0]
    team2_probs = probs[1]

    # these contain average possession count and free throw percentage
    team1_misc = misc[0]
    team2_misc = misc[1]

    # add the two teams average possession count to get the possession for simulated game
    max_poss = team1_misc[0] + team2_misc[0]

    # make one array with the all of the teams necessary stats
    both_team_probs = [team1_probs + [team1_misc[1]], team2_probs + [team2_misc[1]] ]


    # used to keep track of the scores across multiple sims
    team1_scores = []
    team2_scores = []

    team1_wins = 0
    team2_wins = 0


    
    for i in range(num_games):
        scores = [0,0]
        curr_poss = 0
        while curr_poss < max_poss:
            team = curr_poss%2
            scores[team] += sim_poss(both_team_probs[team] )
            curr_poss+=1
            
        team1_scores.append(scores[0])
        team2_scores.append(scores[1])
        if (scores[0] > scores[1]):
            team1_wins += 1
        else:
            team2_wins += 1
        print(f"Game {i+1:2.0f}    {team1:>20} {scores[0]:3.0f} \t {scores[1]:3.0f}   {team2:<20} ")

    print(f"\nAverage {team1:^30} score: {Average(team1_scores):.2f}" + 
          f"\nAverage {team2:^30} score: {Average(team2_scores):.2f}")
    
    print(f"\n{team1:<30} wins: {team1_wins:2.0f}" + 
          f"\n{team2:<30} wins: {team2_wins:2.0f}")

### Simulated results

Simulating ten game between North Carolina and Butler

In [34]:
sim_games(team1="North Carolina", team2="Butler")

[[0.36136018257551633, 0.10115549297568453, 0.19492609017437149, 0.07580382814234912, 0.09460779242619718, 0.17214661370588133]]
[[0.37258948029239697, 0.09587425846405664, 0.1944741830437964, 0.07918155897290767, 0.09878329872756955, 0.1590972204992728]]
Game  1          North Carolina  97 	  88   Butler               
Game  2          North Carolina  84 	  95   Butler               
Game  3          North Carolina  78 	  96   Butler               
Game  4          North Carolina 107 	  83   Butler               
Game  5          North Carolina  86 	  82   Butler               
Game  6          North Carolina  93 	 101   Butler               
Game  7          North Carolina 107 	 107   Butler               
Game  8          North Carolina  82 	  78   Butler               
Game  9          North Carolina  77 	  86   Butler               
Game 10          North Carolina 130 	  93   Butler               

Average         North Carolina         score: 94.10
Average             Butler     

Simulating ten games between Duke and North Carolina

In [36]:
sim_games(team1="North Carolina", team2="Duke")

[[0.36136018257551633, 0.10115549297568453, 0.19492609017437149, 0.07580382814234912, 0.09460779242619718, 0.17214661370588133]]
[[0.3535532476244952, 0.09711739050523685, 0.2028816180158325, 0.08113898256442995, 0.0960341810888952, 0.1692745802011103]]
Game  1          North Carolina  99 	  82   Duke                 
Game  2          North Carolina 101 	  98   Duke                 
Game  3          North Carolina  61 	  98   Duke                 
Game  4          North Carolina 100 	  97   Duke                 
Game  5          North Carolina 103 	  97   Duke                 
Game  6          North Carolina  87 	  82   Duke                 
Game  7          North Carolina  90 	  99   Duke                 
Game  8          North Carolina  87 	  86   Duke                 
Game  9          North Carolina  87 	  90   Duke                 
Game 10          North Carolina  95 	  87   Duke                 

Average         North Carolina         score: 91.00
Average              Duke        

Simulating twenty games between Houston and Purdue

In [27]:
sim_games(20, "Houston", "Purdue")

Game  1                 Houston  91 	  83   Purdue               
Game  2                 Houston  81 	  81   Purdue               
Game  3                 Houston  90 	  78   Purdue               
Game  4                 Houston  92 	  92   Purdue               
Game  5                 Houston  97 	  88   Purdue               
Game  6                 Houston  90 	  83   Purdue               
Game  7                 Houston  79 	  83   Purdue               
Game  8                 Houston  84 	 109   Purdue               
Game  9                 Houston  92 	  75   Purdue               
Game 10                 Houston  76 	 108   Purdue               
Game 11                 Houston  76 	  79   Purdue               
Game 12                 Houston  89 	  94   Purdue               
Game 13                 Houston  90 	 108   Purdue               
Game 14                 Houston  83 	  92   Purdue               
Game 15                 Houston  68 	  83   Purdue               
Game 16   

Simulating twenty games between Purdue and UConn

In [28]:
sim_games(20, "Purdue", "UConn")

Game  1                  Purdue 116 	  95   UConn                
Game  2                  Purdue  81 	 107   UConn                
Game  3                  Purdue  91 	 119   UConn                
Game  4                  Purdue  71 	 101   UConn                
Game  5                  Purdue  90 	 119   UConn                
Game  6                  Purdue 105 	  97   UConn                
Game  7                  Purdue  92 	 103   UConn                
Game  8                  Purdue  98 	 110   UConn                
Game  9                  Purdue  86 	  97   UConn                
Game 10                  Purdue  98 	  91   UConn                
Game 11                  Purdue 105 	  92   UConn                
Game 12                  Purdue  85 	  91   UConn                
Game 13                  Purdue 103 	  94   UConn                
Game 14                  Purdue  89 	  95   UConn                
Game 15                  Purdue  91 	  81   UConn                
Game 16   