# Training a NBA Game Prediction Model

In this project, I will be training a ML model to predict the result of future NBA games. 

Using data from all the games in the 2015/2016 - 2023/2024 season (**time-series data**), I will be using feature selection to identify the best set of predictors. From there I will train the ML model to predict which teams will win games their future games. For now, this will be a supervised learning model where we are comparing the model's performance to the real result of incremental games that we *already* have the results for.

I will be using the SKLearn Library to do the feature selection and library

The first thing we will do is create Pandas DataFrame that we will use to process the data. Each row represents a team's game, and contains a lot of different data points about that team's performance during the game. 

After initializing the DataFrame, we will be re-ordering all the rows to be in chronological order by date, and removing some of the unneeded columns (such as duplicates or things we don't want the model to try to use as predictors) 

In [996]:
import pandas as pd
df = pd.read_csv("data/nba_games.csv", index_col=0)
df = df.sort_values("date")
df = df.reset_index(drop=True)

del df["mp.1"]
del df["mp_opp.1"]
del df["index_opp"]

df

Unnamed: 0,mp,fg,fga,fg%,3p,3pa,3p%,ft,fta,ft%,...,tov%_max_opp,usg%_max_opp,ortg_max_opp,drtg_max_opp,team_opp,total_opp,home_opp,season,date,won
0,240.0,35.0,83.0,0.422,6.0,18.0,0.333,19.0,27.0,0.704,...,69.4,43.7,206.0,104.0,GSW,111,1,2016,2015-10-27,False
1,240.0,37.0,87.0,0.425,7.0,19.0,0.368,16.0,23.0,0.696,...,30.4,29.0,138.0,105.0,CLE,95,0,2016,2015-10-27,True
2,240.0,37.0,96.0,0.385,12.0,29.0,0.414,20.0,26.0,0.769,...,57.1,33.8,258.0,121.0,ATL,94,1,2016,2015-10-27,True
3,240.0,37.0,82.0,0.451,8.0,27.0,0.296,12.0,15.0,0.800,...,33.3,23.6,132.0,104.0,DET,106,0,2016,2015-10-27,False
4,240.0,41.0,96.0,0.427,9.0,30.0,0.300,20.0,22.0,0.909,...,37.5,38.9,201.0,120.0,NOP,95,0,2016,2015-10-27,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23039,240.0,38.0,82.0,0.463,17.0,46.0,0.370,13.0,14.0,0.929,...,50.0,40.0,205.0,124.0,DAL,99,1,2024,2024-06-12,True
23040,240.0,29.0,80.0,0.363,14.0,41.0,0.341,12.0,13.0,0.923,...,100.0,40.8,212.0,99.0,DAL,122,1,2024,2024-06-14,False
23041,240.0,46.0,91.0,0.505,15.0,37.0,0.405,15.0,22.0,0.682,...,50.0,35.9,200.0,138.0,BOS,84,0,2024,2024-06-14,True
23042,240.0,38.0,89.0,0.427,13.0,39.0,0.333,17.0,20.0,0.850,...,36.4,54.6,194.0,133.0,DAL,88,0,2024,2024-06-17,True


## Preparing the data for ML

I will now add a column to each row that tells us the target value that we'd like to predict. In this case this will be whether or not a team won its **next** game. In scenarios where there is no next game (a team's final game in the data range we are looking at), we will add placeholder data there (to avoid having a null value)

I needed to do some cleaning here as well to remove all columns with null values in them as this will potentially cause the model to fail

In [998]:
def add_target(group):
    group["target"] = group["won"].shift(-1)
    return group

df = df.groupby("team", group_keys=False).apply(add_target)

df["target"][pd.isnull(df["target"])] = 2
df["target"] = df["target"].astype(int, errors="ignore")

  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["target"] = group["won"].shift(-1)
  group["ta

In [999]:
for col in df.columns:
    if col.startswith("ft%"):
        null_indices = df[df[col].isna()].index
        df.loc[null_indices,col] = 0
    elif col.startswith("+/-_max"):
        null_indices = df[df[col].isna()].index
        df.loc[null_indices,col] = 2

nulls = pd.isnull(df).sum()
nulls = nulls[nulls > 0]

#valid cols are all the cols that are NOT IN the null columns
valid_columns = df.columns[~df.columns.isin(nulls.index)]
df = df[valid_columns].copy()

## Creating the Model

I am using the RidgeClassifier Model (ridge regression model) and using a time-series split so that the model knows not to use results from later games to try to predict past games.

For now, I will limit the features the model compares at once to 30 (to not overload it). From there, the feature selector will go through the different combinations of columns to classify which 30 columns will be best to use to predict future games.

I removed columns that shouldn't be used as features (such as the team name or whether the team won or lost (since that is the target)) and also re-scaled all the quantitative data so that each column could be evaluated proportionally

In [1001]:
from sklearn.linear_model import RidgeClassifier
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import MinMaxScaler

rr = RidgeClassifier(alpha=1)

split = TimeSeriesSplit(n_splits=3)

sfs = SequentialFeatureSelector(rr, 
                                n_features_to_select=30, 
                                direction="forward",
                                cv=split,
                                n_jobs=1
                               )

removed_columns = ["season", "date", "won", "target", "team", "team_opp"]
selected_columns = df.columns[~df.columns.isin(removed_columns)]

scaler = MinMaxScaler()
df[selected_columns] = scaler.fit_transform(df[selected_columns])

In [1002]:
df

Unnamed: 0,mp,fg,fga,fg%,3p,3pa,3p%,ft,fta,ft%,...,usg%_max_opp,ortg_max_opp,drtg_max_opp,team_opp,total_opp,home_opp,season,date,won,target
0,0.0,0.347826,0.338235,0.366029,0.206897,0.212121,0.395487,0.431818,0.421875,0.704,...,0.278205,0.554502,0.306818,GSW,0.419643,1.0,2016,2015-10-27,False,0
1,0.0,0.391304,0.397059,0.373206,0.241379,0.227273,0.437055,0.363636,0.359375,0.696,...,0.089744,0.232227,0.318182,CLE,0.276786,0.0,2016,2015-10-27,True,1
2,0.0,0.391304,0.529412,0.277512,0.413793,0.378788,0.491686,0.454545,0.406250,0.769,...,0.151282,0.800948,0.500000,ATL,0.267857,1.0,2016,2015-10-27,True,1
3,0.0,0.391304,0.323529,0.435407,0.275862,0.348485,0.351544,0.272727,0.234375,0.800,...,0.020513,0.203791,0.306818,DET,0.375000,0.0,2016,2015-10-27,False,1
4,0.0,0.478261,0.529412,0.377990,0.310345,0.393939,0.356295,0.454545,0.343750,0.909,...,0.216667,0.530806,0.488636,NOP,0.276786,0.0,2016,2015-10-27,True,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23039,0.0,0.413043,0.323529,0.464115,0.586207,0.636364,0.439430,0.295455,0.218750,0.929,...,0.230769,0.549763,0.534091,DAL,0.312500,1.0,2024,2024-06-12,True,0
23040,0.0,0.217391,0.294118,0.224880,0.482759,0.560606,0.404988,0.272727,0.203125,0.923,...,0.241026,0.582938,0.250000,DAL,0.517857,1.0,2024,2024-06-14,False,1
23041,0.0,0.586957,0.455882,0.564593,0.517241,0.500000,0.480998,0.340909,0.343750,0.682,...,0.178205,0.526066,0.693182,BOS,0.178571,0.0,2024,2024-06-14,True,0
23042,0.0,0.413043,0.426471,0.377990,0.448276,0.530303,0.395487,0.386364,0.312500,0.850,...,0.417949,0.497630,0.636364,DAL,0.214286,0.0,2024,2024-06-17,True,2


I will only run the Feature Selector and classifier on the selected columns

In [1004]:
sfs.fit(df[selected_columns], df["target"])

Here are the final predictors that the Feature Selector was able to pull out

In [1006]:
predictors = list(selected_columns[sfs.get_support()])
predictors

['3p%',
 'ft%',
 'blk',
 'pf',
 'orb%',
 'usg%',
 'ft_max',
 'tov_max',
 '+/-_max',
 'orb%_max',
 'drb%_max',
 'stl%_max',
 'tov%_max',
 'usg%_max',
 '3p_opp',
 'ft_opp',
 'fta_opp',
 'orb_opp',
 'stl_opp',
 'drb%_opp',
 'stl%_opp',
 'blk%_opp',
 'usg%_opp',
 'fga_max_opp',
 'ft%_max_opp',
 'stl_max_opp',
 'pts_max_opp',
 'ast%_max_opp',
 'stl%_max_opp',
 'usg%_max_opp']

## Making the Predictions

This "backtest" function will do the work of actually making the predictions by initially training on the data of the first 2 seasons of data as our training data and testing it on the rest of the data (7 more seasons).

In [1008]:
def backtest(data, model, predictors, start=2, step=1):
    all_predictions = []
    
    seasons = sorted(data["season"].unique())
    
    for i in range(start, len(seasons), step):
        season = seasons[i]
        train = data[data["season"] < season]
        test = data[data["season"] == season]
        
        model.fit(train[predictors], train["target"])
        
        preds = model.predict(test[predictors])
        preds = pd.Series(preds, index=test.index)
        combined = pd.concat([test["target"], preds], axis=1)
        combined.columns = ["actual", "prediction"]
        
        all_predictions.append(combined)
    return pd.concat(all_predictions)

Let's check the accuracy of our predictions:

In [1010]:
from sklearn.metrics import accuracy_score

predictions = backtest(df, rr, predictors)
acc = accuracy_score(predictions["actual"], predictions["prediction"])
acc

0.545352365966056

So using the first 2 seasons to train our model, the accuracy of the **model's predictions came out to 54.5%**. In the NBA, a team playing at home is more likely to win that game than if they were playing away--in fact in the years that I've collected data on, the **home-court win rate is just around 57%**, so by simply predicting that the home team would win a game, one could get an even better prediction than the current model.

In [1012]:
df.groupby(["home"]).apply(lambda x: x[x["won"] == 1].shape[0] / x.shape[0], include_groups = False)

home
0.0    0.430047
1.0    0.569953
dtype: float64

The initial pass (that yielded the 54.5% accuracy) was done using only 2 seasons of training data, so I was curious to see if and how the accuracy would improve as I increased the number of seasons the model would be trained on (and of course at the same time, decreasing the amount of seasons the model would be tested on).

Below, I have a table that shows the accuracy of the mdoel's predictions as I increase the number of seasons I train the data on:

In [1014]:
seasons = 2024-2016
prediction_list = [backtest(df, rr, predictors, start=i) for i in range(2,seasons-1)]
accuracy_list = [accuracy_score(prediction["actual"], prediction["prediction"]) for prediction in prediction_list]
data = {"Training Seasons":list(range(2, seasons-1)), "Accuracy":accuracy_list}
pred_table = pd.DataFrame(data)
pred_table

Unnamed: 0,Training Seasons,Accuracy
0,2,0.545352
1,3,0.54733
2,4,0.54902
3,5,0.551267
4,6,0.551641


It looks like as I increase the number of seasons the model is trained on, the accuracy improves by fractions of a percentage. While improvement is always good, it still doesn't beat the home-court advantage predictions, so I will see what else I can do to improve the model.

## Improving the Model Accuracy
I would like the model to at least be able to beat the 57% home-court advantage baseline (accuracy if I just predicted the home team to win).

A team plays 82 games in the regular season of the NBA, and this spans about 6 months, so a lot can change for a team during that time--which will likely affect whether they win or lose a game. To account for this, I will be looking at rolling averages to get a sense of how well a team *has been* performing in their last few games, and use that to influence the predictions.

I will be making a new dataset that has teams' averages for the last {roll} games and combining that dataset with the original dataset in order to have new columns for the classifier to evaluate as predictors

In [1016]:
def find_team_averages(team, roll):
    # print(team)
    
    rolling = team.rolling(roll).mean()
    # print(rolling)
    return rolling

In [1017]:
df10 = df[list(selected_columns) + ["won", "team", "season"]]
df10 = df10.groupby(["team", "season"], group_keys=False).apply(find_team_averages, roll = 10, include_groups=False)

In [1018]:
rolling_cols = [f"{col}_roll" for col in df10.columns]
df10.columns = rolling_cols

df10 = pd.concat([df, df10], axis=1)
df10 = df10.dropna()

The next thing I will do to further contextualize the data and help the model improve its accuracy is adding information to a team's most recent game about whether or not they will be home or away in their next game (the game the model is trying to predict) and their next opponent + the rolling averages of that opponent

In [1114]:
def shift_col(team, col_name):
    next_col = team[col_name].shift(-1)
    return next_col

def add_col(dframe, col_name):
    return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))

df10["home_next"] = add_col(df10, "home")
df10["team_opp_next"] = add_col(df10, "team_opp")
df10["date_next"] = add_col(df10, "date")
df10 = df10.copy()

  return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))
  return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))
  return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))


In [1021]:
# pulling in the rolling columns of the next

full = df10.merge(
    df10[rolling_cols + ["team_opp_next", "date_next", "team"]], 
    left_on=["team", "date_next"], 
    right_on=["team_opp_next", "date_next"])
full

Unnamed: 0,mp,fg,fga,fg%,3p,3pa,3p%,ft,fta,ft%,...,blk%_max_opp_roll_y,tov%_max_opp_roll_y,usg%_max_opp_roll_y,ortg_max_opp_roll_y,drtg_max_opp_roll_y,total_opp_roll_y,home_opp_roll_y,won_roll_y,team_opp_next_y,team_y
0,0.00,0.326087,0.250000,0.413876,0.310345,0.257576,0.509501,0.522727,0.421875,0.852,...,0.1032,0.437212,0.126026,0.404739,0.394318,0.398214,0.2,0.3,TOR,SAC
1,0.00,0.456522,0.500000,0.375598,0.379310,0.348485,0.483373,0.454545,0.406250,0.769,...,0.1072,0.380294,0.274359,0.270616,0.462500,0.286607,0.6,0.7,SAC,TOR
2,0.25,0.521739,0.544118,0.416268,0.413793,0.454545,0.419240,0.204545,0.156250,0.900,...,0.1113,0.467505,0.277436,0.352607,0.465909,0.293750,0.7,0.6,GSW,TOR
3,0.00,0.391304,0.279412,0.476077,0.241379,0.227273,0.437055,0.454545,0.406250,0.769,...,0.0751,0.484906,0.202308,0.375355,0.512500,0.320536,0.4,1.0,LAC,GSW
4,0.00,0.521739,0.426471,0.511962,0.275862,0.363636,0.339667,0.363636,0.328125,0.762,...,0.1826,0.427568,0.259231,0.514218,0.323864,0.362500,0.5,0.0,DAL,PHI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20481,0.00,0.413043,0.352941,0.437799,0.344828,0.530303,0.304038,0.431818,0.312500,0.950,...,0.0801,0.223375,0.241410,0.413744,0.534091,0.364286,0.6,0.6,BOS,DAL
20482,0.00,0.413043,0.382353,0.413876,0.310345,0.318182,0.427553,0.318182,0.250000,0.875,...,0.1007,0.506289,0.152821,0.531754,0.580682,0.348214,0.5,1.0,DAL,BOS
20483,0.00,0.413043,0.323529,0.464115,0.586207,0.636364,0.439430,0.295455,0.218750,0.929,...,0.0850,0.240881,0.241538,0.434123,0.535227,0.369643,0.6,0.6,BOS,DAL
20484,0.00,0.217391,0.294118,0.224880,0.482759,0.560606,0.404988,0.272727,0.203125,0.923,...,0.0984,0.258386,0.159359,0.461611,0.545455,0.362500,0.5,0.6,BOS,DAL


The above dataset is the final dataset that we will be running the model on.

Next, we will just make sure to remove any non-quantitative data columns from the data set so that we can re-run the feature selector

In [1023]:
# add non-numerical columns to removed_columns[]
removed_columns = list(full.columns[full.dtypes == "object"]) + removed_columns
removed_columns

['team_x',
 'team_opp',
 'date',
 'team_opp_next_x',
 'date_next',
 'team_opp_next_y',
 'team_y',
 'season',
 'date',
 'won',
 'target',
 'team',
 'team_opp']

In [1024]:
# update selection of columns to check for predictors
selected_columns = full.columns[~full.columns.isin(removed_columns)]
sfs.fit(full[selected_columns], full["target"])

In [1025]:
predictors = list(selected_columns[sfs.get_support()])
predictors

['blk',
 'tov',
 'efg%',
 'usg%',
 'fg%_max',
 'stl_opp',
 'usg%_opp',
 'blk_max_opp',
 'ftr_max_opp',
 'usg%_roll_x',
 'gmsc_max_roll_x',
 '+/-_max_roll_x',
 'blk_opp_roll_x',
 'tov_opp_roll_x',
 'usg%_opp_roll_x',
 'ft%_max_opp_roll_x',
 'ftr_max_opp_roll_x',
 'ast%_max_opp_roll_x',
 'ortg_max_opp_roll_x',
 'home_next',
 'stl_roll_y',
 'usg%_roll_y',
 'fg_max_roll_y',
 'tov_max_roll_y',
 'pts_max_roll_y',
 '+/-_max_roll_y',
 'drb%_max_roll_y',
 'trb%_max_roll_y',
 'usg%_opp_roll_y',
 'ast%_max_opp_roll_y']

After running the feature selector on the new dataset, we can see that many of the strong predictors are in fact rolling averages whether they are those of the team or those of their opponent.

Next I'll re-run the predictor function to see if using this new set of predictors will improve the accuracy of the model:

In [1048]:
predictions = backtest(full, rr, predictors)
roll10_acc = accuracy_score(predictions["actual"], predictions["prediction"])
roll10_acc

0.6348915174900374

It has! There is a significant improvement (almost 10%) from the first version of the model and this latest one that takes into account how well a team or its opponent has been performing. The model has now successfully beaten the baseline 57% accuracy of selecting the home team to win every time.

## Further Improvements
**Model Selection:**
The lowest hanging fruit for improving the model accuracy would be selecting a different and more powerful model such as Random Forest or XGBoost. Either of those options are able to identify non-linear relationships, which could ultimately change which predictors are selected and how those predictors are weighted.

**Feature Selection Count**
For this model, I had the model only select 30 features out of the ~400 that I ended up having in the final full dataset. It is possible that changing that number--whether increasing or decreasing it--would result in better accuracy of the model, but it would take some iterations to find the sweet spot.

**Rolling Averages**
Many of the final predictors that the model used were rolling averages. I arbitrarily used the last 10 games to calculate those rolling averages, but I could play around with that number as well to see if caculating the rolling averages for more or fewer games could help improve accuracy.

I will actually change the rolling average game count to 5 and 15 to see if either one makes a significant difference in the model's accuracy.

In [1176]:
removed_columns = ["season", "date", "won", "target", "team", "team_opp"]
selected_columns = df.columns[~df.columns.isin(removed_columns)]

df5 = df[list(selected_columns) + ["won", "team", "season"]]
df5 = df5.groupby(["team", "season"], group_keys=False).apply(lambda x: find_team_averages(x, 5), include_groups=False)

In [1177]:
rolling_cols = [f"{col}_roll" for col in df5.columns]
df5.columns = rolling_cols

df5 = pd.concat([df, df5], axis=1)
df5 = df5.dropna()
df5["home_next"] = add_col(df5, "home")
df5["team_opp_next"] = add_col(df5, "team_opp")
df5["date_next"] = add_col(df5, "date")

full5 = df5.merge(
    df5[rolling_cols + ["team_opp_next", "date_next", "team"]], 
    left_on=["team", "date_next"], 
    right_on=["team_opp_next", "date_next"])

  return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))
  return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))
  return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))


In [1170]:
removed_columns = list(full5.columns[full5.dtypes == "object"]) + removed_columns
selected_columns = full5.columns[~full5.columns.isin(removed_columns)]

sfs.fit(full5[selected_columns], full5["target"])

predictors = list(selected_columns[sfs.get_support()])
predictions = backtest(full5, rr, predictors)
roll5_acc = accuracy_score(predictions["actual"], predictions["prediction"])

In [1171]:
removed_columns = ["season", "date", "won", "target", "team", "team_opp"]
selected_columns = df.columns[~df.columns.isin(removed_columns)]

df15 = df[list(selected_columns) + ["won", "team", "season"]]
# df15 = df15.groupby(["team", "season"], group_keys=False).apply(lambda x: find_team_averages(x, 15), include_groups=False)
df15 = df15.groupby(["team", "season"], group_keys=False).apply(find_team_averages, roll = 15, include_groups=False)
rolling_cols = [f"{col}_roll" for col in df15.columns]
df15.columns = rolling_cols

df15 = pd.concat([df, df15], axis=1)
df15 = df15.dropna()
df15["home_next"] = add_col(df15, "home")
df15["team_opp_next"] = add_col(df15, "team_opp")
df15["date_next"] = add_col(df15, "date")

full15 = df15.merge(
    df15[rolling_cols + ["team_opp_next", "date_next", "team"]], 
    left_on=["team", "date_next"], 
    right_on=["team_opp_next", "date_next"])

removed_columns = list(full15.columns[full15.dtypes == "object"]) + removed_columns
selected_columns = full15.columns[~full15.columns.isin(removed_columns)]

sfs.fit(full15[selected_columns], full15["target"])
predictors = list(selected_columns[sfs.get_support()])
predictions = backtest(full15, rr, predictors)
roll15_acc = accuracy_score(predictions["actual"], predictions["prediction"])

  return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))
  return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))
  return dframe.groupby("team", group_keys=False).apply(lambda x: shift_col(x, col_name))


In [1182]:
accuracies = {"Rolling Games":[5,10,15], "Accuracy":[roll5_acc, roll10_acc, roll15_acc]}
acc_df = pd.DataFrame(accuracies)
acc_df

Unnamed: 0,Rolling Games,Accuracy
0,5,0.617171
1,10,0.634892
2,15,0.634581


Above, you can see the different accuracies when using different amounts of games to calculate the rolling averages. With fewer games (5), we see a noteworthy decrease of about 2% in the accuracy; however, there is almost no difference in the accuracy between 10 and 15 games.

This is clearly still a lever one could pull to try to further improve the accuracy of the model. It is possible that some number of games within this range of 5-15 games could yield better accuracy than the ones I've tested.