## Predicting Regular Season NBA Games Outcome + Scores with Machine Learning

#### Goal
My final objective is to predict the outcome and scores of any regular season NBA game of any given team by training a model that take parameters such as 
 - team performance
 - player stats
 - related stats/variations of the above

Ideally, the model will be able to predict with 65%+ from feeding it with a small sample size of significant game training data. 

In [1]:
# Import all necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score
import joblib
import summarytools

First, I will load (into dataframes), clean and do a preliminary EDA for all related NBA data I gathered, including summaries/descriptive analysis, and later on, finding more specific/relevant features that can be used for the baseline model.

In [2]:
atdf = pd.read_csv('data/all_nba_team_stats.csv')
atdf.head()

Unnamed: 0,No,Team,G,Min,Pts,Reb,Ast,Stl,Blk,To,...,Oreb,Fgm-a,Pct,3gm-a,Pct.1,Ftm-a,Pct.2,Eff,Deff,Year
0,1,Chicago,103,48.4,96.0,44.1,23.1,8.6,4.3,13.0,...,14.9,36.7-81.7,0.449,3.9-12.0,0.323,18.7-25.2,0.741,111.6,17.5,1997-1998
1,2,Utah,102,48.3,98.6,40.8,24.7,7.6,4.8,14.7,...,11.3,35.9-74.3,0.483,3.1-8.4,0.368,23.8-30.9,0.768,116.3,17.5,1997-1998
2,3,Phoenix,86,48.6,99.3,41.9,25.6,9.2,5.3,14.4,...,12.1,38.2-82.0,0.466,5.2-14.7,0.355,17.7-23.6,0.747,117.1,13.6,1997-1998
3,4,L.A.Lakers,95,48.3,104.8,42.9,24.3,8.7,6.8,14.7,...,13.2,38.0-79.1,0.48,6.1-17.3,0.35,22.8-33.7,0.675,120.8,13.2,1997-1998
4,5,San Antonio,91,48.4,92.5,44.1,21.9,6.2,6.9,15.3,...,11.9,35.1-75.1,0.468,3.7-10.8,0.344,18.5-26.8,0.688,108.0,13.1,1997-1998


In [3]:
atdf.tail()

Unnamed: 0,No,Team,G,Min,Pts,Reb,Ast,Stl,Blk,To,...,Oreb,Fgm-a,Pct,3gm-a,Pct.1,Ftm-a,Pct.2,Eff,Deff,Year
720,25,Sacramento,82,48.3,110.3,42.9,23.7,7.2,4.5,13.5,...,9.6,40.5-88.1,0.46,11.4-33.2,0.344,17.9-23.3,0.768,122.1,-13.4,2021-2022
721,26,Orlando,82,48.2,104.2,44.3,23.7,6.8,4.5,13.8,...,9.1,38.3-88.3,0.434,12.2-36.8,0.331,15.5-19.7,0.787,115.5,-16.0,2021-2022
722,27,Detroit,82,48.2,104.8,43.0,23.5,7.7,4.8,13.4,...,11.0,38.2-88.6,0.43,11.3-34.6,0.326,17.2-22.0,0.782,115.1,-16.7,2021-2022
723,28,Portland,82,48.1,106.2,42.9,22.9,8.0,4.5,13.7,...,10.4,38.5-87.1,0.443,12.7-36.8,0.346,16.4-21.6,0.76,117.1,-19.0,2021-2022
724,29,Oklahoma City,82,48.3,103.7,45.6,22.2,7.6,4.6,13.3,...,10.4,38.3-89.1,0.43,12.1-37.4,0.323,15.0-19.9,0.756,114.8,-19.2,2021-2022


The data seems to be the team stats of all NBA teams from 1997 - 2022.
Now I will check:
1. Shape of data
2. Datatype or formatting issues
3. Duplicates or redundants
4. Null values

In [4]:
atdf.head()

Unnamed: 0,No,Team,G,Min,Pts,Reb,Ast,Stl,Blk,To,...,Oreb,Fgm-a,Pct,3gm-a,Pct.1,Ftm-a,Pct.2,Eff,Deff,Year
0,1,Chicago,103,48.4,96.0,44.1,23.1,8.6,4.3,13.0,...,14.9,36.7-81.7,0.449,3.9-12.0,0.323,18.7-25.2,0.741,111.6,17.5,1997-1998
1,2,Utah,102,48.3,98.6,40.8,24.7,7.6,4.8,14.7,...,11.3,35.9-74.3,0.483,3.1-8.4,0.368,23.8-30.9,0.768,116.3,17.5,1997-1998
2,3,Phoenix,86,48.6,99.3,41.9,25.6,9.2,5.3,14.4,...,12.1,38.2-82.0,0.466,5.2-14.7,0.355,17.7-23.6,0.747,117.1,13.6,1997-1998
3,4,L.A.Lakers,95,48.3,104.8,42.9,24.3,8.7,6.8,14.7,...,13.2,38.0-79.1,0.48,6.1-17.3,0.35,22.8-33.7,0.675,120.8,13.2,1997-1998
4,5,San Antonio,91,48.4,92.5,44.1,21.9,6.2,6.9,15.3,...,11.9,35.1-75.1,0.468,3.7-10.8,0.344,18.5-26.8,0.688,108.0,13.1,1997-1998


In [5]:
atdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 725 entries, 0 to 724
Data columns (total 22 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   No      725 non-null    int64  
 1   Team    725 non-null    object 
 2   G       725 non-null    int64  
 3   Min     725 non-null    float64
 4   Pts     725 non-null    float64
 5   Reb     725 non-null    float64
 6   Ast     725 non-null    float64
 7   Stl     725 non-null    float64
 8   Blk     725 non-null    float64
 9   To      725 non-null    float64
 10  Pf      725 non-null    float64
 11  Dreb    725 non-null    float64
 12  Oreb    725 non-null    float64
 13  Fgm-a   725 non-null    object 
 14  Pct     725 non-null    float64
 15  3gm-a   725 non-null    object 
 16  Pct.1   725 non-null    float64
 17  Ftm-a   725 non-null    object 
 18  Pct.2   725 non-null    float64
 19  Eff     725 non-null    float64
 20  Deff    725 non-null    float64
 21  Year    725 non-null    object 
dtypes:

In [6]:
print(f"There are {atdf.shape[0]} rows and {atdf.shape[1]} columns. ")

There are 725 rows and 22 columns. 


There isn't missing data as there are 725 non-null values while the shape is also 725 rows.

Now, mostly are float types, but there are also several object types. I will make sense of the data as a basketball player, and see what is the best data type for each column.

Here are the full names for each of the abbreviations commonly used in basketball statistics:

- G: Games Played
- Min: Minutes Played
- Pts: Points
- Reb: Rebounds
- Ast: Assists
- Stl: Steals
- Blk: Blocks
- To: Turnovers
- Pf: Personal Fouls
- Dreb: Defensive Rebounds
- Oreb: Offensive Rebounds
- FG-Pct: Field Goal Percentage
- 3P-PCT: Three-Point Percentage
- FT-Pct: Free Throw Percentage
- Eff: Efficiency = Total Possessions/Total Points Scored ×100
- Deff: Defensive Efficiency = Total Opponent Possessions/Total Points Allowed ×100
- Year: Year (Season Year)

Fgm-a, 3gm-a, Ftm-a are objects that contain redundant information corresponding to `Pct`, `Pct.1` and `Pct.2`. So I will drop the object and renaming the ladder to be `FG-Pct`, `3P-PCT`, `FT-Pct`.

In [7]:
atdf.drop(['Fgm-a', '3gm-a', 'Ftm-a'], axis=1, inplace=True)

In [8]:
atdf.rename(columns={'Pct': 'FG-Pct', 'Pct.1': '3P-PCT', 'Pct.2': 'FT-Pct'}, inplace=True)

In [9]:
atdf.head()

Unnamed: 0,No,Team,G,Min,Pts,Reb,Ast,Stl,Blk,To,Pf,Dreb,Oreb,FG-Pct,3P-PCT,FT-Pct,Eff,Deff,Year
0,1,Chicago,103,48.4,96.0,44.1,23.1,8.6,4.3,13.0,21.1,29.2,14.9,0.449,0.323,0.741,111.6,17.5,1997-1998
1,2,Utah,102,48.3,98.6,40.8,24.7,7.6,4.8,14.7,24.3,29.5,11.3,0.483,0.368,0.768,116.3,17.5,1997-1998
2,3,Phoenix,86,48.6,99.3,41.9,25.6,9.2,5.3,14.4,21.7,29.8,12.1,0.466,0.355,0.747,117.1,13.6,1997-1998
3,4,L.A.Lakers,95,48.3,104.8,42.9,24.3,8.7,6.8,14.7,22.9,29.7,13.2,0.48,0.35,0.675,120.8,13.2,1997-1998
4,5,San Antonio,91,48.4,92.5,44.1,21.9,6.2,6.9,15.3,21.2,32.2,11.9,0.468,0.344,0.688,108.0,13.1,1997-1998


Now, every NBA season run for almost 1 year, so we can also change the `Year` column from object to int data type.
So we automatically know that if the year is 1997, then the NBA season is from 1997 - 1998.

First, I will extract the first number from `Year` column and replace it with the original value.

In [15]:
Years = atdf['Year'].str.split('-', expand=True)
atdf['Year'] = Years[0]

Now I convert the `Year` column from object type to int data type.

In [16]:
atdf['Year'] = pd.to_datetime(atdf['Year'], format='%Y').dt.year

In [17]:
atdf.head()

Unnamed: 0,No,Team,G,Min,Pts,Reb,Ast,Stl,Blk,To,Pf,Dreb,Oreb,FG-Pct,3P-PCT,FT-Pct,Eff,Deff,Year
0,1,Chicago,103,48.4,96.0,44.1,23.1,8.6,4.3,13.0,21.1,29.2,14.9,0.449,0.323,0.741,111.6,17.5,1997
1,2,Utah,102,48.3,98.6,40.8,24.7,7.6,4.8,14.7,24.3,29.5,11.3,0.483,0.368,0.768,116.3,17.5,1997
2,3,Phoenix,86,48.6,99.3,41.9,25.6,9.2,5.3,14.4,21.7,29.8,12.1,0.466,0.355,0.747,117.1,13.6,1997
3,4,L.A.Lakers,95,48.3,104.8,42.9,24.3,8.7,6.8,14.7,22.9,29.7,13.2,0.48,0.35,0.675,120.8,13.2,1997
4,5,San Antonio,91,48.4,92.5,44.1,21.9,6.2,6.9,15.3,21.2,32.2,11.9,0.468,0.344,0.688,108.0,13.1,1997


In [18]:
atdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 725 entries, 0 to 724
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   No      725 non-null    int64  
 1   Team    725 non-null    object 
 2   G       725 non-null    int64  
 3   Min     725 non-null    float64
 4   Pts     725 non-null    float64
 5   Reb     725 non-null    float64
 6   Ast     725 non-null    float64
 7   Stl     725 non-null    float64
 8   Blk     725 non-null    float64
 9   To      725 non-null    float64
 10  Pf      725 non-null    float64
 11  Dreb    725 non-null    float64
 12  Oreb    725 non-null    float64
 13  FG-Pct  725 non-null    float64
 14  3P-PCT  725 non-null    float64
 15  FT-Pct  725 non-null    float64
 16  Eff     725 non-null    float64
 17  Deff    725 non-null    float64
 18  Year    725 non-null    int64  
dtypes: float64(15), int64(3), object(1)
memory usage: 107.7+ KB


Now the data is clean and exported.

In [21]:
atdf.to_csv('all_nba_team_stats_clean.csv', index=False)

Now, I repeated the EDA process for other NBA data .CSV files:

18_19_nba_player_stats.csv

18_19_nba_team_stats.csv

all_nba_game_score.csv

all_nba_game_score_quarter.csv

all_nba_game_stats.csv

all_nba_player_stats.csv

all_nba_team_id.csv

all_nba_team_stats.csv

play_by_play.csv

For the time being, I will omit showcasing the EDA process of other .CSV files as of spring 1.