## NBA Game Results Prediction
Note: all data for this project are generated and/or modified from https://basketball-reference.com <br>
Source data were uploaded to my personal website, they are also avaliable on this site. <br>
This prediction is produced by Jiabao Zheng with thanks to the inspiration by https://lanqiao.cn <br>
The data cleaning notebook can be found at <a href="data_cleaning.html" target="_blank">here.</a>.

In this project, we use Python and Logistic Regression to predict NBA game results in 2019-2020 season based on team's statistic from the previous season. 
![title](nba_profile.jpg)

We have selected the following 3 tables among 2018-2019 season summary tables:

*   **Team Per Game Stats**：

| Name | Description |
| :--- | :--- |
| Rk | Rank | 
| G | Games |
| MP | Minutes Played | 
| FG | Field Goals | 
| FGA | Field Goal Attempts | 
| FG% | Field Goal Percentage | 
| 3P | 3-Point Field Goals | 
| 3PA | 3-Point Field Goal Attempts | 
| 3P% | 3-Point Field Goal Percentage | 
| 2P | 2-Point Field Goals | 
| 2PA | 2-point Field Goal Attempts | 
| 2P%  |  2-Point Field Goal Percentage | 
| FT | Free Throws | 
| FTA | Free Throw Attempts |
| FT% | Free Throw Percentage | 
| ORB | Offensive Rebounds | 
| DRB | Defensive Rebounds | 
| TRB | Total Rebounds | 
| AST | Assists | 
| STL | Steals | 
| BLK  |  Blocks |
| TOV  |  Turnovers | 
| PF  |  Personal Fouls | 
| PTS  |  Points |

*   **Opponent Per Game Stats**: This is similar to the table above.

*   **Miscellaneous Stats**:

| Name | Description |
| :--- | :--- |
| Rk (Rank) | Rank |
| Age | Average age of players |
| W (Wins) | Number of wins |
| L (Losses) | Number of failures |
| PW (Pythagorean wins) | The probability of winning calculated based on the Pythagorean theory |
| PL (Pythagorean losses) | Probability of loss calculated based on Pythagorean theory |
| MOV (Margin of Victory) | Average interval between wins |
| SOS (Strength of Schedule) | Used to judge the difficulty of the opponent's choice with its team or other teams, 0 is the average line, can be positive/negative |
| SRS (Simple Rating System) | A simple rating system to rank teams according to their points difference |
| ORtg (Offensive Rating) | Percentage of offenses per 100 rounds |
| DRtg (Defensive Rating) | Percentage of defense per 100 rounds |
| Pace (Pace Factor) | Approximately how many rounds will be played every 48 minutes |
| FTr (Free Throw Attempt Rate) | Free throws as a percentage of shots |
| 3PAr (3-Point Attempt Rate) | Ratio of three-point shots to shots |
| TS% (True Shooting Percentage) | Total two-pointers, three-pointers and free throw percentages |
| eFG% (Effective Field Goal Percentage) | Effective Field Goal Percentage (including two-pointers and three-pointers) |
| TOV% (Turnover Percentage) | Percentage of turnovers per 100 games |
| ORB% (Offensive Rebound Percentage) | Percentage of offensive rebounds per person on the team |
| FT/FGA | Percentage of free throw shots |
| eFG% (Opponent Effective Field Goal Percentage) | Opponent Shot Percentage |
| TOV% (Opponent Turnover Percentage) | Opponent Turnover Percentage |
| DRB% (Defensive Rebound Percentage) | The average defensive rebound percentage per player on the team |
| FT/FGA (Opponent Free Throws Per Field Goal Attempt) | Opponent Free Throws Per Field Goal Attempt |

In [1]:
'''
# get data
!wget https://student.cs.uwaterloo.ca/~j243zhen/project/data.zip

# install unzip
!apt-get install unzip

# unzip data.zip and remove it
!unzip data.zip 
!rm -r data.zip
'''

'\n# get data\n!wget https://student.cs.uwaterloo.ca/~j243zhen/project/data.zip\n\n# install unzip\n!apt-get install unzip\n\n# unzip data.zip and remove it\n!unzip data.zip \n!rm -r data.zip\n'

In [2]:
# Import libraries
import pandas as pd
import math
import csv
import random
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import cross_val_score

The logistic of calculating Elo Rating in NBA is avaliable at https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/

Set the parameter variables needed for regression training

In [3]:
# if this team does not have initial elo，assign base_elo to it
base_elo = 1600
team_elos = {} 
team_stats = {}
X = []
y = []
# the directory to store data
folder = 'data'

In [4]:
# based on each team's Miscellaneous Opponent，Team Stats file to initialize data
def initialize_data(Mstat, Ostat, Tstat):
    new_Mstat = Mstat.drop(['Rk', 'Arena'], axis=1)
    new_Ostat = Ostat.drop(['Rk', 'G', 'MP'], axis=1)
    new_Tstat = Tstat.drop(['Rk', 'G', 'MP'], axis=1)

    team_stats1 = pd.merge(new_Mstat, new_Ostat, how='left', on='Team')
    team_stats1 = pd.merge(team_stats1, new_Tstat, how='left', on='Team')
    return team_stats1.set_index('Team', inplace=False, drop=True)

In [5]:
'''
get_elo(team) gets the Elo Score rating of each team. When there is no rating at the beginning, 
give it the initial base_elo value:
'''
def get_elo(team):
    try:
        return team_elos[team]
    except:
        # if there is no initial elo of a team, assign base_elo
        team_elos[team] = base_elo
        return team_elos[team]

In [6]:
# calculate the elo value of each team
def calc_elo(win_team, lose_team):
    winner_rank = get_elo(win_team)
    loser_rank = get_elo(lose_team)

    rank_diff = winner_rank - loser_rank
    exp = (rank_diff  * -1) / 400
    odds = 1 / (1 + math.pow(10, exp))
    # modify k value by ranks
    if winner_rank < 2100:
        k = 32
    elif winner_rank >= 2100 and winner_rank < 2400:
        k = 24
    else:
        k = 16
    
    # update rank
    new_winner_rank = round(winner_rank + (k * (1 - odds)))      
    new_loser_rank = round(loser_rank + (k * (0 - odds)))
    return new_winner_rank, new_loser_rank

In [7]:
'''
Based on our initial statistics and the calculation results of Elo score of each team, 
a data set corresponding to each game in the regular season and playoffs from 2018 to 2019 is established.
'''
def  build_dataSet(all_data):
    print("Building data set..")
    X = []
    skip = 0
    for index, row in all_data.iterrows():

        Wteam = row['WTeam']
        Lteam = row['LTeam']

        # initial elo value of each team
        team1_elo = get_elo(Wteam)
        team2_elo = get_elo(Lteam)
        
        # add 100 to the elo value of Home's team, as Home's team normally has advantage than Visistor
        if row['WLoc'] == 'H':
            team1_elo += 100
        else:
            team2_elo += 100

        # consider elo as the primary feature of a team
        team1_features = [team1_elo]
        team2_features = [team2_elo]

        # add more statistical info
        for key, value in team_stats.loc[Wteam].iteritems():
            team1_features.append(value)
        for key, value in team_stats.loc[Lteam].iteritems():
            team2_features.append(value)
        #ramdonly assign two team's feature values to two sides of each team
        # assign the respective 0/1 to y
        if random.random() > 0.5:
            X.append(team1_features + team2_features)
            y.append(0)
        else:
            X.append(team2_features + team1_features)
            y.append(1)

        if skip == 0:
            print('X',X)
            skip = 1

        # update elo based on this match
        new_winner_rank, new_loser_rank = calc_elo(Wteam, Lteam)
        team_elos[Wteam] = new_winner_rank
        team_elos[Lteam] = new_loser_rank

    return np.nan_to_num(X), y

In [8]:
if __name__ == '__main__':

    Mstat = pd.read_csv(folder + '/18-19Miscellaneous_Stat.csv')
    Ostat = pd.read_csv(folder + '/18-19Opponent_Per_Game_Stat.csv')
    Tstat = pd.read_csv(folder + '/18-19Team_Per_Game_Stat.csv')
    
    team_stats = initialize_data(Mstat, Ostat, Tstat)
    print("Data initialization is done.")
    result_data = pd.read_csv(folder + '/18-19_result.csv')
    X, y = build_dataSet(result_data)

    # training network model
    print("Fitting on %d game samples.." % len(X))

    
    # Logistic Regression from sklean library is used
    model = linear_model.LogisticRegression()
    model.fit(X, y)

    # calculate accuracy by cross-validation
    print("Doing cross-validation..")
    print(cross_val_score(model, X, y, cv = 10, scoring='accuracy', n_jobs=-1).mean())
    
    print("The model has been established based by data from 2018 - 2019 season.")

Data initialization is done.
Building data set..
X [[1600, 26.4, 51.0, 31.0, 48.0, 34.0, 2.7, -0.44, 2.25, 112.6, 110.0, 2.6, 101.6, 0.312, 0.342, 0.574, 0.532, 12.9, 24.5, 0.24100000000000002, 0.512, 11.1, 78.6, 0.20600000000000002, 838342.0, 20447.0, 41.7, 91.5, 0.455, 10.3, 30.0, 0.342, 31.4, 61.5, 0.511, 18.8, 24.5, 0.768, 10.0, 33.5, 43.5, 23.4, 7.7, 4.1, 12.7, 22.1, 112.5, 41.5, 88.2, 0.47100000000000003, 10.8, 30.2, 0.359, 30.7, 58.0, 0.529, 21.2, 27.5, 0.7709999999999999, 10.9, 36.9, 47.8, 26.9, 7.4, 5.3, 14.9, 21.3, 115.2, 1700, 25.7, 49.0, 33.0, 52.0, 30.0, 4.44, -0.54, 3.9, 112.2, 107.8, 4.4, 99.6, 0.215, 0.381, 0.5670000000000001, 0.534, 11.5, 21.6, 0.17300000000000001, 0.514, 13.4, 77.0, 0.198, 763584.0, 18624.0, 39.5, 88.1, 0.44799999999999995, 11.5, 33.5, 0.344, 28.0, 54.6, 0.513, 17.4, 22.8, 0.764, 10.4, 35.5, 45.9, 23.7, 6.8, 3.9, 15.1, 19.5, 108.0, 42.1, 90.5, 0.465, 12.6, 34.5, 0.365, 29.5, 56.0, 0.527, 15.6, 19.5, 0.802, 9.8, 34.7, 44.5, 26.3, 8.6, 5.3, 12.8, 20.4, 

In [9]:
def predict_winner(team_1, team_2, model):
    features = []

    # team 1，VISITOR
    features.append(get_elo(team_1))
    for key, value in team_stats.loc[team_1].iteritems():
        features.append(value)

    # team 2，HOME
    features.append(get_elo(team_2) + 100)
    for key, value in team_stats.loc[team_2].iteritems():
        features.append(value)

    features = np.nan_to_num(features)
    return model.predict_proba([features])

In [10]:
# predict match results in 19-20 by the trained model

print('Predicting on new schedule..')
schedule1920 = pd.read_csv(folder + '/19-20Schedule.csv')
result = []
for index, row in schedule1920.iterrows():
    team1 = row['Vteam']
    team2 = row['Hteam']
    pred = predict_winner(team1, team2, model)
    prob = pred[0][0]
    if prob > 0.5:
        winner = team1
        loser = team2
        result.append([winner, loser, prob])
    else:
        winner = team2
        loser = team1
        result.append([winner, loser, 1 - prob])

with open('19-20Result_pred.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['win', 'lose', 'probability'])
    writer.writerows(result)
    print('The results has been produced in {}'.format(folder))

Predicting on new schedule..
The results has been produced in data


In [11]:
# load predicted results
predresults1920 = pd.read_csv('19-20Result_pred.csv', header=0)
predresults1920.head()

Unnamed: 0,win,lose,probability
0,Toronto Raptors,New Orleans Pelicans,0.876681
1,Los Angeles Clippers,Los Angeles Lakers,0.606575
2,Charlotte Hornets,Chicago Bulls,0.806458
3,Indiana Pacers,Detroit Pistons,0.648526
4,Orlando Magic,Cleveland Cavaliers,0.90151


In [12]:
# load real results
realresults1920 = pd.read_csv(folder + '/19-20_result.csv',header=0)
realresults1920.head()

Unnamed: 0,WTeam
0,Toronto Raptors
1,Los Angeles Clippers
2,Charlotte Hornets
3,Detroit Pistons
4,Orlando Magic


In [13]:
# compare predicted and real results
win_match = 0
for i in range(len(realresults1920)):
    if predresults1920.loc[i, 'win'] == realresults1920.loc[i, 'WTeam']:
        win_match += 1
p_success = win_match / len(realresults1920)
print("{}% of real results and predicted results are same in the 2019-2020 season.".format(p_success * 100))

59.23009623797025% of real results and predicted results are same in the 2019-2020 season.


In [14]:
# compare predicted and real results when probability is larger than 85%
win_match = 0
total_match = 0
for i in range(len(realresults1920)):
    if predresults1920.loc[i, 'probability'] > 0.85:
        total_match += 1
        if predresults1920.loc[i, 'win'] == realresults1920.loc[i, 'WTeam']:
            win_match += 1
p_success = win_match / total_match
print("{}% of real results and predicted results are same in the 2019-2020 season, when the probability is more than 85%.".format(p_success * 100))

72.41379310344827% of real results and predicted results are same in the 2019-2020 season, when the probability is more than 85%.


# Summary
We used some statistics from basketball-reference.com to calculate the Elo score of each NBA team and used these basic statistics to evaluate the past games of each team, and compared the teams according to the rating method "Elo Score". The current team's performance and the characteristics of these different teams are finally used to determine which team can have the advantage in a game.  Unlike some other predictions, in our prediction results, we did not determine the results of each game by only saying the winner. Instead, we calculated the probability of our results. One improvement for this project would be the amount of data we use to evaluate the performance of the team is too small (only the 2018-2019 season's data). If you want a more accurate and systematic judgment, you should get more statistics of each NBA team from previous years for prediction. 

The results of the prediction is avaliable at <a href="19-20Result_pred.csv" target="_blank">here.</a>.