# Predicting NBA games outcomes
## Using XGBoost and SageMaker

_Machine Learning Nanodegree Program | Capstone Project_

---

In this project we will train and evaluate an XGBoost model, with the goal to predict the probability of a home team winning an NBA regular season game. We will download datasets from Kaggle, and then select desired features, clean and preprocess data. After that, we will normalize the data and reduce dimensionality using PCA. Finally, we can create train, validation and test datasets, upload them to S3, train the model, and then test it. For more details about the data and methodology, please see Project Report file.

> **Note**: Please note that in order to run this notebook, you have to create a Kaggle account and upload your credentials kaggle.json file in Sagemaker, in the same path that this notebook is located. For more information about this, please see README file.

## General Outline

The general outline for this notebook is the following:

1. Importing libs
2. Downloading the data
3. Preparing and Processing the data
4. Splitting dataset, data normalization and dimensionality reduction
5. Uploading files to S3
6. Create and train the XGBoost model
7. Deploy and test the trained model

> **Note**: If you have already run steps 1-4, you can exit the notebook and later run only steps 5-7, because necessary files will be ready (also don't forget to run step 1 again to import the libs).

## Step 1: Importing libs

First, we import the necessary libs.

In [1]:
import os
import datetime

import pandas as pd
import numpy as np

import sagemaker
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.predictor import csv_serializer

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.externals.joblib import dump, load
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score



## Step 2: Downloading the data

As Kaggle datasets will be used, we will install Kaggle API to download them

Then, using Kaggle API, we download the [Team Box Score dataset](https://www.kaggle.com/pablote/nba-enhanced-stats?select=2012-18_teamBoxScore.csv) and [Standings dataset](https://www.kaggle.com/pablote/nba-enhanced-stats?select=2012-18_standings.csv)

To check what each stat abbreviation means and what is its description, please check the datasets' docs: [Team Box Score doc](https://www.kaggle.com/pablote/nba-enhanced-stats?select=metadata_teamBoxScore.pdf); [Standings doc](https://www.kaggle.com/pablote/nba-enhanced-stats?select=metadata_standing.pdf)

> Rossotti, P. (2017, November). _NBA Enhanced Box Score and Standings (2012 - 2018)_, Version 27. Retrieved June 29, 2020 from https://www.kaggle.com/pablote/nba-enhanced-stats/.

In [2]:
# install kaggle api to download datasets
# IMPORTANT: the following code assumes that the file kaggle.json is in the same path than this notebook
!mkdir /home/$(whoami)/.kaggle
!cp -f kaggle.json /home/$(whoami)/.kaggle
!pip install kaggle
import kaggle

%mkdir -p raw_data

!kaggle datasets download -o -d pablote/nba-enhanced-stats -f 2012-18_teamBoxScore.csv -p raw_data
!kaggle datasets download -o -d pablote/nba-enhanced-stats -f 2012-18_standings.csv -p raw_data
!unzip -o raw_data/4389%2F168008%2Fcompressed%2F2012-18_teamBoxScore.csv.zip -d raw_data
!unzip -o raw_data/4389%2F168008%2Fcompressed%2F2012-18_standings.csv.zip -d raw_data

mkdir: cannot create directory ‘/home/ec2-user/.kaggle’: File exists
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p36/bin/python -m pip install --upgrade pip' command.[0m
Downloading 4389%2F168008%2Fcompressed%2F2012-18_teamBoxScore.csv.zip to raw_data
 53%|████████████████████▎                 | 1.00M/1.87M [00:00<00:00, 10.1MB/s]
100%|██████████████████████████████████████| 1.87M/1.87M [00:00<00:00, 16.3MB/s]
Downloading 4389%2F168008%2Fcompressed%2F2012-18_standings.csv.zip to raw_data
 70%|██████████████████████████▌           | 1.00M/1.43M [00:00<00:00, 9.97MB/s]
100%|██████████████████████████████████████| 1.43M/1.43M [00:00<00:00, 13.1MB/s]
Archive:  raw_data/4389%2F168008%2Fcompressed%2F2012-18_teamBoxScore.csv.zip
  inflating: raw_data/2012-18_teamBoxScore.csv  
Archive:  raw_data/4389%2F168008%2Fcompressed%2F2012-18_standings.csv.zip
  inflating: raw_data/2012-18_standings.csv  


## Step 3: Preparing and Processing the data

Before exploring the data, we will first do some data processing. First, we open both datasets to check shape and how the first rows look like.

In [3]:
team_boxscore_df = pd.read_csv("raw_data/2012-18_teamBoxScore.csv")

print(team_boxscore_df.shape)
team_boxscore_df.head()

(14758, 123)


Unnamed: 0,gmDate,gmTime,seasTyp,offLNm1,offFNm1,offLNm2,offFNm2,offLNm3,offFNm3,teamAbbr,...,opptFIC40,opptOrtg,opptDrtg,opptEDiff,opptPlay%,opptAR,opptAST/TO,opptSTL/TO,poss,pace
0,2012-10-30,19:00,Regular,Brothers,Tony,Smith,Michael,Workman,Haywoode,WAS,...,61.6667,105.6882,94.4447,11.2435,0.439,16.7072,1.0476,33.3333,88.9409,88.9409
1,2012-10-30,19:00,Regular,Brothers,Tony,Smith,Michael,Workman,Haywoode,CLE,...,56.0417,94.4447,105.6882,-11.2435,0.3765,18.8679,2.0,84.6154,88.9409,88.9409
2,2012-10-30,20:00,Regular,McCutchen,Monty,Wright,Sean,Fitzgerald,Kane,BOS,...,80.8333,126.3381,112.6515,13.6866,0.5244,19.8287,3.125,100.0,94.9832,94.9832
3,2012-10-30,20:00,Regular,McCutchen,Monty,Wright,Sean,Fitzgerald,Kane,MIA,...,62.7083,112.6515,126.3381,-13.6866,0.4643,18.8501,1.5,25.0,94.9832,94.9832
4,2012-10-30,22:30,Regular,Foster,Scott,Zielinski,Gary,Dalen,Eric,DAL,...,58.6458,99.3678,108.1034,-8.7356,0.5,18.6567,1.7143,42.8571,91.579,91.579


In [4]:
standings_df = pd.read_csv("raw_data/2012-18_standings.csv")

print(standings_df.shape)
standings_df.head()

(29520, 39)


Unnamed: 0,stDate,teamAbbr,rank,rankOrd,gameWon,gameLost,stk,stkType,stkTot,gameBack,...,rel%Indx,mov,srs,pw%,pyth%13.91,wpyth13.91,lpyth13.91,pyth%16.5,wpyth16.5,lpyth16.5
0,2012-10-30,ATL,3,3rd,0,0,-,-,0,0.5,...,0.0,0.0,0.0,0.5,0.0,0.0,82.0,0.0,0.0,82.0
1,2012-10-30,BKN,3,3rd,0,0,-,-,0,0.5,...,0.0,0.0,0.0,0.5,0.0,0.0,82.0,0.0,0.0,82.0
2,2012-10-30,BOS,14,14th,0,1,L1,loss,1,1.0,...,0.0,-13.0,-13.0,0.072,0.1687,13.8334,68.1666,0.131,10.742,71.258
3,2012-10-30,CHA,3,3rd,0,0,-,-,0,0.5,...,0.0,0.0,0.0,0.5,0.0,0.0,82.0,0.0,0.0,82.0
4,2012-10-30,CHI,3,3rd,0,0,-,-,0,0.5,...,0.0,0.0,0.0,0.5,0.0,0.0,82.0,0.0,0.0,82.0


We can see that each game is duplicated in the team boxscore dataframe. For each game, the first row considers the "team" prefix stats as belonging to the away team, and the "oppt" ones belonging to the home team. In the second row, the opposite occurs. In this project, we will discard the first row for each game, as we want to achieve the probability of the home team winning the game.

In [5]:
# to remove duplicated rows, we simply filter them using "teamLoc" equals to "Home"
team_boxscore_df = team_boxscore_df[team_boxscore_df.teamLoc == "Home"]

In [6]:
# season indeces:
# 1: 2012-2013
# 2: 2013-2014
# 3: 2014-2015
# 4: 2015-2016
# 5: 2016-2017
# 6: 2017-2018

def get_season_index(date):
    date_array = date.split("-")
    year = int(date_array[0])
    month = int(date_array[1])
    
    if year == 2012 or (year == 2013 and month <= 6):
        return 1
    if (year == 2013 and month >= 10) or (year == 2014 and month <= 6):
        return 2
    if (year == 2014 and month >= 10) or (year == 2015 and month <= 6):
        return 3
    if (year == 2015 and month >= 10) or (year == 2016 and month <= 6):
        return 4
    if (year == 2016 and month >= 10) or (year == 2017 and month <= 6):
        return 5
    if (year == 2017 and month >= 10) or year == 2018:
        return 6

In [7]:
team_boxscore_df["season"] = team_boxscore_df.apply (lambda row: get_season_index(row["gmDate"]), axis=1)
standings_df["season"] = standings_df.apply (lambda row: get_season_index(row["stDate"]), axis=1)

In [8]:
# Test cell

# we will only test team boxscore dataframe, because we know that each value should be 1230,
# the number of NBA games per season (exception: 2012-2013 season got 1 game canceled due to Boston Marathon explosions)

def test_season_indeces(df):
    counts = df["season"].value_counts()
    
    assert counts[1] == 1229, "Incorrect number of games for 2012-2013 season, expected 1229, got "+str(counts[1])
    assert counts[2] == 1230, "Incorrect number of games for 2013-2014 season, expected 1230, got "+str(counts[2])
    assert counts[3] == 1230, "Incorrect number of games for 2014-2015 season, expected 1230, got "+str(counts[3])
    assert counts[4] == 1230, "Incorrect number of games for 2015-2016 season, expected 1230, got "+str(counts[4])
    assert counts[5] == 1230, "Incorrect number of games for 2016-2017 season, expected 1230, got "+str(counts[5])
    assert counts[6] == 1230, "Incorrect number of games for 2017-2018 season, expected 1230, got "+str(counts[6])
    print("Test passed")
    
test_season_indeces(team_boxscore_df)

Test passed


Next, we will add 4 columns in team boxscore dataframe: teamWins, teamLosses, opptWins and opptLosses by looking at the standings dataframe, comparing the standing date and the game date, using teams abbreviations.

Obs: we need to look at the standing date one day before the game date, because the standings are calculated at the end of the day, after all games have finished. 

In [9]:
def get_day_before_date(date):
    date_object = datetime.datetime.strptime(date, "%Y-%m-%d")
    day_before_date_object = date_object - datetime.timedelta(days=1)
    return datetime.datetime.strftime(day_before_date_object, "%Y-%m-%d")

def get_team_wins_and_losses(row, home_team=True):
    if home_team:
        prefix = 'team'
    else:
        prefix = 'oppt'
    
    day_before = get_day_before_date(row["gmDate"])
    standing = standings_df.loc[(standings_df['stDate'] == day_before) \
                            & (standings_df['teamAbbr'] == row[prefix + "Abbr"])]

    if(len(standing) == 0):
        row[prefix + "Wins"], row[prefix + "Losses"] = (0, 0)
    else:
        row[prefix + "Wins"], row[prefix + "Losses"] = (standing['gameWon'].values[0], standing['gameLost'].values[0])
    
    return row

In [10]:
team_boxscore_df = team_boxscore_df.apply (lambda row: get_team_wins_and_losses(row), axis=1)
team_boxscore_df = team_boxscore_df.apply (lambda row: get_team_wins_and_losses(row, False), axis=1)

In [11]:
# Test cell

# testing some games that we know both home and away teams wins and losses
def test_wins_and_losses(game, real_team_wins, real_team_losses, real_oppt_wins, real_oppt_losses):
    assert int(game['teamWins']) == real_team_wins, "Incorrect number of wins for home team, expected " \
        +str(real_team_wins)+", got "+str(game['teamWins'])
    assert int(game['teamLosses']) == real_team_losses, "Incorrect number of losses for home team, expected " \
        +str(real_team_losses)+", got "+str(game['teamLosses'])
    assert int(game['opptWins']) == real_oppt_wins, "Incorrect number of wins for away team, expected " \
        +str(real_oppt_wins)+", got "+str(game['opptWins'])
    assert int(game['opptLosses']) == real_oppt_losses, "Incorrect number of losses for away team, expected " \
        +str(real_oppt_losses)+", got "+str(game['opptLosses'])
    print("Test passed")
    
test_wins_and_losses(team_boxscore_df.loc[2727], 2, 5, 2, 5)
test_wins_and_losses(team_boxscore_df.loc[6365], 10, 39, 19, 29)
test_wins_and_losses(team_boxscore_df.loc[10683], 18, 11, 15, 12)

Test passed
Test passed
Test passed


At this point, we don't need the standings dataframe anymore, as we have stored the necessary data from it in the team boxscore dataframe. So we can free some memory up.

In [12]:
standings_df = None

Now that we have all the raw numbers we need in one dataframe, we will start creating the features we will use.

We will calculate the weighted average of the past 10 games of home and away team for the following advanced stats:

    TREB%, ASST%, TS%, EFG%, OREB%, DREB%, TO%, STL%, BLK%, BLKR, PPS, FIC, FIC40, Ortg, Drtg, EDiff, Play%, AR, AST/TO, STL/TO

We will assign weights following the logic that most recent games have more impact in describing how well a team has been playing than farther games.

Obs: As we need 10 *played* games before the actual game in order to calculate the average, we will drop the ones where home or away teams have not played 10 games before yet.

Obs2: Unfortunately, some dates are missing in the standings_csv file, which causes some games to show the Wins and Losses stats as 0, even though they occurred after both teams have already played at least 10 games. So, we will also drop these games.

We will also calculate the winning percentage at the game date for both teams, and also will use the DayOff stat for home and away teams to take into account the rest days before games.

__a) Calculating 10-game averages__

In [13]:
features = ['TREB%', 'ASST%', 'TS%', 'EFG%', 'OREB%', 'DREB%', 'TO%', 'STL%', 'BLK%', 'BLKR', 'PPS', 'FIC', \
            'FIC40', 'Ortg', 'Drtg', 'EDiff', 'Play%', 'AR', 'AST/TO', 'STL/TO']

# weights will be assigned according to their array position: the 10th element corresponds to the most recent game,
# the 9th element corresponds to the second most recent game, and so on and so forth
weights = [1, 1, 1, 2, 2, 2, 2, 3, 3, 3]

In [14]:
%%time

def get_10_last_games_team_avg_stats(row, home_team=True):
    if home_team:
        prefix = 'team'
    else:
        prefix = 'oppt'
    
    # if any of the teams haven't played 10 games yet, we don't do anything
    if (row["teamWins"] + row["teamLosses"] < 10) or (row["opptWins"] + row["opptLosses"] < 10):
        return row
    
    # we get the last 10 games where the target team played as home or away team
    last_10_games_team_df = team_boxscore_df.loc[(team_boxscore_df['teamAbbr'] == row[prefix + 'Abbr']) \
                                            | (team_boxscore_df['opptAbbr'] == row[prefix + 'Abbr'])].loc[:row.name].tail(11).iloc[:-1]
    
    # games where the target team played as home team inside the 10-game span
    home_games_played_team_df = last_10_games_team_df.loc[(team_boxscore_df['teamAbbr'] == row[prefix + 'Abbr'])]
    # games where the target team played as away team inside the 10-game span
    away_games_played_team_df = last_10_games_team_df.loc[(team_boxscore_df['opptAbbr'] == row[prefix + 'Abbr'])]
    
    # now we filter only the columns we want, and rename them to match in both dfs
    home_features = ["team" + feature for feature in features]
    away_features = ["oppt" + feature for feature in features]
    
    home_games_played_team_df = home_games_played_team_df[home_features].rename(columns=lambda x: x[4:])
    away_games_played_team_df = away_games_played_team_df[away_features].rename(columns=lambda x: x[4:])

    # we then join the stats from both dfs, multiply all columns by the weights, and calculate the mean
    averages_df = pd.concat([home_games_played_team_df, away_games_played_team_df], sort=False).sort_index()\
        .apply(lambda x: x * weights).mean()
    
    # finally, we assign each value from the averages_df to the row, with the stat name plus the prefix 'AVG'
    for feature in features:
        row['AVG' + prefix + feature] = averages_df[feature]
        
    return row

team_boxscore_df = team_boxscore_df.apply (lambda row: get_10_last_games_team_avg_stats(row), axis=1)
team_boxscore_df = team_boxscore_df.apply (lambda row: get_10_last_games_team_avg_stats(row, False), axis=1)
print("Calculations of averages done!")
print(team_boxscore_df.shape)

Calculations of averages done!
(7379, 168)
CPU times: user 8min 41s, sys: 1.55 s, total: 8min 43s
Wall time: 8min 40s


In [15]:
# Test cell

# testing some games that we know the last 10 occurrences for the target stat we want
def test_averages(game, last_10_game_stats, stat_name):
    if len(last_10_game_stats) != len(weights):
        raise Exception("Wrong amount of game stats")
        
    real_average = sum([x * weights[i] for i,x in enumerate(last_10_game_stats)]) / len(last_10_game_stats)
    
    assert round(float(game['AVG' + stat_name]), 4) == round(real_average, 4), "Incorrect average calculation, expected " \
        +str(round(real_average, 4))+", got "+str(round(float(game['AVG' + stat_name]), 4))
    print("Test passed")

test_averages(team_boxscore_df.loc[4115], [17.082, 14.982, 9.6154, 21.5034, 9.612, 9.2971, 16.0256, 13.6559, 14.6306, 18.6966], "teamTO%")
test_averages(team_boxscore_df.loc[7793], [103.2956, 98.3942, 108.6772, 95.4746, 120.5297, 96.2682, 94.392, 107.4762, 102.3214, 106.5345], "teamOrtg")
test_averages(team_boxscore_df.loc[10339], [76.0417, 61.6667, 60.5649, 78.4375, 70.5394, 88.4855, 75.3112, 59.8958, 61.6667, 75.625], "opptFIC40")

Test passed
Test passed
Test passed


In [16]:
# Finally, we drop the rows where we didn't calculate the averages
team_boxscore_df = team_boxscore_df.dropna()
print(team_boxscore_df.shape)
team_boxscore_df.head()

(6260, 168)


Unnamed: 0,AVGopptAR,AVGopptASST%,AVGopptAST/TO,AVGopptBLK%,AVGopptBLKR,AVGopptDREB%,AVGopptDrtg,AVGopptEDiff,AVGopptEFG%,AVGopptFIC,...,teamRslt,teamSTL,teamSTL%,teamSTL/TO,teamTO,teamTO%,teamTRB,teamTREB%,teamTS%,teamWins
273,37.70853,123.93995,3.84134,11.95669,19.36824,148.01665,216.02935,9.12686,1.08796,159.4125,...,Loss,13,13.8925,76.4706,17,16.1536,38,51.3514,0.4986,4
285,40.47572,138.08199,4.25931,7.72652,11.31651,151.10513,211.19551,6.01867,1.03781,151.7375,...,Win,8,8.8755,47.0588,17,16.9255,39,56.5217,0.6172,1
301,32.77222,115.516,2.70514,7.06051,10.40672,148.80702,209.96861,-4.01046,0.97763,134.5,...,Loss,12,11.467,70.5882,17,14.9701,43,40.9524,0.523,6
303,33.83075,122.81498,2.9365,8.08264,14.38907,144.44564,214.50429,-2.64267,0.98413,146.6125,...,Win,4,4.1156,28.5714,14,12.0565,55,52.8846,0.4994,5
305,31.95958,118.36269,3.1469,7.79878,11.97997,153.28462,211.4613,-9.13392,0.92896,126.725,...,Win,6,6.5775,75.0,8,7.8927,41,46.0674,0.5677,6


__b) Calculating winning percentages__

In [17]:
team_boxscore_df["teamWin%"] = team_boxscore_df.apply(lambda x: x["teamWins"] / (x["teamWins"] + x["teamLosses"]), axis=1)
team_boxscore_df["opptWin%"] = team_boxscore_df.apply(lambda x: x["opptWins"] / (x["opptWins"] + x["opptLosses"]), axis=1)
team_boxscore_df[["teamWins", "teamLosses", "teamWin%", "opptWins", "opptLosses", "opptWin%"]].head(10)

Unnamed: 0,teamWins,teamLosses,teamWin%,opptWins,opptLosses,opptWin%
273,4,6,0.4,7,3,0.7
285,1,9,0.1,6,4,0.6
301,6,5,0.545455,5,5,0.5
303,5,6,0.454545,4,6,0.4
305,6,4,0.6,3,7,0.3
311,2,8,0.2,7,4,0.636364
315,3,7,0.3,2,9,0.181818
321,8,3,0.727273,8,2,0.8
323,6,5,0.545455,8,3,0.727273
327,4,7,0.363636,5,5,0.5


__c) Creating target variable__

This is a simple action: we will look at the stat 'teamRslt' (which represents if the home team won or not), and we will assign a target variable as 1 if home team won, or 0 if not.

In [18]:
team_boxscore_df["homeTeamWon"] = team_boxscore_df.apply(lambda x: 1 if x["teamRslt"] == "Win" else 0, axis=1)
team_boxscore_df[["gmDate", "teamAbbr", "opptAbbr", "homeTeamWon", "teamRslt"]].head(10)

Unnamed: 0,gmDate,teamAbbr,opptAbbr,homeTeamWon,teamRslt
273,2012-11-17,PHO,MIA,0,Loss
285,2012-11-18,DET,BOS,1,Win
301,2012-11-19,DAL,GS,0,Loss
303,2012-11-19,UTA,HOU,1,Win
305,2012-11-20,PHI,TOR,1,Win
311,2012-11-21,CLE,PHI,1,Win
315,2012-11-21,ORL,DET,1,Win
321,2012-11-21,OKC,LAC,1,Win
323,2012-11-21,BOS,SA,0,Loss
327,2012-11-21,HOU,CHI,1,Win


As a final step for the preparation and processing of data, we will drop the unneccessary columns.

In [19]:
home_avg_features = ["AVGteam" + feature for feature in features]
away_avg_features = ["AVGoppt" + feature for feature in features]
other_features = ["teamWin%", "opptWin%", "teamDayOff", "opptDayOff"]
target_variable = ["homeTeamWon"]

columns = target_variable + home_avg_features + away_avg_features + other_features
team_boxscore_df = team_boxscore_df[columns]
team_boxscore_df.reset_index(drop=True, inplace=True)
print(team_boxscore_df.shape)
team_boxscore_df.head(10)

(6260, 45)


Unnamed: 0,homeTeamWon,AVGteamTREB%,AVGteamASST%,AVGteamTS%,AVGteamEFG%,AVGteamOREB%,AVGteamDREB%,AVGteamTO%,AVGteamSTL%,AVGteamBLK%,...,AVGopptDrtg,AVGopptEDiff,AVGopptPlay%,AVGopptAR,AVGopptAST/TO,AVGopptSTL/TO,teamWin%,opptWin%,teamDayOff,opptDayOff
0,0,97.53393,111.91323,1.00521,0.93683,65.34989,139.41316,22.56112,17.29492,14.39722,...,216.02935,9.12686,0.9161,37.70853,3.84134,109.83052,0.4,0.7,1,2
1,1,94.73646,127.24545,1.04206,0.95959,51.59563,141.93014,28.84313,13.84251,13.48204,...,211.19551,6.01867,0.90142,40.47572,4.25931,124.04025,0.1,0.6,2,1
2,0,99.35806,115.76091,1.0931,0.99186,44.71023,155.95239,29.02624,12.60678,12.82992,...,209.96861,-4.01046,0.85568,32.77222,2.70514,87.45949,0.545455,0.5,2,1
3,1,103.06153,115.7007,1.01439,0.9311,61.51384,137.35438,28.02581,15.02809,15.8587,...,214.50429,-2.64267,0.83194,33.83075,2.9365,97.40042,0.454545,0.4,2,1
4,1,96.42723,122.09489,0.98945,0.91803,43.56332,149.13065,24.2596,18.45238,12.80557,...,211.4613,-9.13392,0.80955,31.95958,3.1469,123.08392,0.6,0.3,2,2
5,1,99.19648,106.85987,0.99712,0.92031,66.5449,136.25089,28.15689,20.07318,4.75583,...,203.36132,2.28723,0.83488,35.08524,4.1694,159.22182,0.2,0.636364,3,1
6,1,101.8747,124.22797,0.96565,0.90074,49.48308,148.71104,31.10318,11.66177,9.26621,...,208.43352,0.2035,0.86603,35.1976,2.93708,82.79646,0.3,0.181818,2,3
7,1,101.533,125.9993,1.21515,1.10327,43.88295,153.00695,31.63873,15.78707,15.58708,...,195.68553,21.59652,0.91125,33.62097,2.62139,122.14497,0.727273,0.8,3,2
8,0,92.52349,128.86441,1.1232,1.02826,39.38927,151.72558,27.15537,16.32071,6.5836,...,206.81228,7.64634,0.86244,36.68806,3.4764,128.99112,0.545455,0.727273,3,2
9,1,101.64165,122.55126,1.04879,0.95236,54.03522,143.79244,28.23067,14.8495,9.56941,...,209.19336,-4.58015,0.85198,35.662,3.1677,97.47697,0.363636,0.5,2,3


## Step 4: Splitting dataset, data normalization and dimensionality reduction

Now that preprocessing is done, we can split the data into two datasets: train and test.

In [20]:
Y_pd = team_boxscore_df[['homeTeamWon']]
X_pd = team_boxscore_df.drop(columns=['homeTeamWon'])

# Split the dataset into 60% training and 40% testing sets.
X_train, X_test, Y_train, Y_test = train_test_split(X_pd, Y_pd, test_size=0.3, shuffle=True, stratify=Y_pd)

With datasets ready, we can normalize the features in the train dataset, using sklearn's MinMaxScaler. We will save the scaler after fitting the data, so we can load it and use it later to transform test data.

In [21]:
scaler = MinMaxScaler()
scaled_X_train = pd.DataFrame(scaler.fit_transform(X_train))
scaled_X_train.columns = X_train.columns
scaled_X_train.index = X_train.index

# save scaler into file
dump(scaler, 'min_max_scaler.bin', compress=True)

scaled_X_train.head()

Unnamed: 0,AVGteamTREB%,AVGteamASST%,AVGteamTS%,AVGteamEFG%,AVGteamOREB%,AVGteamDREB%,AVGteamTO%,AVGteamSTL%,AVGteamBLK%,AVGteamBLKR,...,AVGopptDrtg,AVGopptEDiff,AVGopptPlay%,AVGopptAR,AVGopptAST/TO,AVGopptSTL/TO,teamWin%,opptWin%,teamDayOff,opptDayOff
843,0.682412,0.579009,0.540794,0.514424,0.54254,0.68862,0.743642,0.522839,0.203818,0.274893,...,0.681539,0.305317,0.469064,0.392384,0.282446,0.358219,0.544118,0.323529,0.181818,0.2
357,0.724749,0.392517,0.494549,0.464548,0.790998,0.233282,0.62288,0.495442,0.46007,0.4143,...,0.549818,0.274873,0.401264,0.427743,0.367798,0.299373,0.571429,0.324324,0.272727,0.2
66,0.618512,0.482649,0.524509,0.519535,0.721847,0.326647,0.569593,0.477581,0.265835,0.257083,...,0.562676,0.525685,0.481281,0.505687,0.270206,0.276192,0.466667,0.529412,0.272727,0.1
3124,0.521939,0.436831,0.334038,0.369553,0.605933,0.352773,0.305676,0.40554,0.20152,0.142451,...,0.420795,0.429022,0.44856,0.341305,0.234823,0.417204,0.45679,0.308642,0.181818,0.2
5372,0.329548,0.433811,0.520417,0.60413,0.255343,0.497152,0.492931,0.310898,0.441364,0.538701,...,0.473519,0.695858,0.655334,0.76949,0.45216,0.392804,0.409091,0.727273,0.181818,0.1


Now, we will look the features and try to perform a __dimensionality reduction__, in order to select the lowest number of features and get the most data variance (in this case, at least __95%__). For this, we will use Sklearn's PCA.

In [22]:
# create PCA instance to achieve target 95% explained variance
pca = PCA(0.95)
reduced_X_train = pd.DataFrame(pca.fit_transform(scaled_X_train))
print("Original data shape: " + str(scaled_X_train.shape))
print("Reduced data shape: " + str(reduced_X_train.shape))
reduced_X_train.index = X_train.index
reduced_X_train.head()

Original data shape: (4382, 44)
Reduced data shape: (4382, 18)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
843,-0.359647,0.346179,-0.037493,-0.13598,-0.223906,-0.061597,0.129169,0.208244,0.134575,0.105519,-0.030253,0.203972,0.116808,-0.278372,-0.040701,0.056493,-0.068011,-0.077677
357,-0.517258,0.491011,0.305529,-0.151448,-0.170962,0.450575,0.12856,0.056942,0.154189,-0.0997,-0.076413,-0.092385,0.08641,-0.005608,0.031675,0.019828,0.000121,0.018536
66,-0.043989,0.08204,0.458649,-0.237278,0.205218,0.243044,0.094669,0.082309,0.536925,0.044201,-0.175636,0.226247,0.018659,0.014269,-0.020946,-0.057651,-0.121484,-0.054387
3124,-0.505664,0.204865,0.090599,-0.275946,0.386161,0.24174,-0.353886,0.02481,-0.060311,0.064774,-0.001297,0.120581,-0.169483,0.028412,-0.030226,0.074684,0.065519,0.085435
5372,0.759129,-0.561677,-0.097509,0.446023,0.186,0.077025,0.073061,0.132539,0.279223,0.019344,-0.012565,-0.062794,0.092378,0.423181,-0.079923,0.04023,0.074266,-0.143146


We can observe in the previous cell that we were able to reduce the dataset dimensionality, by creating 18 weighted-linear-combination components from the 44 original features. We will save this fitted PCA to use it later, when we will reduce the test dataset dimensionality.

In [23]:
# save pca into file
dump(pca, 'pca.bin', compress=True)

['pca.bin']

Now that train dataset was normalized and reduced, we can split it into train and validation datasets.

In [24]:
# Split the training set into 60% training and 40% validation sets.
X_train, X_val, Y_train, Y_val = train_test_split(reduced_X_train, Y_train, test_size=0.3, shuffle=True, stratify=Y_train)

In [25]:
# create local folder to store csv files
data_dir = 'data/nba_files'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

# create csv files
pd.concat([Y_test, X_test], axis=1).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)
pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

## Step 5: Uploading files to S3

Now we can upload train and validation datasets to S3 (test.csv file will be used later to test the model).

In [26]:
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# upload csv files to s3
prefix = 'nba-capstone-project'

val_location = sagemaker_session.upload_data(os.path.join(data_dir, 'validation.csv'), bucket=bucket, key_prefix=prefix)
train_location = sagemaker_session.upload_data(os.path.join(data_dir, 'train.csv'), bucket=bucket, key_prefix=prefix)

## Step 6: Create and train the XGBoost model

As mentioned before, we chose to use XGBoost due to its good performance and results when the input is a tabular dataset. So, with files uploaded in S3, we can create and train our XGBoost model.

In [27]:
role = sagemaker.get_execution_role()

# construct the image name for the training container
container = sagemaker.amazon.amazon_estimator.get_image_uri(sagemaker_session.boto_region_name, 'xgboost')

# create xgboost model
xgb = sagemaker.estimator.Estimator(container,
                                    role,
                                    train_instance_count=1,
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sagemaker_session)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
	get_image_uri(region, 'xgboost', '1.0-1').


We will use Sagemaker feature HyperparameterTuner, which will allow us to select the best model from a range of hyperparameters.

In [28]:
# set default hyperparameter values)
xgb.set_hyperparameters(max_depth=10,
                        eta=0.5,
                        gamma=3,
                        min_child_weight=2,
                        subsample=0.5,
                        objective='binary:logistic',
                        early_stopping_rounds=20,
                        num_round=500)


# set hyperparametertuner
xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb,
                                               objective_metric_name = 'validation:rmse',
                                               objective_type = 'Minimize',
                                               max_jobs = 50,
                                               max_parallel_jobs = 3,
                                               hyperparameter_ranges = {
                                                    'max_depth': IntegerParameter(2, 20),
                                                    'eta'      : ContinuousParameter(0.05, 0.95),
                                                    'min_child_weight': IntegerParameter(1, 10),
                                                    'subsample': ContinuousParameter(0.05, 0.95),
                                                    'gamma': ContinuousParameter(0, 20),
                                               })

We can now train the XGBoost model using S3 csv files for train and validation.

In [29]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})



In [30]:
xgb_hyperparameter_tuner.wait()

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................!


## Step 7: Deploy and test the trained model

Now that we have trained multiple XGBoost models, we can select the best one and deploy it in order to test our model using the test dataset, after we have applied the previously fitted MinMaxScaler and PCA.

In [31]:
# deploy the best training model (automatically gets the best training job)
xgb_predictor = xgb_hyperparameter_tuner.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')



2020-07-15 07:35:58 Starting - Preparing the instances for training
2020-07-15 07:35:58 Downloading - Downloading input data
2020-07-15 07:35:58 Training - Training image download completed. Training in progress.
2020-07-15 07:35:58 Uploading - Uploading generated training model
2020-07-15 07:35:58 Completed - Training job completed[34mArguments: train[0m
[34m[2020-07-15:07:35:46:INFO] Running standalone xgboost training.[0m
[34m[2020-07-15:07:35:46:INFO] Setting up HPO optimized metric to be : rmse[0m
[34m[2020-07-15:07:35:46:INFO] File size need to be processed in the node: 1.54mb. Available memory size in the node: 8497.71mb[0m
[34m[2020-07-15:07:35:46:INFO] Determined delimiter of CSV input is ','[0m
[34m[07:35:46] S3DistributionType set as FullyReplicated[0m
[34m[07:35:46] 3067x18 matrix with 55206 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-07-15:07:35:46:INFO] Determined delimiter of CSV input is ','[0m
[34m[07

[34m[07:35:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3[0m
[34m[42]#011train-rmse:0.443741#011validation-rmse:0.46215[0m
[34m[07:35:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3[0m
[34m[43]#011train-rmse:0.443254#011validation-rmse:0.462042[0m
[34m[07:35:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 2 pruned nodes, max_depth=3[0m
[34m[44]#011train-rmse:0.442954#011validation-rmse:0.4619[0m
[34m[07:35:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3[0m
[34m[45]#011train-rmse:0.442627#011validation-rmse:0.461851[0m
[34m[07:35:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3[0m
[34m[46]#011train-rmse:0.442346#011validation-rmse:0.461732[0m
[34m[07:35:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 



Training seconds: 53
Billable seconds: 53
-------------!

In [32]:
# load previouly saved MinMaxScaler
scaler = load('min_max_scaler.bin')

# load previouly saved PCA
pca = load('pca.bin')

In [33]:
# load test features and labels from file
test_df = pd.read_csv("data/nba_files/test.csv")

Y_test = test_df.iloc[:,0]
X_test = test_df.iloc[:,1:]

# apply MinMaxScaler to test features
scaled_X_test = pd.DataFrame(scaler.fit_transform(X_test))
# scaled_X_test.columns = X_test.columns

# apply PCA to test features
reduced_X_test = pd.DataFrame(pca.transform(scaled_X_test))

We can now send the scaled and reduced test dataset to our model, get the predictions and calculate the accuracy comparing with the true labels.

In [35]:
# get predictions
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
Y_pred = xgb_predictor.predict(reduced_X_test.values).decode('utf-8')
Y_pred = np.fromstring(Y_pred, sep=',')

# calculate accuracy
accuracy = accuracy_score(Y_test.values, np.round(Y_pred))
print("Model accuracy: " + str(accuracy))

Model accuracy: 0.6574320724560468


Our model correctly predicted if the home team would win the game with around 66% accuracy, which is a good value compared with the standards achieved by other models (around 60-70%). The encouraging part of this project is that it is still has a huge potential for improvement: we can select other algorithms, incorporate other types of data (such as player injuries), include more seasons in the datasets, among others. For more information, please see Project Report file. 

### Delete the endpoint

As a final step, we will delete the endpoint we have deployed.

In [36]:
xgb_predictor.delete_endpoint()