# Prediction of NBA fantasy scores

This notebook is an experiment to see if we could use machine learning models to predict the fantasy production of NBA players. We plan on comparing two techniques, a MLP regressor and a decision tree regressor.  

Credit to https://www.kaggle.com/jwals96/nba-201718-stats-and-fantasy-scores for the dataset we plan on using.

## Preparing the data

### Ingesting and interpreting the dataset

We'll start by simply using pandas to read our `.csv` file containing the dataset.  
Our dataset is made up of many features, such as: 
- Player season averages (points, rebounds, assists, shooting percentage, etc.)
- Information about specific game (date, home team, etc.)
- Information about the opponent.
- The players fantasy score.
Every example in our dataset corresponds to a player's fantasy production for a specific game.  

In [35]:
import pandas as pd

In [36]:
dataset_raw = pd.read_csv("NBA Stats Database - Player Game Stats 2017 (1).csv")
dataset_raw.head(5)

Unnamed: 0,Name,Position,(NBAS Team),(BBref Team),DK Team,(Team),(vs.),(NBAS Opponent),(BBref Opponent),DK Opponent,...,Player FT%,Player ORB,Player DRB,Player TRB,Player AST,Player STL,Player BLK,Player TOV,Player PF,Player PS/G
0,Aaron Brooks,PG,MIN,MIN,Min,MIN,vs.,IND,IND,Ind,...,0.727,0.2,0.3,0.5,0.6,0.2,0.0,0.3,0.9,2.3
1,Aaron Brooks,PG,MIN,MIN,Min,MIN,@,DET,DET,Det,...,0.727,0.2,0.3,0.5,0.6,0.2,0.0,0.3,0.9,2.3
2,Aaron Brooks,PG,MIN,MIN,Min,MIN,@,GSW,GSW,GS,...,0.727,0.2,0.3,0.5,0.6,0.2,0.0,0.3,0.9,2.3
3,Aaron Brooks,PG,MIN,MIN,Min,MIN,vs.,CHA,CHO,Cha,...,0.727,0.2,0.3,0.5,0.6,0.2,0.0,0.3,0.9,2.3
4,Aaron Brooks,PG,MIN,MIN,Min,MIN,@,DAL,DAL,Dal,...,0.727,0.2,0.3,0.5,0.6,0.2,0.0,0.3,0.9,2.3


Let's explore our data just to have a better idea of what is available to us.

In [37]:
#columns
print("All columns in the dataset:")
print(list(dataset_raw))
print()

# Shape of the dataset
print("Number of examples in the dataset: ", dataset_raw.shape[0])
print("Number of columns in the dataset: ", dataset_raw.shape[1])
print()

# Information about each column
for column in list(dataset_raw):
    print(dataset_raw[column].value_counts())
    print()

All columns in the dataset:
['Name', 'Position', '(NBAS Team)', '(BBref Team)', 'DK Team', '(Team)', '(vs.)', '(NBAS Opponent)', '(BBref Opponent)', 'DK Opponent', 'Home', 'Match Up', 'Date', 'Min', '(Double digit stats)', '(Unadj Pts)', 'DK Points Scored', 'Team Points ', 'Player Avg Pts/min', 'Team Avg Pace', 'Opponent Avg Pace', 'Team FG', 'Team FGA', 'Team FG%', 'Team 3P', 'Team 3PA', 'Team 3P%', 'Team 2P', 'Team 2PA', 'Team 2P%', 'Team FT', 'Team FTA', 'Team FT%', 'Team ORB', 'Team DRB', 'Team TRB', 'Team AST', 'Team STL', 'Team BLK', 'Team TOV', 'Team PF', 'Team Avg Points', 'Opp FG', 'Opp FGA', 'Opp FG%', 'Opp 3P', 'Opp 3PA', 'Opp 3P%', 'Opp 2P', 'Opp 2PA', 'Opp 2P%', 'Opp FT', 'Opp FTA', 'Opp FT%', 'Opp ORB', 'Opp DRB', 'Opp TRB', 'Opp AST', 'Opp STL', 'Opp BLK', 'Opp TOV', 'Opp PF', 'Opp Avg PTS', 'Player Age', 'Player FG', 'Player FGA', 'Player FG%', 'Player 3P', 'Player 3PA', 'Player 3P%', 'Player 2P', 'Player 2PA', 'Player 2P%', 'Player eFG%', 'Player FT', 'Player FTA', 'Pl


7.6    4460
7.7    3714
7.8    1878
8.0    1851
7.0    1850
8.8    1780
7.4     941
6.7     940
9.1     931
6.9     926
8.6     926
6.8     922
7.1     920
7.9     905
6.2     896
8.5     879
7.5     869
8.3     843
8.4     838
Name: Team STL, dtype: int64

4.5    3698
5.1    2709
4.8    2644
3.8    1858
4.1    1823
4.2    1781
4.9    1781
5.6    1019
7.5    1013
6.1     984
5.0     931
4.7     918
5.2     912
4.3     905
5.4     892
3.9     868
3.5     867
5.9     838
5.3     828
Name: Team BLK, dtype: int64

14.7    2775
14.0    2739
13.7    1855
13.5    1780
13.8    1771
15.0    1731
13.1    1019
15.5    1013
13.4     984
15.6     943
12.3     938
15.7     926
12.7     922
14.5     919
15.8     918
14.6     905
15.2     896
13.3     888
16.5     843
14.9     838
12.5     838
14.4     828
Name: Team TOV, dtype: int64

19.6    2882
17.2    1941
20.0    1844
19.2    1805
19.5    1791
21.7     984
19.7     941
20.5     940
20.2     931
22.0     926
18.6     920
19.3     919
21.2     91

Name: Opp PF, dtype: int64

103.4    2736
111.7    1817
106.6    1801
105.6    1767
112.4     955
98.8      946
102.7     933
104.1     932
103.9     932
102.3     931
108.2     926
104.0     922
107.9     918
108.1     916
104.5     914
113.5     910
109.0     908
110.0     903
109.8     900
102.9     899
109.5     894
103.8     882
110.9     880
106.5     877
99.3      870
Name: Opp Avg PTS, dtype: int64

23.0    2639
25.0    2573
24.0    2554
27.0    2220
26.0    2072
29.0    1923
22.0    1900
28.0    1858
31.0    1551
21.0    1504
30.0    1429
20.0    1366
32.0    1081
33.0     818
19.0     385
36.0     340
37.0     280
34.0     263
35.0     236
40.0     117
39.0      86
41.0      54
38.0      19
Name: Player Age, dtype: int64

3.1    958
3.0    957
2.0    871
2.8    844
2.6    839
      ... 
8.9     68
6.6     67
6.8     51
0.0     28
0.1     21
Name: Player FG, Length: 90, dtype: int64

5.0     566
3.5     521
4.3     489
5.5     475
10.7    452
       ... 
1.5      12
0.4       

We can see that out of our 86 columns,  some of them could be removed so to make coputing lighter. We see three reason to remove a column:
- Other column(s) offers similar information.
- Information is useless (e.g. '(vs.)').
- Column offers no predictive value.
- We believe the column offers little value. This is very subjective, which leaves us room for experimentation.  

Let's discuss which columns we'll keep and which we'll drop:
- 'Name': **DROP** - we think a players stats should define him well enough. Also, on-hot encoding of this column would add 500+ features.
- 'Position': **DROP** - This is a bit 'inside baseball', but the modern move towards positionless basketball makes us think adding this feature wouldn't be usefull enough.
- '(NBAS Team)': **DROP** - We don't believe knowing the team adds something if we know the team stats.
- '(BBref Team)': **DROP** - This is also the team, see '(NBAS Team)'
- 'DK Team': **DROP** - This is also the team, see '(NBAS Team)'
- '(Team)': **DROP** - This is also the team, see '(NBAS Team)'
- '(vs.)': **DROP** - This is useless
- '(NBAS Opponent)': **DROP** - We don't believe knowing the opponent adds something if we know the opponent stats.
- '(BBref Opponent)': **DROP** - This is also the opponent, see '(NBAS Opponent)'
- 'DK Opponent': **WILL DROP** - This is also the opponent, we'll drop it later, but keeping if for more modifications to dataset we are planning.
- 'Home': **KEEP** - Knowing if the game is at home seems very interesting as home court advantage is big in basketball.
- 'Match Up': **DROP** - This is just a combination of team and opponent, so same reasonning.
- 'Date': **WILL DROP** - We don't see much value to the date, we'll drop it later, but keeping if for more modifications to dataset we are planning.
- 'Min': **DROP** - This is the number of minute played for a game. We don't know this beforehand, so no predictive value.
- '(Double digit stats)': **DROP** - We actually don't know what that column is, so we'll drop it. But it seems like it says how many of the player's stats were in double digit, which once again is not know beforhand so no predictive value.
- '(Unadj Pts)': **DROP** - We made the design decision that our label column would be 'DK Points scored'. This column offers roughly the same information.
- 'DK Points Scored': **KEEP** - This is our label column. Essential.
- 'Team Points ': **DROP** - This is the number of points the team scored. We don't know this beforehand, no predictive value.
- 'Player Avg Pts/min': **KEEP** - This seems interesting since points are a big part of fantasy production.
- 'Team Avg Pace': **KEEP** - Interesting since the higher the pace, the more chances there are to score for the player.
- 'Opponent Avg Pace': **KEEP** - Same reasoning as 'Team Avg Pace'
- 'Team FG': **KEEP** - Team average field goal made per game. We find it interesting, though it's similar to the pace.
- 'Team FGA': **KEEP** - Team average field goal attemps per game. We find it interesting, though it's similar to the pace.
- 'Team FG%': **DROP** - While useful, it's fairly similiar to having both 'Team FG' and 'Team FGA'.
- 'Team 3P': **KEEP** - Same reasoning as 'Team FG'
- 'Team 3PA': **KEEP** - Same reasoning as 'Team FGA' 
- 'Team 3P%': **DROP** - Same reasoning as 'Team FG%' 
- 'Team 2P': **KEEP** - Same reasoning as 'Team FG' 
- 'Team 2PA': **KEEP** - Same reasoning as 'Team FGA'
- 'Team 2P%': **DROP** - Same reasoning as 'Team FG%' 
- 'Team FT': **KEEP** - Same reasoning as 'Team FG' 
- 'Team FTA': **KEEP** - Same reasoning as 'Team FGA' 
- 'Team FT%': **DROP** - Same reasoning as 'Team FG%' 
- 'Team ORB', 
- 'Team DRB', 
- 'Team TRB', 
- 'Team AST', 
- 'Team STL', 
- 'Team BLK', 
- 'Team TOV', 
- 'Team PF', 
- 'Team Avg Points', 
- 'Opp FG', 
- 'Opp FGA', 
- 'Opp FG%', 
- 'Opp 3P', 
- 'Opp 3PA', 
- 'Opp 3P%', 
- 'Opp 2P', 
- 'Opp 2PA', 
- 'Opp 2P%', 
- 'Opp FT', 
- 'Opp FTA', 
- 'Opp FT%', 
- 'Opp ORB', 
- 'Opp DRB', 
- 'Opp TRB', 
- 'Opp AST', 
- 'Opp STL', 
- 'Opp BLK', 
- 'Opp TOV', 
- 'Opp PF', 
- 'Opp Avg PTS', 
- 'Player Age', 
- 'Player FG', 
- 'Player FGA', 
- 'Player FG%', 
- 'Player 3P', 
- 'Player 3PA', 
- 'Player 3P%', 
- 'Player 2P', 
- 'Player 2PA', 
- 'Player 2P%', 
- 'Player eFG%', 
- 'Player FT', 
- 'Player FTA', 
- 'Player FT%', 
- 'Player ORB', 
- 'Player DRB', 
- 'Player TRB', 
- 'Player AST', 
- 'Player STL', 
- 'Player BLK', 
- 'Player TOV', 
- 'Player PF', 
- 'Player PS/G'

At this point, we do need to discuss what we believe is a huge weak point of our dataset. Most stats that show "averages" seem to show the final average for the season. This is not ideal considering some example are from the start of the season, so using the final season average, and not the season average at the time of the game, doens't perfectly reflect reality.  

We will use the NBA stats API to correct some of the, but we'll leave most of them as they are until we wee how our models perform.

### Remodeling the dataset

Let's now select the colums we care about.

In [38]:
#isolating cell because it can only be run once
y = dataset_raw.pop('DK Points Scored').values

In [39]:
feature_set = ['DK Opponent', 'Home', 'Date', 'Player Avg Pts/min', 'Team Avg Pace', 'Opponent Avg Pace', 
               'Team FG', 'Team FGA', 'Team 3P', 'Team 3PA', 'Team 2P', 'Team 2PA', 'Team FT', 'Team FTA', 
               'Team ORB', 'Team DRB', 'Team AST', 'Team STL', 'Team BLK', 'Team TOV', 'Team Avg Points', 
               'Opp FG', 'Opp FGA', 'Opp ORB', 'Opp DRB', 'Opp STL', 'Opp BLK', 'Opp TOV', 'Opp Avg PTS', 
               'Player Age', 'Player FG', 'Player FGA', 'Player 3P', 'Player 3PA', 'Player 2P', 'Player 2PA', 
               'Player FT', 'Player FTA', 'Player ORB', 'Player DRB', 'Player AST', 'Player STL', 'Player BLK', 
               'Player TOV', 'Player PF', 'Player PS/G']

X = dataset_raw[feature_set].copy()
X.head(5)

Unnamed: 0,DK Opponent,Home,Date,Player Avg Pts/min,Team Avg Pace,Opponent Avg Pace,Team FG,Team FGA,Team 3P,Team 3PA,...,Player FT,Player FTA,Player ORB,Player DRB,Player AST,Player STL,Player BLK,Player TOV,Player PF,Player PS/G
0,Ind,1,10/24/2017,0.716,96.84,96.9,41.0,86.1,8.0,22.5,...,0.3,0.3,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3
1,Det,0,10/25/2017,0.716,96.84,96.81,41.0,86.1,8.0,22.5,...,0.3,0.3,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3
2,GS,0,11/8/2017,0.716,96.84,100.43,41.0,86.1,8.0,22.5,...,0.3,0.3,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3
3,Cha,1,11/5/2017,0.716,96.84,98.74,41.0,86.1,8.0,22.5,...,0.3,0.3,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3
4,Dal,0,11/17/2017,0.716,96.84,96.61,41.0,86.1,8.0,22.5,...,0.3,0.3,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3


Now we'll add a column for month and remove the date colum, because we only need the month for the changes we'll make later on.

In [41]:
X['Date'] = pd.to_datetime(X['Date'], errors='coerce')
X['Month'] = X['Date'].dt.month
#X.pop('Date')
print(list(X))
X['Month'].value_counts()

['DK Opponent', 'Home', 'Date', 'Player Avg Pts/min', 'Team Avg Pace', 'Opponent Avg Pace', 'Team FG', 'Team FGA', 'Team 3P', 'Team 3PA', 'Team 2P', 'Team 2PA', 'Team FT', 'Team FTA', 'Team ORB', 'Team DRB', 'Team AST', 'Team STL', 'Team BLK', 'Team TOV', 'Team Avg Points', 'Opp FG', 'Opp FGA', 'Opp ORB', 'Opp DRB', 'Opp STL', 'Opp BLK', 'Opp TOV', 'Opp Avg PTS', 'Player Age', 'Player FG', 'Player FGA', 'Player 3P', 'Player 3PA', 'Player 2P', 'Player 2PA', 'Player FT', 'Player FTA', 'Player ORB', 'Player DRB', 'Player AST', 'Player STL', 'Player BLK', 'Player TOV', 'Player PF', 'Player PS/G', 'Month']


Hou    955
Sac    946
Atl    935
SA     933
Pho    932
Uta    932
Dal    931
Cha    926
Bos    922
OKC    918
LAL    916
Orl    915
NY     914
GS     910
Tor    910
Was    909
LAC    908
NO     907
Den    903
Phi    900
Chi    899
Min    894
Bkn    892
Ind    886
Mia    886
Det    882
Por    881
Cle    880
Mil    877
Mem    870
Name: DK Opponent, dtype: int64

Now, we want to replace the 'DK Opponent' by the defensive rating of that opponent during that month. We believe it's more informative, and will make the model more applicable to different seasons, even if our dataset only covers 2017-18.  

First step is collecting the data from NBA stats. The following code is a bit dirty because the code was done quickly to emulate the request done by the NBA.com stats page. This is a good are of improvement.

In [27]:
import requests

In [43]:
URL = "https://stats.nba.com/stats/leaguedashteamstats/"

PARAMS = {
    'Conference':'',
    'DateFrom':'',
    'DateTo':'',
    'Division':'',
    'GameScope':'',
    'GameSegment':'',
    'LastNGames':0,
    'LeagueID':'00',
    'Location':'',
    'MeasureType':'Defense',
    'Month':1,
    'OpponentTeamID':0,
    'Outcome':'',
    'PORound':0,
    'PaceAdjust':'N',
    'PerMode':'PerGame',
    'Period':0,
    'PlayerExperience':'',
    'PlayerPosition':'',
    'PlusMinus':'N',
    'Rank':'N',
    'Season':'2017-18',
    'SeasonSegment':'',
    'SeasonType':'Regular Season',
    'ShotClockRange':'',
    'StarterBench':'',
    'TeamID':0,
    'TwoWay':0,
    'VsConference':'',
    'VsDivision':''
}

#need to trick API into thinking we're a browser
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}

ratings = {
    '10': {},
    '11': {},
    '12': {},
    '1': {},
    '2': {},
    '3': {},
    '4': {}
}

month_count = 1 #pretty ugly, do it better
for month_rating in ratings:
    PARAMS['Month'] = month_count
    r = requests.get(url = URL, params = PARAMS, headers = HEADERS) 
    data = r.json()
    for team_info in data['resultSets'][0]['rowSet']:
        ratings[month_rating][team_info[1]] = team_info[7]
    month_count+=1

['Atlanta Hawks', 'Boston Celtics', 'Brooklyn Nets', 'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers', 'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons', 'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers', 'LA Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies', 'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves', 'New Orleans Pelicans', 'New York Knicks', 'Oklahoma City Thunder', 'Orlando Magic', 'Philadelphia 76ers', 'Phoenix Suns', 'Portland Trail Blazers', 'Sacramento Kings', 'San Antonio Spurs', 'Toronto Raptors', 'Utah Jazz', 'Washington Wizards']


In [66]:
# now adding feature for defensive rating
#first need to define a match between team names from NBA stats API and our dataset
team_name_acronyme_match = {
    'Hou':'Houston Rockets',
    'Sac':'Sacramento Kings',
    'Atl':'Atlanta Hawks',
    'SA':'San Antonio Spurs',
    'Pho':'Phoenix Suns',
    'Uta':'Utah Jazz',
    'Dal':'Dallas Mavericks',
    'Cha':'Charlotte Hornets',
    'Bos':'Boston Celtics',
    'OKC':'Oklahoma City Thunder',
    'LAL':'Los Angeles Lakers',
    'Orl':'Orlando Magic',
    'NY':'New York Knicks',
    'GS':'Golden State Warriors',
    'Tor':'Toronto Raptors',
    'Was':'Washington Wizards',
    'LAC':'LA Clippers',
    'NO':'New Orleans Pelicans',
    'Den':'Denver Nuggets',
    'Phi':'Philadelphia 76ers',
    'Chi':'Chicago Bulls',
    'Min':'Minnesota Timberwolves',
    'Bkn':'Brooklyn Nets',
    'Ind':'Indiana Pacers',
    'Mia':'Miami Heat',
    'Det':'Detroit Pistons',
    'Por':'Portland Trail Blazers',
    'Cle':'Cleveland Cavaliers',
    'Mil':'Milwaukee Bucks',
    'Mem':'Memphis Grizzlies'
}

#TODO explain
def get_defensive_rating(row, acronyme_dict, ratings_dict):
    team = acronyme_dict[row['DK Opponent']]
    month = row['Month']
    return ratings_dict[str(month)][team]

#TODO explain and pop useless features
X['Opp def rating'] = X.apply(lambda row: get_defensive_rating(row, team_name_acronyme_match, ratings), axis=1)

Unnamed: 0,DK Opponent,Home,Date,Player Avg Pts/min,Team Avg Pace,Opponent Avg Pace,Team FG,Team FGA,Team 3P,Team 3PA,...,Player ORB,Player DRB,Player AST,Player STL,Player BLK,Player TOV,Player PF,Player PS/G,Month,Opp def rating
0,Ind,1,2017-10-24,0.716,96.84,96.9,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,10,107.4
1,Det,0,2017-10-25,0.716,96.84,96.81,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,10,104.3
2,GS,0,2017-11-08,0.716,96.84,100.43,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,11,99.6
3,Cha,1,2017-11-05,0.716,96.84,98.74,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,11,111.9
4,Dal,0,2017-11-17,0.716,96.84,96.61,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,11,104.3
5,Dal,1,2017-11-04,0.716,96.84,96.61,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,11,104.3
6,GS,0,2017-11-08,0.716,96.84,100.43,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,11,99.6
7,Cha,1,2017-11-05,0.716,96.84,98.74,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,11,111.9
8,Dal,0,2017-11-17,0.716,96.84,96.61,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,11,104.3
9,Dal,1,2017-11-04,0.716,96.84,96.61,41.0,86.1,8.0,22.5,...,0.2,0.3,0.6,0.2,0.0,0.3,0.9,2.3,11,104.3
