# Prediction of NBA fantasy scores

This notebook is an experiment to see if we could use machine learning models to predict the fantasy production of NBA players. We plan on comparing two techniques, a MLP regressor and a decision tree regressor.  

Credit to https://www.kaggle.com/jwals96/nba-201718-stats-and-fantasy-scores for the dataset we plan on using.

## Preparing the data

### Ingesting and interpreting the dataset

We'll start by simply using pandas to read our `.csv` file containing the dataset.  
Our dataset is made up of many features, such as: 
- Player season averages (points, rebounds, assists, shooting percentage, etc.)
- Information about specific game (date, home team, etc.)
- Information about the opponent.
- The players fantasy score.
Every example in our dataset corresponds to a player's fantasy production for a specific game.  

In [None]:
import pandas as pd

In [None]:
dataset_raw = pd.read_csv("NBA Stats Database - Player Game Stats 2017 (1).csv")
dataset_raw.head(5)

Let's explore our data just to have a better idea of what is available to us.

In [None]:
#columns
print("All columns in the dataset:")
print(list(dataset_raw))
print()

# Shape of the dataset
print("Number of examples in the dataset: ", dataset_raw.shape[0])
print("Number of columns in the dataset: ", dataset_raw.shape[1])
print()

# Information about each column
'''
for column in list(dataset_raw):
    print(dataset_raw[column].value_counts())
    print()
'''

We plan to use different versions of the dataset to see which one performs the best. To do so, we think we should try to classify each feature of the dataset and then base our versions of the dataset on those categories. We decided to define the following categories:
- **Main**: Those are features that we think belong in every dataset because we see them as having a good value.
- **Derived**: Other feature(s) offers similar information or can litteraly be used to derive the feature.
- **Useless**: Information is useless (e.g. '(vs.)').
- **Not predictive**: Column offers no predictive value. I.e. it cannot be a feature in the test set.
- **Optional**: We believe the column offers little value. This is very subjective, which leaves us room for experimentation.  

Let's classify our features and explain a bit. This is going to be quite long, so please feel free to skip this: 


| Feature | Category | Justification |
|:--------|:---------|:--------------|
| 'Name' | **Optional** | we think a players stats should define him well enough. Also, on-hot encoding of this column would add 500+ features. |  
| 'Position' | **Optional** | This is a bit 'inside baseball', but the modern move towards positionless basketball makes us think adding this feature wouldn't be usefull enough. |
| '(NBAS Team)' | **Optional** | We don't believe knowing the team adds something if we know the team stats. Also, playing agains Miami in 2012 isn't the same as Miami in 2019, so there is a time constraint on this feature. |
| '(BBref Team)' | **Useless** | This is also the team, see '(NBAS Team)' |
| 'DK Team' | **Useless** | This is also the team, see '(NBAS Team)' |
| '(Team)' | **Useless** | This is also the team, see '(NBAS Team)' |
| '(vs.)' | **Useless** | This is useless. |
| 'DK Opponent' | **Optional** | We believe knowing the opponent has limited value if we know the opponent stats. |
| '(NBAS Opponent)' | **Useless** |  This is also the opponent, see 'DK Opponent' |
| '(BBref Opponent)' | **Useless** | This is also the opponent, see 'DK Opponent' |
| 'Home' | **Main** | Knowing if the game is at home seems very interesting as home court advantage is big in basketball. |
| 'Match Up' | **Useless** | This is just a combination of team and opponent, so same reasonning. |
| 'Date' | **Optional** | Wethink the date should not have to much impact, especially the exact date. Maybe the month can catch some trends. However, we will use this feature to do some changes later on. |
| 'Min' | **Not predictive** | This is the number of minute played for a game. We don't know this beforehand, so no predictive value. |
| '(Double digit stats)' | **Main** | Unfortunatly, we're not sure what this is, so we'll ignore it for now. It seems like it says how many of the player's stats were in double digit, which once again is not know beforhand so no predictive value. |
| '(Unadj Pts)' | **Useless** | We made the design decision that our label column would be 'DK Points scored'. This column offers roughly the same information. |
| 'DK Points Scored' | **Main** | This is our label column. Essential. |
| 'Team Points' | **Not predictive** | This is the points the team scored. We don't know this beforehand, no predictive value. |
| 'Player Avg Pts/min' | **Main** | This seems interesting since points are a big part of fantasy production. |
| 'Team Avg Pace' | **Main** | Interesting since the higher the pace, the more chances there are to score for the player. |
| 'Opponent Avg Pace' | **Main** | Same reasoning as 'Team Avg Pace'. |
| 'Team FG' | **Main** | Team average field goal made per game. We find it interesting, though it's similar to the pace. |
| 'Team FGA' | **Main** | Team average field goal attemps per game. We find it interesting, though it's similar to the pace. |
| 'Team FG%' | **Derived** | While useful, it's fairly similiar to having both 'Team FG' and 'Team FGA'. |
| 'Team 3P' | **Main** | Same reasoning as 'Team FG' |
| 'Team 3PA' | **Main** | Same reasoning as 'Team FGA' | 
| 'Team 3P%' | **Derived** | Same reasoning as 'Team FG%' | 
| 'Team 2P' | **Main** | Same reasoning as 'Team FG' | 
| 'Team 2PA' | **Main** | Same reasoning as 'Team FGA' |
| 'Team 2P%' | **Derived** | Same reasoning as 'Team FG%' | 
| 'Team FT' | **Main** | Same reasoning as 'Team FG' | 
| 'Team FTA' | **Main** | Same reasoning as 'Team FGA' | 
| 'Team FT%' | **Derived** | Same reasoning as 'Team FG%' | 
| 'Team ORB' | **Main** |  | 
| 'Team DRB' | **Main** |  |
| 'Team TRB' | **Derived** |  |
| 'Team AST' | **Main** |  |
| 'Team STL' | **Main** |  |
| 'Team BLK' | **Main** |  |
| 'Team TOV' | **Main** |  |
| 'Team PF' | **Optional** |  |
| 'Team Avg Points' | **Main** |  |
| 'Opp FG' | **Main** |  |
| 'Opp FGA' | **Main** |  |
| 'Opp FG%' | **Derived** |  |
| 'Opp 3P' | **Main** |  |
| 'Opp 3PA' | **Main** |  |
| 'Opp 3P%' | **Derived** |  |
| 'Opp 2P' | **Main** |  |
| 'Opp 2PA' | **Main** |  |
| 'Opp 2P%' | **Derived** |  |
| 'Opp FT' | **Main** |  |
| 'Opp FTA' | **Main** |  |
| 'Opp FT%' | **Derived** |  |
| 'Opp ORB' | **Main** |  |
| 'Opp DRB' | **Main** |  |
| 'Opp TRB' | **Derived** |  |
| 'Opp AST' | **Main** |  |
| 'Opp STL' | **Main** |  |
| 'Opp BLK' | **Main** |  |
| 'Opp TOV' | **Main** |  |
| 'Opp PF' | **Main** |  |
| 'Opp Avg PTS' | **Main** |  |
| 'Player Age' | **Main** |  |
| 'Player FG' | **Main** |  |
| 'Player FGA' | **Main** |  |
| 'Player FG%' | **Derived** |  |
| 'Player 3P' | **Main** |  |
| 'Player 3PA' | **Main** |  |
| 'Player 3P%' | **Derived** |  |
| 'Player 2P' | **Main** |  |
| 'Player 2PA' | **Main** |  |
| 'Player 2P%' | **Derived** |  |
| 'Player eFG%' | **Derived** |  |
| 'Player FT' | **Main** |  |
| 'Player FTA' | **Main** |  |
| 'Player FT%' | **Derived** |  |
| 'Player ORB' | **Main** |  |
| 'Player DRB' | **Main** |  |
| 'Player TRB' | **Derived** |  |
| 'Player AST' | **Main** |  |
| 'Player STL' | **Main** |  |
| 'Player BLK' | **Main** |  |
| 'Player TOV' | **Main** |  |
| 'Player PF' | **Main** |  |
| 'Player PS/G' | **Main** |  |


### Remodeling the dataset

Like we mentionned earlier, we want to use different versions of the dataset to see how different mix of feature might change performance. We decided to have three versions

**Version 1: Full dataset**  
This dataset will pretty much be the original dataset, but we'll remove features that we classified as **Useless** or **Not Predictive**. For the other features, we'll make no or very little changes.  

**Version 2: Lightweight dataset**  
In this dataset, we'll only keep the features we classified as **Main**. The idea is that the features classified as **Derived** or **Optional** offer information that we can already get with the **Main** features. We might do small changes to the features in this dataset.  

**Version 3: Fully custom**  
This is pretty much the version 2 dataset, but we'll make quite a bit of changes. We'll detail them in the section reserved for the creation of the version 3 dataset.

First, let's pop our target column as it's the same for all our datasets.

In [None]:
#isolating cell because it can only be run once
y = dataset_raw.pop('DK Points Scored').values

Now we'll apply changes that we will be using on all our datasets

In [None]:
# replace the date by a 'month' feature
dataset_raw['Date'] = pd.to_datetime(dataset_raw['Date'], errors='coerce')
dataset_raw['Month'] = dataset_raw['Date'].dt.month
dataset_raw.pop('Date')

#### Creating the Version 1 dataset: Original dataset

As we mentionned earlier, the version 1 dataset, but without the features we classified as **Useless** or **Not Predictive**. The only change we will make is the one-hot encoding of discrete features (and 'Month', which is technically numeric).

In [None]:
full_feature_set = ['Name', 'Position', '(NBAS Team)', 'DK Opponent', 'Home', '(Double digit stats)', 
                    'DK Points Scored', 'Player Avg Pts/min', 'Team Avg Pace', 'Opponent Avg Pace', 'Team FG', 
                    'Team FGA', 'Team FG%', 'Team 3P', 'Team 3PA', 'Team 3P%', 'Team 2P', 'Team 2PA', 'Team 2P%', 
                    'Team FT', 'Team FTA', 'Team FT%', 'Team ORB', 'Team DRB', 'Team TRB', 'Team AST', 'Team STL', 
                    'Team BLK', 'Team TOV', 'Team PF', 'Team Avg Points', 'Opp FG', 'Opp FGA', 'Opp FG%', 'Opp 3P', 
                    'Opp 3PA', 'Opp 3P%', 'Opp 2P', 'Opp 2PA', 'Opp 2P%', 'Opp FT', 'Opp FTA', 'Opp FT%', 'Opp ORB', 
                    'Opp DRB', 'Opp TRB', 'Opp AST', 'Opp STL', 'Opp BLK', 'Opp TOV', 'Opp PF', 'Opp Avg PTS', 
                    'Player Age', 'Player FG', 'Player FGA', 'Player FG%', 'Player 3P', 'Player 3PA', 'Player 3P%', 
                    'Player 2P', 'Player 2PA', 'Player 2P%', 'Player eFG%', 'Player FT', 'Player FTA', 'Player FT%', 
                    'Player ORB', 'Player DRB', 'Player TRB', 'Player AST', 'Player STL', 'Player BLK', 'Player TOV', 
                    'Player PF', 'Player PS/G', 'Month']

X_full_ft = dataset_raw[full_feature_set].copy()
X_full_ft.head(5)

# one hot encoding 'Name', 'Position', '(NBAS Team)', 'DK Opponent' and 'Month'
one_hot_name = pd.get_dummies(X_full_ft['Name'], prefix = 'player')
one_hot_position = pd.get_dummies(X_full_ft['Position'], prefix = 'position')
one_hot_team = pd.get_dummies(X_full_ft['(NBAS Team)'], prefix = 'team')
one_hot_opponent = pd.get_dummies(X_full_ft['DK Opponent'], prefix = 'opponent')
one_hot_month = pd.get_dummies(X_full_ft['Month'], prefix = 'month')


X_full_ft = pd.concat([X_full_ft, one_hot_name, one_hot_position, one_hot_team, one_hot_opponent, one_hot_month], axis=1)
X_full_ft.pop('Name')
X_full_ft.pop('Position')
X_full_ft.pop('(NBAS Team)')
X_full_ft.pop('DK Opponent')
X_full_ft.pop('Month')
print(list(X_full_ft))

#### Creating the Version 2 dataset: Lightweight dataset

#### Creating the Version 3 dataset: Fully custom

## References

- One hot encoding using pandas: http://www.insightsbot.com/blog/zuyVu/python-one-hot-encoding-with-pandas-made-simple
- Change the date column: https://stackoverflow.com/questions/21954197/which-is-the-fastest-way-to-extract-day-month-and-year-from-a-given-date
- 

# Bellow this is trash that I don't want to delete yet. Gonna move it.

In [None]:
feature_set = ['DK Opponent', 'Home', 'Date', 'Player Avg Pts/min', 'Team Avg Pace', 'Opponent Avg Pace', 
               'Team FG', 'Team FGA', 'Team 3P', 'Team 3PA', 'Team 2P', 'Team 2PA', 'Team FT', 'Team FTA', 
               'Team ORB', 'Team DRB', 'Team AST', 'Team STL', 'Team BLK', 'Team TOV', 'Team Avg Points', 
               'Opp FG', 'Opp FGA', 'Opp ORB', 'Opp DRB', 'Opp STL', 'Opp BLK', 'Opp TOV', 'Opp Avg PTS', 
               'Player Age', 'Player FG', 'Player FGA', 'Player 3P', 'Player 3PA', 'Player 2P', 'Player 2PA', 
               'Player FT', 'Player FTA', 'Player ORB', 'Player DRB', 'Player AST', 'Player STL', 'Player BLK', 
               'Player TOV', 'Player PF', 'Player PS/G']

X = dataset_raw[feature_set].copy()
X.head(5)

Now we'll add a column for month and remove the date colum, because we only need the month for the changes we'll make later on.

In [None]:
X['Date'] = pd.to_datetime(X['Date'], errors='coerce')
X['Month'] = X['Date'].dt.month
#X.pop('Date')
print(list(X))
X['Month'].value_counts()

Now, we want to replace the 'DK Opponent' by the defensive rating of that opponent during that month. We believe it's more informative, and will make the model more applicable to different seasons, even if our dataset only covers 2017-18.  

First step is collecting the data from NBA stats. The following code is a bit dirty because the code was done quickly to emulate the request done by the NBA.com stats page. This is a good are of improvement.

In [None]:
import requests

In [None]:
URL = "https://stats.nba.com/stats/leaguedashteamstats/"

PARAMS = {
    'Conference':'',
    'DateFrom':'',
    'DateTo':'',
    'Division':'',
    'GameScope':'',
    'GameSegment':'',
    'LastNGames':0,
    'LeagueID':'00',
    'Location':'',
    'MeasureType':'Defense',
    'Month':1,
    'OpponentTeamID':0,
    'Outcome':'',
    'PORound':0,
    'PaceAdjust':'N',
    'PerMode':'PerGame',
    'Period':0,
    'PlayerExperience':'',
    'PlayerPosition':'',
    'PlusMinus':'N',
    'Rank':'N',
    'Season':'2017-18',
    'SeasonSegment':'',
    'SeasonType':'Regular Season',
    'ShotClockRange':'',
    'StarterBench':'',
    'TeamID':0,
    'TwoWay':0,
    'VsConference':'',
    'VsDivision':''
}

#need to trick API into thinking we're a browser
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}

ratings = {
    '10': {},
    '11': {},
    '12': {},
    '1': {},
    '2': {},
    '3': {},
    '4': {}
}

month_count = 1 #pretty ugly, do it better
for month_rating in ratings:
    PARAMS['Month'] = month_count
    r = requests.get(url = URL, params = PARAMS, headers = HEADERS) 
    data = r.json()
    for team_info in data['resultSets'][0]['rowSet']:
        ratings[month_rating][team_info[1]] = team_info[7]
    month_count+=1

In [None]:
# now adding feature for defensive rating
#first need to define a match between team names from NBA stats API and our dataset
team_name_acronyme_match = {
    'Hou':'Houston Rockets',
    'Sac':'Sacramento Kings',
    'Atl':'Atlanta Hawks',
    'SA':'San Antonio Spurs',
    'Pho':'Phoenix Suns',
    'Uta':'Utah Jazz',
    'Dal':'Dallas Mavericks',
    'Cha':'Charlotte Hornets',
    'Bos':'Boston Celtics',
    'OKC':'Oklahoma City Thunder',
    'LAL':'Los Angeles Lakers',
    'Orl':'Orlando Magic',
    'NY':'New York Knicks',
    'GS':'Golden State Warriors',
    'Tor':'Toronto Raptors',
    'Was':'Washington Wizards',
    'LAC':'LA Clippers',
    'NO':'New Orleans Pelicans',
    'Den':'Denver Nuggets',
    'Phi':'Philadelphia 76ers',
    'Chi':'Chicago Bulls',
    'Min':'Minnesota Timberwolves',
    'Bkn':'Brooklyn Nets',
    'Ind':'Indiana Pacers',
    'Mia':'Miami Heat',
    'Det':'Detroit Pistons',
    'Por':'Portland Trail Blazers',
    'Cle':'Cleveland Cavaliers',
    'Mil':'Milwaukee Bucks',
    'Mem':'Memphis Grizzlies'
}

#TODO explain
def get_defensive_rating(row, acronyme_dict, ratings_dict):
    team = acronyme_dict[row['DK Opponent']]
    month = row['Month']
    return ratings_dict[str(month)][team]

#TODO explain and pop useless features
X['Opp def rating'] = X.apply(lambda row: get_defensive_rating(row, team_name_acronyme_match, ratings), axis=1)