# 1. <a id='toc1_'></a>[NBA Season 2022-2023 Analysis](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- 1. [NBA Season 2022-2023 Analysis](#toc1_)    
- 2. [Importings](#toc2_)    
  - 2.1. [Libraries](#toc2_1_)    
  - 2.2. [Helper Function](#toc2_2_)    
  - 2.3. [Data loading](#toc2_3_)    
- 3. [Data exploration and problem comprehension](#toc3_)    
  - 3.1. [Examining the **Advanced Dataset**](#toc3_1_)    
    - 3.1.1. [Features from Advanced Dataset](#toc3_1_1_)    
    - 3.1.2. [What are we dealing with?](#toc3_1_2_)    
    - 3.1.3. [Renaming and droping empty columns](#toc3_1_3_)    
    - 3.1.4. [Checking for NAs](#toc3_1_4_)    
    - 3.1.5. [Do these players have multiple lines due to team exchanges?](#toc3_1_5_)    
    - 3.1.6. [Let's combine the rows with same players](#toc3_1_6_)    
      - 3.1.6.1. [Checking if the concatenation went right](#toc3_1_6_1_)    
    - 3.1.7. [First glance at the Advanced Dataset](#toc3_1_7_)    
    - 3.1.8. [Imputing values to the missing data](#toc3_1_8_)    
  - 3.2. [Examining **Per Game Dataset**](#toc3_2_)    
    - 3.2.1. [Features from Per Game Dataset](#toc3_2_1_)    
    - 3.2.2. [What are we dealing with?](#toc3_2_2_)    
    - 3.2.3. [Renaming the columns](#toc3_2_3_)    
    - 3.2.4. [Checking for NAs](#toc3_2_4_)    
    - 3.2.5. [Let's combine multiple player rows in one](#toc3_2_5_)    
      - 3.2.5.1. [Checking if the concatanation went as expected](#toc3_2_5_1_)    
      - 3.2.5.2. [Checking again for NAs](#toc3_2_5_2_)    
    - 3.2.6. [Filling out NAs](#toc3_2_6_)    
    - 3.2.7. [First glance at the Per Game Dataset](#toc3_2_7_)    
  - 3.3. [Merging the two datasets](#toc3_3_)    
    - 3.3.1. [Creating some new features](#toc3_3_1_)    
      - 3.3.1.1. [GM = Games Missed](#toc3_3_1_1_)    
    - 3.3.2. [Reordering the columns](#toc3_3_2_)    
  - 3.4. [Exporting the merged dataset as a pickle file](#toc3_4_)    
- 4. [Feature Engineering and Hypothesis Creation](#toc4_)    
- 5. [Data selection and filtering](#toc5_)    
- 6. [Exploratory Data Analysis](#toc6_)    
- 7. [Data Preparation](#toc7_)    
- 8. [Feature Selection through Boruta algorithm](#toc8_)    
- 9. [Model implementation](#toc9_)    
- 10. [Hyperparameter Fine-Tuning](#toc10_)    
- 11. [Model Error Estimation and Interpretation](#toc11_)    
- 12. [Model Deployment](#toc12_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# 2. <a id='toc2_'></a>[Importings](#toc0_)

## 2.1. <a id='toc2_1_'></a>[Libraries](#toc0_)

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import pickle

from ydata_profiling        import ProfileReport
from sklearn.impute         import SimpleImputer
from IPython.display        import Image
from IPython.core.display   import HTML

## 2.2. <a id='toc2_2_'></a>[Helper Function](#toc0_)

In [3]:
def jupyter_configs():
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [15, 8]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    pd.set_option('display.max_columns', None)
    
    sns.set()
    
    warnings.filterwarnings( 'ignore' )
    
jupyter_configs()

## 2.3. <a id='toc2_3_'></a>[Data loading](#toc0_)

In [None]:
advanced_df_raw = pd.read_csv('./data/data_advanced.csv')
pergame_df_raw = pd.read_csv('./data/data_pergame.csv')

# 3. <a id='toc3_'></a>[Data exploration and problem comprehension](#toc0_)
- Main goal/problem
- Sub-goals
- What will the finished product be?

## 3.1. <a id='toc3_1_'></a>[Examining the **Advanced Dataset**](#toc0_)

### 3.1.1. <a id='toc3_1_1_'></a>[Features from Advanced Dataset](#toc0_)


- Rk -- Rank

- Pos -- Position

- Age -- Player's age on February 1 of the season

- Tm -- Team

- G -- Games

- MP -- Minutes Played

- PER -- Player Efficiency Rating. A measure of per-minute production standardized such that the league average is 15.

- TS% -- True Shooting Percentage. A measure of shooting efficiency that takes into account 2-point field goals, 3-point field goals, and free throws.

- 3PAr -- 3-Point Attempt Rate. Percentage of FG Attempts from 3-Point Range

- FTr -- Free Throw Attempt Rate. Number of FT Attempts Per FG Attempt

- ORB% -- Offensive Rebound Percentage. An estimate of the percentage of available offensive rebounds a player grabbed while they were on the floor.

- DRB% -- Defensive Rebound Percentage. An estimate of the percentage of available defensive rebounds a player grabbed while they were on the floor.

- TRB% -- Total Rebound Percentage. An estimate of the percentage of available rebounds a player grabbed while they were on the floor.

- AST% -- Assist Percentage. An estimate of the percentage of teammate field goals a player assisted while they were on the floor.

- STL% -- Steal Percentage. An estimate of the percentage of opponent possessions that end with a steal by the player while they were on the floor.

- BLK% -- Block Percentage. An estimate of the percentage of opponent two-point field goal attempts blocked by the player while they were on the floor.

- TOV% -- Turnover Percentage. An estimate of turnovers committed per 100 plays.

- USG% -- Usage Percentage. An estimate of the percentage of team plays used by a player while they were on the floor.

- OWS -- Offensive Win Shares. An estimate of the number of wins contributed by a player due to offense.

- DWS -- Defensive Win Shares. An estimate of the number of wins contributed by a player due to defense.

- WS -- Win Shares. An estimate of the number of wins contributed by a player.

- WS/48 -- Win Shares Per 48 Minutes. An estimate of the number of wins contributed by a player per 48 minutes (league average is approximately .100)

- OBPM -- Offensive Box Plus/Minus. A box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team.

- DBPM -- Defensive Box Plus/Minus. A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team.

- BPM -- Box Plus/Minus. A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team.

- VORP -- Value over Replacement Player. A box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season. Multiply by 2.70 to convert to wins over replacement.

### 3.1.2. <a id='toc3_1_2_'></a>[What are we dealing with?](#toc0_)

In [None]:
advanced_df_raw.head()

In [None]:
advanced_df_raw.shape

### 3.1.3. <a id='toc3_1_3_'></a>[Renaming and droping empty columns](#toc0_)

In [None]:
droped_columns = ['Unnamed: 19', 'Unnamed: 24']
advanced_df_raw = advanced_df_raw.drop(droped_columns, axis = 1)

In [None]:
advanced_df_raw.columns

In [None]:
advanced_cols = ['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'MP_Total', 'PER', 'TS%', '3PAr',
       'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%',
       'OWS', 'DWS', 'WS', 'WS_48', 'OBPM', 'DBPM', 'BPM', 'VORP',
       'Player_additional']

advanced_df_raw.columns = advanced_cols

In [None]:
advanced_df_raw.shape

In [None]:
# There are 679 rows in the dataset. However only 539 singular players. It happens because some players changed teams during the season and appear in multiple lines.
# It may be a good solution to join these lines and stick only with the latest team in wich the player acts.

print( advanced_df_raw['Player_additional'].nunique(), 'out of', advanced_df_raw.shape[0])

### 3.1.4. <a id='toc3_1_4_'></a>[Checking for NAs](#toc0_)
- Only three NAs in columns 'TS%', '3PAr' and 'FTr', and one at the column 'TOV%'. The same three rows have NAs to the first three features and Michael Foster Jr. has missing values to 'TOV%'. 
- Let's inspect it so we can figure out why they are empty and what to do with it.
- Columns 'Unnamed: 19' and 'Unnamed: 24' are completely empty and should be deleted.

In [None]:
advanced_df_raw.isna().sum()

In [None]:
advanced_df_raw[advanced_df_raw['TOV%'].isna()]

In [None]:
advanced_df_raw[advanced_df_raw['TS%'].isna()]

In [None]:
advanced_df_raw[advanced_df_raw['3PAr'].isna()]

In [None]:
advanced_df_raw[advanced_df_raw['FTr'].isna()]

### 3.1.5. <a id='toc3_1_5_'></a>[Do these players have multiple lines due to team exchanges?](#toc0_)
- Moses Brown do appear in three different rows once he was traded two times during this season so it may be a good alternative to join the rows
- Michael Foster Jr. and Alondes Williams don't appear. So the missing data may be due to impossobilities to calculate it. It may be a good solution to use 0,0 as values or to attempt to estimate it from the Per Game Dataset.

In [None]:
advanced_df_raw[advanced_df_raw['Player_additional'] == 'brownmo01']

In [None]:
advanced_df_raw[advanced_df_raw['Player_additional'] == 'fostemi02']

In [None]:
advanced_df_raw[advanced_df_raw['Player_additional'] == 'willial06']

### 3.1.6. <a id='toc3_1_6_'></a>[Let's combine the rows with same players](#toc0_)

In [None]:
advanced_df = advanced_df_raw.groupby("Player_additional", as_index=False).agg(
                      {
                          'Rk':'first', 'Player':'first', 
                          'Pos':'first', 'Age':'first', 
                          'Tm':'first', 'G':'first', 
                          'MP_Total':'mean', 'PER':'mean', 
                          'TS%':'mean', '3PAr':'mean',
                          'FTr':'mean', 'ORB%':'mean', 
                          'DRB%':'mean', 'TRB%':'mean', 
                          'AST%':'mean', 'STL%':'mean', 
                          'BLK%':'mean', 'TOV%':'mean', 
                          'USG%':'mean', 'OWS':'mean', 
                          'DWS':'mean', 'WS':'mean', 
                          'WS_48':'mean', 'OBPM':'mean', 
                          'DBPM':'mean', 'BPM':'mean', 
                          'VORP':'mean', 'Player_additional':'first'
                      }
                      )

#### 3.1.6.1. <a id='toc3_1_6_1_'></a>[Checking if the concatenation went right](#toc0_)

In [None]:
advanced_df.shape[0]

In [None]:
advanced_df['Player_additional'].nunique()

In [None]:
# Como era:

advanced_df_raw[advanced_df_raw['Player_additional'] == 'brownmo01']

In [None]:
# Como ficou:

advanced_df[advanced_df['Player_additional'] == 'brownmo01']

### 3.1.7. <a id='toc3_1_7_'></a>[First glance at the Advanced Dataset](#toc0_)

In [None]:
# The data types are all set correctly

advanced_df.dtypes

In [None]:
advanced_df.describe().T

In [None]:
advanced_df.info()

In [None]:
# Generate a dataset profile report

# advanced_profile = ProfileReport(advanced_df, title = 'Advanced NBA Dataset Profile')
# advanced_profile.to_file('advanced_profile.html')
# advanced_profile

### 3.1.8. <a id='toc3_1_8_'></a>[Imputing values to the missing data](#toc0_)
- We still have two players with missing values:
  - Michael Foster Jr.: 'TS%', '3PAr', 'FTr' and 'TOV%'
  - Alondes Williams: 'TS%', '3PAr' and 'FTr'
- Both of them are note playing in NBA league currently
- For that reason we will imput zeros to the NAs

In [None]:
advanced_df[(advanced_df['Player_additional']=='fostemi02') | (advanced_df['Player_additional']=='willial06')]

In [None]:
advanced_df = advanced_df.fillna(0)

In [None]:
# Checking if the imputation gone well

advanced_df[(advanced_df['Player_additional']=='fostemi02') | (advanced_df['Player_additional']=='willial06')][['Player', 'TS%', '3PAr', 'FTr', 'TOV%']]

## 3.2. <a id='toc3_2_'></a>[Examining **Per Game Dataset**](#toc0_)

### 3.2.1. <a id='toc3_2_1_'></a>[Features from Per Game Dataset](#toc0_)


- Rk -- Rank

- Pos -- Position

- Age -- Player's age on February 1 of the season

- Tm -- Team

- G -- Games

- GS -- Games Started

- MP -- Minutes Played Per Game

- FG -- Field Goals Per Game

- FGA -- Field Goal Attempts Per Game

- FG% -- Field Goal Percentage

- 3P -- 3-Point Field Goals Per Game

- 3PA -- 3-Point Field Goal Attempts Per Game

- 3P% -- 3-Point Field Goal Percentage

- 2P -- 2-Point Field Goals Per Game

- 2PA -- 2-Point Field Goal Attempts Per Game

- 2P% -- 2-Point Field Goal Percentage

- eFG% -- Effective Field Goal Percentage

- This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.

- FT -- Free Throws Per Game

- FTA -- Free Throw Attempts Per Game

- FT% -- Free Throw Percentage

- ORB -- Offensive Rebounds Per Game

- DRB -- Defensive Rebounds Per Game

- TRB -- Total Rebounds Per Game

- AST -- Assists Per Game

- STL -- Steals Per Game

- BLK -- Blocks Per Game

- TOV -- Turnovers Per Game

- PF -- Personal Fouls Per Game

- PTS -- Points Per Game

### 3.2.2. <a id='toc3_2_2_'></a>[What are we dealing with?](#toc0_)

In [None]:
pergame_df_raw.head()

In [None]:
pergame_df_raw.shape

### 3.2.3. <a id='toc3_2_3_'></a>[Renaming the columns](#toc0_)

In [None]:
pergame_df_raw.columns

In [None]:
pergame_df_raw.columns = ['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS',
       'Player_additional']

### 3.2.4. <a id='toc3_2_4_'></a>[Checking for NAs](#toc0_)
- In this dataset we have a little bit more NAs than in the previous one
- There are NAs in five columns in total:
  - FG%
  - 3P%
  - 2P% 
  - eFG%
  - FT%
- To the features 'FG%' and 'eFG%' the same thre player from the previous dataset have missing values and we can proceed as we did then

In [None]:
pergame_df_raw.isna().sum()

In [None]:
pergame_df_raw[pergame_df_raw['FG%'].isna()]

In [None]:
pergame_df_raw[pergame_df_raw['3P%'].isna()]

In [None]:
pergame_df_raw[pergame_df_raw['2P%'].isna()]

In [None]:
pergame_df_raw[pergame_df_raw['eFG%'].isna()]

In [None]:
pergame_df_raw[pergame_df_raw['FT%'].isna()]

### 3.2.5. <a id='toc3_2_5_'></a>[Let's combine multiple player rows in one](#toc0_)

In [None]:
pergame_df = pergame_df_raw.groupby("Player_additional", as_index=False).agg(
                      {
                          'Rk':'first', 'Player':'first', 
                          'Pos':'first', 'Age':'first', 
                          'Tm':'first', 'G':'first', 
                          'GS':'first', 'MP':'mean', 
                          'FG':'mean', 'FGA':'mean', 
                          'FG%':'mean', '3P':'mean', 
                          '3PA':'mean', '3P%':'mean', 
                          '2P':'mean', '2PA':'mean', 
                          '2P%':'mean', 'eFG%':'mean', 
                          'FT':'mean', 'FTA':'mean', 
                          'FT%':'mean', 'ORB':'mean', 
                          'DRB':'mean', 'TRB':'mean', 
                          'AST':'mean', 'STL':'mean', 
                          'BLK':'mean', 'TOV':'mean', 
                          'PF':'mean', 'PTS':'mean', 
                          'Player_additional':'first'
                      }
                      )

#### 3.2.5.1. <a id='toc3_2_5_1_'></a>[Checking if the concatanation went as expected](#toc0_)

In [None]:
print(pergame_df.shape[0], 'out of', pergame_df_raw.shape[0])

In [None]:
pergame_df['Player_additional'].nunique()

#### 3.2.5.2. <a id='toc3_2_5_2_'></a>[Checking again for NAs](#toc0_)
- We still have some NAs. Letś examine them further and decide how to deal with them

In [None]:
pergame_df.isna().sum()

### 3.2.6. <a id='toc3_2_6_'></a>[Filling out NAs](#toc0_)
- The NAs still present in the dataset are due to a basic game statistic that has itself only null values (zeros)
- Because of that we can input zeros to the NAs

In [None]:
pergame_df = pergame_df.fillna(0)

In [None]:
pergame_df.isna().sum()

### 3.2.7. <a id='toc3_2_7_'></a>[First glance at the Per Game Dataset](#toc0_)

In [None]:
pergame_df.describe().T

In [None]:
pergame_df.info()

In [None]:
# Generate a dataset profile report

# pergame_profile = ProfileReport(pergame_df, title = 'Per Game NBA Dataset Profile')
# pergame_profile.to_file('pergame_profile.html')
# pergame_profile

## 3.3. <a id='toc3_3_'></a>[Merging the two datasets](#toc0_)

In [None]:
df = pd.merge(advanced_df, pergame_df, how = 'left', on=['Player_additional', 'Player', 'Pos', 'Age', 'Tm', 'G', 'Rk'])
print(df)
print(df.shape)

### 3.3.1. <a id='toc3_3_1_'></a>[Creating some new features](#toc0_)

#### 3.3.1.1. <a id='toc3_3_1_1_'></a>[GM = Games Missed](#toc0_)

In [None]:
df['GM'] = 82 - df['G']

### 3.3.2. <a id='toc3_3_2_'></a>[Reordering the columns](#toc0_)

In [None]:
df = df[['Rk', 'Player', 'Pos', 'Age', 'Tm', 
         'G', 'GS', 'GM',
         'MP_Total', 'MP', 'PER', 
         'USG%', 'OWS', 'DWS', 'WS', 'WS_48', 
         'OBPM', 'DBPM', 'BPM', 'VORP',
         'TS%', 'PTS', 
         'FG', 'FGA', 'FG%', 
         '3P', '3PA', '3P%', '3PAr',
         '2P', '2PA', '2P%', 'eFG%', 
         'FT', 'FTA', 'FT%', 'FTr',
         'ORB', 'ORB%', 
         'DRB', 'DRB%', 
         'TRB', 'TRB%',
         'AST', 'AST%',
         'STL', 'STL%',
         'BLK','BLK%',
         'TOV', 'TOV%',
         'PF', 'Player_additional']]
df.head()

## 3.4. <a id='toc3_4_'></a>[Exporting the merged dataset as a pickle file](#toc0_)

In [None]:
pd.to_pickle(df, 'df.pkl')

# 4. <a id='toc4_'></a>[Feature Engineering and Hypothesis Creation](#toc0_)
- Mental hypothesis map
- Hypothesis list
- Fillout NAs
- Derive new variables as needed

## Importing dataset from pickle file

In [5]:
df04 = pd.read_pickle('~/repos/NBA_2022-2023/data/df.pkl')

# 5. <a id='toc5_'></a>[Data selection and filtering](#toc0_)
- Filter data rows
- Filter data columns

# 6. <a id='toc6_'></a>[Exploratory Data Analysis](#toc0_)
- Answer the hypothesis list
- Build data visualization solutions and plots

# 7. <a id='toc7_'></a>[Data Preparation](#toc0_)
- Normalize, re-scale and transform (enconding) variables to suit model requirements

# 8. <a id='toc8_'></a>[Feature Selection through Boruta algorithm](#toc0_)
- Use Boruta algorithm to select best features to machine learning models

# 9. <a id='toc9_'></a>[Model implementation](#toc0_)
- Implement different machine learning models and algorithms
- Conduct cross-velidation computing
- Conduct single performance metrics computing

# 10. <a id='toc10_'></a>[Hyperparameter Fine-Tuning](#toc0_)
- Implement hyperparameter search (Bayes Search) to find best model hyperparameter values
- Re-train model using best values

# 11. <a id='toc11_'></a>[Model Error Estimation and Interpretation](#toc0_)
- Use model errors to interpret the goals 

# 12. <a id='toc12_'></a>[Model Deployment](#toc0_)
- Deploy the model to a cloud service so it can be used by its consumers