# 1. <a id='toc1_'></a>[NBA Season 2022-2023 Analysis](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- 1. [NBA Season 2022-2023 Analysis](#toc1_)    
- 2. [Importings](#toc2_)    
  - 2.1. [Libraries](#toc2_1_)    
  - 2.2. [Helper Function](#toc2_2_)    
  - 2.3. [Data loading](#toc2_3_)    
- 3. [Data exploration and problem comprehension](#toc3_)    
  - 3.1. [Examining the **Advanced Dataset**](#toc3_1_)    
    - 3.1.1. [Features from Advanced Dataset](#toc3_1_1_)    
    - 3.1.2. [What are we dealing with?](#toc3_1_2_)    
    - 3.1.3. [Renaming and droping empty columns](#toc3_1_3_)    
    - 3.1.4. [Checking for NAs](#toc3_1_4_)    
    - 3.1.5. [Do these players have multiple lines due to team exchanges?](#toc3_1_5_)    
    - 3.1.6. [Let's combine the rows with same players](#toc3_1_6_)    
      - 3.1.6.1. [Checking if the concatenation went right](#toc3_1_6_1_)    
    - 3.1.7. [First glance at the Advanced Dataset](#toc3_1_7_)    
    - 3.1.8. [Imputing values to the missing data](#toc3_1_8_)    
  - 3.2. [Examining **Per Game Dataset**](#toc3_2_)    
    - 3.2.1. [Features from Per Game Dataset](#toc3_2_1_)    
    - 3.2.2. [What are we dealing with?](#toc3_2_2_)    
    - 3.2.3. [Renaming the columns](#toc3_2_3_)    
    - 3.2.4. [Checking for NAs](#toc3_2_4_)    
    - 3.2.5. [Let's combine multiple player rows in one](#toc3_2_5_)    
      - 3.2.5.1. [Checking if the concatanation went as expected](#toc3_2_5_1_)    
      - 3.2.5.2. [Checking again for NAs](#toc3_2_5_2_)    
    - 3.2.6. [Filling out NAs](#toc3_2_6_)    
    - 3.2.7. [First glance at the Per Game Dataset](#toc3_2_7_)    
- 4. [Feature Engineering and Hypothesis Creation](#toc4_)    
  - 4.1. [Merging the two datasets and getting new columns](#toc4_1_)    
    - 4.1.1. [Creating some new features](#toc4_1_1_)    
      - 4.1.1.1. [GM = Games Missed](#toc4_1_1_1_)    
    - 4.1.2. [Reordering the columns](#toc4_1_2_)    
    - 4.1.3. [Changing rows with odd player's positions](#toc4_1_3_)    
  - 4.2. [Exporting the merged dataset as a csv file](#toc4_2_)    
- 5. [Data selection and filtering](#toc5_)    
  - 5.1. [Importing merged dataset from csv file](#toc5_1_)    
- 6. [Exploratory Data Analysis](#toc6_)    
  - 6.1. [Importing merged dataset from csv file](#toc6_1_)    
  - 6.2. [First graphs](#toc6_2_)    
    - 6.2.1. [How are distributed the Points Per Game according to the Positions assigned to each Player?](#toc6_2_1_)    
- 7. [Data Preparation](#toc7_)    
- 8. [Feature Selection through Boruta algorithm](#toc8_)    
- 9. [Model implementation](#toc9_)    
- 10. [Hyperparameter Fine-Tuning](#toc10_)    
- 11. [Model Error Estimation and Interpretation](#toc11_)    
- 12. [Model Deployment](#toc12_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# 2. <a id='toc2_'></a>[Importings](#toc0_)

## 2.1. <a id='toc2_1_'></a>[Libraries](#toc0_)

In [420]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import pickle
import plotly.express as px
import plotly.graph_objects as go

from ydata_profiling        import ProfileReport
from sklearn.impute         import SimpleImputer
from IPython.display        import Image
from IPython.core.display   import HTML

## 2.2. <a id='toc2_2_'></a>[Helper Function](#toc0_)

In [421]:
def jupyter_configs():
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [15, 8]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    pd.set_option('display.max_columns', None)
    
    sns.set()
    
    warnings.filterwarnings( 'ignore' )
    
jupyter_configs()

## 2.3. <a id='toc2_3_'></a>[Data loading](#toc0_)

In [422]:
advanced_df_raw = pd.read_csv('~/repos/NBA_2022-2023/data/data_advanced.csv')
pergame_df_raw = pd.read_csv('~/repos/NBA_2022-2023/data/data_pergame.csv')

# 3. <a id='toc3_'></a>[Data exploration and problem comprehension](#toc0_)
- Main goal/problem
- Sub-goals
- What will the finished product be?

## 3.1. <a id='toc3_1_'></a>[Examining the **Advanced Dataset**](#toc0_)

### 3.1.1. <a id='toc3_1_1_'></a>[Features from Advanced Dataset](#toc0_)


- Rk -- Rank

- Pos -- Position

- Age -- Player's age on February 1 of the season

- Tm -- Team

- G -- Games

- MP -- Minutes Played

- PER -- Player Efficiency Rating. A measure of per-minute production standardized such that the league average is 15.

- TS% -- True Shooting Percentage. A measure of shooting efficiency that takes into account 2-point field goals, 3-point field goals, and free throws.

- 3PAr -- 3-Point Attempt Rate. Percentage of FG Attempts from 3-Point Range

- FTr -- Free Throw Attempt Rate. Number of FT Attempts Per FG Attempt

- ORB% -- Offensive Rebound Percentage. An estimate of the percentage of available offensive rebounds a player grabbed while they were on the floor.

- DRB% -- Defensive Rebound Percentage. An estimate of the percentage of available defensive rebounds a player grabbed while they were on the floor.

- TRB% -- Total Rebound Percentage. An estimate of the percentage of available rebounds a player grabbed while they were on the floor.

- AST% -- Assist Percentage. An estimate of the percentage of teammate field goals a player assisted while they were on the floor.

- STL% -- Steal Percentage. An estimate of the percentage of opponent possessions that end with a steal by the player while they were on the floor.

- BLK% -- Block Percentage. An estimate of the percentage of opponent two-point field goal attempts blocked by the player while they were on the floor.

- TOV% -- Turnover Percentage. An estimate of turnovers committed per 100 plays.

- USG% -- Usage Percentage. An estimate of the percentage of team plays used by a player while they were on the floor.

- OWS -- Offensive Win Shares. An estimate of the number of wins contributed by a player due to offense.

- DWS -- Defensive Win Shares. An estimate of the number of wins contributed by a player due to defense.

- WS -- Win Shares. An estimate of the number of wins contributed by a player.

- WS/48 -- Win Shares Per 48 Minutes. An estimate of the number of wins contributed by a player per 48 minutes (league average is approximately .100)

- OBPM -- Offensive Box Plus/Minus. A box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team.

- DBPM -- Defensive Box Plus/Minus. A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team.

- BPM -- Box Plus/Minus. A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team.

- VORP -- Value over Replacement Player. A box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season. Multiply by 2.70 to convert to wins over replacement.

### 3.1.2. <a id='toc3_1_2_'></a>[What are we dealing with?](#toc0_)

In [423]:
advanced_df_raw.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,Unnamed: 19,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP,Player-additional
0,1,Precious Achiuwa,C,23,TOR,55,1140,15.2,0.554,0.267,0.307,9.3,24.4,16.3,6.3,1.3,2.6,11.4,19.4,,0.8,1.4,2.2,0.093,,-1.4,-0.8,-2.3,-0.1,achiupr01
1,2,Steven Adams,C,29,MEM,42,1133,17.5,0.564,0.004,0.49,20.1,25.3,22.7,11.2,1.5,3.7,19.8,14.6,,1.3,2.1,3.4,0.144,,-0.3,0.9,0.6,0.7,adamsst01
2,3,Bam Adebayo,C,25,MIA,75,2598,20.1,0.592,0.011,0.361,8.0,23.6,15.5,15.9,1.7,2.4,12.7,25.2,,3.6,3.8,7.4,0.137,,0.8,0.8,1.5,2.3,adebaba01
3,4,Ochai Agbaji,SG,22,UTA,59,1209,9.5,0.561,0.591,0.179,3.9,6.9,5.4,7.5,0.6,1.0,9.0,15.8,,0.9,0.4,1.3,0.053,,-1.7,-1.4,-3.0,-0.3,agbajoc01
4,5,Santi Aldama,PF,22,MEM,77,1682,13.9,0.591,0.507,0.274,5.4,18.0,11.7,7.6,1.3,2.6,9.3,16.0,,2.1,2.4,4.6,0.13,,-0.3,0.8,0.5,1.1,aldamsa01


In [424]:
advanced_df_raw.shape

(679, 30)

### 3.1.3. <a id='toc3_1_3_'></a>[Renaming and droping empty columns](#toc0_)

In [425]:
droped_columns = ['Unnamed: 19', 'Unnamed: 24']
advanced_df_raw = advanced_df_raw.drop(droped_columns, axis = 1)

In [426]:
advanced_df_raw.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr',
       'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%',
       'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP',
       'Player-additional'],
      dtype='object')

In [427]:
advanced_cols = ['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'MP_Total', 'PER', 'TS%', '3PAr',
       'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%',
       'OWS', 'DWS', 'WS', 'WS_48', 'OBPM', 'DBPM', 'BPM', 'VORP',
       'Player_additional']

advanced_df_raw.columns = advanced_cols

In [428]:
advanced_df_raw.shape

(679, 28)

In [429]:
# There are 679 rows in the dataset. However only 539 singular players. It happens because some players changed teams during the season and appear in multiple lines.
# It may be a good solution to join these lines and stick only with the latest team in wich the player acts.

print( advanced_df_raw['Player_additional'].nunique(), 'out of', advanced_df_raw.shape[0])

539 out of 679


### 3.1.4. <a id='toc3_1_4_'></a>[Checking for NAs](#toc0_)
- Only three NAs in columns 'TS%', '3PAr' and 'FTr', and one at the column 'TOV%'. The same three rows have NAs to the first three features and Michael Foster Jr. has missing values to 'TOV%'. 
- Let's inspect it so we can figure out why they are empty and what to do with it.
- Columns 'Unnamed: 19' and 'Unnamed: 24' are completely empty and should be deleted.

In [430]:
advanced_df_raw.isna().sum()

Rk                   0
Player               0
Pos                  0
Age                  0
Tm                   0
G                    0
MP_Total             0
PER                  0
TS%                  3
3PAr                 3
FTr                  3
ORB%                 0
DRB%                 0
TRB%                 0
AST%                 0
STL%                 0
BLK%                 0
TOV%                 1
USG%                 0
OWS                  0
DWS                  0
WS                   0
WS_48                0
OBPM                 0
DBPM                 0
BPM                  0
VORP                 0
Player_additional    0
dtype: int64

In [431]:
advanced_df_raw[advanced_df_raw['TOV%'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
196,151,Michael Foster Jr.,PF,20,PHI,1,1,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.01,-7.2,-1.9,-9.2,0.0,fostemi02


In [432]:
advanced_df_raw[advanced_df_raw['TS%'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
89,66,Moses Brown,C,23,BRK,2,6,-2.6,,,,0.0,0.0,0.0,0.0,8.1,0.0,100.0,7.4,0.0,0.0,0.0,-0.129,-12.7,2.8,-9.9,0.0,brownmo01
196,151,Michael Foster Jr.,PF,20,PHI,1,1,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.01,-7.2,-1.9,-9.2,0.0,fostemi02
652,515,Alondes Williams,SG,23,BRK,1,5,-20.9,,,,0.0,22.0,11.2,0.0,0.0,0.0,100.0,17.7,-0.1,0.0,-0.1,-0.517,-21.3,-5.2,-26.5,0.0,willial06


In [433]:
advanced_df_raw[advanced_df_raw['3PAr'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
89,66,Moses Brown,C,23,BRK,2,6,-2.6,,,,0.0,0.0,0.0,0.0,8.1,0.0,100.0,7.4,0.0,0.0,0.0,-0.129,-12.7,2.8,-9.9,0.0,brownmo01
196,151,Michael Foster Jr.,PF,20,PHI,1,1,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.01,-7.2,-1.9,-9.2,0.0,fostemi02
652,515,Alondes Williams,SG,23,BRK,1,5,-20.9,,,,0.0,22.0,11.2,0.0,0.0,0.0,100.0,17.7,-0.1,0.0,-0.1,-0.517,-21.3,-5.2,-26.5,0.0,willial06


In [434]:
advanced_df_raw[advanced_df_raw['FTr'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
89,66,Moses Brown,C,23,BRK,2,6,-2.6,,,,0.0,0.0,0.0,0.0,8.1,0.0,100.0,7.4,0.0,0.0,0.0,-0.129,-12.7,2.8,-9.9,0.0,brownmo01
196,151,Michael Foster Jr.,PF,20,PHI,1,1,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.01,-7.2,-1.9,-9.2,0.0,fostemi02
652,515,Alondes Williams,SG,23,BRK,1,5,-20.9,,,,0.0,22.0,11.2,0.0,0.0,0.0,100.0,17.7,-0.1,0.0,-0.1,-0.517,-21.3,-5.2,-26.5,0.0,willial06


### 3.1.5. <a id='toc3_1_5_'></a>[Do these players have multiple lines due to team exchanges?](#toc0_)
- Moses Brown do appear in three different rows once he was traded two times during this season so it may be a good alternative to join the rows
- Michael Foster Jr. and Alondes Williams don't appear. So the missing data may be due to impossobilities to calculate it. It may be a good solution to use 0,0 as values or to attempt to estimate it from the Per Game Dataset.

In [435]:
advanced_df_raw[advanced_df_raw['Player_additional'] == 'brownmo01']

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
87,66,Moses Brown,C,23,TOT,36,294,22.2,0.607,0.0,0.75,22.0,30.9,26.5,2.1,0.7,4.2,10.5,21.2,0.7,0.4,1.1,0.179,0.6,-1.2,-0.6,0.1,brownmo01
88,66,Moses Brown,C,23,LAC,34,288,22.7,0.607,0.0,0.75,22.4,31.5,27.0,2.2,0.5,4.3,9.9,21.5,0.7,0.4,1.1,0.185,0.9,-1.3,-0.4,0.1,brownmo01
89,66,Moses Brown,C,23,BRK,2,6,-2.6,,,,0.0,0.0,0.0,0.0,8.1,0.0,100.0,7.4,0.0,0.0,0.0,-0.129,-12.7,2.8,-9.9,0.0,brownmo01


In [436]:
advanced_df_raw[advanced_df_raw['Player_additional'] == 'fostemi02']

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
196,151,Michael Foster Jr.,PF,20,PHI,1,1,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.01,-7.2,-1.9,-9.2,0.0,fostemi02


In [437]:
advanced_df_raw[advanced_df_raw['Player_additional'] == 'willial06']

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
652,515,Alondes Williams,SG,23,BRK,1,5,-20.9,,,,0.0,22.0,11.2,0.0,0.0,0.0,100.0,17.7,-0.1,0.0,-0.1,-0.517,-21.3,-5.2,-26.5,0.0,willial06


### 3.1.6. <a id='toc3_1_6_'></a>[Let's combine the rows with same players](#toc0_)

In [438]:
advanced_df = advanced_df_raw.groupby("Player_additional", as_index=False).agg(
                      {
                          'Rk':'first', 'Player':'first', 
                          'Pos':'first', 'Age':'first', 
                          'Tm':'first', 'G':'first', 
                          'MP_Total':'mean', 'PER':'mean', 
                          'TS%':'mean', '3PAr':'mean',
                          'FTr':'mean', 'ORB%':'mean', 
                          'DRB%':'mean', 'TRB%':'mean', 
                          'AST%':'mean', 'STL%':'mean', 
                          'BLK%':'mean', 'TOV%':'mean', 
                          'USG%':'mean', 'OWS':'mean', 
                          'DWS':'mean', 'WS':'mean', 
                          'WS_48':'mean', 'OBPM':'mean', 
                          'DBPM':'mean', 'BPM':'mean', 
                          'VORP':'mean', 'Player_additional':'first'
                      }
                      )

#### 3.1.6.1. <a id='toc3_1_6_1_'></a>[Checking if the concatenation went right](#toc0_)

In [439]:
advanced_df.shape[0]

539

In [440]:
advanced_df['Player_additional'].nunique()

539

In [441]:
# Como era:

advanced_df_raw[advanced_df_raw['Player_additional'] == 'brownmo01']

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
87,66,Moses Brown,C,23,TOT,36,294,22.2,0.607,0.0,0.75,22.0,30.9,26.5,2.1,0.7,4.2,10.5,21.2,0.7,0.4,1.1,0.179,0.6,-1.2,-0.6,0.1,brownmo01
88,66,Moses Brown,C,23,LAC,34,288,22.7,0.607,0.0,0.75,22.4,31.5,27.0,2.2,0.5,4.3,9.9,21.5,0.7,0.4,1.1,0.185,0.9,-1.3,-0.4,0.1,brownmo01
89,66,Moses Brown,C,23,BRK,2,6,-2.6,,,,0.0,0.0,0.0,0.0,8.1,0.0,100.0,7.4,0.0,0.0,0.0,-0.129,-12.7,2.8,-9.9,0.0,brownmo01


In [442]:
# Como ficou:

advanced_df[advanced_df['Player_additional'] == 'brownmo01']

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
65,66,Moses Brown,C,23,TOT,36,196.0,14.1,0.607,0.0,0.75,14.8,20.8,17.833333,1.433333,3.1,2.833333,40.133333,16.7,0.466667,0.266667,0.733333,0.078333,-3.733333,0.1,-3.633333,0.066667,brownmo01


### 3.1.7. <a id='toc3_1_7_'></a>[First glance at the Advanced Dataset](#toc0_)

In [443]:
# The data types are all set correctly

advanced_df.dtypes

Rk                     int64
Player                object
Pos                   object
Age                    int64
Tm                    object
G                      int64
MP_Total             float64
PER                  float64
TS%                  float64
3PAr                 float64
FTr                  float64
ORB%                 float64
DRB%                 float64
TRB%                 float64
AST%                 float64
STL%                 float64
BLK%                 float64
TOV%                 float64
USG%                 float64
OWS                  float64
DWS                  float64
WS                   float64
WS_48                float64
OBPM                 float64
DBPM                 float64
BPM                  float64
VORP                 float64
Player_additional     object
dtype: object

In [444]:
advanced_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rk,539.0,270.0,155.740168,1.0,135.5,270.0,404.5,539.0
Age,539.0,25.727273,4.290326,19.0,23.0,25.0,28.5,42.0
G,539.0,48.042672,24.648006,1.0,30.5,54.0,68.0,83.0
MP_Total,539.0,1058.08658,815.685148,1.0,273.5,914.0,1772.5,2842.0
PER,539.0,13.318429,6.117492,-20.9,10.1,13.0,16.4,65.6
TS%,537.0,0.562687,0.098285,0.0,0.524,0.567,0.61,1.064
3PAr,537.0,0.401669,0.219115,0.0,0.261,0.408,0.547,1.0
FTr,537.0,0.25084,0.182275,0.0,0.143,0.23,0.321667,2.0
ORB%,539.0,5.213358,4.241112,0.0,2.166667,3.9,7.05,28.8
DRB%,539.0,14.951515,6.539374,0.0,10.75,13.4,18.7,55.4


In [445]:
advanced_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539 entries, 0 to 538
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rk                 539 non-null    int64  
 1   Player             539 non-null    object 
 2   Pos                539 non-null    object 
 3   Age                539 non-null    int64  
 4   Tm                 539 non-null    object 
 5   G                  539 non-null    int64  
 6   MP_Total           539 non-null    float64
 7   PER                539 non-null    float64
 8   TS%                537 non-null    float64
 9   3PAr               537 non-null    float64
 10  FTr                537 non-null    float64
 11  ORB%               539 non-null    float64
 12  DRB%               539 non-null    float64
 13  TRB%               539 non-null    float64
 14  AST%               539 non-null    float64
 15  STL%               539 non-null    float64
 16  BLK%               539 non

In [446]:
# Generate a dataset profile report

# advanced_profile = ProfileReport(advanced_df, title = 'Advanced NBA Dataset Profile')
# advanced_profile.to_file('advanced_profile.html')
# advanced_profile

### 3.1.8. <a id='toc3_1_8_'></a>[Imputing values to the missing data](#toc0_)
- We still have two players with missing values:
  - Michael Foster Jr.: 'TS%', '3PAr', 'FTr' and 'TOV%'
  - Alondes Williams: 'TS%', '3PAr' and 'FTr'
- Both of them are note playing in NBA league currently
- For that reason we will imput zeros to the NAs

In [447]:
advanced_df[(advanced_df['Player_additional']=='fostemi02') | (advanced_df['Player_additional']=='willial06')]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP_Total,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,Player_additional
150,151,Michael Foster Jr.,PF,20,PHI,1,1.0,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.01,-7.2,-1.9,-9.2,0.0,fostemi02
514,515,Alondes Williams,SG,23,BRK,1,5.0,-20.9,,,,0.0,22.0,11.2,0.0,0.0,0.0,100.0,17.7,-0.1,0.0,-0.1,-0.517,-21.3,-5.2,-26.5,0.0,willial06


In [448]:
advanced_df = advanced_df.fillna(0)

In [449]:
# Checking if the imputation gone well

advanced_df[(advanced_df['Player_additional']=='fostemi02') | (advanced_df['Player_additional']=='willial06')][['Player', 'TS%', '3PAr', 'FTr', 'TOV%']]

Unnamed: 0,Player,TS%,3PAr,FTr,TOV%
150,Michael Foster Jr.,0.0,0.0,0.0,0.0
514,Alondes Williams,0.0,0.0,0.0,100.0


### Fixing the % features (they are multiplied by 100, not proportions of 1)

In [450]:
advanced_df[['USG%', 'TOV%', 'BLK%','STL%', 'AST%', 'TRB%', 'DRB%', 'ORB%']].head()

Unnamed: 0,USG%,TOV%,BLK%,STL%,AST%,TRB%,DRB%,ORB%
0,19.4,11.4,2.6,1.3,6.3,16.3,24.4,9.3
1,14.6,19.8,3.7,1.5,11.2,22.7,25.3,20.1
2,25.2,12.7,2.4,1.7,15.9,15.5,23.6,8.0
3,15.8,9.0,1.0,0.6,7.5,5.4,6.9,3.9
4,16.0,9.3,2.6,1.3,7.6,11.7,18.0,5.4


In [451]:
advanced_df[['USG%', 'TOV%', 'BLK%','STL%', 'AST%', 'TRB%', 'DRB%', 'ORB%']] = advanced_df[['USG%', 'TOV%', 'BLK%','STL%', 'AST%', 'TRB%', 'DRB%', 'ORB%']]/100

## 3.2. <a id='toc3_2_'></a>[Examining **Per Game Dataset**](#toc0_)

### 3.2.1. <a id='toc3_2_1_'></a>[Features from Per Game Dataset](#toc0_)


- Rk -- Rank

- Pos -- Position

- Age -- Player's age on February 1 of the season

- Tm -- Team

- G -- Games

- GS -- Games Started

- MP -- Minutes Played Per Game

- FG -- Field Goals Per Game

- FGA -- Field Goal Attempts Per Game

- FG% -- Field Goal Percentage

- 3P -- 3-Point Field Goals Per Game

- 3PA -- 3-Point Field Goal Attempts Per Game

- 3P% -- 3-Point Field Goal Percentage

- 2P -- 2-Point Field Goals Per Game

- 2PA -- 2-Point Field Goal Attempts Per Game

- 2P% -- 2-Point Field Goal Percentage

- eFG% -- Effective Field Goal Percentage

- This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.

- FT -- Free Throws Per Game

- FTA -- Free Throw Attempts Per Game

- FT% -- Free Throw Percentage

- ORB -- Offensive Rebounds Per Game

- DRB -- Defensive Rebounds Per Game

- TRB -- Total Rebounds Per Game

- AST -- Assists Per Game

- STL -- Steals Per Game

- BLK -- Blocks Per Game

- TOV -- Turnovers Per Game

- PF -- Personal Fouls Per Game

- PTS -- Points Per Game

### 3.2.2. <a id='toc3_2_2_'></a>[What are we dealing with?](#toc0_)

In [452]:
pergame_df_raw.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player-additional
0,1,Precious Achiuwa,C,23,TOR,55,12,20.7,3.6,7.3,0.485,0.5,2.0,0.269,3.0,5.4,0.564,0.521,1.6,2.3,0.702,1.8,4.1,6.0,0.9,0.6,0.5,1.1,1.9,9.2,achiupr01
1,2,Steven Adams,C,29,MEM,42,42,27.0,3.7,6.3,0.597,0.0,0.0,0.0,3.7,6.2,0.599,0.597,1.1,3.1,0.364,5.1,6.5,11.5,2.3,0.9,1.1,1.9,2.3,8.6,adamsst01
2,3,Bam Adebayo,C,25,MIA,75,75,34.6,8.0,14.9,0.54,0.0,0.2,0.083,8.0,14.7,0.545,0.541,4.3,5.4,0.806,2.5,6.7,9.2,3.2,1.2,0.8,2.5,2.8,20.4,adebaba01
3,4,Ochai Agbaji,SG,22,UTA,59,22,20.5,2.8,6.5,0.427,1.4,3.9,0.355,1.4,2.7,0.532,0.532,0.9,1.2,0.812,0.7,1.3,2.1,1.1,0.3,0.3,0.7,1.7,7.9,agbajoc01
4,5,Santi Aldama,PF,22,MEM,77,20,21.8,3.2,6.8,0.47,1.2,3.5,0.353,2.0,3.4,0.591,0.56,1.4,1.9,0.75,1.1,3.7,4.8,1.3,0.6,0.6,0.8,1.9,9.0,aldamsa01


In [453]:
pergame_df_raw.shape

(679, 31)

### 3.2.3. <a id='toc3_2_3_'></a>[Renaming the columns](#toc0_)

In [454]:
pergame_df_raw.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS',
       'Player-additional'],
      dtype='object')

In [455]:
pergame_df_raw.columns = ['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS',
       'Player_additional']

### 3.2.4. <a id='toc3_2_4_'></a>[Checking for NAs](#toc0_)
- In this dataset we have a little bit more NAs than in the previous one
- There are NAs in five columns in total:
  - FG%
  - 3P%
  - 2P% 
  - eFG%
  - FT%
- To the features 'FG%' and 'eFG%' the same thre player from the previous dataset have missing values and we can proceed as we did then

In [456]:
pergame_df_raw.isna().sum()

Rk                    0
Player                0
Pos                   0
Age                   0
Tm                    0
G                     0
GS                    0
MP                    0
FG                    0
FGA                   0
FG%                   3
3P                    0
3PA                   0
3P%                  24
2P                    0
2PA                   0
2P%                   7
eFG%                  3
FT                    0
FTA                   0
FT%                  37
ORB                   0
DRB                   0
TRB                   0
AST                   0
STL                   0
BLK                   0
TOV                   0
PF                    0
PTS                   0
Player_additional     0
dtype: int64

In [457]:
pergame_df_raw[pergame_df_raw['FG%'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player_additional
89,66,Moses Brown,C,23,BRK,2,0,3.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.5,0.0,brownmo01
196,151,Michael Foster Jr.,PF,20,PHI,1,0,1.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,fostemi02
652,515,Alondes Williams,SG,23,BRK,1,0,5.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,1.0,1.0,0.0,0.0,0.0,2.0,1.0,0.0,willial06


In [458]:
pergame_df_raw[pergame_df_raw['3P%'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player_additional
21,18,Udoka Azubuike,C,23,UTA,36,4,10.0,1.6,2.0,0.819,0.0,0.0,,1.6,2.0,0.819,0.819,0.2,0.6,0.35,0.9,2.4,3.3,0.3,0.2,0.4,0.5,0.9,3.5,azubuud01
60,43,Bismack Biyombo,C,30,PHO,61,14,14.3,2.0,3.4,0.578,0.0,0.0,,2.0,3.4,0.578,0.578,0.4,1.1,0.357,1.5,2.8,4.3,0.9,0.3,1.4,0.8,1.9,4.3,biyombi01
87,66,Moses Brown,C,23,TOT,36,1,8.2,1.7,2.7,0.635,0.0,0.0,,1.7,2.7,0.635,0.635,0.9,2.0,0.458,1.6,2.3,3.9,0.1,0.1,0.4,0.4,1.1,4.3,brownmo01
88,66,Moses Brown,C,23,LAC,34,1,8.5,1.8,2.8,0.635,0.0,0.0,,1.8,2.8,0.635,0.635,1.0,2.1,0.458,1.7,2.4,4.1,0.1,0.1,0.4,0.4,1.1,4.6,brownmo01
89,66,Moses Brown,C,23,BRK,2,0,3.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.5,0.0,brownmo01
107,82,Vernon Carey Jr.,C,21,WAS,11,0,2.5,0.2,0.7,0.25,0.0,0.0,,0.2,0.7,0.25,0.25,0.2,0.2,1.0,0.3,0.7,1.0,0.3,0.2,0.2,0.2,0.5,0.5,careyve01
116,88,Justin Champagnie,SF,21,TOR,3,0,3.7,1.0,1.0,1.0,0.0,0.0,,1.0,1.0,1.0,1.0,0.0,0.0,,0.3,1.0,1.3,0.3,0.0,0.0,0.0,0.3,2.0,champju01
127,98,Chance Comanche,C,26,POR,1,0,21.0,3.0,5.0,0.6,0.0,0.0,,3.0,5.0,0.6,0.6,1.0,4.0,0.25,2.0,1.0,3.0,0.0,0.0,1.0,0.0,0.0,7.0,comanch01
196,151,Michael Foster Jr.,PF,20,PHI,1,0,1.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,fostemi02
201,156,Daniel Gafford,C,24,WAS,78,47,20.6,3.7,5.1,0.732,0.0,0.0,,3.7,5.1,0.732,0.732,1.6,2.4,0.679,2.1,3.5,5.6,1.1,0.4,1.3,1.1,2.4,9.0,gaffoda01


In [459]:
pergame_df_raw[pergame_df_raw['2P%'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player_additional
72,53,Jamaree Bouyea,PG,23,WAS,1,0,6.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,bouyeja01
89,66,Moses Brown,C,23,BRK,2,0,3.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.5,0.0,brownmo01
196,151,Michael Foster Jr.,PF,20,PHI,1,0,1.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,fostemi02
211,166,Jacob Gilyard,PG,24,MEM,1,0,41.0,1.0,3.0,0.333,1.0,3.0,0.333,0.0,0.0,,0.5,0.0,0.0,,0.0,4.0,4.0,7.0,3.0,0.0,2.0,3.0,3.0,gilyaja01
336,263,Trevor Keels,SG,19,NYK,3,0,2.7,0.3,1.3,0.25,0.3,1.3,0.25,0.0,0.0,,0.375,0.0,0.0,,0.0,0.7,0.7,0.0,0.0,0.0,0.0,0.0,1.0,keelstr01
613,482,Stanley Umude,SG,23,DET,1,0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,,0.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,2.0,umudest01
652,515,Alondes Williams,SG,23,BRK,1,0,5.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,1.0,1.0,0.0,0.0,0.0,2.0,1.0,0.0,willial06


In [460]:
pergame_df_raw[pergame_df_raw['eFG%'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player_additional
89,66,Moses Brown,C,23,BRK,2,0,3.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.5,0.0,brownmo01
196,151,Michael Foster Jr.,PF,20,PHI,1,0,1.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,fostemi02
652,515,Alondes Williams,SG,23,BRK,1,0,5.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,1.0,1.0,0.0,0.0,0.0,2.0,1.0,0.0,willial06


In [461]:
pergame_df_raw[pergame_df_raw['FT%'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player_additional
16,15,Ryan Arcidiacono,PG,28,TOT,20,4,8.6,0.5,1.9,0.243,0.4,1.2,0.348,0.1,0.7,0.071,0.351,0.0,0.0,,0.0,0.8,0.8,1.2,0.3,0.0,0.4,0.9,1.3,arcidry01
17,15,Ryan Arcidiacono,PG,28,NYK,11,0,2.4,0.1,0.5,0.2,0.1,0.3,0.333,0.0,0.2,0.0,0.3,0.0,0.0,,0.0,0.4,0.4,0.2,0.2,0.0,0.1,0.3,0.3,arcidry01
18,15,Ryan Arcidiacono,PG,28,POR,9,4,16.2,0.9,3.6,0.25,0.8,2.2,0.35,0.1,1.3,0.083,0.359,0.0,0.0,,0.0,1.2,1.2,2.3,0.3,0.0,0.7,1.6,2.6,arcidry01
65,48,Leandro Bolmaro,SG,22,UTA,14,0,4.9,0.2,1.4,0.15,0.0,0.3,0.0,0.2,1.1,0.188,0.15,0.0,0.0,,0.3,0.2,0.5,0.5,0.2,0.1,0.5,0.7,0.4,bolmale01
72,53,Jamaree Bouyea,PG,23,WAS,1,0,6.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,,0.0,0.0,0.0,,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,bouyeja01
89,66,Moses Brown,C,23,BRK,2,0,3.0,0.0,0.0,,0.0,0.0,,0.0,0.0,,,0.0,0.0,,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.5,0.0,brownmo01
90,67,Sterling Brown,SG,27,LAL,4,0,6.0,0.0,1.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,,0.8,1.3,2.0,0.5,0.8,0.0,0.0,1.0,0.0,brownst02
98,73,Deonte Burton,SG,29,SAC,2,0,3.0,0.0,1.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,burtode02
99,74,Jared Butler,SG,22,OKC,6,1,12.8,2.5,5.3,0.469,1.2,2.3,0.5,1.3,3.0,0.444,0.578,0.0,0.0,,0.2,0.5,0.7,1.3,0.8,0.0,0.8,0.8,6.2,butleja02
113,87,Julian Champagnie,SF,21,PHI,2,0,3.5,0.0,1.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,champju02


### 3.2.5. <a id='toc3_2_5_'></a>[Let's combine multiple player rows in one](#toc0_)

In [462]:
pergame_df = pergame_df_raw.groupby("Player_additional", as_index=False).agg(
                      {
                          'Rk':'first', 'Player':'first', 
                          'Pos':'first', 'Age':'first', 
                          'Tm':'first', 'G':'first', 
                          'GS':'first', 'MP':'mean', 
                          'FG':'mean', 'FGA':'mean', 
                          'FG%':'mean', '3P':'mean', 
                          '3PA':'mean', '3P%':'mean', 
                          '2P':'mean', '2PA':'mean', 
                          '2P%':'mean', 'eFG%':'mean', 
                          'FT':'mean', 'FTA':'mean', 
                          'FT%':'mean', 'ORB':'mean', 
                          'DRB':'mean', 'TRB':'mean', 
                          'AST':'mean', 'STL':'mean', 
                          'BLK':'mean', 'TOV':'mean', 
                          'PF':'mean', 'PTS':'mean', 
                          'Player_additional':'first'
                      }
                      )

#### 3.2.5.1. <a id='toc3_2_5_1_'></a>[Checking if the concatanation went as expected](#toc0_)

In [463]:
print(pergame_df.shape[0], 'out of', pergame_df_raw.shape[0])

539 out of 679


In [464]:
pergame_df['Player_additional'].nunique()

539

#### 3.2.5.2. <a id='toc3_2_5_2_'></a>[Checking again for NAs](#toc0_)
- We still have some NAs. Letś examine them further and decide how to deal with them

In [465]:
pergame_df.isna().sum()

Rk                    0
Player                0
Pos                   0
Age                   0
Tm                    0
G                     0
GS                    0
MP                    0
FG                    0
FGA                   0
FG%                   2
3P                    0
3PA                   0
3P%                  16
2P                    0
2PA                   0
2P%                   5
eFG%                  2
FT                    0
FTA                   0
FT%                  24
ORB                   0
DRB                   0
TRB                   0
AST                   0
STL                   0
BLK                   0
TOV                   0
PF                    0
PTS                   0
Player_additional     0
dtype: int64

### 3.2.6. <a id='toc3_2_6_'></a>[Filling out NAs](#toc0_)
- The NAs still present in the dataset are due to a basic game statistic that has itself only null values (zeros)
- Because of that we can input zeros to the NAs

In [466]:
pergame_df = pergame_df.fillna(0)

In [467]:
pergame_df.isna().sum()

Rk                   0
Player               0
Pos                  0
Age                  0
Tm                   0
G                    0
GS                   0
MP                   0
FG                   0
FGA                  0
FG%                  0
3P                   0
3PA                  0
3P%                  0
2P                   0
2PA                  0
2P%                  0
eFG%                 0
FT                   0
FTA                  0
FT%                  0
ORB                  0
DRB                  0
TRB                  0
AST                  0
STL                  0
BLK                  0
TOV                  0
PF                   0
PTS                  0
Player_additional    0
dtype: int64

### 3.2.7. <a id='toc3_2_7_'></a>[First glance at the Per Game Dataset](#toc0_)

In [468]:
pergame_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rk,539.0,270.0,155.740168,1.0,135.5,270.0,404.5,539.0
Age,539.0,25.727273,4.290326,19.0,23.0,25.0,28.5,42.0
G,539.0,48.042672,24.648006,1.0,30.5,54.0,68.0,83.0
GS,539.0,22.820037,27.295285,0.0,1.0,8.0,46.5,83.0
MP,539.0,19.752072,9.563387,1.0,12.266667,19.266667,28.3,41.0
FG,539.0,3.341868,2.437476,0.0,1.6,2.6,4.483333,11.2
FGA,539.0,7.089858,4.95845,0.0,3.4,5.8,9.4,22.2
FG%,539.0,0.463769,0.110612,0.0,0.4155,0.456,0.508,1.0
3P,539.0,0.988745,0.873156,0.0,0.3,0.8,1.5,4.9
3PA,539.0,2.778108,2.243782,0.0,1.0,2.4,4.1,11.4


In [469]:
pergame_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539 entries, 0 to 538
Data columns (total 31 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rk                 539 non-null    int64  
 1   Player             539 non-null    object 
 2   Pos                539 non-null    object 
 3   Age                539 non-null    int64  
 4   Tm                 539 non-null    object 
 5   G                  539 non-null    int64  
 6   GS                 539 non-null    int64  
 7   MP                 539 non-null    float64
 8   FG                 539 non-null    float64
 9   FGA                539 non-null    float64
 10  FG%                539 non-null    float64
 11  3P                 539 non-null    float64
 12  3PA                539 non-null    float64
 13  3P%                539 non-null    float64
 14  2P                 539 non-null    float64
 15  2PA                539 non-null    float64
 16  2P%                539 non

In [470]:
# Generate a dataset profile report

# pergame_profile = ProfileReport(pergame_df, title = 'Per Game NBA Dataset Profile')
# pergame_profile.to_file('pergame_profile.html')
# pergame_profile

# 4. <a id='toc4_'></a>[Feature Engineering and Hypothesis Creation](#toc0_)
- Mental map for hypothesis and questions
- Hypothesis and questions list
- Fillout remaining NAs 
- Derive new variables as needed

## 4.1. <a id='toc4_1_'></a>[Merging the two datasets and getting new columns](#toc0_)

In [471]:
df = pd.merge(advanced_df, pergame_df, how = 'left', on=['Player_additional', 'Player', 'Pos', 'Age', 'Tm', 'G', 'Rk'])
print(df)
print(df.shape)

      Rk                    Player    Pos  Age   Tm   G     MP_Total        PER       TS%      3PAr       FTr      ORB%      DRB%      TRB%      AST%      STL%      BLK%      TOV%      USG%        OWS       DWS         WS     WS_48       OBPM          DBPM        BPM      VORP Player_additional  GS         MP         FG        FGA       FG%        3P        3PA       3P%         2P        2PA       2P%      eFG%         FT        FTA       FT%       ORB       DRB        TRB        AST       STL       BLK       TOV        PF        PTS
0      1          Precious Achiuwa      C   23  TOR  55  1140.000000  15.200000  0.554000  0.267000  0.307000  0.093000  0.244000  0.163000  0.063000  0.013000  0.026000  0.114000  0.194000   0.800000  1.400000   2.200000  0.093000  -1.400000 -8.000000e-01  -2.300000 -0.100000         achiupr01  12  20.700000   3.600000   7.300000  0.485000  0.500000   2.000000  0.269000   3.000000   5.400000  0.564000  0.521000   1.600000   2.300000  0.702000  1.800000  

### 4.1.1. <a id='toc4_1_1_'></a>[Creating some new features](#toc0_)

#### 4.1.1.1. <a id='toc4_1_1_1_'></a>[GM = Games Missed](#toc0_)

In [472]:
df['GM'] = 82 - df['G']

### 4.1.2. <a id='toc4_1_2_'></a>[Reordering the columns](#toc0_)

In [473]:
df = df[['Rk', 'Player', 'Pos', 'Age', 'Tm', 
         'G', 'GS', 'GM',
         'MP_Total', 'MP', 'PER', 
         'USG%', 'OWS', 'DWS', 'WS', 'WS_48', 
         'OBPM', 'DBPM', 'BPM', 'VORP',
         'TS%', 'PTS', 
         'FG', 'FGA', 'FG%', 
         '3P', '3PA', '3P%', '3PAr',
         '2P', '2PA', '2P%', 'eFG%', 
         'FT', 'FTA', 'FT%', 'FTr',
         'ORB', 'ORB%', 
         'DRB', 'DRB%', 
         'TRB', 'TRB%',
         'AST', 'AST%',
         'STL', 'STL%',
         'BLK','BLK%',
         'TOV', 'TOV%',
         'PF', 'Player_additional']]
df.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,GM,MP_Total,MP,PER,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,TS%,PTS,FG,FGA,FG%,3P,3PA,3P%,3PAr,2P,2PA,2P%,eFG%,FT,FTA,FT%,FTr,ORB,ORB%,DRB,DRB%,TRB,TRB%,AST,AST%,STL,STL%,BLK,BLK%,TOV,TOV%,PF,Player_additional
0,1,Precious Achiuwa,C,23,TOR,55,12,27,1140.0,20.7,15.2,0.194,0.8,1.4,2.2,0.093,-1.4,-0.8,-2.3,-0.1,0.554,9.2,3.6,7.3,0.485,0.5,2.0,0.269,0.267,3.0,5.4,0.564,0.521,1.6,2.3,0.702,0.307,1.8,0.093,4.1,0.244,6.0,0.163,0.9,0.063,0.6,0.013,0.5,0.026,1.1,0.114,1.9,achiupr01
1,2,Steven Adams,C,29,MEM,42,42,40,1133.0,27.0,17.5,0.146,1.3,2.1,3.4,0.144,-0.3,0.9,0.6,0.7,0.564,8.6,3.7,6.3,0.597,0.0,0.0,0.0,0.004,3.7,6.2,0.599,0.597,1.1,3.1,0.364,0.49,5.1,0.201,6.5,0.253,11.5,0.227,2.3,0.112,0.9,0.015,1.1,0.037,1.9,0.198,2.3,adamsst01
2,3,Bam Adebayo,C,25,MIA,75,75,7,2598.0,34.6,20.1,0.252,3.6,3.8,7.4,0.137,0.8,0.8,1.5,2.3,0.592,20.4,8.0,14.9,0.54,0.0,0.2,0.083,0.011,8.0,14.7,0.545,0.541,4.3,5.4,0.806,0.361,2.5,0.08,6.7,0.236,9.2,0.155,3.2,0.159,1.2,0.017,0.8,0.024,2.5,0.127,2.8,adebaba01
3,4,Ochai Agbaji,SG,22,UTA,59,22,23,1209.0,20.5,9.5,0.158,0.9,0.4,1.3,0.053,-1.7,-1.4,-3.0,-0.3,0.561,7.9,2.8,6.5,0.427,1.4,3.9,0.355,0.591,1.4,2.7,0.532,0.532,0.9,1.2,0.812,0.179,0.7,0.039,1.3,0.069,2.1,0.054,1.1,0.075,0.3,0.006,0.3,0.01,0.7,0.09,1.7,agbajoc01
4,5,Santi Aldama,PF,22,MEM,77,20,5,1682.0,21.8,13.9,0.16,2.1,2.4,4.6,0.13,-0.3,0.8,0.5,1.1,0.591,9.0,3.2,6.8,0.47,1.2,3.5,0.353,0.507,2.0,3.4,0.591,0.56,1.4,1.9,0.75,0.274,1.1,0.054,3.7,0.18,4.8,0.117,1.3,0.076,0.6,0.013,0.6,0.026,0.8,0.093,1.9,aldamsa01


### 4.1.3. <a id='toc4_1_3_'></a>[Changing rows with weird player's positions](#toc0_)

In [474]:
df[(df['Pos'] == 'SF-SG') | (df['Pos'] == 'SG-PG')]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,GM,MP_Total,MP,PER,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,TS%,PTS,FG,FGA,FG%,3P,3PA,3P%,3PAr,2P,2PA,2P%,eFG%,FT,FTA,FT%,FTr,ORB,ORB%,DRB,DRB%,TRB,TRB%,AST,AST%,STL,STL%,BLK,BLK%,TOV,TOV%,PF,Player_additional
199,199,Josh Hart,SF-SG,27,TOT,76,52,6,1636.0,31.9,14.633333,0.126,2.533333,1.466667,4.0,0.129333,0.1,1.266667,1.366667,1.2,0.637,9.833333,3.633333,6.733333,0.539667,0.866667,2.166667,0.398333,0.318333,2.766667,4.6,0.604667,0.603667,1.733333,2.3,0.756667,0.343,1.9,0.066,5.8,0.206333,7.666667,0.135667,3.766667,0.161,1.233333,0.018333,0.333333,0.009333,1.5,0.165667,2.566667,hartjo01
365,366,Kendrick Nunn,SG-PG,27,TOT,70,2,12,642.666667,13.8,11.3,0.237333,-0.4,0.533333,0.133333,0.01,-2.0,-1.466667,-3.4,-0.266667,0.532,7.1,2.733333,6.4,0.425667,1.133333,3.166667,0.357,0.495,1.6,3.233333,0.493667,0.514333,0.5,0.566667,0.854667,0.092,0.2,0.018333,1.333333,0.103,1.566667,0.062333,1.333333,0.141,0.4,0.014333,0.1,0.006333,0.933333,0.124667,0.933333,nunnke01


In [475]:
df.iloc[199,2] = 'SF'
df.iloc[365,2] = 'SG'

In [476]:
df[(df['Pos'] == 'SF-SG') | (df['Pos'] == 'SG-PG')]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,GM,MP_Total,MP,PER,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,TS%,PTS,FG,FGA,FG%,3P,3PA,3P%,3PAr,2P,2PA,2P%,eFG%,FT,FTA,FT%,FTr,ORB,ORB%,DRB,DRB%,TRB,TRB%,AST,AST%,STL,STL%,BLK,BLK%,TOV,TOV%,PF,Player_additional


## 4.2. <a id='toc4_2_'></a>[Exporting the merged dataset as a csv file](#toc0_)

In [477]:
df.to_csv('~/repos/NBA_2022-2023/data/df.csv')

# 5. <a id='toc5_'></a>[Data selection and filtering](#toc0_)
- Filter data rows
- Filter data columns

## 5.1. <a id='toc5_1_'></a>[Importing merged dataset from csv file](#toc0_)

In [478]:
df05 = pd.read_csv('~/repos/NBA_2022-2023/data/df.csv', low_memory=False)

# 6. <a id='toc6_'></a>[Exploratory Data Analysis](#toc0_)
- Answer the hypothesis list
- Build data visualization solutions and plots

## 6.1. <a id='toc6_1_'></a>[Importing merged dataset from csv file](#toc0_)

In [479]:
df06 = pd.read_csv('~/repos/NBA_2022-2023/data/df.csv', low_memory=False)

In [480]:
df06.head()

Unnamed: 0.1,Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,GM,MP_Total,MP,PER,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,TS%,PTS,FG,FGA,FG%,3P,3PA,3P%,3PAr,2P,2PA,2P%,eFG%,FT,FTA,FT%,FTr,ORB,ORB%,DRB,DRB%,TRB,TRB%,AST,AST%,STL,STL%,BLK,BLK%,TOV,TOV%,PF,Player_additional
0,0,1,Precious Achiuwa,C,23,TOR,55,12,27,1140.0,20.7,15.2,0.194,0.8,1.4,2.2,0.093,-1.4,-0.8,-2.3,-0.1,0.554,9.2,3.6,7.3,0.485,0.5,2.0,0.269,0.267,3.0,5.4,0.564,0.521,1.6,2.3,0.702,0.307,1.8,0.093,4.1,0.244,6.0,0.163,0.9,0.063,0.6,0.013,0.5,0.026,1.1,0.114,1.9,achiupr01
1,1,2,Steven Adams,C,29,MEM,42,42,40,1133.0,27.0,17.5,0.146,1.3,2.1,3.4,0.144,-0.3,0.9,0.6,0.7,0.564,8.6,3.7,6.3,0.597,0.0,0.0,0.0,0.004,3.7,6.2,0.599,0.597,1.1,3.1,0.364,0.49,5.1,0.201,6.5,0.253,11.5,0.227,2.3,0.112,0.9,0.015,1.1,0.037,1.9,0.198,2.3,adamsst01
2,2,3,Bam Adebayo,C,25,MIA,75,75,7,2598.0,34.6,20.1,0.252,3.6,3.8,7.4,0.137,0.8,0.8,1.5,2.3,0.592,20.4,8.0,14.9,0.54,0.0,0.2,0.083,0.011,8.0,14.7,0.545,0.541,4.3,5.4,0.806,0.361,2.5,0.08,6.7,0.236,9.2,0.155,3.2,0.159,1.2,0.017,0.8,0.024,2.5,0.127,2.8,adebaba01
3,3,4,Ochai Agbaji,SG,22,UTA,59,22,23,1209.0,20.5,9.5,0.158,0.9,0.4,1.3,0.053,-1.7,-1.4,-3.0,-0.3,0.561,7.9,2.8,6.5,0.427,1.4,3.9,0.355,0.591,1.4,2.7,0.532,0.532,0.9,1.2,0.812,0.179,0.7,0.039,1.3,0.069,2.1,0.054,1.1,0.075,0.3,0.006,0.3,0.01,0.7,0.09,1.7,agbajoc01
4,4,5,Santi Aldama,PF,22,MEM,77,20,5,1682.0,21.8,13.9,0.16,2.1,2.4,4.6,0.13,-0.3,0.8,0.5,1.1,0.591,9.0,3.2,6.8,0.47,1.2,3.5,0.353,0.507,2.0,3.4,0.591,0.56,1.4,1.9,0.75,0.274,1.1,0.054,3.7,0.18,4.8,0.117,1.3,0.076,0.6,0.013,0.6,0.026,0.8,0.093,1.9,aldamsa01


## 6.2. <a id='toc6_2_'></a>[First graphs](#toc0_)

### 6.2.1. <a id='toc6_2_1_'></a>[How are distributed the Points Per Game according to the Positions assigned to each Player?](#toc0_)

In [481]:
fig = px.box(data_frame = df06,
       x = 'Pos',
       y = 'PTS',
       color = 'Pos',
       hover_name = 'Player',
       title = 'Points per Game by Position',
       labels = {'PTS':'Points per Game',
                 'Pos':'Position'},
       category_orders = {'Pos':('PG', 'SG', 'SF', 'PF', 'C', 'PF-SF', 'SF-SG', 'SG-PG')})

fig.show()

### How are distributed the 3 Points Percentage Per Game according to the Positions assigned to each Player?

In [482]:
px.box(data_frame = df06,
        x = 'Pos',
        y = '3P',
        color = 'Pos',
        hover_name = 'Player',
        title = '3 Points per Game by Position',
        labels = {'3P':'3 Points per Game',
                        'Pos':'Position'},
        category_orders = {'Pos':('PG', 'SG', 'SF', 'PF', 'C', 'PF-SF', 'SF-SG', 'SG-PG')})

### How are distributed the Field Goals Per Game according to the Positions assigned to each Player?

In [483]:
px.box(data_frame = df06,
       x = 'Pos',
       y = 'FG',
       color = 'Pos',
       hover_name = 'Player',
       title = 'Field Goals per Game by Position',
       labels = {'FG':'Field Goals', 'Pos': 'Position'},
       category_orders = {'Pos':('PG', 'SG', 'SF', 'PF', 'C', 'PF-SF', 'SF-SG', 'SG-PG')})

### How are distributed the Personal Fouls Per Game according to the Positions assigned to each Player?

In [484]:
px.box(data_frame = df06,
       x = 'Pos',
       y = 'PF',
       color = 'Pos',
       hover_name = 'Player',
       title = 'Personal Fouls per Game by Position',
       labels = {'PF':'Personal Fouls', 'Pos': 'Position'},
       category_orders = {'Pos':('PG', 'SG', 'SF', 'PF', 'C', 'PF-SF', 'SF-SG', 'SG-PG')})

### How are distributed the Turn-Overs Per Game according to the Positions assigned to each Player?

In [485]:
px.box(data_frame = df06,
       x = 'Pos',
       y = 'TOV',
       color = 'Pos',
       hover_name = 'Player',
       title = 'Turn-Overs per Game by Position',
       labels = {'TOV':'Turn-Overs', 'Pos': 'Position'},
       category_orders = {'Pos':('PG', 'SG', 'SF', 'PF', 'C', 'PF-SF', 'SF-SG', 'SG-PG')})

### How are distributed the Blocks Per Game according to the Position assigned to each PLayers?

In [486]:
px.box(data_frame = df06,
       x = 'Pos',
       y = 'BLK',
       color = 'Pos',
       hover_name = 'Player',
       title = 'Blocks per Game by Position',
       labels = {'BLK':'Blocks', 'Pos': 'Position'},
       category_orders = {'Pos':('PG', 'SG', 'SF', 'PF', 'C', 'PF-SF', 'SF-SG', 'SG-PG')})

In [487]:
df06['Pos'].value_counts()

SG       139
PG       102
C        101
SF        98
PF        96
PF-SF      3
Name: Pos, dtype: int64

## Testing some radar charts

### Pre-processing Data to Chart

In [586]:
player = 'Jayson Tatum'


In [587]:
df06[df06['Player'] == player]

Unnamed: 0.1,Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,GM,MP_Total,MP,PER,USG%,OWS,DWS,WS,WS_48,OBPM,DBPM,BPM,VORP,TS%,PTS,FG,FGA,FG%,3P,3PA,3P%,3PAr,2P,2PA,2P%,eFG%,FT,FTA,FT%,FTr,ORB,ORB%,DRB,DRB%,TRB,TRB%,AST,AST%,STL,STL%,BLK,BLK%,TOV,TOV%,PF,Player_additional
464,464,465,Jayson Tatum,PF,24,BOS,74,74,8,2732.0,36.9,23.7,0.327,6.2,4.3,10.5,0.185,4.8,0.7,5.5,5.1,0.607,30.1,9.8,21.1,0.466,3.2,9.3,0.35,0.44,6.6,11.8,0.558,0.543,7.2,8.4,0.854,0.399,1.1,0.032,7.7,0.225,8.8,0.13,4.6,0.209,1.1,0.014,0.7,0.016,2.9,0.104,2.2,tatumja01


### Full Chart

In [588]:
aux = df06[df06['Player'] == player][['FG', '3P', 'TS%', 'FT', 'AST', 'ORB', 'DRB', 'STL', 'BLK', 'DWS']].T
aux.columns = [player]
aux.iloc[0] = aux.iloc[0]/df06['FG'].max()
aux.iloc[1] = aux.iloc[1]/df06['3P'].max()
aux.iloc[2] = aux.iloc[2]/df06['TS%'].max()
aux.iloc[3] = aux.iloc[3]/df06['FT'].max()
aux.iloc[4] = aux.iloc[4]/df06['AST'].max()
aux.iloc[5] = aux.iloc[5]/df06['ORB'].max()
aux.iloc[6] = aux.iloc[6]/df06['DRB'].max()
aux.iloc[7] = aux.iloc[7]/df06['STL'].max()
aux.iloc[8] = aux.iloc[8]/df06['BLK'].max()
aux.iloc[9] = aux.iloc[9]/df06['DWS'].max()
aux

Unnamed: 0,Jayson Tatum
FG,0.875
3P,0.653061
TS%,0.570489
FT,0.72
AST,0.429907
ORB,0.215686
DRB,0.802083
STL,0.366667
BLK,0.233333
DWS,0.895833


In [592]:
fig_full = px.line_polar(data_frame=aux,
             r=player,
             theta=aux.index,
             color_discrete_sequence=px.colors.sequential.Plasma_r, 
             template="plotly_dark",
             title= f"Offensive - {player}",
             line_close=True,
             markers=False,
             range_r=[0, 1])
fig_full.update_layout(title_text=f"Full Chart - {player}", 
                       title_x=0.5)
fig_full.update_traces(fill = 'toself')

### Offensive Chart

In [None]:
aux = df06[df06['Player'] == player][['FG%', '3P%', 'TS%', 'FT%', 'AST%', 'ORB%', 'DRB%', 'STL%', 'BLK%', 'DWS']].T
aux.columns = [player]
aux.iloc[4, 0]
aux.iloc[9] = aux.iloc[9]/df06['DWS'].max()
# aux.iloc[9] = aux.iloc[9]/4.8
# aux.iloc[9] = aux.iloc[9]/4.8
# aux.iloc[9] = aux.iloc[9]/4.8
aux

Unnamed: 0,Marcus Smart
FG%,0.415
3P%,0.336
TS%,0.538
FT%,0.746
AST%,0.264
ORB%,0.026
DRB%,0.08
STL%,0.023
BLK%,0.01
DWS,0.5625


In [543]:
fig_off = px.line_polar(data_frame=aux,
             r=player,
             theta=aux.index,
             color_discrete_sequence=px.colors.sequential.Plasma_r, 
             template="plotly_dark",
             title= f"Offensive - {player}",
             line_close=True,
             markers=False,
             range_r=[0, 1])
fig_off.update_layout(title_text=f"Offensive - {player}", title_x=0.5)

### Deffensive Chart

In [538]:
aux2 = df06[df06['Player'] == player][['DRB%', 'STL%', 'BLK%', 'DWS']].T
aux2.columns = [player]
aux2.iloc[3] = aux2.iloc[3]/4.8
aux2

Unnamed: 0,Marcus Smart
DRB%,0.08
STL%,0.023
BLK%,0.01
DWS,0.5625


In [539]:
fig_def = px.line_polar(data_frame=aux2,
             r=player,
             theta=aux2.index,
             color_discrete_sequence=px.colors.sequential.Plasma_r, 
             template="plotly_dark",
             title=f"Deffensive - {player}",
             line_close=True,
             markers=False,
             range_r=[0, 1])
fig_def.update_layout(title_text=f"Deffensive - {player}", title_x=0.5)

---

### Outra abordagem

In [581]:
# import plotly.graph_objects as go

# categories = ['processing cost','mechanical properties','chemical stability',
#               'thermal stability', 'device integration']

# fig = go.Figure()

# fig.add_trace(go.Scatterpolar(
#       r=[1, 5, 2, 2, 3],
#       theta=categories,
#       fill='toself',
#       name='Product A'
# ))
# fig.add_trace(go.Scatterpolar(
#       r=[4, 3, 2.5, 1, 2],
#       theta=categories,
#       fill='toself',
#       name='Product B'
# ))

# fig.update_layout(
#   polar=dict(
#     radialaxis=dict(
#       visible=True,
#       range=[0, 5]
#     )),
#   showlegend=False
# )

# fig.show()

# 7. <a id='toc7_'></a>[Data Preparation](#toc0_)
- Normalize, re-scale and transform (enconding) variables to suit model requirements
- It may be a good idea to normalize all of the features so they are comparable in magnitude

# 8. <a id='toc8_'></a>[Feature Selection through Boruta algorithm](#toc0_)
- Use Boruta algorithm to select best features to machine learning models

# 9. <a id='toc9_'></a>[Model implementation](#toc0_)
- Implement different machine learning models and algorithms
- Conduct cross-velidation computing
- Conduct single performance metrics computing

# 10. <a id='toc10_'></a>[Hyperparameter Fine-Tuning](#toc0_)
- Implement hyperparameter search (Bayes Search) to find best model hyperparameter values
- Re-train model using best values

# 11. <a id='toc11_'></a>[Model Error Estimation and Interpretation](#toc0_)
- Use model errors to interpret the goals 

# 12. <a id='toc12_'></a>[Model Deployment](#toc0_)
- Deploy the model to a cloud service so it can be used by its consumers