# 1. <a id='toc1_'></a>[NBA Season 2022-2023 Analysis](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- 1. [NBA Season 2022-2023 Analysis](#toc1_)    
- 2. [Importings](#toc2_)    
  - 2.1. [Libraries](#toc2_1_)    
  - 2.2. [Data loading](#toc2_2_)    
- 3. [Data exploration and problem comprehension](#toc3_)    
  - 3.1. [Examining the Advanced Dataset](#toc3_1_)    
    - 3.1.1. [Features from Advanced Dataset](#toc3_1_1_)    
    - 3.1.2. [What are we dealing with?](#toc3_1_2_)    
    - 3.1.3. [Checking for NAs](#toc3_1_3_)    
    - 3.1.4. [Do these players have multiple lines due to team exchanges?](#toc3_1_4_)    
    - 3.1.5. [Renaming and droping empty columns](#toc3_1_5_)    
    - 3.1.6. [First glance at the Advanced Dataset](#toc3_1_6_)    
    - 3.1.7. [Imputing values to the missing data](#toc3_1_7_)    
  - 3.2. [Examining Per Game Dataset](#toc3_2_)    
    - 3.2.1. [Features from Per Game Dataset](#toc3_2_1_)    
- 4. [Feature Engineering and Hypothesis Creation](#toc4_)    
- 5. [Data selection and filtering](#toc5_)    
- 6. [Exploratory Data Analysis](#toc6_)    
- 7. [Data Preparation](#toc7_)    
- 8. [Feature Selection through Boruta algorithm](#toc8_)    
- 9. [Model implementation](#toc9_)    
- 10. [Hyperparameter Fine-Tuning](#toc10_)    
- 11. [Model Error Estimation and Interpretation](#toc11_)    
- 12. [Model Deployment](#toc12_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# 2. <a id='toc2_'></a>[Importings](#toc0_)

## 2.1. <a id='toc2_1_'></a>[Libraries](#toc0_)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from ydata_profiling import ProfileReport

## 2.2. <a id='toc2_2_'></a>[Data loading](#toc0_)

In [2]:
advanced_df = pd.read_csv('./data/data_advanced.csv')
pergame_df = pd.read_csv('./data/data_pergame.csv')

# 3. <a id='toc3_'></a>[Data exploration and problem comprehension](#toc0_)
- Main goal/problem
- Sub-goals
- What will the finished product be?

## 3.1. <a id='toc3_1_'></a>[Examining the Advanced Dataset](#toc0_)

### 3.1.1. <a id='toc3_1_1_'></a>[Features from Advanced Dataset](#toc0_)


- Rk -- Rank

- Pos -- Position

- Age -- Player's age on February 1 of the season

- Tm -- Team

- G -- Games

- MP -- Minutes Played

- PER -- Player Efficiency Rating. A measure of per-minute production standardized such that the league average is 15.

- TS% -- True Shooting Percentage. A measure of shooting efficiency that takes into account 2-point field goals, 3-point field goals, and free throws.

- 3PAr -- 3-Point Attempt Rate. Percentage of FG Attempts from 3-Point Range

- FTr -- Free Throw Attempt Rate. Number of FT Attempts Per FG Attempt

- ORB% -- Offensive Rebound Percentage. An estimate of the percentage of available offensive rebounds a player grabbed while they were on the floor.

- DRB% -- Defensive Rebound Percentage. An estimate of the percentage of available defensive rebounds a player grabbed while they were on the floor.

- TRB% -- Total Rebound Percentage. An estimate of the percentage of available rebounds a player grabbed while they were on the floor.

- AST% -- Assist Percentage. An estimate of the percentage of teammate field goals a player assisted while they were on the floor.

- STL% -- Steal Percentage. An estimate of the percentage of opponent possessions that end with a steal by the player while they were on the floor.

- BLK% -- Block Percentage. An estimate of the percentage of opponent two-point field goal attempts blocked by the player while they were on the floor.

- TOV% -- Turnover Percentage. An estimate of turnovers committed per 100 plays.

- USG% -- Usage Percentage. An estimate of the percentage of team plays used by a player while they were on the floor.

- OWS -- Offensive Win Shares. An estimate of the number of wins contributed by a player due to offense.

- DWS -- Defensive Win Shares. An estimate of the number of wins contributed by a player due to defense.

- WS -- Win Shares. An estimate of the number of wins contributed by a player.

- WS/48 -- Win Shares Per 48 Minutes. An estimate of the number of wins contributed by a player per 48 minutes (league average is approximately .100)

- OBPM -- Offensive Box Plus/Minus. A box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team.

- DBPM -- Defensive Box Plus/Minus. A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team.

- BPM -- Box Plus/Minus. A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team.

- VORP -- Value over Replacement Player. A box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season. Multiply by 2.70 to convert to wins over replacement.

### 3.1.2. <a id='toc3_1_2_'></a>[What are we dealing with?](#toc0_)

In [3]:
advanced_df.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP,Player-additional
0,1,Precious Achiuwa,C,23,TOR,55,1140,15.2,0.554,0.267,...,0.8,1.4,2.2,0.093,,-1.4,-0.8,-2.3,-0.1,achiupr01
1,2,Steven Adams,C,29,MEM,42,1133,17.5,0.564,0.004,...,1.3,2.1,3.4,0.144,,-0.3,0.9,0.6,0.7,adamsst01
2,3,Bam Adebayo,C,25,MIA,75,2598,20.1,0.592,0.011,...,3.6,3.8,7.4,0.137,,0.8,0.8,1.5,2.3,adebaba01
3,4,Ochai Agbaji,SG,22,UTA,59,1209,9.5,0.561,0.591,...,0.9,0.4,1.3,0.053,,-1.7,-1.4,-3.0,-0.3,agbajoc01
4,5,Santi Aldama,PF,22,MEM,77,1682,13.9,0.591,0.507,...,2.1,2.4,4.6,0.13,,-0.3,0.8,0.5,1.1,aldamsa01


In [4]:
advanced_df.shape

(679, 30)

### 3.1.3. <a id='toc3_1_3_'></a>[Checking for NAs](#toc0_)
- Only three NAs in columns 'TS%', '3PAr' and 'FTr'. The same three rows have NAs to this three features. Let's inspect it so we can figure out why they are empty and what to do with it.
- Columns 'Unnamed: 19' and 'Unnamed: 24' are completely empty and should be deleted.

In [5]:
advanced_df.isna().sum()

Rk                     0
Player                 0
Pos                    0
Age                    0
Tm                     0
G                      0
MP                     0
PER                    0
TS%                    3
3PAr                   3
FTr                    3
ORB%                   0
DRB%                   0
TRB%                   0
AST%                   0
STL%                   0
BLK%                   0
TOV%                   1
USG%                   0
Unnamed: 19          679
OWS                    0
DWS                    0
WS                     0
WS/48                  0
Unnamed: 24          679
OBPM                   0
DBPM                   0
BPM                    0
VORP                   0
Player-additional      0
dtype: int64

In [6]:
advanced_df[advanced_df['TS%'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP,Player-additional
89,66,Moses Brown,C,23,BRK,2,6,-2.6,,,...,0.0,0.0,0.0,-0.129,,-12.7,2.8,-9.9,0.0,brownmo01
196,151,Michael Foster Jr.,PF,20,PHI,1,1,0.0,,,...,0.0,0.0,0.0,0.01,,-7.2,-1.9,-9.2,0.0,fostemi02
652,515,Alondes Williams,SG,23,BRK,1,5,-20.9,,,...,-0.1,0.0,-0.1,-0.517,,-21.3,-5.2,-26.5,0.0,willial06


In [7]:
advanced_df[advanced_df['3PAr'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP,Player-additional
89,66,Moses Brown,C,23,BRK,2,6,-2.6,,,...,0.0,0.0,0.0,-0.129,,-12.7,2.8,-9.9,0.0,brownmo01
196,151,Michael Foster Jr.,PF,20,PHI,1,1,0.0,,,...,0.0,0.0,0.0,0.01,,-7.2,-1.9,-9.2,0.0,fostemi02
652,515,Alondes Williams,SG,23,BRK,1,5,-20.9,,,...,-0.1,0.0,-0.1,-0.517,,-21.3,-5.2,-26.5,0.0,willial06


In [8]:
advanced_df[advanced_df['FTr'].isna()]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP,Player-additional
89,66,Moses Brown,C,23,BRK,2,6,-2.6,,,...,0.0,0.0,0.0,-0.129,,-12.7,2.8,-9.9,0.0,brownmo01
196,151,Michael Foster Jr.,PF,20,PHI,1,1,0.0,,,...,0.0,0.0,0.0,0.01,,-7.2,-1.9,-9.2,0.0,fostemi02
652,515,Alondes Williams,SG,23,BRK,1,5,-20.9,,,...,-0.1,0.0,-0.1,-0.517,,-21.3,-5.2,-26.5,0.0,willial06


### 3.1.4. <a id='toc3_1_4_'></a>[Do these players have multiple lines due to team exchanges?](#toc0_)
- Moses Brown do appear in three different rows once he was traded two times during this season so it may be a good alternative to join the rows
- Michael Foster Jr. and Alondes Williams don't appear. So the missing data may be due to impossobilities to calculate it. It may be a good solution to use 0,0 as values or to attempt to estimate it from the Per Game Dataset.

In [9]:
advanced_df[advanced_df['Player-additional'] == 'brownmo01']

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP,Player-additional
87,66,Moses Brown,C,23,TOT,36,294,22.2,0.607,0.0,...,0.7,0.4,1.1,0.179,,0.6,-1.2,-0.6,0.1,brownmo01
88,66,Moses Brown,C,23,LAC,34,288,22.7,0.607,0.0,...,0.7,0.4,1.1,0.185,,0.9,-1.3,-0.4,0.1,brownmo01
89,66,Moses Brown,C,23,BRK,2,6,-2.6,,,...,0.0,0.0,0.0,-0.129,,-12.7,2.8,-9.9,0.0,brownmo01


In [10]:
advanced_df[advanced_df['Player-additional'] == 'fostemi02']

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP,Player-additional
196,151,Michael Foster Jr.,PF,20,PHI,1,1,0.0,,,...,0.0,0.0,0.0,0.01,,-7.2,-1.9,-9.2,0.0,fostemi02


In [11]:
advanced_df[advanced_df['Player-additional'] == 'willial06']

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP,Player-additional
652,515,Alondes Williams,SG,23,BRK,1,5,-20.9,,,...,-0.1,0.0,-0.1,-0.517,,-21.3,-5.2,-26.5,0.0,willial06


### 3.1.5. <a id='toc3_1_5_'></a>[Renaming and droping empty columns](#toc0_)

In [12]:
droped_columns = ['Unnamed: 19', 'Unnamed: 24']
advanced_df = advanced_df.drop(droped_columns, axis = 1)

In [13]:
advanced_df.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr',
       'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%',
       'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP',
       'Player-additional'],
      dtype='object')

In [14]:
advanced_cols = ['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr',
       'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%',
       'OWS', 'DWS', 'WS', 'WS_48', 'OBPM', 'DBPM', 'BPM', 'VORP',
       'Player_additional']

advanced_df.columns = advanced_cols

In [15]:
advanced_df.shape

(679, 28)

### 3.1.6. <a id='toc3_1_6_'></a>[First glance at the Advanced Dataset](#toc0_)

In [16]:
advanced_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rk,679.0,265.976436,154.956296,1.0,132.5,264.0,399.5,539.0
Age,679.0,26.025037,4.325709,19.0,23.0,25.0,29.0,42.0
G,679.0,43.338733,24.727306,1.0,22.0,45.0,65.5,83.0
MP,679.0,984.421208,800.236331,1.0,266.5,797.0,1663.5,2963.0
PER,679.0,13.226068,6.044551,-20.9,10.1,13.0,16.1,65.6
TS%,676.0,0.562746,0.10437,0.0,0.523,0.566,0.60925,1.064
3PAr,676.0,0.41112,0.219802,0.0,0.27175,0.419,0.55525,1.0
FTr,676.0,0.245464,0.17678,0.0,0.1395,0.227,0.32125,2.0
ORB%,679.0,5.151694,4.259419,0.0,2.0,3.8,7.1,28.8
DRB%,679.0,14.998822,6.767066,0.0,10.5,13.5,18.7,55.4


In [17]:
advanced_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 679 entries, 0 to 678
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rk                 679 non-null    int64  
 1   Player             679 non-null    object 
 2   Pos                679 non-null    object 
 3   Age                679 non-null    int64  
 4   Tm                 679 non-null    object 
 5   G                  679 non-null    int64  
 6   MP                 679 non-null    int64  
 7   PER                679 non-null    float64
 8   TS%                676 non-null    float64
 9   3PAr               676 non-null    float64
 10  FTr                676 non-null    float64
 11  ORB%               679 non-null    float64
 12  DRB%               679 non-null    float64
 13  TRB%               679 non-null    float64
 14  AST%               679 non-null    float64
 15  STL%               679 non-null    float64
 16  BLK%               679 non

In [18]:
# There are 679 rows in the dataset. However only 539 singular players. It happens because some players changed teams during the season and appear in multiple lines.
# It may be a good solution to join these lines and stick only with the latest team in wich the player acts.

advanced_df['Player_additional'].nunique() 

539

In [19]:
# The data types are all set correctly

advanced_df.dtypes

Rk                     int64
Player                object
Pos                   object
Age                    int64
Tm                    object
G                      int64
MP                     int64
PER                  float64
TS%                  float64
3PAr                 float64
FTr                  float64
ORB%                 float64
DRB%                 float64
TRB%                 float64
AST%                 float64
STL%                 float64
BLK%                 float64
TOV%                 float64
USG%                 float64
OWS                  float64
DWS                  float64
WS                   float64
WS_48                float64
OBPM                 float64
DBPM                 float64
BPM                  float64
VORP                 float64
Player_additional     object
dtype: object

In [None]:
# advanced_profile = ProfileReport(advanced_df, title = 'Advanced NBA Dataset Profile')
# advanced_profile.to_file('advanced_profile.html')
# advanced_profile

### 3.1.7. <a id='toc3_1_7_'></a>[Imputing values to the missing data](#toc0_)

## 3.2. <a id='toc3_2_'></a>[Examining Per Game Dataset](#toc0_)

### 3.2.1. <a id='toc3_2_1_'></a>[Features from Per Game Dataset](#toc0_)


- Rk -- Rank

- Pos -- Position

- Age -- Player's age on February 1 of the season

- Tm -- Team

- G -- Games

- GS -- Games Started

- MP -- Minutes Played Per Game

- FG -- Field Goals Per Game

- FGA -- Field Goal Attempts Per Game

- FG% -- Field Goal Percentage

- 3P -- 3-Point Field Goals Per Game

- 3PA -- 3-Point Field Goal Attempts Per Game

- 3P% -- 3-Point Field Goal Percentage

- 2P -- 2-Point Field Goals Per Game

- 2PA -- 2-Point Field Goal Attempts Per Game

- 2P% -- 2-Point Field Goal Percentage

- eFG% -- Effective Field Goal Percentage

- This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.

- FT -- Free Throws Per Game

- FTA -- Free Throw Attempts Per Game

- FT% -- Free Throw Percentage

- ORB -- Offensive Rebounds Per Game

- DRB -- Defensive Rebounds Per Game

- TRB -- Total Rebounds Per Game

- AST -- Assists Per Game

- STL -- Steals Per Game

- BLK -- Blocks Per Game

- TOV -- Turnovers Per Game

- PF -- Personal Fouls Per Game

- PTS -- Points Per Game

In [21]:
pergame_df.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player-additional
0,1,Precious Achiuwa,C,23,TOR,55,12,20.7,3.6,7.3,...,1.8,4.1,6.0,0.9,0.6,0.5,1.1,1.9,9.2,achiupr01
1,2,Steven Adams,C,29,MEM,42,42,27.0,3.7,6.3,...,5.1,6.5,11.5,2.3,0.9,1.1,1.9,2.3,8.6,adamsst01
2,3,Bam Adebayo,C,25,MIA,75,75,34.6,8.0,14.9,...,2.5,6.7,9.2,3.2,1.2,0.8,2.5,2.8,20.4,adebaba01
3,4,Ochai Agbaji,SG,22,UTA,59,22,20.5,2.8,6.5,...,0.7,1.3,2.1,1.1,0.3,0.3,0.7,1.7,7.9,agbajoc01
4,5,Santi Aldama,PF,22,MEM,77,20,21.8,3.2,6.8,...,1.1,3.7,4.8,1.3,0.6,0.6,0.8,1.9,9.0,aldamsa01


In [22]:
pergame_df.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS',
       'Player-additional'],
      dtype='object')

In [23]:
pergame_df.shape

(679, 31)

In [24]:
pergame_df['3PA']

0      2.0
1      0.0
2      0.2
3      3.9
4      3.5
      ... 
674    0.6
675    6.3
676    0.8
677    0.1
678    0.0
Name: 3PA, Length: 679, dtype: float64

# 4. <a id='toc4_'></a>[Feature Engineering and Hypothesis Creation](#toc0_)
- Mental hypothesis map
- Hypothesis list
- Fillout NAs
- Derive new variables as needed

# 5. <a id='toc5_'></a>[Data selection and filtering](#toc0_)
- Filter data rows
- Filter data columns

# 6. <a id='toc6_'></a>[Exploratory Data Analysis](#toc0_)
- Answer the hypothesis list
- Build data visualization solutions and plots

# 7. <a id='toc7_'></a>[Data Preparation](#toc0_)
- Normalize, re-scale and transform (enconding) variables to suit model requirements

# 8. <a id='toc8_'></a>[Feature Selection through Boruta algorithm](#toc0_)
- Use Boruta algorithm to select best features to machine learning models

# 9. <a id='toc9_'></a>[Model implementation](#toc0_)
- Implement different machine learning models and algorithms
- Conduct cross-velidation computing
- Conduct single performance metrics computing

# 10. <a id='toc10_'></a>[Hyperparameter Fine-Tuning](#toc0_)
- Implement hyperparameter search (Bayes Search) to find best model hyperparameter values
- Re-train model using best values

# 11. <a id='toc11_'></a>[Model Error Estimation and Interpretation](#toc0_)
- Use model errors to interpret the goals 

# 12. <a id='toc12_'></a>[Model Deployment](#toc0_)
- Deploy the model to a cloud service so it can be used by its consumers