# Predicting All-NBA Team and Player Salaries - Data Cleaning and EDA
---

In this notebook, we will be cleaning and feature engineering our webscraped NBA data, as well as conducting exploratory data analysis.

We will use a combination of Python and SQL to preprocess the data for accuracy and consistency, including handling missing values, addressing data inconsistencies, and converting data into appropriate formats. We will engineer columns such as Voter Share for the All-NBA Team award winners and total NBA payroll per season, as well as narrow down the dataframes to only variables of interest, before conducting a series of merges to create cleaned and final data for modeling. Our exploratory data analysis will give us an overall sense of the makeup of our data and explore various relationships between variables.

Further detailed notebooks on the various segments of this project can be found at the following: 
- [01_Data_Acquisition](./01_Data_Acquisition.ipynb)
- [03_Data_Modeling_I](./03_Data_Modeling_I.ipynb)
- [04_Data_Modeling_II](./04_Data_Modeling_II.ipynb)

For more information on the background, a summary of methods, and findings, please see the associated [README](../README.md) for this analysis. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

import warnings
warnings.filterwarnings('ignore') 

# this setting widens how many characters pandas will display in a column:
#pd.options.display.max_colwidth = 400
pd.options.display.max_rows = 400
pd.options.display.max_columns = 400

In [2]:
adv = pd.read_csv('../data/advanced_data.csv')
pg = pd.read_csv('../data/per_game_data.csv')
tot = pd.read_csv('../data/totals_data.csv')

allteam = pd.read_csv('../data/all_nba_teams.csv')
allstar = pd.read_csv('../data/all_star_appearances.csv')
rank = pd.read_csv('../data/team_rank.csv')

sal = pd.read_csv('../data/salaries.csv')
salcap = pd.read_csv('../data/salarycap.csv')
payroll = pd.read_csv('../data/team_payroll.csv')

## I. Combine Player Statistics

In [3]:
print(adv.Pos.value_counts())
print(adv.dtypes)

PF          3903
SG          3827
PG          3745
C           3723
SF          3442
Pos          712
SF-SG         35
SG-SF         31
PG-SG         31
C-PF          28
SG-PG         28
PF-SF         27
PF-C          26
SF-PF         24
SG-PF          4
PG-SF          1
SF-C           1
SG-PG-SF       1
Name: Pos, dtype: int64
Rk              object
Player          object
Pos             object
Age             object
Tm              object
G               object
MP              object
PER             object
TS%             object
3PAr            object
FTr             object
ORB%            object
DRB%            object
TRB%            object
AST%            object
STL%            object
BLK%            object
TOV%            object
USG%            object
Unnamed: 19    float64
OWS             object
DWS             object
WS              object
WS/48           object
Unnamed: 24    float64
OBPM            object
DBPM            object
BPM             object
VORP            object
Yea

In [4]:
print(tot.Pos.value_counts())
print(tot.dtypes)

PF          3903
SG          3827
PG          3745
C           3723
SF          3442
Pos          712
SF-SG         35
SG-SF         31
PG-SG         31
C-PF          28
SG-PG         28
PF-SF         27
PF-C          26
SF-PF         24
SG-PF          4
PG-SF          1
SF-C           1
SG-PG-SF       1
Name: Pos, dtype: int64
Rk        object
Player    object
Pos       object
Age       object
Tm        object
G         object
GS        object
MP        object
FG        object
FGA       object
FG%       object
3P        object
3PA       object
3P%       object
2P        object
2PA       object
2P%       object
eFG%      object
FT        object
FTA       object
FT%       object
ORB       object
DRB       object
TRB       object
AST       object
STL       object
BLK       object
TOV       object
PF        object
PTS       object
Year       int64
Stat      object
dtype: object


In [5]:
print(pg.Pos.value_counts())
print(pg.dtypes)

PF          3903
SG          3827
PG          3745
C           3723
SF          3442
Pos          712
SF-SG         35
SG-SF         31
PG-SG         31
C-PF          28
SG-PG         28
PF-SF         27
PF-C          26
SF-PF         24
SG-PF          4
PG-SF          1
SF-C           1
SG-PG-SF       1
Name: Pos, dtype: int64
Rk        object
Player    object
Pos       object
Age       object
Tm        object
G         object
GS        object
MP        object
FG        object
FGA       object
FG%       object
3P        object
3PA       object
3P%       object
2P        object
2PA       object
2P%       object
eFG%      object
FT        object
FTA       object
FT%       object
ORB       object
DRB       object
TRB       object
AST       object
STL       object
BLK       object
TOV       object
PF        object
PTS       object
Year       int64
Stat      object
dtype: object


### Initial Cleaning
##### Some initial data cleaning tasks have become apparent:
- Remove any Unnamed rows
- Remove rows which are repeat headers
- Convert Objects to Floats
- Dummify ```Pos``` (Position)
- Prefix variables with statistic type

In [6]:
def init_clean(df, pre):
    
    # Remove Unnamed columns
    df = df.loc[:, ~df.columns.str.startswith('Unnamed')]

    # Remove repeat header rows
    df = df.loc[df.Rk != 'Rk']

    # Convert to float
    keep_obj = ['Pos', 'Player', 'Tm', 'Stat']
    data_types = {col: 'float64' for col in df.columns if col not in keep_obj}
    df = df.astype(data_types).drop(columns=['Stat'])
    
    # Overwrite float to make these variables integers
    keep_obj2 = ['Rk', 'Year', 'Age', 'G', 'MP'] 
    data_types2 = {col: 'int64' for col in df.columns if col in keep_obj2}
    df = df.astype(data_types2)

    # Dummify Position - will address later
    # df_dum = pd.get_dummies(df['Pos'], drop_first=True)
    # df = pd.concat([df, df_dum], axis=1)

    # Prefix variables
    no_pre = ['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'Year']
    prefix = pre
    rename_col = {col: f'{prefix}_{col}' for col in df.columns if col not in no_pre}
    df = df.rename(columns=rename_col)
    
    return df

In [7]:
adv = init_clean(adv, 'adv')
adv

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,adv_MP,adv_PER,adv_TS%,adv_3PAr,adv_FTr,adv_ORB%,adv_DRB%,adv_TRB%,adv_AST%,adv_STL%,adv_BLK%,adv_TOV%,adv_USG%,adv_OWS,adv_DWS,adv_WS,adv_WS/48,adv_OBPM,adv_DBPM,adv_BPM,adv_VORP,Year
0,1,Alaa Abdelnaby,PF,22,POR,43,290,13.1,0.499,0.000,0.379,10.4,23.4,17.0,5.8,0.7,2.5,14.0,22.1,0.0,0.5,0.5,0.079,-3.4,-1.2,-4.6,-0.2,1990
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,1505,12.2,0.448,0.099,0.097,1.9,6.0,3.8,19.2,1.5,0.1,9.5,27.2,-0.7,-0.3,-1.0,-0.031,-2.0,-3.0,-5.0,-1.1,1990
2,3,Mark Acres,C,28,ORL,68,1313,9.2,0.551,0.014,0.472,11.3,18.7,14.9,2.5,0.9,1.1,14.0,9.3,1.4,1.1,2.5,0.090,-2.8,-0.2,-3.0,-0.3,1990
3,4,Michael Adams,PG,28,DEN,66,2346,22.3,0.530,0.397,0.372,2.1,8.8,5.2,39.4,2.6,0.1,12.7,28.5,5.8,0.4,6.3,0.128,6.0,-0.7,5.3,4.3,1990
4,5,Mark Aguirre,SF,31,DET,78,2006,16.7,0.526,0.086,0.349,7.6,13.7,10.7,11.6,1.2,0.6,10.9,25.7,2.8,2.7,5.5,0.132,1.2,0.2,1.4,1.7,1990
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19584,535,Thaddeus Young,PF,34,TOR,54,795,14.1,0.573,0.172,0.131,9.4,14.6,11.8,12.9,3.4,0.6,16.7,13.5,0.7,1.1,1.8,0.109,-1.8,1.9,0.1,0.4,2022
19585,536,Trae Young,PG,24,ATL,73,2541,22.0,0.573,0.331,0.460,2.4,7.0,4.7,42.5,1.5,0.3,15.2,32.6,5.3,1.4,6.7,0.126,5.3,-2.0,3.3,3.4,2022
19586,537,Omer Yurtseven,C,24,MIA,9,83,16.7,0.675,0.259,0.222,10.9,21.9,16.2,3.9,1.2,2.5,11.9,18.0,0.2,0.1,0.3,0.159,-2.5,-1.5,-3.9,0.0,2022
19587,538,Cody Zeller,C,30,MIA,15,217,16.4,0.659,0.034,0.593,13.0,21.8,17.3,7.2,0.7,1.9,15.8,18.1,0.4,0.3,0.7,0.147,-2.0,-0.7,-2.8,0.0,2022


In [8]:
tot = init_clean(tot, 'tot')
tot

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,tot_GS,tot_MP,tot_FG,tot_FGA,tot_FG%,tot_3P,tot_3PA,tot_3P%,tot_2P,tot_2PA,tot_2P%,tot_eFG%,tot_FT,tot_FTA,tot_FT%,tot_ORB,tot_DRB,tot_TRB,tot_AST,tot_STL,tot_BLK,tot_TOV,tot_PF,tot_PTS,Year
0,1,Alaa Abdelnaby,PF,22,POR,43,0.0,290,55.0,116.0,0.474,0.0,0.0,,55.0,116.0,0.474,0.474,25.0,44.0,0.568,27.0,62.0,89.0,12.0,4.0,12.0,22.0,39.0,135.0,1990
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19.0,1505,417.0,1009.0,0.413,24.0,100.0,0.240,393.0,909.0,0.432,0.425,84.0,98.0,0.857,34.0,87.0,121.0,206.0,55.0,4.0,110.0,149.0,942.0,1990
2,3,Mark Acres,C,28,ORL,68,0.0,1313,109.0,214.0,0.509,1.0,3.0,0.333,108.0,211.0,0.512,0.512,66.0,101.0,0.653,140.0,219.0,359.0,25.0,25.0,25.0,42.0,218.0,285.0,1990
3,4,Michael Adams,PG,28,DEN,66,66.0,2346,560.0,1421.0,0.394,167.0,564.0,0.296,393.0,857.0,0.459,0.453,465.0,529.0,0.879,58.0,198.0,256.0,693.0,147.0,6.0,240.0,162.0,1752.0,1990
4,5,Mark Aguirre,SF,31,DET,78,13.0,2006,420.0,909.0,0.462,24.0,78.0,0.308,396.0,831.0,0.477,0.475,240.0,317.0,0.757,134.0,240.0,374.0,139.0,47.0,20.0,128.0,209.0,1104.0,1990
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19584,535,Thaddeus Young,PF,34,TOR,54,9.0,795,108.0,198.0,0.545,6.0,34.0,0.176,102.0,164.0,0.622,0.561,18.0,26.0,0.692,71.0,95.0,166.0,75.0,54.0,5.0,42.0,88.0,240.0,2022
19585,536,Trae Young,PG,24,ATL,73,73.0,2541,597.0,1390.0,0.429,154.0,460.0,0.335,443.0,930.0,0.476,0.485,566.0,639.0,0.886,56.0,161.0,217.0,741.0,80.0,9.0,300.0,104.0,1914.0,2022
19586,537,Omer Yurtseven,C,24,MIA,9,0.0,83,16.0,27.0,0.593,3.0,7.0,0.429,13.0,20.0,0.650,0.648,5.0,6.0,0.833,8.0,15.0,23.0,2.0,2.0,2.0,4.0,16.0,40.0,2022
19587,538,Cody Zeller,C,30,MIA,15,2.0,217,37.0,59.0,0.627,0.0,2.0,0.000,37.0,57.0,0.649,0.627,24.0,35.0,0.686,25.0,39.0,64.0,10.0,3.0,4.0,14.0,33.0,98.0,2022


In [9]:
pg = init_clean(pg, 'pg')
pg

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,pg_GS,pg_MP,pg_FG,pg_FGA,pg_FG%,pg_3P,pg_3PA,pg_3P%,pg_2P,pg_2PA,pg_2P%,pg_eFG%,pg_FT,pg_FTA,pg_FT%,pg_ORB,pg_DRB,pg_TRB,pg_AST,pg_STL,pg_BLK,pg_TOV,pg_PF,pg_PTS,Year
0,1,Alaa Abdelnaby,PF,22,POR,43,0.0,6,1.3,2.7,0.474,0.0,0.0,,1.3,2.7,0.474,0.474,0.6,1.0,0.568,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1,1990
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19.0,22,6.2,15.1,0.413,0.4,1.5,0.240,5.9,13.6,0.432,0.425,1.3,1.5,0.857,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1,1990
2,3,Mark Acres,C,28,ORL,68,0.0,19,1.6,3.1,0.509,0.0,0.0,0.333,1.6,3.1,0.512,0.512,1.0,1.5,0.653,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2,1990
3,4,Michael Adams,PG,28,DEN,66,66.0,35,8.5,21.5,0.394,2.5,8.5,0.296,6.0,13.0,0.459,0.453,7.0,8.0,0.879,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5,1990
4,5,Mark Aguirre,SF,31,DET,78,13.0,25,5.4,11.7,0.462,0.3,1.0,0.308,5.1,10.7,0.477,0.475,3.1,4.1,0.757,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2,1990
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19584,535,Thaddeus Young,PF,34,TOR,54,9.0,14,2.0,3.7,0.545,0.1,0.6,0.176,1.9,3.0,0.622,0.561,0.3,0.5,0.692,1.3,1.8,3.1,1.4,1.0,0.1,0.8,1.6,4.4,2022
19585,536,Trae Young,PG,24,ATL,73,73.0,34,8.2,19.0,0.429,2.1,6.3,0.335,6.1,12.7,0.476,0.485,7.8,8.8,0.886,0.8,2.2,3.0,10.2,1.1,0.1,4.1,1.4,26.2,2022
19586,537,Omer Yurtseven,C,24,MIA,9,0.0,9,1.8,3.0,0.593,0.3,0.8,0.429,1.4,2.2,0.650,0.648,0.6,0.7,0.833,0.9,1.7,2.6,0.2,0.2,0.2,0.4,1.8,4.4,2022
19587,538,Cody Zeller,C,30,MIA,15,2.0,14,2.5,3.9,0.627,0.0,0.1,0.000,2.5,3.8,0.649,0.627,1.6,2.3,0.686,1.7,2.6,4.3,0.7,0.2,0.3,0.9,2.2,6.5,2022


In [10]:
adv.dtypes

Rk             int64
Player        object
Pos           object
Age            int64
Tm            object
G              int64
adv_MP         int64
adv_PER      float64
adv_TS%      float64
adv_3PAr     float64
adv_FTr      float64
adv_ORB%     float64
adv_DRB%     float64
adv_TRB%     float64
adv_AST%     float64
adv_STL%     float64
adv_BLK%     float64
adv_TOV%     float64
adv_USG%     float64
adv_OWS      float64
adv_DWS      float64
adv_WS       float64
adv_WS/48    float64
adv_OBPM     float64
adv_DBPM     float64
adv_BPM      float64
adv_VORP     float64
Year           int64
dtype: object

In [11]:
tot.dtypes

Rk            int64
Player       object
Pos          object
Age           int64
Tm           object
G             int64
tot_GS      float64
tot_MP        int64
tot_FG      float64
tot_FGA     float64
tot_FG%     float64
tot_3P      float64
tot_3PA     float64
tot_3P%     float64
tot_2P      float64
tot_2PA     float64
tot_2P%     float64
tot_eFG%    float64
tot_FT      float64
tot_FTA     float64
tot_FT%     float64
tot_ORB     float64
tot_DRB     float64
tot_TRB     float64
tot_AST     float64
tot_STL     float64
tot_BLK     float64
tot_TOV     float64
tot_PF      float64
tot_PTS     float64
Year          int64
dtype: object

In [12]:
pg.dtypes

Rk           int64
Player      object
Pos         object
Age          int64
Tm          object
G            int64
pg_GS      float64
pg_MP        int64
pg_FG      float64
pg_FGA     float64
pg_FG%     float64
pg_3P      float64
pg_3PA     float64
pg_3P%     float64
pg_2P      float64
pg_2PA     float64
pg_2P%     float64
pg_eFG%    float64
pg_FT      float64
pg_FTA     float64
pg_FT%     float64
pg_ORB     float64
pg_DRB     float64
pg_TRB     float64
pg_AST     float64
pg_STL     float64
pg_BLK     float64
pg_TOV     float64
pg_PF      float64
pg_PTS     float64
Year         int64
dtype: object

##### <span style = 'color:mediumvioletred'> _We will merge the three datasets by Rk, Player, Pos, Age, Tm, G, as these should be the same across all three datasets._ </span>

In [13]:
stats1 = pg.merge(tot, how='left', on=['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'Year'])
stats1

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,pg_GS,pg_MP,pg_FG,pg_FGA,pg_FG%,pg_3P,pg_3PA,pg_3P%,pg_2P,pg_2PA,pg_2P%,pg_eFG%,pg_FT,pg_FTA,pg_FT%,pg_ORB,pg_DRB,pg_TRB,pg_AST,pg_STL,pg_BLK,pg_TOV,pg_PF,pg_PTS,Year,tot_GS,tot_MP,tot_FG,tot_FGA,tot_FG%,tot_3P,tot_3PA,tot_3P%,tot_2P,tot_2PA,tot_2P%,tot_eFG%,tot_FT,tot_FTA,tot_FT%,tot_ORB,tot_DRB,tot_TRB,tot_AST,tot_STL,tot_BLK,tot_TOV,tot_PF,tot_PTS
0,1,Alaa Abdelnaby,PF,22,POR,43,0.0,6,1.3,2.7,0.474,0.0,0.0,,1.3,2.7,0.474,0.474,0.6,1.0,0.568,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1,1990,0.0,290,55.0,116.0,0.474,0.0,0.0,,55.0,116.0,0.474,0.474,25.0,44.0,0.568,27.0,62.0,89.0,12.0,4.0,12.0,22.0,39.0,135.0
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19.0,22,6.2,15.1,0.413,0.4,1.5,0.240,5.9,13.6,0.432,0.425,1.3,1.5,0.857,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1,1990,19.0,1505,417.0,1009.0,0.413,24.0,100.0,0.240,393.0,909.0,0.432,0.425,84.0,98.0,0.857,34.0,87.0,121.0,206.0,55.0,4.0,110.0,149.0,942.0
2,3,Mark Acres,C,28,ORL,68,0.0,19,1.6,3.1,0.509,0.0,0.0,0.333,1.6,3.1,0.512,0.512,1.0,1.5,0.653,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2,1990,0.0,1313,109.0,214.0,0.509,1.0,3.0,0.333,108.0,211.0,0.512,0.512,66.0,101.0,0.653,140.0,219.0,359.0,25.0,25.0,25.0,42.0,218.0,285.0
3,4,Michael Adams,PG,28,DEN,66,66.0,35,8.5,21.5,0.394,2.5,8.5,0.296,6.0,13.0,0.459,0.453,7.0,8.0,0.879,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5,1990,66.0,2346,560.0,1421.0,0.394,167.0,564.0,0.296,393.0,857.0,0.459,0.453,465.0,529.0,0.879,58.0,198.0,256.0,693.0,147.0,6.0,240.0,162.0,1752.0
4,5,Mark Aguirre,SF,31,DET,78,13.0,25,5.4,11.7,0.462,0.3,1.0,0.308,5.1,10.7,0.477,0.475,3.1,4.1,0.757,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2,1990,13.0,2006,420.0,909.0,0.462,24.0,78.0,0.308,396.0,831.0,0.477,0.475,240.0,317.0,0.757,134.0,240.0,374.0,139.0,47.0,20.0,128.0,209.0,1104.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18872,535,Thaddeus Young,PF,34,TOR,54,9.0,14,2.0,3.7,0.545,0.1,0.6,0.176,1.9,3.0,0.622,0.561,0.3,0.5,0.692,1.3,1.8,3.1,1.4,1.0,0.1,0.8,1.6,4.4,2022,9.0,795,108.0,198.0,0.545,6.0,34.0,0.176,102.0,164.0,0.622,0.561,18.0,26.0,0.692,71.0,95.0,166.0,75.0,54.0,5.0,42.0,88.0,240.0
18873,536,Trae Young,PG,24,ATL,73,73.0,34,8.2,19.0,0.429,2.1,6.3,0.335,6.1,12.7,0.476,0.485,7.8,8.8,0.886,0.8,2.2,3.0,10.2,1.1,0.1,4.1,1.4,26.2,2022,73.0,2541,597.0,1390.0,0.429,154.0,460.0,0.335,443.0,930.0,0.476,0.485,566.0,639.0,0.886,56.0,161.0,217.0,741.0,80.0,9.0,300.0,104.0,1914.0
18874,537,Omer Yurtseven,C,24,MIA,9,0.0,9,1.8,3.0,0.593,0.3,0.8,0.429,1.4,2.2,0.650,0.648,0.6,0.7,0.833,0.9,1.7,2.6,0.2,0.2,0.2,0.4,1.8,4.4,2022,0.0,83,16.0,27.0,0.593,3.0,7.0,0.429,13.0,20.0,0.650,0.648,5.0,6.0,0.833,8.0,15.0,23.0,2.0,2.0,2.0,4.0,16.0,40.0
18875,538,Cody Zeller,C,30,MIA,15,2.0,14,2.5,3.9,0.627,0.0,0.1,0.000,2.5,3.8,0.649,0.627,1.6,2.3,0.686,1.7,2.6,4.3,0.7,0.2,0.3,0.9,2.2,6.5,2022,2.0,217,37.0,59.0,0.627,0.0,2.0,0.000,37.0,57.0,0.649,0.627,24.0,35.0,0.686,25.0,39.0,64.0,10.0,3.0,4.0,14.0,33.0,98.0


In [14]:
stats2 = stats1.merge(adv, how='left', on=['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'Year'])
stats2

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,pg_GS,pg_MP,pg_FG,pg_FGA,pg_FG%,pg_3P,pg_3PA,pg_3P%,pg_2P,pg_2PA,pg_2P%,pg_eFG%,pg_FT,pg_FTA,pg_FT%,pg_ORB,pg_DRB,pg_TRB,pg_AST,pg_STL,pg_BLK,pg_TOV,pg_PF,pg_PTS,Year,tot_GS,tot_MP,tot_FG,tot_FGA,tot_FG%,tot_3P,tot_3PA,tot_3P%,tot_2P,tot_2PA,tot_2P%,tot_eFG%,tot_FT,tot_FTA,tot_FT%,tot_ORB,tot_DRB,tot_TRB,tot_AST,tot_STL,tot_BLK,tot_TOV,tot_PF,tot_PTS,adv_MP,adv_PER,adv_TS%,adv_3PAr,adv_FTr,adv_ORB%,adv_DRB%,adv_TRB%,adv_AST%,adv_STL%,adv_BLK%,adv_TOV%,adv_USG%,adv_OWS,adv_DWS,adv_WS,adv_WS/48,adv_OBPM,adv_DBPM,adv_BPM,adv_VORP
0,1,Alaa Abdelnaby,PF,22,POR,43,0.0,6,1.3,2.7,0.474,0.0,0.0,,1.3,2.7,0.474,0.474,0.6,1.0,0.568,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1,1990,0.0,290,55.0,116.0,0.474,0.0,0.0,,55.0,116.0,0.474,0.474,25.0,44.0,0.568,27.0,62.0,89.0,12.0,4.0,12.0,22.0,39.0,135.0,290,13.1,0.499,0.000,0.379,10.4,23.4,17.0,5.8,0.7,2.5,14.0,22.1,0.0,0.5,0.5,0.079,-3.4,-1.2,-4.6,-0.2
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19.0,22,6.2,15.1,0.413,0.4,1.5,0.240,5.9,13.6,0.432,0.425,1.3,1.5,0.857,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1,1990,19.0,1505,417.0,1009.0,0.413,24.0,100.0,0.240,393.0,909.0,0.432,0.425,84.0,98.0,0.857,34.0,87.0,121.0,206.0,55.0,4.0,110.0,149.0,942.0,1505,12.2,0.448,0.099,0.097,1.9,6.0,3.8,19.2,1.5,0.1,9.5,27.2,-0.7,-0.3,-1.0,-0.031,-2.0,-3.0,-5.0,-1.1
2,3,Mark Acres,C,28,ORL,68,0.0,19,1.6,3.1,0.509,0.0,0.0,0.333,1.6,3.1,0.512,0.512,1.0,1.5,0.653,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2,1990,0.0,1313,109.0,214.0,0.509,1.0,3.0,0.333,108.0,211.0,0.512,0.512,66.0,101.0,0.653,140.0,219.0,359.0,25.0,25.0,25.0,42.0,218.0,285.0,1313,9.2,0.551,0.014,0.472,11.3,18.7,14.9,2.5,0.9,1.1,14.0,9.3,1.4,1.1,2.5,0.090,-2.8,-0.2,-3.0,-0.3
3,4,Michael Adams,PG,28,DEN,66,66.0,35,8.5,21.5,0.394,2.5,8.5,0.296,6.0,13.0,0.459,0.453,7.0,8.0,0.879,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5,1990,66.0,2346,560.0,1421.0,0.394,167.0,564.0,0.296,393.0,857.0,0.459,0.453,465.0,529.0,0.879,58.0,198.0,256.0,693.0,147.0,6.0,240.0,162.0,1752.0,2346,22.3,0.530,0.397,0.372,2.1,8.8,5.2,39.4,2.6,0.1,12.7,28.5,5.8,0.4,6.3,0.128,6.0,-0.7,5.3,4.3
4,5,Mark Aguirre,SF,31,DET,78,13.0,25,5.4,11.7,0.462,0.3,1.0,0.308,5.1,10.7,0.477,0.475,3.1,4.1,0.757,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2,1990,13.0,2006,420.0,909.0,0.462,24.0,78.0,0.308,396.0,831.0,0.477,0.475,240.0,317.0,0.757,134.0,240.0,374.0,139.0,47.0,20.0,128.0,209.0,1104.0,2006,16.7,0.526,0.086,0.349,7.6,13.7,10.7,11.6,1.2,0.6,10.9,25.7,2.8,2.7,5.5,0.132,1.2,0.2,1.4,1.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18872,535,Thaddeus Young,PF,34,TOR,54,9.0,14,2.0,3.7,0.545,0.1,0.6,0.176,1.9,3.0,0.622,0.561,0.3,0.5,0.692,1.3,1.8,3.1,1.4,1.0,0.1,0.8,1.6,4.4,2022,9.0,795,108.0,198.0,0.545,6.0,34.0,0.176,102.0,164.0,0.622,0.561,18.0,26.0,0.692,71.0,95.0,166.0,75.0,54.0,5.0,42.0,88.0,240.0,795,14.1,0.573,0.172,0.131,9.4,14.6,11.8,12.9,3.4,0.6,16.7,13.5,0.7,1.1,1.8,0.109,-1.8,1.9,0.1,0.4
18873,536,Trae Young,PG,24,ATL,73,73.0,34,8.2,19.0,0.429,2.1,6.3,0.335,6.1,12.7,0.476,0.485,7.8,8.8,0.886,0.8,2.2,3.0,10.2,1.1,0.1,4.1,1.4,26.2,2022,73.0,2541,597.0,1390.0,0.429,154.0,460.0,0.335,443.0,930.0,0.476,0.485,566.0,639.0,0.886,56.0,161.0,217.0,741.0,80.0,9.0,300.0,104.0,1914.0,2541,22.0,0.573,0.331,0.460,2.4,7.0,4.7,42.5,1.5,0.3,15.2,32.6,5.3,1.4,6.7,0.126,5.3,-2.0,3.3,3.4
18874,537,Omer Yurtseven,C,24,MIA,9,0.0,9,1.8,3.0,0.593,0.3,0.8,0.429,1.4,2.2,0.650,0.648,0.6,0.7,0.833,0.9,1.7,2.6,0.2,0.2,0.2,0.4,1.8,4.4,2022,0.0,83,16.0,27.0,0.593,3.0,7.0,0.429,13.0,20.0,0.650,0.648,5.0,6.0,0.833,8.0,15.0,23.0,2.0,2.0,2.0,4.0,16.0,40.0,83,16.7,0.675,0.259,0.222,10.9,21.9,16.2,3.9,1.2,2.5,11.9,18.0,0.2,0.1,0.3,0.159,-2.5,-1.5,-3.9,0.0
18875,538,Cody Zeller,C,30,MIA,15,2.0,14,2.5,3.9,0.627,0.0,0.1,0.000,2.5,3.8,0.649,0.627,1.6,2.3,0.686,1.7,2.6,4.3,0.7,0.2,0.3,0.9,2.2,6.5,2022,2.0,217,37.0,59.0,0.627,0.0,2.0,0.000,37.0,57.0,0.649,0.627,24.0,35.0,0.686,25.0,39.0,64.0,10.0,3.0,4.0,14.0,33.0,98.0,217,16.4,0.659,0.034,0.593,13.0,21.8,17.3,7.2,0.7,1.9,15.8,18.1,0.4,0.3,0.7,0.147,-2.0,-0.7,-2.8,0.0


In [15]:
print(f'PerGame Stats Shape: {pg.shape}')
print(f'Total Stats Shape: {tot.shape}')
print(f'Advanced Stats Shape: {adv.shape}')
print(f'First Mrg Shape: {stats1.shape}')
print(f'Final Mrg Shape: {stats2.shape}')
      
# We expect the number of rows to be the same in all, and columns in Stats2, ultimately, to increase by ([columns-in-tot] - 7) +  ([columns-in-adv] - 7)

PerGame Stats Shape: (18877, 31)
Total Stats Shape: (18877, 31)
Advanced Stats Shape: (18877, 28)
First Mrg Shape: (18877, 55)
Final Mrg Shape: (18877, 76)


### Dummify Position

In [16]:
adv.Pos.value_counts()

PF          3903
SG          3827
PG          3745
C           3723
SF          3442
SF-SG         35
SG-SF         31
PG-SG         31
C-PF          28
SG-PG         28
PF-SF         27
PF-C          26
SF-PF         24
SG-PF          4
PG-SF          1
SF-C           1
SG-PG-SF       1
Name: Pos, dtype: int64

In [17]:
tot.Pos.value_counts()

PF          3903
SG          3827
PG          3745
C           3723
SF          3442
SF-SG         35
SG-SF         31
PG-SG         31
C-PF          28
SG-PG         28
PF-SF         27
PF-C          26
SF-PF         24
SG-PF          4
PG-SF          1
SF-C           1
SG-PG-SF       1
Name: Pos, dtype: int64

In [18]:
pg.Pos.value_counts()

PF          3903
SG          3827
PG          3745
C           3723
SF          3442
SF-SG         35
SG-SF         31
PG-SG         31
C-PF          28
SG-PG         28
PF-SF         27
PF-C          26
SF-PF         24
SG-PF          4
PG-SF          1
SF-C           1
SG-PG-SF       1
Name: Pos, dtype: int64

In [19]:
adv.Pos.value_counts(normalize=True)

PF          0.206760
SG          0.202733
PG          0.198390
C           0.197224
SF          0.182338
SF-SG       0.001854
SG-SF       0.001642
PG-SG       0.001642
C-PF        0.001483
SG-PG       0.001483
PF-SF       0.001430
PF-C        0.001377
SF-PF       0.001271
SG-PF       0.000212
PG-SF       0.000053
SF-C        0.000053
SG-PG-SF    0.000053
Name: Pos, dtype: float64

##### <span style = 'color:mediumvioletred'> _Only 1.26% of players have more than one position listed. We will take the primary position but will create another variable called "GT1_Pos" to indicate whether a player had more than one position listed. We may use this later as it indicates versatility._ </span>

In [20]:
def cnt_pos(col):
    return col.count('-')

stats2['GT1_Pos'] = stats2.Pos.apply(cnt_pos)
stats2

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,pg_GS,pg_MP,pg_FG,pg_FGA,pg_FG%,pg_3P,pg_3PA,pg_3P%,pg_2P,pg_2PA,pg_2P%,pg_eFG%,pg_FT,pg_FTA,pg_FT%,pg_ORB,pg_DRB,pg_TRB,pg_AST,pg_STL,pg_BLK,pg_TOV,pg_PF,pg_PTS,Year,tot_GS,tot_MP,tot_FG,tot_FGA,tot_FG%,tot_3P,tot_3PA,tot_3P%,tot_2P,tot_2PA,tot_2P%,tot_eFG%,tot_FT,tot_FTA,tot_FT%,tot_ORB,tot_DRB,tot_TRB,tot_AST,tot_STL,tot_BLK,tot_TOV,tot_PF,tot_PTS,adv_MP,adv_PER,adv_TS%,adv_3PAr,adv_FTr,adv_ORB%,adv_DRB%,adv_TRB%,adv_AST%,adv_STL%,adv_BLK%,adv_TOV%,adv_USG%,adv_OWS,adv_DWS,adv_WS,adv_WS/48,adv_OBPM,adv_DBPM,adv_BPM,adv_VORP,GT1_Pos
0,1,Alaa Abdelnaby,PF,22,POR,43,0.0,6,1.3,2.7,0.474,0.0,0.0,,1.3,2.7,0.474,0.474,0.6,1.0,0.568,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1,1990,0.0,290,55.0,116.0,0.474,0.0,0.0,,55.0,116.0,0.474,0.474,25.0,44.0,0.568,27.0,62.0,89.0,12.0,4.0,12.0,22.0,39.0,135.0,290,13.1,0.499,0.000,0.379,10.4,23.4,17.0,5.8,0.7,2.5,14.0,22.1,0.0,0.5,0.5,0.079,-3.4,-1.2,-4.6,-0.2,0
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19.0,22,6.2,15.1,0.413,0.4,1.5,0.240,5.9,13.6,0.432,0.425,1.3,1.5,0.857,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1,1990,19.0,1505,417.0,1009.0,0.413,24.0,100.0,0.240,393.0,909.0,0.432,0.425,84.0,98.0,0.857,34.0,87.0,121.0,206.0,55.0,4.0,110.0,149.0,942.0,1505,12.2,0.448,0.099,0.097,1.9,6.0,3.8,19.2,1.5,0.1,9.5,27.2,-0.7,-0.3,-1.0,-0.031,-2.0,-3.0,-5.0,-1.1,0
2,3,Mark Acres,C,28,ORL,68,0.0,19,1.6,3.1,0.509,0.0,0.0,0.333,1.6,3.1,0.512,0.512,1.0,1.5,0.653,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2,1990,0.0,1313,109.0,214.0,0.509,1.0,3.0,0.333,108.0,211.0,0.512,0.512,66.0,101.0,0.653,140.0,219.0,359.0,25.0,25.0,25.0,42.0,218.0,285.0,1313,9.2,0.551,0.014,0.472,11.3,18.7,14.9,2.5,0.9,1.1,14.0,9.3,1.4,1.1,2.5,0.090,-2.8,-0.2,-3.0,-0.3,0
3,4,Michael Adams,PG,28,DEN,66,66.0,35,8.5,21.5,0.394,2.5,8.5,0.296,6.0,13.0,0.459,0.453,7.0,8.0,0.879,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5,1990,66.0,2346,560.0,1421.0,0.394,167.0,564.0,0.296,393.0,857.0,0.459,0.453,465.0,529.0,0.879,58.0,198.0,256.0,693.0,147.0,6.0,240.0,162.0,1752.0,2346,22.3,0.530,0.397,0.372,2.1,8.8,5.2,39.4,2.6,0.1,12.7,28.5,5.8,0.4,6.3,0.128,6.0,-0.7,5.3,4.3,0
4,5,Mark Aguirre,SF,31,DET,78,13.0,25,5.4,11.7,0.462,0.3,1.0,0.308,5.1,10.7,0.477,0.475,3.1,4.1,0.757,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2,1990,13.0,2006,420.0,909.0,0.462,24.0,78.0,0.308,396.0,831.0,0.477,0.475,240.0,317.0,0.757,134.0,240.0,374.0,139.0,47.0,20.0,128.0,209.0,1104.0,2006,16.7,0.526,0.086,0.349,7.6,13.7,10.7,11.6,1.2,0.6,10.9,25.7,2.8,2.7,5.5,0.132,1.2,0.2,1.4,1.7,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18872,535,Thaddeus Young,PF,34,TOR,54,9.0,14,2.0,3.7,0.545,0.1,0.6,0.176,1.9,3.0,0.622,0.561,0.3,0.5,0.692,1.3,1.8,3.1,1.4,1.0,0.1,0.8,1.6,4.4,2022,9.0,795,108.0,198.0,0.545,6.0,34.0,0.176,102.0,164.0,0.622,0.561,18.0,26.0,0.692,71.0,95.0,166.0,75.0,54.0,5.0,42.0,88.0,240.0,795,14.1,0.573,0.172,0.131,9.4,14.6,11.8,12.9,3.4,0.6,16.7,13.5,0.7,1.1,1.8,0.109,-1.8,1.9,0.1,0.4,0
18873,536,Trae Young,PG,24,ATL,73,73.0,34,8.2,19.0,0.429,2.1,6.3,0.335,6.1,12.7,0.476,0.485,7.8,8.8,0.886,0.8,2.2,3.0,10.2,1.1,0.1,4.1,1.4,26.2,2022,73.0,2541,597.0,1390.0,0.429,154.0,460.0,0.335,443.0,930.0,0.476,0.485,566.0,639.0,0.886,56.0,161.0,217.0,741.0,80.0,9.0,300.0,104.0,1914.0,2541,22.0,0.573,0.331,0.460,2.4,7.0,4.7,42.5,1.5,0.3,15.2,32.6,5.3,1.4,6.7,0.126,5.3,-2.0,3.3,3.4,0
18874,537,Omer Yurtseven,C,24,MIA,9,0.0,9,1.8,3.0,0.593,0.3,0.8,0.429,1.4,2.2,0.650,0.648,0.6,0.7,0.833,0.9,1.7,2.6,0.2,0.2,0.2,0.4,1.8,4.4,2022,0.0,83,16.0,27.0,0.593,3.0,7.0,0.429,13.0,20.0,0.650,0.648,5.0,6.0,0.833,8.0,15.0,23.0,2.0,2.0,2.0,4.0,16.0,40.0,83,16.7,0.675,0.259,0.222,10.9,21.9,16.2,3.9,1.2,2.5,11.9,18.0,0.2,0.1,0.3,0.159,-2.5,-1.5,-3.9,0.0,0
18875,538,Cody Zeller,C,30,MIA,15,2.0,14,2.5,3.9,0.627,0.0,0.1,0.000,2.5,3.8,0.649,0.627,1.6,2.3,0.686,1.7,2.6,4.3,0.7,0.2,0.3,0.9,2.2,6.5,2022,2.0,217,37.0,59.0,0.627,0.0,2.0,0.000,37.0,57.0,0.649,0.627,24.0,35.0,0.686,25.0,39.0,64.0,10.0,3.0,4.0,14.0,33.0,98.0,217,16.4,0.659,0.034,0.593,13.0,21.8,17.3,7.2,0.7,1.9,15.8,18.1,0.4,0.3,0.7,0.147,-2.0,-0.7,-2.8,0.0,0


In [21]:
stats2.GT1_Pos.value_counts()

0    18640
1      236
2        1
Name: GT1_Pos, dtype: int64

In [22]:
stats2.GT1_Pos.value_counts(normalize=True)

0    0.987445
1    0.012502
2    0.000053
Name: GT1_Pos, dtype: float64

In [23]:
# Condense Pos variable to only the 5 standard NBA positions: PF, SG, PG, C, SF
stats2['Pos_5'] = stats2['Pos'].apply(lambda x: x.split('-')[0])

In [24]:
stats2.Pos_5.value_counts()

PF    3956
SG    3891
PG    3777
C     3751
SF    3502
Name: Pos_5, dtype: int64

In [25]:
stats2.Pos_5.value_counts(normalize=True)

PF    0.209567
SG    0.206124
PG    0.200085
C     0.198707
SF    0.185517
Name: Pos_5, dtype: float64

In [26]:
# Will dummify vs. OHE, because these positions will not change (i.e., have more added or some taken away) and our data should always contain a mix of each
pos_dummy = pd.get_dummies(stats2['Pos_5'], drop_first=True)
pos_dummy

Unnamed: 0,PF,PG,SF,SG
0,1,0,0,0
1,0,1,0,0
2,0,0,0,0
3,0,1,0,0
4,0,0,1,0
...,...,...,...,...
18872,1,0,0,0
18873,0,1,0,0
18874,0,0,0,0
18875,0,0,0,0


In [27]:
stats2 = pd.concat([stats2, pos_dummy], axis=1)
stats2

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,pg_GS,pg_MP,pg_FG,pg_FGA,pg_FG%,pg_3P,pg_3PA,pg_3P%,pg_2P,pg_2PA,pg_2P%,pg_eFG%,pg_FT,pg_FTA,pg_FT%,pg_ORB,pg_DRB,pg_TRB,pg_AST,pg_STL,pg_BLK,pg_TOV,pg_PF,pg_PTS,Year,tot_GS,tot_MP,tot_FG,tot_FGA,tot_FG%,tot_3P,tot_3PA,tot_3P%,tot_2P,tot_2PA,tot_2P%,tot_eFG%,tot_FT,tot_FTA,tot_FT%,tot_ORB,tot_DRB,tot_TRB,tot_AST,tot_STL,tot_BLK,tot_TOV,tot_PF,tot_PTS,adv_MP,adv_PER,adv_TS%,adv_3PAr,adv_FTr,adv_ORB%,adv_DRB%,adv_TRB%,adv_AST%,adv_STL%,adv_BLK%,adv_TOV%,adv_USG%,adv_OWS,adv_DWS,adv_WS,adv_WS/48,adv_OBPM,adv_DBPM,adv_BPM,adv_VORP,GT1_Pos,Pos_5,PF,PG,SF,SG
0,1,Alaa Abdelnaby,PF,22,POR,43,0.0,6,1.3,2.7,0.474,0.0,0.0,,1.3,2.7,0.474,0.474,0.6,1.0,0.568,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1,1990,0.0,290,55.0,116.0,0.474,0.0,0.0,,55.0,116.0,0.474,0.474,25.0,44.0,0.568,27.0,62.0,89.0,12.0,4.0,12.0,22.0,39.0,135.0,290,13.1,0.499,0.000,0.379,10.4,23.4,17.0,5.8,0.7,2.5,14.0,22.1,0.0,0.5,0.5,0.079,-3.4,-1.2,-4.6,-0.2,0,PF,1,0,0,0
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19.0,22,6.2,15.1,0.413,0.4,1.5,0.240,5.9,13.6,0.432,0.425,1.3,1.5,0.857,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1,1990,19.0,1505,417.0,1009.0,0.413,24.0,100.0,0.240,393.0,909.0,0.432,0.425,84.0,98.0,0.857,34.0,87.0,121.0,206.0,55.0,4.0,110.0,149.0,942.0,1505,12.2,0.448,0.099,0.097,1.9,6.0,3.8,19.2,1.5,0.1,9.5,27.2,-0.7,-0.3,-1.0,-0.031,-2.0,-3.0,-5.0,-1.1,0,PG,0,1,0,0
2,3,Mark Acres,C,28,ORL,68,0.0,19,1.6,3.1,0.509,0.0,0.0,0.333,1.6,3.1,0.512,0.512,1.0,1.5,0.653,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2,1990,0.0,1313,109.0,214.0,0.509,1.0,3.0,0.333,108.0,211.0,0.512,0.512,66.0,101.0,0.653,140.0,219.0,359.0,25.0,25.0,25.0,42.0,218.0,285.0,1313,9.2,0.551,0.014,0.472,11.3,18.7,14.9,2.5,0.9,1.1,14.0,9.3,1.4,1.1,2.5,0.090,-2.8,-0.2,-3.0,-0.3,0,C,0,0,0,0
3,4,Michael Adams,PG,28,DEN,66,66.0,35,8.5,21.5,0.394,2.5,8.5,0.296,6.0,13.0,0.459,0.453,7.0,8.0,0.879,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5,1990,66.0,2346,560.0,1421.0,0.394,167.0,564.0,0.296,393.0,857.0,0.459,0.453,465.0,529.0,0.879,58.0,198.0,256.0,693.0,147.0,6.0,240.0,162.0,1752.0,2346,22.3,0.530,0.397,0.372,2.1,8.8,5.2,39.4,2.6,0.1,12.7,28.5,5.8,0.4,6.3,0.128,6.0,-0.7,5.3,4.3,0,PG,0,1,0,0
4,5,Mark Aguirre,SF,31,DET,78,13.0,25,5.4,11.7,0.462,0.3,1.0,0.308,5.1,10.7,0.477,0.475,3.1,4.1,0.757,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2,1990,13.0,2006,420.0,909.0,0.462,24.0,78.0,0.308,396.0,831.0,0.477,0.475,240.0,317.0,0.757,134.0,240.0,374.0,139.0,47.0,20.0,128.0,209.0,1104.0,2006,16.7,0.526,0.086,0.349,7.6,13.7,10.7,11.6,1.2,0.6,10.9,25.7,2.8,2.7,5.5,0.132,1.2,0.2,1.4,1.7,0,SF,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18872,535,Thaddeus Young,PF,34,TOR,54,9.0,14,2.0,3.7,0.545,0.1,0.6,0.176,1.9,3.0,0.622,0.561,0.3,0.5,0.692,1.3,1.8,3.1,1.4,1.0,0.1,0.8,1.6,4.4,2022,9.0,795,108.0,198.0,0.545,6.0,34.0,0.176,102.0,164.0,0.622,0.561,18.0,26.0,0.692,71.0,95.0,166.0,75.0,54.0,5.0,42.0,88.0,240.0,795,14.1,0.573,0.172,0.131,9.4,14.6,11.8,12.9,3.4,0.6,16.7,13.5,0.7,1.1,1.8,0.109,-1.8,1.9,0.1,0.4,0,PF,1,0,0,0
18873,536,Trae Young,PG,24,ATL,73,73.0,34,8.2,19.0,0.429,2.1,6.3,0.335,6.1,12.7,0.476,0.485,7.8,8.8,0.886,0.8,2.2,3.0,10.2,1.1,0.1,4.1,1.4,26.2,2022,73.0,2541,597.0,1390.0,0.429,154.0,460.0,0.335,443.0,930.0,0.476,0.485,566.0,639.0,0.886,56.0,161.0,217.0,741.0,80.0,9.0,300.0,104.0,1914.0,2541,22.0,0.573,0.331,0.460,2.4,7.0,4.7,42.5,1.5,0.3,15.2,32.6,5.3,1.4,6.7,0.126,5.3,-2.0,3.3,3.4,0,PG,0,1,0,0
18874,537,Omer Yurtseven,C,24,MIA,9,0.0,9,1.8,3.0,0.593,0.3,0.8,0.429,1.4,2.2,0.650,0.648,0.6,0.7,0.833,0.9,1.7,2.6,0.2,0.2,0.2,0.4,1.8,4.4,2022,0.0,83,16.0,27.0,0.593,3.0,7.0,0.429,13.0,20.0,0.650,0.648,5.0,6.0,0.833,8.0,15.0,23.0,2.0,2.0,2.0,4.0,16.0,40.0,83,16.7,0.675,0.259,0.222,10.9,21.9,16.2,3.9,1.2,2.5,11.9,18.0,0.2,0.1,0.3,0.159,-2.5,-1.5,-3.9,0.0,0,C,0,0,0,0
18875,538,Cody Zeller,C,30,MIA,15,2.0,14,2.5,3.9,0.627,0.0,0.1,0.000,2.5,3.8,0.649,0.627,1.6,2.3,0.686,1.7,2.6,4.3,0.7,0.2,0.3,0.9,2.2,6.5,2022,2.0,217,37.0,59.0,0.627,0.0,2.0,0.000,37.0,57.0,0.649,0.627,24.0,35.0,0.686,25.0,39.0,64.0,10.0,3.0,4.0,14.0,33.0,98.0,217,16.4,0.659,0.034,0.593,13.0,21.8,17.3,7.2,0.7,1.9,15.8,18.1,0.4,0.3,0.7,0.147,-2.0,-0.7,-2.8,0.0,0,C,0,0,0,0


### Assess Missing Values, Null Values, and Outliers

In [28]:
stats2.describe()

Unnamed: 0,Rk,Age,G,pg_GS,pg_MP,pg_FG,pg_FGA,pg_FG%,pg_3P,pg_3PA,pg_3P%,pg_2P,pg_2PA,pg_2P%,pg_eFG%,pg_FT,pg_FTA,pg_FT%,pg_ORB,pg_DRB,pg_TRB,pg_AST,pg_STL,pg_BLK,pg_TOV,pg_PF,pg_PTS,Year,tot_GS,tot_MP,tot_FG,tot_FGA,tot_FG%,tot_3P,tot_3PA,tot_3P%,tot_2P,tot_2PA,tot_2P%,tot_eFG%,tot_FT,tot_FTA,tot_FT%,tot_ORB,tot_DRB,tot_TRB,tot_AST,tot_STL,tot_BLK,tot_TOV,tot_PF,tot_PTS,adv_MP,adv_PER,adv_TS%,adv_3PAr,adv_FTr,adv_ORB%,adv_DRB%,adv_TRB%,adv_AST%,adv_STL%,adv_BLK%,adv_TOV%,adv_USG%,adv_OWS,adv_DWS,adv_WS,adv_WS/48,adv_OBPM,adv_DBPM,adv_BPM,adv_VORP,GT1_Pos,PF,PG,SF,SG
count,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18768.0,18877.0,18877.0,15997.0,18877.0,18877.0,18694.0,18768.0,18877.0,18877.0,17949.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18768.0,18877.0,18877.0,15997.0,18877.0,18877.0,18694.0,18768.0,18877.0,18877.0,17949.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18872.0,18777.0,18768.0,18768.0,18872.0,18872.0,18872.0,18872.0,18872.0,18872.0,18793.0,18872.0,18877.0,18877.0,18877.0,18872.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0,18877.0
mean,235.252265,26.768978,46.416009,22.009006,19.221381,2.971113,6.636939,0.435493,0.559072,1.61185,0.279784,2.412089,5.025195,0.466446,0.474334,1.435186,1.920136,0.725768,0.942501,2.493998,3.434661,1.793272,0.632759,0.393527,1.157265,1.813334,7.933586,2007.206918,22.009006,1078.461567,168.490438,369.905917,0.435493,30.91185,87.09742,0.279784,137.578588,282.808497,0.466446,0.474334,82.070297,108.666525,0.725768,51.539175,137.743974,189.283149,100.656619,34.790327,21.869683,63.169624,95.612491,449.963024,1078.461567,12.485407,0.511005,0.235406,0.299767,5.777946,14.249719,10.015377,12.986822,1.617068,1.579509,14.182387,18.704711,1.148403,1.08771,2.237389,0.067762,-1.743402,-0.240435,-1.984002,0.519066,0.012608,0.209567,0.200085,0.185517,0.206124
std,138.743359,4.178723,26.357103,27.488658,10.06404,2.171432,4.577206,0.103308,0.689423,1.819032,0.163529,1.919198,3.795355,0.112575,0.107626,1.363718,1.720929,0.15057,0.828466,1.777366,2.471674,1.800168,0.453899,0.472998,0.79769,0.842555,5.90705,9.524702,27.488658,889.052376,168.366918,359.130487,0.103308,45.531405,120.753024,0.163529,146.761668,293.8488,0.112575,0.107626,99.561678,127.086974,0.15057,59.86339,137.327163,191.088636,127.638373,34.367558,32.470757,61.299505,74.344684,455.489396,889.052376,6.505349,0.102782,0.219032,0.232849,5.087284,6.810401,5.151828,9.56051,1.082606,1.920539,6.917436,5.592218,1.915674,1.153395,2.806729,0.106821,8.376874,1.932455,8.93885,1.246157,0.112052,0.407011,0.400074,0.388727,0.404531
min,1.0,18.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1990.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-90.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.3,-1.0,-2.1,-2.519,-1000.0,-31.1,-1000.0,-2.6,0.0,0.0,0.0,0.0,0.0
25%,116.0,24.0,23.0,0.0,11.0,1.3,3.1,0.397,0.0,0.1,0.2,1.0,2.2,0.426,0.439,0.5,0.7,0.661,0.3,1.2,1.7,0.6,0.3,0.1,0.6,1.2,3.4,1999.0,0.0,270.0,31.0,74.0,0.397,0.0,2.0,0.2,23.0,52.0,0.426,0.439,12.0,18.0,0.661,9.0,30.0,42.0,14.0,7.0,3.0,14.0,29.0,82.0,270.0,9.6,0.477,0.014,0.177,2.2,9.5,6.1,6.4,1.1,0.4,10.6,15.1,0.0,0.2,0.2,0.032,-3.3,-1.1,-3.7,-0.1,0.0,0.0,0.0,0.0,0.0
50%,232.0,26.0,50.0,7.0,18.0,2.4,5.5,0.438,0.3,1.0,0.323,1.9,3.9,0.472,0.484,1.0,1.4,0.75,0.7,2.1,2.9,1.2,0.5,0.2,1.0,1.8,6.4,2008.0,7.0,884.0,114.0,259.0,0.438,8.0,27.0,0.323,87.0,184.0,0.472,0.484,45.0,63.0,0.75,30.0,99.0,133.0,55.0,25.0,10.0,45.0,85.0,301.0,884.0,12.6,0.523,0.209,0.265,4.4,13.2,9.0,10.2,1.5,1.0,13.4,18.3,0.4,0.7,1.2,0.077,-1.4,-0.2,-1.6,0.0,0.0,0.0,0.0,0.0,0.0
75%,349.0,30.0,71.0,40.0,27.0,4.2,9.3,0.483,0.9,2.6,0.374,3.4,7.0,0.515,0.524,1.9,2.6,0.82,1.3,3.3,4.6,2.4,0.9,0.5,1.6,2.4,11.2,2016.0,40.0,1758.0,259.0,571.0,0.483,47.0,134.0,0.374,205.0,420.0,0.515,0.524,115.0,155.0,0.82,71.0,202.0,275.0,136.0,52.0,27.0,95.0,150.0,689.0,1758.0,15.7,0.561,0.398,0.373,8.6,18.3,13.3,17.6,2.0,2.1,16.7,22.0,1.7,1.6,3.4,0.118,0.3,0.7,0.4,0.7,0.0,0.0,0.0,0.0,0.0
max,605.0,44.0,85.0,83.0,44.0,12.7,27.8,1.0,5.3,13.2,1.0,12.1,23.4,1.0,1.5,10.3,13.1,1.0,6.8,13.0,18.7,14.2,3.5,6.0,5.7,6.0,36.1,2022.0,83.0,3533.0,992.0,2173.0,1.0,402.0,1028.0,1.0,961.0,1773.0,1.0,1.5,756.0,972.0,1.0,523.0,1007.0,1530.0,1164.0,246.0,342.0,464.0,371.0,2832.0,3533.0,133.8,1.5,1.0,6.0,100.0,100.0,100.0,100.0,25.0,77.8,100.0,100.0,14.9,9.1,20.4,2.712,199.4,60.7,242.2,11.8,2.0,1.0,1.0,1.0,1.0


In [35]:
stats2.isnull().sum()[stats2.isnull().sum() > 0]

pg_FG%        109
pg_3P%       2880
pg_2P%        183
pg_eFG%       109
pg_FT%        928
tot_FG%       109
tot_3P%      2880
tot_2P%       183
tot_eFG%      109
tot_FT%       928
adv_PER         5
adv_TS%       100
adv_3PAr      109
adv_FTr       109
adv_ORB%        5
adv_DRB%        5
adv_TRB%        5
adv_AST%        5
adv_STL%        5
adv_BLK%        5
adv_TOV%       84
adv_USG%        5
adv_WS/48       5
dtype: int64

In [34]:
stats2[stats2['pg_FG%'].isnull()].G.value_counts()

1    70
2    26
3     9
4     2
5     2
Name: G, dtype: int64

##### <span style = 'color:mediumvioletred'> _All players who have missing FG% stats only played between 1-5 games - we can safely remove these players as they will not be chosen for the All-NBA team and would not help train our model will with so few game statistics._ </span>

In [43]:
stats2[stats2['pg_3P%'].isnull()].Pos_5.value_counts()

C     1579
PF     856
SF     201
PG     124
SG     120
Name: Pos_5, dtype: int64

In [48]:
stats2[stats2['pg_3P%'].isnull()].Pos_5.value_counts(normalize=True)

C     0.548264
PF    0.297222
SF    0.069792
PG    0.043056
SG    0.041667
Name: Pos_5, dtype: float64

In [47]:
miss_3p = stats2[stats2['pg_3P%'].isnull()]

In [52]:
miss_3p.groupby('Pos_5').agg({
    'G' : ['mean', 'min', 'max']})

Unnamed: 0_level_0,G,G,G
Unnamed: 0_level_1,mean,min,max
Pos_5,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
C,34.053832,1,85
PF,26.693925,1,82
PG,6.137097,1,40
SF,15.676617,1,82
SG,6.066667,1,60


##### <span style = 'color:mediumvioletred'> _~60% of players who had missing 3P% were Centers and Power Forwards, who are not expected to   only played between 1-5 games - we can safely remove these players as they will not be chosen for the All-NBA team and would not help train our model will with so few game statistics._ </span>

### 