# Cleaning

First we start by loading in NBA player data from the past 10 NBA seasons (2012-13 to 2022-23).

The first dataframe has stats per 36 minutes which was chosen over per game stats and per 48 minute stats. This was done because the best players usually play 3/4 of the game (each NBA game is 48 minutes total)

The second dataframe has the same stats per game but with a minutes played per game column ('MP') which is used later to filter the dataset.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("/Users/giovanni-lunetta/stat_4185/final/past_ten_seasons/data/ten_years.csv")

df_mp = pd.read_csv("/Users/giovanni-lunetta/stat_4185/final/past_ten_seasons/data/MP_data.csv")

Drop columns that do not provide any information for this analysis.

In [5]:
df = df.drop(['Rk', 'Team', '-9999', 'From', 'To'], axis=1)

Check if there are duplicate players in either dataset and remove them.

In [6]:
duplicates_df = df[df.duplicated(subset='Player', keep=False)]
duplicates_df

Unnamed: 0,Player,ORB%,DRB%,AST%,STL%,BLK%,TOV%,USG%,Age,G,...,TOV,PF,PTS,FG%,2P%,3P%,FT%,TS%,eFG%,Pos
22,Chris Wright,8.8,9.4,6.9,2.9,3.2,10.7,16.7,25,8,...,1.4,4.9,13.7,0.6,0.618,0.0,0.4,0.577,0.6,F
23,Chris Wright,0.0,0.0,0.0,0.0,0.0,33.3,34.2,23,3,...,9.0,0.0,18.0,0.5,0.5,,,0.5,0.5,G
549,Tony Mitchell,19.7,16.0,3.4,3.8,3.0,16.4,13.2,21,21,...,1.8,4.1,10.0,0.417,0.364,1.0,0.579,0.54,0.458,F
550,Tony Mitchell,11.1,0.0,22.6,5.2,0.0,0.0,22.6,24,3,...,0.0,0.0,21.6,0.6,0.75,0.0,,0.6,0.6,F
786,Chris Johnson,7.8,16.6,4.9,1.3,7.6,12.6,16.2,27,30,...,1.6,6.2,14.8,0.64,0.64,,0.618,0.65,0.64,C
787,Chris Johnson,3.5,11.4,6.3,2.0,1.1,11.2,14.1,22-25,147,...,1.3,3.4,10.3,0.392,0.508,0.307,0.835,0.514,0.48,F-G
808,Mike James,1.1,8.2,24.6,1.5,0.3,16.4,17.8,37-38,56,...,2.3,4.1,10.9,0.363,0.352,0.377,0.742,0.465,0.447,G
809,Mike James,1.9,12.4,29.7,1.8,0.7,13.0,24.6,27-30,49,...,2.7,2.3,17.0,0.38,0.423,0.287,0.766,0.472,0.425,G


In [7]:
duplicates_mp_df = df_mp[df_mp.duplicated(subset='Player', keep=False)]
duplicates_mp_df

Unnamed: 0,Rk,Player,From,To,Age,G,GS,MP,FG,FGA,...,PTS,FG%,2P%,3P%,FT%,TS%,eFG%,Pos,Team,-9999
22,23,Chris Wright,2013-14,2013-14,25,8,0,15.8,2.6,4.4,...,6.0,0.6,0.618,0.0,0.4,0.577,0.6,F,MIL,wrighch01
23,24,Chris Wright,2012-13,2012-13,23,3,0,1.3,0.3,0.7,...,0.7,0.5,0.5,,,0.5,0.5,G,DAL,wrighch02
552,553,Tony Mitchell,2013-14,2013-14,21,21,0,3.8,0.2,0.6,...,1.0,0.417,0.364,1.0,0.579,0.54,0.458,F,DET,mitchto02
553,554,Tony Mitchell,2013-14,2013-14,24,3,0,3.3,1.0,1.7,...,2.0,0.6,0.75,0.0,,0.6,0.6,F,MIL,mitchto03
790,791,Chris Johnson,2012-13,2012-13,27,30,0,9.5,1.6,2.5,...,3.9,0.64,0.64,,0.618,0.65,0.64,C,MIN,johnsch03
791,792,Chris Johnson,2012-13,2015-16,22-25,147,4,15.5,1.6,4.0,...,4.4,0.392,0.508,0.307,0.835,0.514,0.48,F-G,BOSMEMMILPHIUTA,johnsch04
813,814,Mike James,2012-13,2013-14,37-38,56,23,16.8,1.9,5.2,...,5.1,0.363,0.352,0.377,0.742,0.465,0.447,G,CHIDAL,jamesmi01
814,815,Mike James,2017-18,2020-21,27-30,49,11,18.8,3.2,8.4,...,8.9,0.38,0.423,0.287,0.766,0.472,0.425,G,BRKNOPPHO,jamesmi02


In [8]:
df = df.drop_duplicates(subset='Player')
df_mp = df_mp.drop_duplicates(subset='Player')

In [9]:
print(df.shape)
print(df_mp.shape)

(1454, 35)
(1464, 34)


Merge the minutes played column onto the original dataframe for filtering.

In [10]:
df = df.merge(df_mp[['Player', 'MP']], on='Player', how='left')

In [11]:
print(df.shape)

(1454, 36)


In [12]:
df.columns

Index(['Player', 'ORB%', 'DRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'Age',
       'G', 'GS', 'FG', 'FGA', '2P', '2PA', '3P', '3PA', 'FT', 'FTA', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'FG%', '2P%',
       '3P%', 'FT%', 'TS%', 'eFG%', 'Pos', 'MP'],
      dtype='object')

In [13]:
df.head()

Unnamed: 0,Player,ORB%,DRB%,AST%,STL%,BLK%,TOV%,USG%,Age,G,...,PF,PTS,FG%,2P%,3P%,FT%,TS%,eFG%,Pos,MP
0,Ivica Zubac,12.7,24.3,8.0,0.8,3.8,14.3,16.6,19-25,436,...,4.1,15.2,0.605,0.607,0.083,0.741,0.641,0.605,C,20.7
1,Ante Žižić,10.7,22.4,6.6,0.7,2.3,13.0,18.2,21-23,113,...,4.0,16.0,0.581,0.581,,0.711,0.614,0.581,F-C,13.4
2,Paul Zipser,1.8,15.1,7.0,1.0,1.6,14.6,14.8,22-23,98,...,3.5,9.9,0.371,0.402,0.335,0.769,0.474,0.448,G-F,17.0
3,Stephen Zimmerman,10.8,24.9,5.3,0.9,3.7,8.3,14.8,20,19,...,5.7,7.7,0.323,0.323,,0.6,0.346,0.323,C,5.7
4,Tyler Zeller,9.6,18.5,8.1,0.7,2.7,11.8,17.8,23-30,414,...,4.5,14.3,0.508,0.512,0.286,0.764,0.553,0.511,F-C,17.5


Next we filter the data to only have players who average atleast one quarter of play per game and who have played atleast 82 games over the past 10 seasons (each season has 82 games so I only wanted players who have played atleast one full seasons worth of games).

In [154]:
df['MP'].describe()

count    1454.000000
mean       16.655021
std         8.351827
min         0.700000
25%        10.200000
50%        16.000000
75%        22.500000
max        43.500000
Name: MP, dtype: float64

In [155]:
len(df[df["MP"] <= 12])

472

In [156]:
df = df[df['MP'] >= 12]

In [157]:
df['G'].describe()

count    987.000000
mean     264.988855
std      206.481032
min        1.000000
25%       95.000000
50%      213.000000
75%      399.000000
max      835.000000
Name: G, dtype: float64

In [158]:
len(df[df["G"] < 82])

223

In [159]:
df = df[df["G"] > 82]

Check for null values and fill the missing values with 0. This was done because players who have missing values for 3P% means they did not take any three point shots which is effectively having a 3P% of 0.

In [170]:
print((df.isnull().sum() * 100)/ len(df))

Player    0.0
ORB%      0.0
DRB%      0.0
AST%      0.0
STL%      0.0
BLK%      0.0
TOV%      0.0
USG%      0.0
Age       0.0
G         0.0
GS        0.0
FG        0.0
FGA       0.0
2P        0.0
2PA       0.0
3P        0.0
3PA       0.0
FT        0.0
FTA       0.0
ORB       0.0
DRB       0.0
TRB       0.0
AST       0.0
STL       0.0
BLK       0.0
TOV       0.0
PF        0.0
PTS       0.0
FG%       0.0
2P%       0.0
3P%       0.0
FT%       0.0
TS%       0.0
eFG%      0.0
Pos       0.0
MP        0.0
dtype: float64


In [171]:
# replace null values in "3P%" columns with zeros
df['3P%'] = df['3P%'].fillna(0)

In [172]:
df.describe()

Unnamed: 0,ORB%,DRB%,AST%,STL%,BLK%,TOV%,USG%,G,GS,FG,...,TOV,PF,PTS,FG%,2P%,3P%,FT%,TS%,eFG%,MP
count,761.0,761.0,761.0,761.0,761.0,761.0,761.0,761.0,761.0,761.0,...,761.0,761.0,761.0,761.0,761.0,761.0,761.0,761.0,761.0,761.0
mean,4.902891,14.970302,13.458739,1.533114,1.719711,12.513403,18.685414,332.156373,169.575558,5.509987,...,1.913535,3.158607,14.753219,0.457041,0.504869,0.312593,0.752067,0.547845,0.515838,22.118922
std,3.520473,5.460365,8.059821,0.505868,1.40808,3.195269,4.444051,188.101905,178.399109,1.409994,...,0.644195,0.868548,3.857275,0.059743,0.058424,0.099706,0.087476,0.043789,0.046774,6.050586
min,0.7,5.1,1.5,0.5,0.0,3.3,7.2,83.0,0.0,2.5,...,0.3,1.3,6.2,0.337,0.365,0.0,0.326,0.403,0.388,12.0
25%,2.1,10.5,7.5,1.2,0.7,10.3,15.4,174.0,38.0,4.5,...,1.4,2.5,12.1,0.417,0.466,0.3,0.707,0.521,0.486,17.3
50%,3.6,14.0,10.7,1.5,1.3,12.1,18.1,289.0,97.0,5.4,...,1.8,3.1,14.3,0.443,0.498,0.342,0.766,0.546,0.512,21.3
75%,6.9,18.6,17.7,1.8,2.3,14.3,21.6,458.0,246.0,6.4,...,2.3,3.7,17.0,0.481,0.536,0.366,0.814,0.572,0.541,26.8
max,16.9,34.0,44.4,3.6,7.9,26.8,35.6,835.0,809.0,11.1,...,4.5,6.3,30.8,0.758,0.763,0.667,0.92,0.741,0.758,37.0


In [173]:
df['Pos'].value_counts()

G      290
F      205
G-F     71
F-C     70
C       54
C-F     43
F-G     28
Name: Pos, dtype: int64

In [164]:
# # Define the mapping dictionary
# pos_mapping = {
#     'G': 1,
#     'G-F': 2,
#     'F-G': 3,
#     'F': 4,
#     'F-C': 5,
#     'C-F': 6,
#     'C': 7
# }

# # Replace the values in the 'Pos' column using the mapping dictionary
# df['Pos'] = df['Pos'].replace(pos_mapping)

In [165]:
df['Player'].value_counts()

Ivica Zubac        1
Harry Giles        1
Eric Gordon        1
Ben Gordon         1
Aaron Gordon       1
                  ..
Mike Muscala       1
Jamal Murray       1
Dejounte Murray    1
Trey Murphy III    1
Álex Abrines       1
Name: Player, Length: 761, dtype: int64

Create a new cleaned csv to use for visualization and analysis.

In [166]:
df.to_csv("/Users/giovanni-lunetta/stat_4185/final_project/past_ten_seasons/data/cleaned.csv", index=False)