<h2 style='text-align: center'> Statistics Practice Case </h2>
<h1 style='text-align: center'> NBA Statistics 2017 </h1>
<h3 style='text-align: center'> IYKRA Data Fellowship Batch 5 2021 </h3>
<h3 style='text-align: center'> Abednego Kristanto </h3>

In [1]:
import pandas as pd

## Data Pre-Processing
### 1. Loading data

In [2]:
df = pd.read_csv('Seasons_Stats.csv', index_col='Unnamed: 0')
df.head()

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,...,0.705,,,,176.0,,,,217.0,458.0
1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,0.435,...,0.708,,,,109.0,,,,99.0,279.0
2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,...,0.698,,,,140.0,,,,192.0,438.0
3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,0.312,...,0.559,,,,20.0,,,,29.0,63.0
4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,0.308,...,0.548,,,,20.0,,,,27.0,59.0


### 2. Check dataset shape and columns data types

In [3]:
df.shape

(24691, 52)

In [4]:
df.dtypes

Year      float64
Player     object
Pos        object
Age       float64
Tm         object
G         float64
GS        float64
MP        float64
PER       float64
TS%       float64
3PAr      float64
FTr       float64
ORB%      float64
DRB%      float64
TRB%      float64
AST%      float64
STL%      float64
BLK%      float64
TOV%      float64
USG%      float64
blanl     float64
OWS       float64
DWS       float64
WS        float64
WS/48     float64
blank2    float64
OBPM      float64
DBPM      float64
BPM       float64
VORP      float64
FG        float64
FGA       float64
FG%       float64
3P        float64
3PA       float64
3P%       float64
2P        float64
2PA       float64
2P%       float64
eFG%      float64
FT        float64
FTA       float64
FT%       float64
ORB       float64
DRB       float64
TRB       float64
AST       float64
STL       float64
BLK       float64
TOV       float64
PF        float64
PTS       float64
dtype: object

There is no problem with the data types, all data types are appropriate.

### 3. Take data only from 2017

In [5]:
df = df.loc[df['Year'] == 2017]
df = df.reset_index()
df.shape

(595, 53)

### 4. Check missing or null values:

In [6]:
df.isnull().sum()

index       0
Year        0
Player      0
Pos         0
Age         0
Tm          0
G           0
GS          0
MP          0
PER         0
TS%         2
3PAr        2
FTr         2
ORB%        0
DRB%        0
TRB%        0
AST%        0
STL%        0
BLK%        0
TOV%        2
USG%        0
blanl     595
OWS         0
DWS         0
WS          0
WS/48       0
blank2    595
OBPM        0
DBPM        0
BPM         0
VORP        0
FG          0
FGA         0
FG%         2
3P          0
3PA         0
3P%        46
2P          0
2PA         0
2P%         5
eFG%        2
FT          0
FTA         0
FT%        24
ORB         0
DRB         0
TRB         0
AST         0
STL         0
BLK         0
TOV         0
PF          0
PTS         0
dtype: int64

There are 9 columns with missing values. Columns with null values in all rows ('blanl' and 'blank2') will be dropped. Other values are in columns that have either percentage value or normalized values of other columns, thus redundant. Because the normalization method in '3Par' and 'FTr' is unknown, and there are already columns with '3Pa' and 'FT' original value, these two redundant columns will be dropped. In columns which values are percentage ('TS%','TOV%','FG%','3P', and 'FT%'), its missing values will be assumed with 0%.

In [7]:
df_clean = df.drop(['blanl','blank2','3PAr','FTr'], axis=1)
df_clean = df_clean.fillna(0)
# Check whether the missing values problem is solved or not.
df_clean.isnull().sum()

index     0
Year      0
Player    0
Pos       0
Age       0
Tm        0
G         0
GS        0
MP        0
PER       0
TS%       0
ORB%      0
DRB%      0
TRB%      0
AST%      0
STL%      0
BLK%      0
TOV%      0
USG%      0
OWS       0
DWS       0
WS        0
WS/48     0
OBPM      0
DBPM      0
BPM       0
VORP      0
FG        0
FGA       0
FG%       0
3P        0
3PA       0
3P%       0
2P        0
2PA       0
2P%       0
eFG%      0
FT        0
FTA       0
FT%       0
ORB       0
DRB       0
TRB       0
AST       0
STL       0
BLK       0
TOV       0
PF        0
PTS       0
dtype: int64

### 5. Check and remove duplicates

In [8]:
df_clean['Player'].value_counts()

Lance Stephenson    4
Ersan Ilyasova      4
Omri Casspi         4
Johnny O'Bryant     3
Chris McCullough    3
                   ..
Thon Maker          1
Thomas Robinson     1
Fred VanVleet       1
Tyler Ulis          1
Solomon Hill        1
Name: Player, Length: 486, dtype: int64

There are several players that played in more than one teams. Their records are already aggregated and saved in rows with Team name 'TOT' (Two Other Teams). Therefore, players data from each team they played is duplicates that no longer needed, and will be dropped.

In [9]:
df_clean.drop_duplicates(subset='Player',inplace=True,ignore_index=True)
df_clean.drop(['index'],axis=1,inplace=True)
# Check again if there are still duplicates
df_clean['Player'].value_counts()

Chasson Randle            1
Michael Kidd-Gilchrist    1
Greg Monroe               1
Dahntay Jones             1
Ish Smith                 1
                         ..
Jared Sullinger           1
Draymond Green            1
Serge Ibaka               1
K.J. McDaniels            1
Solomon Hill              1
Name: Player, Length: 486, dtype: int64

In [10]:
# Check the final data shape and its values
print(df_clean.shape)
df_clean.head(10)

(486, 48)


Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,2017.0,Alex Abrines,SG,23.0,OKC,68.0,6.0,1055.0,10.1,0.56,...,0.898,18.0,68.0,86.0,40.0,37.0,8.0,33.0,114.0,406.0
1,2017.0,Quincy Acy,PF,26.0,TOT,38.0,1.0,558.0,11.8,0.565,...,0.75,20.0,95.0,115.0,18.0,14.0,15.0,21.0,67.0,222.0
2,2017.0,Steven Adams,C,23.0,OKC,80.0,80.0,2389.0,16.5,0.589,...,0.611,282.0,333.0,615.0,86.0,88.0,78.0,146.0,195.0,905.0
3,2017.0,Arron Afflalo,SG,31.0,SAC,61.0,45.0,1580.0,9.0,0.559,...,0.892,9.0,116.0,125.0,78.0,21.0,7.0,42.0,104.0,515.0
4,2017.0,Alexis Ajinca,C,28.0,NOP,39.0,15.0,584.0,12.9,0.529,...,0.725,46.0,131.0,177.0,12.0,20.0,22.0,31.0,77.0,207.0
5,2017.0,Cole Aldrich,C,28.0,MIN,62.0,0.0,531.0,12.7,0.549,...,0.682,51.0,107.0,158.0,25.0,25.0,23.0,17.0,85.0,105.0
6,2017.0,LaMarcus Aldridge,PF,31.0,SAS,72.0,72.0,2335.0,18.6,0.532,...,0.812,174.0,350.0,524.0,139.0,46.0,89.0,98.0,158.0,1243.0
7,2017.0,Lavoy Allen,PF,27.0,IND,61.0,5.0,871.0,11.6,0.485,...,0.697,105.0,115.0,220.0,57.0,18.0,24.0,29.0,78.0,177.0
8,2017.0,Tony Allen,SG,35.0,MEM,71.0,66.0,1914.0,13.3,0.493,...,0.615,166.0,225.0,391.0,98.0,115.0,29.0,100.0,178.0,643.0
9,2017.0,Al-Farouq Aminu,SF,26.0,POR,61.0,25.0,1773.0,11.3,0.506,...,0.706,77.0,374.0,451.0,99.0,60.0,44.0,94.0,102.0,532.0


### 6. NBA Teams and Their Abbreviation

In [11]:
df['Tm'].unique()

array(['OKC', 'TOT', 'DAL', 'BRK', 'SAC', 'NOP', 'MIN', 'SAS', 'IND',
       'MEM', 'POR', 'CLE', 'LAC', 'PHI', 'HOU', 'MIL', 'NYK', 'DEN',
       'ORL', 'MIA', 'PHO', 'GSW', 'CHO', 'DET', 'ATL', 'WAS', 'LAL',
       'UTA', 'BOS', 'CHI', 'TOR'], dtype=object)

|Western Conference||Eastern Conference||
|--|:--|--|:--|
|Abbr.|Team Name|Abbr.|Team Name|
|UTA|Utah Jazz|PHI|Philadelphia 76ers|
|LAC|Los Angeles Clippers|BRK|Brooklyn Nets|
|LAL|Los Angeles Lakers|MIL|Milwaukee Bucks|
|PHO|Phoenix Suns|TOR|Toronto Raptors|
|SAS|San Antonio Spurs|MIA|Miami Heat|
|POR|Portland Trail Blazers|BOS|Boston Celtics|
|GSW|Golden State Warriors|NYK|New York Knicks|
|DEN|Denver Nuggets|IND|Indiana Pacers|
|DAL|Dallas Mavericks|CHI|Chicago Bulls|
|MEM|Memphis Grizzlies|CHO|Charlotte Hornets|
|NOP|New Orleans Pelicans|ATL|Atlanta Hawks|
|OKC|Oklahoma City Thunder|WAS|Washington Wizards|
|SAC|Sacramento Kings|ORL|Orlando Magic|
|HOU|Houston Rockets|CLE|Cleveland Cavaliers|
|MIN|Minnesota Timberwolves|DET|Detroit Pistons|
||
|TOT|Two Other Teams|Player played for 2 or|more teams|

## Questions
### 1. The youngest and the oldest players in each team

In [12]:
# create empty dataframe to be populated with result
yo_df = pd.DataFrame(columns=df_clean[['Player','Tm','Pos','Age']].columns)

for t in df_clean.Tm.unique():
    # create temporary dataframe t_df to save data from one team
    t_df = df_clean[['Player','Tm','Pos','Age']].loc[df_clean['Tm']==t]

    # take player(s) with minimum age and append it to t_df
    min_df = t_df.loc[t_df['Age']==t_df['Age'].min()]
    yo_df = yo_df.append(min_df)

    # take player(s) with maximum age and append it to yo_df
    max_df = t_df.loc[t_df['Age']==t_df['Age'].max()]
    yo_df = yo_df.append(max_df)

# display all rows
pd.set_option('display.max_rows', None)
yo_df

Unnamed: 0,Player,Tm,Pos,Age
389,Domantas Sabonis,OKC,PF,20.0
85,Nick Collison,OKC,PF,36.0
289,Chris McCullough,TOT,PF,21.0
28,Matt Barnes,TOT,SF,36.0
116,Mike Dunleavy,TOT,SF,36.0
344,Georgios Papagiannis,SAC,C,19.0
3,Arron Afflalo,SAC,SG,31.0
430,Anthony Tolliver,SAC,PF,31.0
108,Cheick Diallo,NOP,PF,20.0
215,Jarrett Jack,NOP,PG,33.0


### 2. Which player has the most minutes played in each position?

In [13]:
# create empty dataframe to be populated with result
mp_df = pd.DataFrame(columns=df_clean[['Player','Tm','Pos','MP']].columns)

for p in df_clean.Pos.unique():
    # exclude players with two positions
    if p != 'PF-C':
        # create temporary dataframe t_df to save data from one position
        t_df = df_clean[['Player','Tm','Pos','MP']].loc[df_clean['Pos']==p]

        # take player(s) with maximum minutes played 'MP' and append it to mp_df
        max_df = t_df.loc[t_df['MP']==t_df['MP'].max()]
        mp_df = mp_df.append(max_df)

mp_df

Unnamed: 0,Player,Tm,Pos,MP
287,C.J. McCollum,POR,SG,2796.0
27,Harrison Barnes,DAL,PF,2803.0
432,Karl-Anthony Towns,MIN,C,3030.0
461,Andrew Wiggins,MIN,SF,3048.0
171,James Harden,HOU,PG,2947.0


### 3. Which team has the highest average total rebound percentage, assist percentage, steal percentage, and block percentage?
5 teams with the highest average total rebound percentage:

In [14]:
df_clean[['Tm','TRB%']].groupby('Tm').mean().sort_values(by='TRB%',ascending=False).head()

Unnamed: 0_level_0,TRB%
Tm,Unnamed: 1_level_1
WAS,13.45
GSW,11.426667
NYK,11.192857
LAL,10.921429
PHI,10.86


5 teams with the highest average assist percentage:

In [15]:
df_clean[['Tm','AST%']].groupby('Tm').mean().sort_values(by='AST%',ascending=False).head()

Unnamed: 0_level_0,AST%
Tm,Unnamed: 1_level_1
DEN,15.86
PHI,15.04
BRK,15.0
SAS,14.9
BOS,14.573333


5 teams with the highest steal percentage:

In [16]:
df_clean[['Tm','STL%']].groupby('Tm').mean().sort_values(by='STL%',ascending=False).head()

Unnamed: 0_level_0,STL%
Tm,Unnamed: 1_level_1
MIN,2.371429
LAC,1.993333
TOR,1.928571
DEN,1.88
GSW,1.733333


5 teams with the highest block percentage:

In [17]:
df_clean[['Tm','BLK%']].groupby('Tm').mean().sort_values(by='BLK%',ascending=False).head()

Unnamed: 0_level_0,BLK%
Tm,Unnamed: 1_level_1
GSW,2.74
LAC,2.053333
MIL,1.986667
TOR,1.978571
NYK,1.964286


### 4. Best player based on the record stats

In [18]:
# Select relevant features
df_rating = df_clean[['Player','Pos','Tm','PTS','AST','FG%','FT%','ORB','TOV','STL','BLK','DRB','PF']]

In [19]:
from sklearn.preprocessing import MinMaxScaler
# create normalization method using sklearn MinMaxScaler
scaler = MinMaxScaler()

# normalized numeric features
df_rating[['PTS','AST','FG%','FT%','ORB','TOV','STL','BLK','DRB','PF']] = \
scaler.fit_transform(df_rating[['PTS','AST','FG%','FT%','ORB','TOV','STL','BLK','DRB','PF']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


Weighting multiplier of features to obtain player rating:

| Feature | Multiplier |
| :-- | --- |
| PTS | 30 |
| AST | 25 |
| FG% | 15 |
| FT% | 10 |
| ORB | 20 |
| TOV | -10 |
| Max Offensive Rating | 100 |
| | |
| STL | 35 |
| BLK | 35 |
| DRB | 30 |
| PF | -10 |
| Max Defensive Rating | 100 |

Average Offensive and Defensive Rating = (6 Offensive features / 10 rating features)\*Offensive Rating + (4 Defensive features / 10 rating features)\*Defensive Rating

In [20]:
# create offensive rating for each players
df_rating[['O_Rating']] = (df_rating['PTS']*30 + df_rating['AST']*25 + df_rating['FG%']*15\
                        + df_rating['FT%']*10 + df_rating['ORB']*20) - df_rating['TOV']*10

# create defensive rating for each players
df_rating[['D_Rating']] = (df_rating['STL']*35 + df_rating['BLK']*35 + df_rating['DRB']*30)\
                        - df_rating['PF']*10

# create combined offensive-defensive rating for each players
df_rating[['OD_Average']] = (df_rating['O_Rating']*0.6 + df_rating['D_Rating']*0.4)

df_rating.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[k] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

Unnamed: 0,Player,Pos,Tm,PTS,AST,FG%,FT%,ORB,TOV,STL,BLK,DRB,PF,O_Rating,D_Rating,OD_Average
0,Alex Abrines,SG,OKC,0.158718,0.04415,0.393,0.898,0.052174,0.071121,0.235669,0.037383,0.083231,0.410072,21.072557,7.953039,15.82475
1,Quincy Acy,PF,TOT,0.086787,0.019868,0.412,0.75,0.057971,0.045259,0.089172,0.070093,0.116279,0.241007,17.487119,6.65259,13.153308
2,Steven Adams,C,OKC,0.353792,0.094923,0.571,0.611,0.817391,0.314655,0.56051,0.364486,0.407589,0.701439,40.863104,37.588117,39.553109
3,Arron Afflalo,SG,SAC,0.201329,0.086093,0.44,0.892,0.026087,0.090517,0.133758,0.03271,0.141983,0.374101,23.328759,6.344867,16.535203
4,Alexis Ajinca,C,NOP,0.080923,0.013245,0.5,0.725,0.133333,0.06681,0.127389,0.102804,0.160343,0.276978,19.507367,10.097227,15.743311
5,Cole Aldrich,C,MIN,0.041048,0.027594,0.523,0.682,0.147826,0.036638,0.159236,0.107477,0.130967,0.305755,19.176419,10.206385,15.588405
6,LaMarcus Aldridge,PF,SAS,0.485927,0.153422,0.477,0.812,0.504348,0.211207,0.292994,0.415888,0.428397,0.568345,41.663224,31.979296,37.789652
7,Lavoy Allen,PF,IND,0.069195,0.062914,0.458,0.697,0.304348,0.0625,0.11465,0.11215,0.140759,0.280576,22.950645,9.354983,17.51238
8,Tony Allen,SG,MEM,0.251368,0.108168,0.461,0.615,0.481159,0.215517,0.732484,0.135514,0.275398,0.640288,30.778258,32.23899,31.362551
9,Al-Farouq Aminu,SF,POR,0.207975,0.109272,0.393,0.706,0.223188,0.202586,0.382166,0.205607,0.457772,0.366906,24.363944,30.636163,26.872831


In [21]:
# sort dataframe by combined rating, and show 5 best players
df_rating = df_rating.sort_values(by='OD_Average', ascending=False)
df_rating.reset_index(drop=True, inplace=True)
df_rating[['Player','Tm','O_Rating','D_Rating','OD_Average']].head()

Unnamed: 0,Player,Tm,O_Rating,D_Rating,OD_Average
0,Russell Westbrook,OKC,66.506182,54.580469,61.735897
1,Anthony Davis,NOP,50.715212,68.369764,57.777033
2,Giannis Antetokounmpo,MIL,52.165034,65.540759,57.515324
3,James Harden,HOU,63.208208,45.779096,56.236563
4,Karl-Anthony Towns,MIN,59.282322,46.991447,54.365972


This rating is based on dataset above, it put Russell Westbrook as the best NBA player in 2017. Interestingly the official NBA Most Valuable Player in 2017 is also Russell Westbrook.

### 5. Best team based on average stats record of their players

In [22]:
# group dataset by teams, and sort its average players rating
df_rating[['Tm','O_Rating','D_Rating','OD_Average']].groupby('Tm').mean().sort_values(by='OD_Average',ascending=False).head()

Unnamed: 0_level_0,O_Rating,D_Rating,OD_Average
Tm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GSW,27.587027,20.544657,24.770079
HOU,27.974642,17.586626,23.819436
MIN,26.878598,16.942743,22.904256
OKC,26.172678,17.429717,22.675494
MIA,25.99282,17.324594,22.52553


The best team based on average records of their players are Golden State Warriors (GSW). They are also 2017 NBA Champion after beat Cleveland Cavaliers 4-1 in the final.