## Data preprocessing and cleansing

This is going to be hard part ensuring all the data that has been collected can be used or if there is any corrupted data.

First things, first. We should probably familiarize ourselves with Basketball Statistics.

Basketball stats abbreviations
<br>
<br>
- MIN: Minutes played
- 2M-2A: Two-points field goal made, attempted
- 3M-3A: Three-points field goal made, attempted
- FG%: Field goal percentage
- 1M-1A: Free throws made, attempted
- 1%: Free throw percentage
<br>
<br>
- Or: Offensive rebounds
- Dr: Defensive rebounds
- Reb: Total rebounds
- Ast: Assists
- Stl: Steals
<br>
<br>
- Blk: Blocks
- Fo: Personal fouls
- Pts: Points scored
- Eff: Efficiency



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read in the data
df = pd.read_csv('../1_Data_Collection/datasets/atlanta-hawks-minnesota-timberwolves-2023-10-31.csv')
df

Unnamed: 0,PLAYER,Pts,Reb,Ast,MIN,2M-2A,3M-3A,FG%,1M-1A,1%,Or,Dr,To,Stl,Blk,Fo,+/-,Eff,Team
0,DejounteMurray,41,7,5,39,14-19,3-5,70.8%,4-4,100.0%,0,7,3,2,0,1,25,45,AtlantaHawks
1,TraeYoung,24,1,8,35,8-15,1-7,40.9%,5-5,100.0%,0,1,1,1,0,1,4,20,AtlantaHawks
2,De&#039;AndreHunter,16,4,2,30,3-5,3-5,60.0%,1-2,50.0%,0,4,0,1,0,0,4,18,AtlantaHawks
3,JalenT.Johnson,12,5,3,33,4-5,1-2,71.4%,1-1,100.0%,1,4,2,1,1,3,19,18,AtlantaHawks
4,SaddiqBey,11,6,1,25,1-2,3-3,80.0%,0-0,-,0,6,1,0,1,2,17,17,AtlantaHawks
5,OnyekaOkongwu,10,7,3,24,3-3,0-0,100.0%,4-4,100.0%,2,5,0,0,0,1,22,20,AtlantaHawks
6,BogdanBogdanovic,8,1,4,20,1-3,2-7,30.0%,0-0,-,0,1,1,1,0,1,3,6,AtlantaHawks
7,AJGriffin,3,0,1,6,0-0,1-1,100.0%,0-0,-,0,0,0,0,0,0,-8,4,AtlantaHawks
8,ClintCapela,2,5,1,23,0-3,0-0,0.0%,2-2,100.0%,1,4,1,0,5,3,-4,9,AtlantaHawks
9,MouhamedGueye,0,0,0,1,0-0,0-0,-,0-0,-,0,0,0,0,0,0,-4,0,AtlantaHawks


## Patterns

Let's try to find something we can work with. The first thing we can identify is one certainty. The team with the most points will always win.
So I think it's fair to not try to corelate something to the number of points.

First I tried looking at the number of rebounds. In this specific case I found that the number of rebounds cannot accurately predict the outcome, since the timberwolves had four more rebounds than the Hawks, and still lost.

However I will say that it's interesting ATL still has slightly higher mean rebounds.

In [2]:
Atl = df[df['PLAYER'] == 'AtlantaHawks']
Min = df[df['PLAYER'] == 'MinnesotaTimberwolves']

Atl_Players = df[df['Team'] == 'AtlantaHawks']
Min_Players = df[df['Team'] == 'MinnesotaTimberwolves']
# Print the total rebounds for each team
print("ATL Reb total: ", Atl['Reb'].sum(), "\n" "MIN Reb total: ", Min['Reb'].sum())
# Print the mean rebounds for each team
print("ATL Reb mean: ", Atl_Players['Reb'].mean(), "\n" "MIN Reb mean: ", Min_Players['Reb'].mean())
# Print the median rebounds for each team
print("ATL Reb median: ", Atl_Players['Reb'].median(), "\n" "MIN Reb median: ", Min_Players['Reb'].median())
# Print the standard deviation of rebounds for each team
print("ATL Reb std: ", Atl_Players['Reb'].std(), "\n" "MIN Reb std: ", Min_Players['Reb'].std())
# Print the variance of rebounds for each team
print("ATL Reb variance: ", Atl_Players['Reb'].var(), "\n" "MIN Reb variance: ", Min_Players['Reb'].var())
# Print the range of rebounds for each team
print("ATL Reb range: ", Atl_Players['Reb'].max() - Atl_Players['Reb'].min(), "\n" "MIN Reb range: ", Min_Players['Reb'].max() - Min_Players['Reb'].min())
# Print IQR of rebounds for each team
print("ATL Reb IQR: ", Atl_Players['Reb'].quantile(0.75) - Atl_Players['Reb'].quantile(0.25), "\n" "MIN Reb IQR: ", Min_Players['Reb'].quantile(0.75) - Min_Players['Reb'].quantile(0.25))
# Print the skewness of rebounds for each team
print("ATL Reb skewness: ", Atl_Players['Reb'].skew(), "\n" "MIN Reb skewness: ", Min_Players['Reb'].skew())
# Print the kurtosis of rebounds for each team
print("ATL Reb kurtosis: ", Atl_Players['Reb'].kurtosis(), "\n" "MIN Reb kurtosis: ", Min_Players['Reb'].kurtosis())
# Print the mode of rebounds for each team
print("ATL Reb mode: ", Atl_Players['Reb'].mode(), "\n" "MIN Reb mode: ", Min_Players['Reb'].mode())

ATL Reb total:  36 
MIN Reb total:  38
ATL Reb mean:  5.538461538461538 
MIN Reb mean:  5.428571428571429
ATL Reb median:  4.0 
MIN Reb median:  1.5
ATL Reb std:  9.570922844875728 
MIN Reb std:  10.17322493089116
ATL Reb variance:  91.60256410256409 
MIN Reb variance:  103.49450549450547
ATL Reb range:  36 
MIN Reb range:  38
ATL Reb IQR:  6.0 
MIN Reb IQR:  4.25
ATL Reb skewness:  3.0780734830403103 
MIN Reb skewness:  2.905644900916013
ATL Reb kurtosis:  10.296579803869134 
MIN Reb kurtosis:  9.117276874139808
ATL Reb mode:  0    0
Name: Reb, dtype: int64 
MIN Reb mode:  0    0
Name: Reb, dtype: int64


So so finding the dispersion and location of each variable wasn't as bad an idea as I thought. I ended up realizing that more players on the ATL team contributed to rebounds. Hence the higher mean, smaller variance, standard deviation and range. 

The most frequent value will typically be the lowest number, this is because most of the team is usually not gathered by the net to try to get the ball.


## Let's do it again

This part will be a bit boring. Just going over every measure and seeing it's location and dispersion

In [4]:
# Print the total assists for each team
print("ATL Ast total: ", Atl['Ast'].sum(), "\n" "MIN Ast total: ", Min['Ast'].sum())
# Print the mean assists for each team
print("ATL Ast mean: ", Atl_Players['Ast'].mean(), "\n" "MIN Ast mean: ", Min_Players['Ast'].mean())
# Print the median assists for each team
print("ATL Ast median: ", Atl_Players['Ast'].median(), "\n" "MIN Ast median: ", Min_Players['Ast'].median())
# Print the standard deviation of assists for each team
print("ATL Ast std: ", Atl_Players['Ast'].std(), "\n" "MIN Ast std: ", Min_Players['Ast'].std())
# Print the variance of assists for each team
print("ATL Ast variance: ", Atl_Players['Ast'].var(), "\n" "MIN Ast variance: ", Min_Players['Ast'].var())
# Print the range of assists for each team
print("ATL Ast range: ", Atl_Players['Ast'].max() - Atl_Players['Ast'].min(), "\n" "MIN Ast range: ", Min_Players['Ast'].max() - Min_Players['Ast'].min())
# Print IQR of assists for each team
print("ATL Ast IQR: ", Atl_Players['Ast'].quantile(0.75) - Atl_Players['Ast'].quantile(0.25), "\n" "MIN Ast IQR: ", Min_Players['Ast'].quantile(0.75) - Min_Players['Ast'].quantile(0.25))
# Print the skewness of assists for each team
print("ATL Ast skewness: ", Atl_Players['Ast'].skew(), "\n" "MIN Ast skewness: ", Min_Players['Ast'].skew())
# Print the kurtosis of assists for each team
print("ATL Ast kurtosis: ", Atl_Players['Ast'].kurtosis(), "\n" "MIN Ast kurtosis: ", Min_Players['Ast'].kurtosis())
# Print the mode of assists for each team
print("ATL Ast mode: ", Atl_Players['Ast'].mode(), "\n" "MIN Ast mode: ", Min_Players['Ast'].mode())

ATL Ast total:  28 
MIN Ast total:  24
ATL Ast mean:  4.3076923076923075 
MIN Ast mean:  3.4285714285714284
ATL Ast median:  2.0 
MIN Ast median:  1.5
ATL Ast std:  7.487596581287121 
MIN Ast std:  6.284465410399321
ATL Ast variance:  56.06410256410257 
MIN Ast variance:  39.4945054945055
ATL Ast range:  28 
MIN Ast range:  24
ATL Ast IQR:  3.0 
MIN Ast IQR:  3.0
ATL Ast skewness:  3.046286902813049 
MIN Ast skewness:  3.080176409511953
ATL Ast kurtosis:  9.965020596976382 
MIN Ast kurtosis:  10.28298517461312
ATL Ast mode:  0    0
1    1
Name: Ast, dtype: int64 
MIN Ast mode:  0    0
Name: Ast, dtype: int64
