---------------------
# **P2 - Player Classification** 🥏
---------------------

In [110]:
# IMPORTS
import pandas as pd
import seaborn as sns

In [111]:
# Color Palette
palette = {
    'ltBlu': '#8ecae6',
    'mdBlu': '#219ebc',
    'dkBlu': '#023047',
    'gold': '#ffb703',
    'orange': '#fb8500'
}

In [112]:
# Get raw data
df = pd.read_csv(r'.\DATA\raw\20220227_player_all_time.csv')

In [113]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2558 entries, 0 to 2557
Data columns (total 26 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Player  2558 non-null   object
 1   G       2558 non-null   int64 
 2   PP      2558 non-null   int64 
 3   POS     2558 non-null   int64 
 4   SCR     2558 non-null   int64 
 5   AST     2558 non-null   int64 
 6   GLS     2558 non-null   int64 
 7   BLK     2558 non-null   int64 
 8   +/- ▼   2558 non-null   int64 
 9   Cmp     2558 non-null   int64 
 10  Cmp%    2558 non-null   object
 11  Y       2558 non-null   int64 
 12  TY      2558 non-null   int64 
 13  RY      2558 non-null   int64 
 14  OEFF    2558 non-null   object
 15  HA      2558 non-null   int64 
 16  T       2558 non-null   int64 
 17  S       2558 non-null   int64 
 18  D       2558 non-null   int64 
 19  C       2558 non-null   int64 
 20  Hck     2558 non-null   int64 
 21  Hck%    2558 non-null   object
 22  Pul     2558 non-null   

# GAME PLAN
**DATA CLEANING**
1. clean up stat column names
1. Verify dtypes are correct (i.e. OEFF = object)
1. Assess null information
1. Assign positions to players
1. Any Data Engineering?

**EDA**
1. Single variable EDA, focus on ___?
1. Multi-variable EDA, focus on differences in position
1. Correlation heatmap
1. Create df where relevant values are by points played
    a. If any highly collinear after, remove
1. Remove one of any pairs of highly collinear variables
1. Create DF for just players w/ position

In [114]:
# Clean up colum names
col_names = {'Player': 'player',
             'G': 'games',
             'PP': 'points_played',
             'POS': 'off_possessions',
             'SCR': 'scores', # assists + goals
             'AST': 'assists',
             'GLS': 'goals',
             'BLK': 'blocks',
             '+/- ▼': 'plus_minus', # +1 per goal/assist/block, -1 per throwaway and drop
             'Cmp': 'completions',
             'Cmp%': 'completion_pct', # completion pct, must have >= 100 throw attempts
             'Y': 'total_yards',
             'TY': 'throwing_yards',
             'RY': 'receiving_yards',
             'OEFF': 'off_efficiency', # team scores while on field / off. possessions
             'HA': 'hockey_assists',
             'T': 'throwaways',
             'S': 'stalls',
             'D': 'drops',
             'C': 'callahans',
             'Hck': 'hucks', # > 40yd throws downfield
             'Hck%': 'huck_pct', # completion pct, must have 10+ huck attempts
             'Pul': 'pulls',
             'OPP': 'off_points_played',
             'DPP': 'def_points_played',
             'MP': 'minutes_played'
}

df.columns = [col_names[i] for i in df.columns]

In [115]:
# Sample
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2558 entries, 0 to 2557
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   player             2558 non-null   object
 1   games              2558 non-null   int64 
 2   points_played      2558 non-null   int64 
 3   off_possessions    2558 non-null   int64 
 4   scores             2558 non-null   int64 
 5   assists            2558 non-null   int64 
 6   goals              2558 non-null   int64 
 7   blocks             2558 non-null   int64 
 8   plus_minus         2558 non-null   int64 
 9   completions        2558 non-null   int64 
 10  completion_pct     2558 non-null   object
 11  total_yards        2558 non-null   int64 
 12  throwing_yards     2558 non-null   int64 
 13  receiving_yards    2558 non-null   int64 
 14  off_efficiency     2558 non-null   object
 15  hockey_assists     2558 non-null   int64 
 16  throwaways         2558 non-null   int64 


In [116]:
len(df[df['hucks'] == 0])

2144

In [117]:
# Transform Object Columns into int/float
df[['completion_pct', 'off_efficiency', 'huck_pct']]
df.replace(inplace=True, to_replace='--', value=None)

In [118]:
num_cols = ['completion_pct', 'off_efficiency', 'huck_pct']
for i in num_cols:
    df[i] = df[i].astype('float')

df.info give the following information: 
- completion_pct     1317 non-null   object
- off_efficiency     1574 non-null   object
- huck_pct           107 non-null    object

`completion_pct` is **48.5%** null </br>
`off_efficiency` is **38.5%** null </br>
`huck_pct` is **95.8%** null </br>

Because of the high presence of missing values in `huck_pct`, this column is dropped from the finall processed dataset. Similarly, `hucks` is also dropped because the large majority of values is 0. For `completion_pct` and `off_efficiency`, these will be left in along with null values. Many clustering algorithms 

In [119]:
df.drop(labels=['huck_pct', 'hucks'], axis=1, inplace=True)

In [120]:
# Assign Positions to Players
player_pos = pd.read_csv("DATA/raw/20220304_player_positions.csv").drop(labels='Unnamed: 0.1', axis=1)
player_pos.sample(5)

Unnamed: 0.1,Unnamed: 0,player_link,position
160,Andrew Misthos,https://theaudl.com/league/players/amisthos,
46,AJ Jacoski,https://theaudl.com/league/players/ajacoski,Defender
851,Elliot Warner,https://theaudl.com/league/players/ewarner,
582,Conner Henderson,https://theaudl.com/league/players/chenderso,
2062,Paul Karavayev,https://theaudl.com/league/players/pkaravaye,


In [121]:
player_pos = player_pos.rename({'Unnamed: 0': 'player_name'}, axis=1)

In [122]:
player_pos.set_index('player_name', inplace=True)
merged = df.join(other=player_pos, on='player', how='left')

In [126]:
# Drop link col and save
merged.drop(labels='player_link', axis=1, inplace=True)
merged.sample(5)

Unnamed: 0,player,games,points_played,off_possessions,scores,assists,goals,blocks,plus_minus,completions,...,hockey_assists,throwaways,stalls,drops,callahans,pulls,off_points_played,def_points_played,minutes_played,position
1442,Felix Leonard,7,79,49,8,5,3,2,9,30,...,3,0,0,1,0,0,0,79,90,
1988,Matt Gubernick,3,26,18,1,0,1,0,1,3,...,0,0,0,0,0,0,0,26,29,
879,Taylor Nadon,13,262,329,37,16,21,2,24,119,...,9,9,0,6,0,2,203,59,213,
882,Jeremy Hess,13,245,167,17,11,6,10,23,55,...,4,4,0,0,0,139,39,206,269,
2309,Alex Hutton,2,21,15,0,0,0,0,-1,4,...,0,0,0,1,0,0,6,15,15,


In [127]:
# SAVE FILE
merged.to_csv(".\\DATA\\postproc\\players.csv")