## Setting up Dataframes

- I need to merge these dataframes together intelligently in order to create effective features.
- I have three different tables for every year between 2008 and 2018.
- One of those tables represents team statistics, this is the format in which I will feed variables into my model. I need to convert the individual player statistics into values that can be represented in the team statistic table.
- This will require merging each 'type' of dataframe together so I have one dataframe for all years.

#### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Importing the scraped data from CSVs

In [2]:
# I will probably run out of time to use this salary data, 
# but I'll import it here just in case

salaries_2018 = pd.read_csv('./2018 player salaries.csv', index_col=0)
salaries_2017 = pd.read_csv('./2017 player salaries.csv', index_col=0)
salaries_2016 = pd.read_csv('./2016 player salaries.csv', index_col=0)
salaries_2015 = pd.read_csv('./2015 player salaries.csv', index_col=0)
salaries_2014 = pd.read_csv('./2014 player salaries.csv', index_col=0)
salaries_2013 = pd.read_csv('./2013 player salaries.csv', index_col=0)
salaries_2012 = pd.read_csv('./2012 player salaries.csv', index_col=0)
salaries_2011 = pd.read_csv('./2011 player salaries.csv', index_col=0)

player_stats_2018 = pd.read_csv('./2018 player stats.csv', index_col=0)
player_stats_2017 = pd.read_csv('./2017 player stats.csv', index_col=0)
player_stats_2016 = pd.read_csv('./2016 player stats.csv', index_col=0)
player_stats_2015 = pd.read_csv('./2015 player stats.csv', index_col=0)
player_stats_2014 = pd.read_csv('./2014 player stats.csv', index_col=0)
player_stats_2013 = pd.read_csv('./2013 player stats.csv', index_col=0)
player_stats_2012 = pd.read_csv('./2012 player stats.csv', index_col=0)
player_stats_2011 = pd.read_csv('./2011 player stats.csv', index_col=0)
player_stats_2010 = pd.read_csv('./2010 player stats.csv', index_col=0)
player_stats_2009 = pd.read_csv('./2009 player stats.csv', index_col=0)
player_stats_2008 = pd.read_csv('./2008 player stats.csv', index_col=0)

league_season_2018 = pd.read_csv('./2018 league season.csv', index_col=0)
league_season_2017 = pd.read_csv('./2017 league season.csv', index_col=0)
league_season_2016 = pd.read_csv('./2016 league season.csv', index_col=0)
league_season_2015 = pd.read_csv('./2015 league season.csv', index_col=0)
league_season_2014 = pd.read_csv('./2014 league season.csv', index_col=0)
league_season_2013 = pd.read_csv('./2013 league season.csv', index_col=0)
league_season_2012 = pd.read_csv('./2012 league season.csv', index_col=0)
league_season_2011 = pd.read_csv('./2011 league season.csv', index_col=0)
league_season_2010 = pd.read_csv('./2010 league season.csv', index_col=0)
league_season_2009 = pd.read_csv('./2009 league season.csv', index_col=0)
league_season_2008 = pd.read_csv('./2008 league season.csv', index_col=0)

advanced_stats_2018 = pd.read_csv('./2018 advanced stats.csv', index_col=0)
advanced_stats_2017 = pd.read_csv('./2017 advanced stats.csv', index_col=0)
advanced_stats_2016 = pd.read_csv('./2016 advanced stats.csv', index_col=0)
advanced_stats_2015 = pd.read_csv('./2015 advanced stats.csv', index_col=0)
advanced_stats_2014 = pd.read_csv('./2014 advanced stats.csv', index_col=0)
advanced_stats_2013 = pd.read_csv('./2013 advanced stats.csv', index_col=0)
advanced_stats_2012 = pd.read_csv('./2012 advanced stats.csv', index_col=0)
advanced_stats_2011 = pd.read_csv('./2011 advanced stats.csv', index_col=0)
advanced_stats_2010 = pd.read_csv('./2010 advanced stats.csv', index_col=0)
advanced_stats_2009 = pd.read_csv('./2009 advanced stats.csv', index_col=0)
advanced_stats_2008 = pd.read_csv('./2008 advanced stats.csv', index_col=0)

playoff_results = pd.read_csv('./NHL Playoffs 2008-2018 vertical.csv', index_col=0)

In [3]:
pd.set_option('display.max_columns', 500)

### Adding a year column to team stats tables

- As I will be adding these dataframes together I need to identify the year within the dataframe

In [4]:
league_season_2018['year'] = 2018
league_season_2017['year'] = 2017
league_season_2016['year'] = 2016
league_season_2015['year'] = 2015
league_season_2014['year'] = 2014
league_season_2013['year'] = 2013
league_season_2012['year'] = 2012
league_season_2011['year'] = 2011
league_season_2010['year'] = 2010
league_season_2009['year'] = 2009
league_season_2008['year'] = 2008

#### Concatenating each season into one dataframe

In [5]:
full_stats_df = pd.concat([league_season_2018, league_season_2017, league_season_2016,
                           league_season_2015, league_season_2014, league_season_2013,
                           league_season_2012, league_season_2011, league_season_2010,
                           league_season_2009, league_season_2008], axis=0)
full_stats_df.reset_index(drop=True, inplace=True)

In [6]:
full_stats_df.shape

(331, 33)

#### Cleaning full_stats_df dataframe

In [7]:
full_stats_df['team_name'] = full_stats_df['team_name'].replace('[^A-Za-z ]', '', regex=True)
full_stats_df

Unnamed: 0,average_age,chances_pp,games,goals,goals_against_ev,goals_ev,goals_pp,goals_sh,losses,losses_ot,losses_shootout,opp_chances_pp,opp_goals,opp_goals_pp,opp_goals_sh,pdo,pen_kill_pct,pen_min_per_game,pen_min_per_game_opp,points,points_pct,power_play_pct,save_pct,shot_pct,shots,shots_against,sos,srs,team_name,total_goals_per_game,wins,wins_shootout,year
0,28.4,274,82,267,145,193,58,10,18,11,7,299,211,54,5,101.6,81.94,11.3,9.6,117,0.713,21.17,0.923,9.9,2641,2659,0.03,0.71,Nashville Predators,5.83,53,6,2018
1,26.8,274,82,277,159,200,64,9,20,10,2,274,218,50,7,101.0,81.75,8.5,8.6,114,0.695,23.36,0.917,10.3,2643,2613,0.02,0.74,Winnipeg Jets,6.04,52,4,2018
2,27.5,276,82,296,172,216,66,9,23,5,2,267,236,64,3,102.0,76.03,10.1,10.4,113,0.689,23.91,0.912,10.7,2737,2756,-0.07,0.66,Tampa Bay Lightning,6.49,54,6,2018
3,28.6,258,82,270,161,197,61,9,20,12,3,245,214,40,10,100.2,83.67,9.5,9.6,112,0.683,23.64,0.912,9.9,2703,2399,-0.07,0.62,Boston Bruins,5.90,50,3,2018
4,28.0,248,82,272,182,218,53,8,24,7,3,237,228,44,5,100.5,81.43,7.1,7.8,109,0.665,21.37,0.911,10.1,2774,2619,-0.01,0.52,Vegas Golden Knights,6.10,51,4,2018
5,28.4,244,82,259,178,197,55,4,26,7,1,269,239,53,8,101.4,80.30,9.9,9.3,105,0.640,22.54,0.909,10.7,2400,2637,-0.04,0.21,Washington Capitals,6.07,49,3,2018
6,28.3,224,82,277,189,213,56,4,26,7,2,231,232,43,5,101.6,81.39,7.4,7.1,105,0.640,25.00,0.915,10.1,2700,2844,-0.06,0.49,Toronto Maple Leafs,6.21,49,7,2018
7,28.7,214,82,235,159,183,38,10,25,13,7,274,216,46,4,101.2,83.21,10.0,8.6,101,0.616,17.76,0.923,9.3,2475,2716,0.01,0.24,Anaheim Ducks,5.50,44,4,2018
8,29.5,240,82,253,176,194,49,7,26,11,3,272,232,51,6,100.7,81.25,8.4,7.7,101,0.616,20.42,0.910,10.0,2506,2595,0.04,0.29,Minnesota Wild,5.91,45,3,2018
9,27.7,260,82,272,195,198,68,6,29,6,2,265,250,53,3,98.5,80.00,9.6,9.2,100,0.610,26.15,0.902,9.6,2845,2575,-0.04,0.23,Pittsburgh Penguins,6.37,47,2,2018


#### Cleaning playoff results dataframe
- This will eventually become my target

In [8]:
playoff_results.fillna(0, inplace=True)
playoff_results.rename(index=str, columns={'2018_1': 'cup_champs', '2018_2': 'cup_finals', '2018_4': 'conf_fin', 
                                           '2018_8': 'rd_2', '2018_16': 'rd_1'}, inplace=True)

playoff_results.replace(to_replace={'cup_champs': True, 'cup_finals': True, 'conf_fin': True,
                                    'rd_2': True, 'rd_1': True}, value={'cup_champs': 10,
                                                                        'cup_finals': 8, 'conf_fin': 4,
                                                                        'rd_2': 2, 'rd_1': 1}, inplace=True)

In [9]:
playoff_results.head()

Unnamed: 0_level_0,cup_champs,cup_finals,conf_fin,rd_2,rd_1,Year
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Washington Capitals,10,8,4,2,1,2018
Vegas Golden Knights,0,8,4,2,1,2018
Winnipeg Jets Atlanta Thrashers,0,0,4,2,1,2018
Tampa Bay Lightning,0,0,4,2,1,2018
Nashville Predators,0,0,0,2,1,2018


In [10]:
cup_champs = playoff_results.drop(columns=['cup_finals', 'conf_fin', 'rd_2', 'rd_1'])
cup_champs['cup_champs'].replace(to_replace=10, value=1, inplace=True)
cup_champs.reset_index(inplace=True)
cup_champs.head()

Unnamed: 0,Team,cup_champs,Year
0,Washington Capitals,1,2018
1,Vegas Golden Knights,0,2018
2,Winnipeg Jets Atlanta Thrashers,0,2018
3,Tampa Bay Lightning,0,2018
4,Nashville Predators,0,2018


#### Saving clean cup_champs dataframe to csv

In [11]:
cup_champs.to_csv('cup champs.csv')

  ---

### Joining the champion for every year as a column in the full stats df

In [12]:
champ_dict = cup_champs[cup_champs['cup_champs']==1].set_index('Year').drop(columns='cup_champs').to_dict()['Team']

In [13]:
cup_champs['Team'].map(lambda x: 'Winnipeg Jets' if 'Jets' in x else x)

0        Washington Capitals
1       Vegas Golden Knights
2              Winnipeg Jets
3        Tampa Bay Lightning
4        Nashville Predators
5            San Jose Sharks
6              Boston Bruins
7        Pittsburgh Penguins
8        Philadelphia Flyers
9        Toronto Maple Leafs
10     Columbus Blue Jackets
11          New Jersey Devis
12             Anaheim Ducks
13            Minnesota Wild
14         Los Angeles Kings
15        Colorado Avalanche
16           Ottawa Senators
17           St. Louis Blues
18           Edmonton Oilers
19          New York Rangers
20        Montreal Canadians
21            Calgary Flames
22        Chicago Blackhawks
23              Dallas Stars
24        New York Islanders
25         Detriot Red Wings
26          Florida Panthers
27         Vancouver Canucks
28           Arizona Coyotes
29            Buffalo Sabres
               ...          
301      Washington Capitals
302            Winnipeg Jets
303      Tampa Bay Lightning
304      Nashv

In [14]:
champ_dict.values()

dict_values(['Washington Capitals', 'Pittsburgh Penguins', 'Pittsburgh Penguins', 'Chicago Blackhawks', 'Los Angeles Kings', 'Chicago Blackhawks', 'Los Angeles Kings', 'Boston Bruins', 'Chicago Blackhawks', 'Pittsburgh Penguins', 'Detriot Red Wings'])

In [15]:
def is_champion(series, champion_dict):
    team = series['team_name']
    year = series['year']
    champ = champion_dict[year]
    if team == champ:
        return 1
    else:
        return 0

In [16]:
full_stats_df['is_champ'] = full_stats_df.apply(is_champion, champion_dict=champ_dict, axis=1)

In [17]:
full_stats_df.head(7)

Unnamed: 0,average_age,chances_pp,games,goals,goals_against_ev,goals_ev,goals_pp,goals_sh,losses,losses_ot,losses_shootout,opp_chances_pp,opp_goals,opp_goals_pp,opp_goals_sh,pdo,pen_kill_pct,pen_min_per_game,pen_min_per_game_opp,points,points_pct,power_play_pct,save_pct,shot_pct,shots,shots_against,sos,srs,team_name,total_goals_per_game,wins,wins_shootout,year,is_champ
0,28.4,274,82,267,145,193,58,10,18,11,7,299,211,54,5,101.6,81.94,11.3,9.6,117,0.713,21.17,0.923,9.9,2641,2659,0.03,0.71,Nashville Predators,5.83,53,6,2018,0
1,26.8,274,82,277,159,200,64,9,20,10,2,274,218,50,7,101.0,81.75,8.5,8.6,114,0.695,23.36,0.917,10.3,2643,2613,0.02,0.74,Winnipeg Jets,6.04,52,4,2018,0
2,27.5,276,82,296,172,216,66,9,23,5,2,267,236,64,3,102.0,76.03,10.1,10.4,113,0.689,23.91,0.912,10.7,2737,2756,-0.07,0.66,Tampa Bay Lightning,6.49,54,6,2018,0
3,28.6,258,82,270,161,197,61,9,20,12,3,245,214,40,10,100.2,83.67,9.5,9.6,112,0.683,23.64,0.912,9.9,2703,2399,-0.07,0.62,Boston Bruins,5.9,50,3,2018,0
4,28.0,248,82,272,182,218,53,8,24,7,3,237,228,44,5,100.5,81.43,7.1,7.8,109,0.665,21.37,0.911,10.1,2774,2619,-0.01,0.52,Vegas Golden Knights,6.1,51,4,2018,0
5,28.4,244,82,259,178,197,55,4,26,7,1,269,239,53,8,101.4,80.3,9.9,9.3,105,0.64,22.54,0.909,10.7,2400,2637,-0.04,0.21,Washington Capitals,6.07,49,3,2018,1
6,28.3,224,82,277,189,213,56,4,26,7,2,231,232,43,5,101.6,81.39,7.4,7.1,105,0.64,25.0,0.915,10.1,2700,2844,-0.06,0.49,Toronto Maple Leafs,6.21,49,7,2018,0


### Joining all years of the individual player dataframes

In [18]:
full_player_stats = pd.concat([player_stats_2018, player_stats_2017, player_stats_2016,
                           player_stats_2015, player_stats_2014, player_stats_2013,
                           player_stats_2012, player_stats_2011, player_stats_2010,
                           player_stats_2009, player_stats_2008], axis=0)
full_player_stats.reset_index(drop=True, inplace=True)

In [19]:
full_player_stats.head()

Unnamed: 0,age,assists,dps,es_assists,es_blocks,es_faceoff_losses,es_faceoff_pct,es_faceoff_wins,es_goals,es_hits,games_played,goals,gw_goals,ops,penalty_minutes,player,plus_minus,point_shares,points,position,pp_assists,pp_goals,sh_assists,sh_goals,shot_pct,shots,team,toi,toi_avg,year
0,24.0,35,1.9,25.0,30.0,121.0,46.5,105.0,26,115,77,34,3,6.4,14,Rickard Rakell,6.0,8.3,69,RW,10.0,8,0.0,0,14.8,230,Anaheim Ducks,1495,19:25,2018
1,32.0,50,2.1,36.0,57.0,443.0,47.8,406.0,10,96,56,11,0,4.0,42,Ryan Getzlaf,20.0,6.2,61,C,13.0,0,1.0,1,9.4,117,Anaheim Ducks,1200,21:26,2018
2,32.0,32,1.2,25.0,36.0,14.0,12.5,2.0,13,59,71,17,1,3.3,71,Corey Perry,-4.0,4.5,49,RW,7.0,4,0.0,0,10.1,168,Anaheim Ducks,1262,17:47,2018
3,27.0,23,1.8,16.0,61.0,35.0,34.0,18.0,14,52,77,17,2,2.0,18,Jakob Silfverberg,6.0,3.8,40,LW,6.0,2,1.0,1,9.1,187,Anaheim Ducks,1383,17:58,2018
4,22.0,18,1.7,16.0,26.0,21.0,19.2,5.0,19,28,66,20,5,3.4,14,Ondrej Kase,18.0,5.1,38,RW,2.0,1,0.0,0,13.7,146,Anaheim Ducks,919,13:55,2018


 - This dataset has a lot of NaN values for goalies. As these NaN values relate to stats that goalies do not record, like shots taken or shooting pct. I am filling those NaN values with 0.

In [None]:
full_player_stats.update()

In [43]:
full_player_stats.loc[full_player_stats['position'] == 'G'].fillna(value=0)

Unnamed: 0,age,assists,dps,es_assists,es_blocks,es_faceoff_losses,es_faceoff_pct,es_faceoff_wins,es_goals,es_hits,games_played,goals,gw_goals,ops,penalty_minutes,player,plus_minus,point_shares,points,position,pp_assists,pp_goals,sh_assists,sh_goals,shot_pct,shots,team,toi,toi_avg,year
30,24.0,1,0.0,1.0,0.0,0.0,0.0,0.0,0,0,60,0,0,0.0,16,John Gibson,0.0,13.2,1,G,0.0,0,0.0,0,0.0,0,Anaheim Ducks,3428,57:08,2018
33,31.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,5,0,0,0.0,0,Reto Berra,0.0,0.7,0,G,0.0,0,0.0,0,0.0,0,Anaheim Ducks,182,36:23,2018
38,37.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,28,0,0,0.0,0,Ryan Miller,0.0,5.4,0,G,0.0,0,0.0,0,0.0,0,Anaheim Ducks,1353,48:20,2018
68,25.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,7,0,0,0.0,0,Louis Domingue,0.0,-0.1,0,G,0.0,0,0.0,0,0.0,0,Arizona Coyotes,388,55:24,2018
71,21.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,4,0,0,0.0,0,Adin Hill,0.0,0.4,0,G,0.0,0,0.0,0,0.0,0,Arizona Coyotes,241,60:11,2018
72,27.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,10,0,0,0.0,0,Darcy Kuemper,0.0,1.3,0,G,0.0,0,0.0,0,0.0,0,Arizona Coyotes,597,59:40,2018
73,23.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,0,0.0,0,Marek Langhamer,0.0,0.2,0,G,0.0,0,0.0,0,0.0,0,Arizona Coyotes,29,28:53,2018
76,28.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,47,0,0,0.0,0,Antti Raanta,0.0,10.4,0,G,0.0,0,0.0,0,0.0,0,Arizona Coyotes,2599,55:18,2018
77,25.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,20,0,0,0.0,2,Scott Wedgewood,0.0,2.0,0,G,0.0,0,0.0,0,0.0,0,Arizona Coyotes,1097,54:50,2018
107,31.0,2,0.0,1.0,0.0,0.0,0.0,0.0,0,0,31,0,0,0.0,2,Anton Khudobin,0.0,5.0,2,G,1.0,0,0.0,0,0.0,0,Boston Bruins,1781,57:27,2018


In [21]:
full_player_stats.shape

(11568, 30)

In [22]:
full_player_stats.isnull().sum()

age                     1
assists                 0
dps                     0
es_assists            543
es_blocks             146
es_faceoff_losses     146
es_faceoff_pct       4959
es_faceoff_wins       146
es_goals                0
es_hits                 0
games_played            0
goals                   0
gw_goals                0
ops                     0
penalty_minutes         0
player                  0
plus_minus            112
point_shares            0
points                  0
position                0
pp_assists            543
pp_goals                0
sh_assists            543
sh_goals                0
shot_pct             1343
shots                   0
team                    0
toi                     0
toi_avg                 0
year                    0
dtype: int64

In [None]:
full_player_stats = full_player_stats[full_player_stats['games_played'] > 10]

In [None]:
full_player_stats.shape

### Joining all years of the advanced stats dataframes

- Adding a year column to each dataframe

In [None]:
advanced_stats_2018['year'] = 2018
advanced_stats_2017['year'] = 2017
advanced_stats_2016['year'] = 2016
advanced_stats_2015['year'] = 2015
advanced_stats_2014['year'] = 2014
advanced_stats_2013['year'] = 2013
advanced_stats_2012['year'] = 2012
advanced_stats_2011['year'] = 2011
advanced_stats_2010['year'] = 2010
advanced_stats_2009['year'] = 2009
advanced_stats_2008['year'] = 2008

In [None]:
full_advanced_stats = pd.concat([advanced_stats_2018, advanced_stats_2017, advanced_stats_2016,
                           advanced_stats_2015, advanced_stats_2014, advanced_stats_2013,
                           advanced_stats_2012, advanced_stats_2011, advanced_stats_2010,
                           advanced_stats_2009, advanced_stats_2008], axis=0)
full_advanced_stats.reset_index(drop=True, inplace=True)

- Looking at the null values below, we seem to have some instances of a specific advanced stat not being recorded for seasons prior to a certain time. The expected plus/minus column seems to be a case of that. I'm going to drop that column and look for ways to impute the other missing values.

In [None]:
full_advanced_stats.isnull().sum()

In [None]:
full_advanced_stats.drop(columns='expected_plsmns', inplace=True)

#### Filtering out players who have played fewer than 10 games in a season

- This will get rid of many if not all of the NaN values, and will make my variables more reliable, as these players are most likely minor leaguers who will not be playing in the playoffs

In [None]:
full_advanced_stats = full_advanced_stats[full_advanced_stats['games_played'] > 10]

In [None]:
full_advanced_stats

In [None]:
# full_stats_df['team_name'] = full_stats_df['team_name'].map(lambda x: str(x)[:-1])

In [None]:
skater_stats_2018[skater_stats_2018['Team'] == 'Washington Capitals']

In [None]:
points_mean = skater_stats_2018['Points'].mean()

In [None]:
plt.hist(skater_stats_2018['Points'])

In [None]:
points_std = np.std(skater_stats_2018['Points'])

In [None]:
points_limit = points_mean + points_std

In [None]:
skater_stats_2018[skater_stats_2018['Points'] > points_limit]