## Setting up Dataframes

- I need to merge these dataframes together intelligently in order to create effective features.
- I have three different tables for every year between 2008 and 2018.
- One of those tables represents team statistics, this is the format in which I will feed variables into my model. I need to convert the individual player statistics into values that can be represented in the team statistic table. That will be done in the feature engineering section
- Here I will be merging all the years of the team statistics dataframe as well as adding columns for season rank and season champion.

#### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Importing the scraped data from CSVs

In [2]:
# I will probably run out of time to use this salary data, 
# but I'll import it here just in case

salaries_2018 = pd.read_csv('./2018 player salaries.csv', index_col=0)
salaries_2017 = pd.read_csv('./2017 player salaries.csv', index_col=0)
salaries_2016 = pd.read_csv('./2016 player salaries.csv', index_col=0)
salaries_2015 = pd.read_csv('./2015 player salaries.csv', index_col=0)
salaries_2014 = pd.read_csv('./2014 player salaries.csv', index_col=0)
salaries_2013 = pd.read_csv('./2013 player salaries.csv', index_col=0)
salaries_2012 = pd.read_csv('./2012 player salaries.csv', index_col=0)
salaries_2011 = pd.read_csv('./2011 player salaries.csv', index_col=0)

player_stats_2018 = pd.read_csv('./2018 player stats.csv', index_col=0)
player_stats_2017 = pd.read_csv('./2017 player stats.csv', index_col=0)
player_stats_2016 = pd.read_csv('./2016 player stats.csv', index_col=0)
player_stats_2015 = pd.read_csv('./2015 player stats.csv', index_col=0)
player_stats_2014 = pd.read_csv('./2014 player stats.csv', index_col=0)
player_stats_2013 = pd.read_csv('./2013 player stats.csv', index_col=0)
player_stats_2012 = pd.read_csv('./2012 player stats.csv', index_col=0)
player_stats_2011 = pd.read_csv('./2011 player stats.csv', index_col=0)
player_stats_2010 = pd.read_csv('./2010 player stats.csv', index_col=0)
player_stats_2009 = pd.read_csv('./2009 player stats.csv', index_col=0)
player_stats_2008 = pd.read_csv('./2008 player stats.csv', index_col=0)

league_season_2018 = pd.read_csv('./2018 league season.csv', index_col=0)
league_season_2017 = pd.read_csv('./2017 league season.csv', index_col=0)
league_season_2016 = pd.read_csv('./2016 league season.csv', index_col=0)
league_season_2015 = pd.read_csv('./2015 league season.csv', index_col=0)
league_season_2014 = pd.read_csv('./2014 league season.csv', index_col=0)
league_season_2013 = pd.read_csv('./2013 league season.csv', index_col=0)
league_season_2012 = pd.read_csv('./2012 league season.csv', index_col=0)
league_season_2011 = pd.read_csv('./2011 league season.csv', index_col=0)
league_season_2010 = pd.read_csv('./2010 league season.csv', index_col=0)
league_season_2009 = pd.read_csv('./2009 league season.csv', index_col=0)
league_season_2008 = pd.read_csv('./2008 league season.csv', index_col=0)

advanced_stats_2018 = pd.read_csv('./2018 advanced stats.csv', index_col=0)
advanced_stats_2017 = pd.read_csv('./2017 advanced stats.csv', index_col=0)
advanced_stats_2016 = pd.read_csv('./2016 advanced stats.csv', index_col=0)
advanced_stats_2015 = pd.read_csv('./2015 advanced stats.csv', index_col=0)
advanced_stats_2014 = pd.read_csv('./2014 advanced stats.csv', index_col=0)
advanced_stats_2013 = pd.read_csv('./2013 advanced stats.csv', index_col=0)
advanced_stats_2012 = pd.read_csv('./2012 advanced stats.csv', index_col=0)
advanced_stats_2011 = pd.read_csv('./2011 advanced stats.csv', index_col=0)
advanced_stats_2010 = pd.read_csv('./2010 advanced stats.csv', index_col=0)
advanced_stats_2009 = pd.read_csv('./2009 advanced stats.csv', index_col=0)
advanced_stats_2008 = pd.read_csv('./2008 advanced stats.csv', index_col=0)

playoff_results = pd.read_csv('./NHL Playoffs 2008-2018 vertical(2).csv', index_col=0)
cup_champs = pd.read_csv('./cup champs.csv', index_col=0)
team_ranks = pd.read_csv('./NHL Rankings 2008-2018 vertical.csv', index_col=0)

In [3]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

### Adding a year column to team stats tables

- As I will be adding these dataframes together I need to identify the year within the dataframe

In [4]:
league_season_2018['year'] = 2018
league_season_2017['year'] = 2017
league_season_2016['year'] = 2016
league_season_2015['year'] = 2015
league_season_2014['year'] = 2014
league_season_2013['year'] = 2013
league_season_2012['year'] = 2012
league_season_2011['year'] = 2011
league_season_2010['year'] = 2010
league_season_2009['year'] = 2009
league_season_2008['year'] = 2008

#### Concatenating each season into one dataframe

In [5]:
full_stats_df = pd.concat([league_season_2018, league_season_2017, league_season_2016,
                           league_season_2015, league_season_2014, league_season_2013,
                           league_season_2012, league_season_2011, league_season_2010,
                           league_season_2009, league_season_2008], axis=0)
full_stats_df.reset_index(drop=True, inplace=True)

#### Cleaning full_stats_df dataframe

- There was an asterisk next to the team names of teams that made the playoffs in a given year. For consistencies sake, I am removing special characters.

In [6]:
full_stats_df['team_name'] = full_stats_df['team_name'].replace('[^A-Za-z ]', '', regex=True)

Unnamed: 0_level_0,average_age,chances_pp,games,goals,goals_against_ev,goals_ev,goals_pp,goals_sh,losses,losses_ot,losses_shootout,opp_chances_pp,opp_goals,opp_goals_pp,opp_goals_sh,pdo,pen_kill_pct,pen_min_per_game,pen_min_per_game_opp,points,points_pct,power_play_pct,save_pct,shot_pct,shots,shots_against,sos,srs,team_name,total_goals_per_game,wins,wins_shootout
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
2008,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2009,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2010,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2011,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2012,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2013,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2014,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2015,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2016,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2017,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30


  ---

### Adding the cup_champs column

- I manually created an excel spreadsheet denoting the cup winner for each year with a '1' and every other team with a '0'. I have imported this csv above and will use it to add a cup champs column

In [7]:
cup_champs.head()

Unnamed: 0,Team,cup_champs,Year
0,Washington Capitals,1,2018
1,Vegas Golden Knights,0,2018
2,Winnipeg Jets,0,2018
3,Tampa Bay Lightning,0,2018
4,Nashville Predators,0,2018


Because I removed all special characters from the team stats table, I also need to remove them here as the St. Louis Blues contains a period. I am also renaming the team and year columns to match the full_stats_df column names


In [8]:
cup_champs['Team'] = cup_champs['Team'].replace('[^A-Za-z ]', '', regex=True)

cup_champs.rename(columns={'Team': 'team_name', 'Year': 'year'}, inplace=True)

In [13]:
full_stats_champ = pd.merge(full_stats_df, cup_champs, how='outer', on=['team_name', 'year'])

full_stats_champ.reset_index(drop=True, inplace=True)

In [15]:
full_stats_champ

Unnamed: 0,average_age,chances_pp,games,goals,goals_against_ev,goals_ev,goals_pp,goals_sh,losses,losses_ot,losses_shootout,opp_chances_pp,opp_goals,opp_goals_pp,opp_goals_sh,pdo,pen_kill_pct,pen_min_per_game,pen_min_per_game_opp,points,points_pct,power_play_pct,save_pct,shot_pct,shots,shots_against,sos,srs,team_name,total_goals_per_game,wins,wins_shootout,year,cup_champs
0,28.4,274.0,82.0,267.0,145.0,193.0,58.0,10.0,18.0,11.0,7.0,299.0,211.0,54.0,5.0,101.6,81.94,11.3,9.6,117.0,0.713,21.17,0.923,9.9,2641.0,2659.0,0.03,0.71,Nashville Predators,5.83,53.0,6.0,2018,0.0
1,26.8,274.0,82.0,277.0,159.0,200.0,64.0,9.0,20.0,10.0,2.0,274.0,218.0,50.0,7.0,101.0,81.75,8.5,8.6,114.0,0.695,23.36,0.917,10.3,2643.0,2613.0,0.02,0.74,Winnipeg Jets,6.04,52.0,4.0,2018,0.0
2,27.5,276.0,82.0,296.0,172.0,216.0,66.0,9.0,23.0,5.0,2.0,267.0,236.0,64.0,3.0,102.0,76.03,10.1,10.4,113.0,0.689,23.91,0.912,10.7,2737.0,2756.0,-0.07,0.66,Tampa Bay Lightning,6.49,54.0,6.0,2018,0.0
3,28.6,258.0,82.0,270.0,161.0,197.0,61.0,9.0,20.0,12.0,3.0,245.0,214.0,40.0,10.0,100.2,83.67,9.5,9.6,112.0,0.683,23.64,0.912,9.9,2703.0,2399.0,-0.07,0.62,Boston Bruins,5.9,50.0,3.0,2018,0.0
4,28.0,248.0,82.0,272.0,182.0,218.0,53.0,8.0,24.0,7.0,3.0,237.0,228.0,44.0,5.0,100.5,81.43,7.1,7.8,109.0,0.665,21.37,0.911,10.1,2774.0,2619.0,-0.01,0.52,Vegas Golden Knights,6.1,51.0,4.0,2018,0.0
5,28.4,244.0,82.0,259.0,178.0,197.0,55.0,4.0,26.0,7.0,1.0,269.0,239.0,53.0,8.0,101.4,80.3,9.9,9.3,105.0,0.64,22.54,0.909,10.7,2400.0,2637.0,-0.04,0.21,Washington Capitals,6.07,49.0,3.0,2018,1.0
6,28.3,224.0,82.0,277.0,189.0,213.0,56.0,4.0,26.0,7.0,2.0,231.0,232.0,43.0,5.0,101.6,81.39,7.4,7.1,105.0,0.64,25.0,0.915,10.1,2700.0,2844.0,-0.06,0.49,Toronto Maple Leafs,6.21,49.0,7.0,2018,0.0
7,28.7,214.0,82.0,235.0,159.0,183.0,38.0,10.0,25.0,13.0,7.0,274.0,216.0,46.0,4.0,101.2,83.21,10.0,8.6,101.0,0.616,17.76,0.923,9.3,2475.0,2716.0,0.01,0.24,Anaheim Ducks,5.5,44.0,4.0,2018,0.0
8,29.5,240.0,82.0,253.0,176.0,194.0,49.0,7.0,26.0,11.0,3.0,272.0,232.0,51.0,6.0,100.7,81.25,8.4,7.7,101.0,0.616,20.42,0.91,10.0,2506.0,2595.0,0.04,0.29,Minnesota Wild,5.91,45.0,3.0,2018,0.0
9,27.7,260.0,82.0,272.0,195.0,198.0,68.0,6.0,29.0,6.0,2.0,265.0,250.0,53.0,3.0,98.5,80.0,9.6,9.2,100.0,0.61,26.15,0.902,9.6,2845.0,2575.0,-0.04,0.23,Pittsburgh Penguins,6.37,47.0,2.0,2018,0.0


In [16]:
full_stats_champ = full_stats_champ.drop(full_stats_champ[full_stats_champ['average_age'].isnull()].index)

In [17]:
full_stats_champ['cup_champs'].fillna(value=0, inplace=True)

In [18]:
full_stats_champ.groupby('year').count()

Unnamed: 0_level_0,average_age,chances_pp,games,goals,goals_against_ev,goals_ev,goals_pp,goals_sh,losses,losses_ot,losses_shootout,opp_chances_pp,opp_goals,opp_goals_pp,opp_goals_sh,pdo,pen_kill_pct,pen_min_per_game,pen_min_per_game_opp,points,points_pct,power_play_pct,save_pct,shot_pct,shots,shots_against,sos,srs,team_name,total_goals_per_game,wins,wins_shootout,cup_champs
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
2008,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2009,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2010,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2011,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2012,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2013,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2014,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2015,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2016,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2017,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30


### Joining Team Ranks to the dataframe

In [19]:
team_ranks.head()

Unnamed: 0_level_0,Year,Rank
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Washington Capitals,2018,1
Vegas Golden Knights,2018,2
Winnipeg Jets,2018,3
Tampa Bay Lightning,2018,4
Nashville Predators,2018,5


In [20]:
team_ranks.reset_index(inplace=True)

In [21]:
team_ranks.rename(columns={'Team': 'team_name', 'Year': 'year', 'Rank': 'rank'}, inplace=True)

In [23]:
team_stats = pd.merge(full_stats_champ, team_ranks, how='outer', on=['team_name', 'year'])

In [24]:
team_stats.groupby('year').count()

Unnamed: 0_level_0,average_age,chances_pp,games,goals,goals_against_ev,goals_ev,goals_pp,goals_sh,losses,losses_ot,losses_shootout,opp_chances_pp,opp_goals,opp_goals_pp,opp_goals_sh,pdo,pen_kill_pct,pen_min_per_game,pen_min_per_game_opp,points,points_pct,power_play_pct,save_pct,shot_pct,shots,shots_against,sos,srs,team_name,total_goals_per_game,wins,wins_shootout,cup_champs,rank
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1
2008,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2009,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2010,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2011,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2012,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2013,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2014,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2015,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2016,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2017,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30


In [25]:
team_stats.to_csv('clean NHL data.csv')

### Joining all years of the individual player dataframes

In [26]:
full_player_stats = pd.concat([player_stats_2018, player_stats_2017, player_stats_2016,
                           player_stats_2015, player_stats_2014, player_stats_2013,
                           player_stats_2012, player_stats_2011, player_stats_2010,
                           player_stats_2009, player_stats_2008], axis=0)
full_player_stats.reset_index(drop=True, inplace=True)

In [27]:
full_player_stats.shape

(11568, 30)

 - This dataset has a lot of NaN values for goalies. As these NaN values relate to stats that goalies do not record, like shots taken or shooting pct. I am filling those NaN values with 0.
 - Also, I'm removing any players who have played 10 or fewer games as they are likely to be minor league or injured players who will not play in the playoffs 

In [28]:
full_player_stats[full_player_stats['position'] == 'G'] = full_player_stats[full_player_stats['position'] == 'G'].fillna(value=0)

In [29]:
full_player_stats = full_player_stats[full_player_stats['games_played'] > 10]

- There are also many NaN values for even strength faceoff percentage. All the NaNs are for non-center skaters, meaning they never took a faceoff all year. I am filling these in with 0, and if I do use faceoff percentage in my model I will be sure to filter on only centers.

In [30]:
full_player_stats['es_faceoff_pct'].fillna(value=0, inplace=True)

- I will also drop the below row as it contains a NaN value and this player played so little in that season

In [31]:
full_player_stats.dropna(inplace=True)
full_player_stats.reset_index(drop=True, inplace=True)

In [32]:
full_player_stats['team'].unique()

array(['Anaheim Ducks', 'Arizona Coyotes', 'Boston Bruins',
       'Buffalo Sabres', 'Calgary Flames', 'Carolina Hurricanes',
       'Chicago Blackhawks', 'Colorado Avalanche',
       'Columbus Blue Jackets', 'Dallas Stars', 'Detroit Red Wings',
       'Edmonton Oilers', 'Florida Panthers', 'Los Angeles Kings',
       'Minnesota Wild', 'Montreal Canadiens', 'Nashville Predators',
       'New Jersey Devils', 'New York Islanders', 'New York Rangers',
       'Ottawa Senators', 'Philadelphia Flyers', 'Pittsburgh Penguins',
       'San Jose Sharks', 'St. Louis Blues', 'Tampa Bay Lightning',
       'Toronto Maple Leafs', 'Vancouver Canucks', 'Vegas Golden Knights',
       'Washington Capitals', 'Winnipeg Jets', 'Phoenix Coyotes',
       'Atlanta Thrashers'], dtype=object)

In [33]:
full_player_stats['team'] = full_player_stats['team'].replace('[^A-Za-z ]', '', regex=True)

In [34]:
full_player_stats['team'].unique()

array(['Anaheim Ducks', 'Arizona Coyotes', 'Boston Bruins',
       'Buffalo Sabres', 'Calgary Flames', 'Carolina Hurricanes',
       'Chicago Blackhawks', 'Colorado Avalanche',
       'Columbus Blue Jackets', 'Dallas Stars', 'Detroit Red Wings',
       'Edmonton Oilers', 'Florida Panthers', 'Los Angeles Kings',
       'Minnesota Wild', 'Montreal Canadiens', 'Nashville Predators',
       'New Jersey Devils', 'New York Islanders', 'New York Rangers',
       'Ottawa Senators', 'Philadelphia Flyers', 'Pittsburgh Penguins',
       'San Jose Sharks', 'St Louis Blues', 'Tampa Bay Lightning',
       'Toronto Maple Leafs', 'Vancouver Canucks', 'Vegas Golden Knights',
       'Washington Capitals', 'Winnipeg Jets', 'Phoenix Coyotes',
       'Atlanta Thrashers'], dtype=object)

### Joining all years of the advanced stats dataframes

- Adding a year column to each dataframe

In [39]:
advanced_stats_2018['year'] = 2018
advanced_stats_2017['year'] = 2017
advanced_stats_2016['year'] = 2016
advanced_stats_2015['year'] = 2015
advanced_stats_2014['year'] = 2014
advanced_stats_2013['year'] = 2013
advanced_stats_2012['year'] = 2012
advanced_stats_2011['year'] = 2011
advanced_stats_2010['year'] = 2010
advanced_stats_2009['year'] = 2009
advanced_stats_2008['year'] = 2008

In [40]:
full_advanced_stats = pd.concat([advanced_stats_2018, advanced_stats_2017, advanced_stats_2016,
                           advanced_stats_2015, advanced_stats_2014, advanced_stats_2013,
                           advanced_stats_2012, advanced_stats_2011, advanced_stats_2010,
                           advanced_stats_2009, advanced_stats_2008], axis=0)
full_advanced_stats.reset_index(drop=True, inplace=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




- Looking at the null values below, we seem to have some instances of a specific advanced stat not being recorded for seasons prior to a certain time. The expected plus/minus column seems to be a case of that. I'm going to drop that column and look for ways to impute the other missing values.

In [41]:
full_advanced_stats.shape

(10522, 26)

In [42]:
full_advanced_stats.isnull().sum()

age                             1
corsi_against                   0
corsi_for                       0
corsi_pct                       2
corsi_rel_pct                 211
expected_plsmns              6668
fenwick_against                 0
fenwick_for                     0
fenwick_pct                     2
fenwick_rel_pct               211
games_played                    0
giveaways                       0
on_ice_shot_pct                29
on_ice_sv_pct                  12
pdo                            36
player                          0
pos                             0
shot_thru_percentage          165
takeaways                       0
team                            0
toi_pbp_per_60_all              1
toi_pbp_per_60_ev               1
total_shots_attempted_all       9
year                            0
zs_defense_pct                 14
zs_offense_pct                 14
dtype: int64

- Again let's drop players who have played 10 or fewer games in a seaso

In [43]:
full_advanced_stats = full_advanced_stats[full_advanced_stats['games_played'] > 10]
full_advanced_stats.shape

(8554, 26)

- And finally we will drop the expected plus minus column as that columns is more than 50% NaNs. I don't believe I will use this statistic anyway

In [44]:
full_advanced_stats.drop(columns='expected_plsmns', inplace=True)

In [45]:
full_advanced_stats.isnull().sum()

age                          0
corsi_against                0
corsi_for                    0
corsi_pct                    0
corsi_rel_pct                0
fenwick_against              0
fenwick_for                  0
fenwick_pct                  0
fenwick_rel_pct              0
games_played                 0
giveaways                    0
on_ice_shot_pct              0
on_ice_sv_pct                0
pdo                          0
player                       0
pos                          0
shot_thru_percentage         0
takeaways                    0
team                         0
toi_pbp_per_60_all           0
toi_pbp_per_60_ev            0
total_shots_attempted_all    0
year                         0
zs_defense_pct               0
zs_offense_pct               0
dtype: int64

- Okay, it looks like all of our dataframes are clean, let's do some feature engineering.

- Saving dataframes to CSV for use in new notebook

In [46]:
full_player_stats.to_csv('full player stats.csv')
full_advanced_stats.to_csv('full advanced stats.csv')

  ---