# Introduction

This is an analysis of data from the National Hockey League. This was done for Foundations of Data Science in Fall of 2019. This project was done by Group 5:
- Andokie Ibeshi
- Daniel Basilio
- Prabuddh Dixit
- Vineeta Kuckreja


## Data Notes
All of the data in this analysis was fetched from Natural Stat Trick. We used individal player stats for the last five years, regular season data only. Each of the CSVs contains all players who registered any game time throughout that year. We got the data as individual year CSV's for a few reasons:
- The data was a lot easier to download in chunks
- The data does not have a year label, so by importing them separately, we can attach year labels from the filenames

### A note on years
Hockey seasons span two calendar years because the season starts in October each year, and continues until the playoffs begin, typically in April. There have been events which shift the start/end dates (the World Cup of Hockey in 2016, the Olympics in 2014) but October -> April should apply most years. Each row of data would need to be marked with two years (i.e. 2014-2015) to represent the two calendar years the season was played over. To keep the year data simpler to understand, we will mark all the data as the year that the Stanley Cup was awared. For example, 2014-2015 will be marked as simply 2015, as that seasons Stanley Cup was awarded at the end of the season in 2015.

In [96]:
import pandas as pd
import matplotlib as plt
%matplotlib inline

In [97]:
def read_data(filename):
    data = pd.read_csv(
        filename,
        usecols=[
            "Player",
            "GP",
            "Goals",
            "Total Assists",
            "First Assists",
            "Second Assists",
            "Total Points",
            "Shots",
            "SH%",
        ]
    )
    return data

In [98]:
def clean_data(data, year):
    # Remove all players who failed to register at least 20 games played.
    data = data.drop(data[data["GP"] < 20].index)
    
    # Cast the SH% column to float (it imports as object for some reason)
    data = data.astype({"SH%": "float64"})
    
    # Attach the year as a column
    data["Year"] = year
    
    # Set the index to be a combo of player and year
    data = data.set_index(["Year", "Player"])
    
    return data

In [99]:
# Read all of our data
fourteen_fifteen = read_data('./yearly-data/14-15.csv')
fifteen_sixteen = read_data('./yearly-data/15-16.csv')
sixteen_seventeen = read_data('./yearly-data/16-17.csv')
seventeen_eighteen = read_data('./yearly-data/17-18.csv')
eighteen_nineteen = read_data('./yearly-data/18-19.csv')

In [100]:
# Clean all of our data
fourteen_fifteen = clean_data(fourteen_fifteen, "2015")
fifteen_sixteen = clean_data(fifteen_sixteen, "2016")
sixteen_seventeen = clean_data(sixteen_seventeen, "2017")
seventeen_eighteen = clean_data(seventeen_eighteen, "2018")
eighteen_nineteen = clean_data(eighteen_nineteen, "2019")

In [101]:
# Combine all of Data Frames together
hockey_data = fourteen_fifteen.append([
    fifteen_sixteen,
    sixteen_seventeen,
    seventeen_eighteen,
    eighteen_nineteen,
])

In [102]:
hockey_data

Unnamed: 0_level_0,Unnamed: 1_level_0,GP,Goals,Total Assists,First Assists,Second Assists,Total Points,Shots,SH%
Year,Player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015,Jamie Benn,82,35,52,32,20,87,253,13.83
2015,John Tavares,82,38,48,30,18,86,278,13.67
2015,Sidney Crosby,77,28,56,31,25,84,237,11.81
2015,Alex Ovechkin,81,53,28,21,7,81,395,13.42
2015,Jakub Voracek,82,22,59,31,28,81,221,9.95
2015,Nicklas Backstrom,82,18,60,31,29,78,153,11.76
2015,Tyler Seguin,71,37,40,33,7,77,280,13.21
2015,Daniel Sedin,82,20,56,35,21,76,226,8.85
2015,Jiri Hudler,78,31,45,29,16,76,158,19.62
2015,Henrik Sedin,82,18,55,28,27,73,101,17.82
