## Scraping Hockey Reference

#### Workflow

- I need to scrape several years of data for each team and all individual players
- The team and player statistics will be in separate dataframes.
- Aditionally, each year of data will also be in separate dataframes.
- I will set up my scraper to grab team statistics for each team in a given year and make that a temporary dataframe which I will turn into individual csvs.
- The individual player statistics will also be separated by year and saved into individual csvs.

### Importing libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import urllib3
import requests
import time

#### Creating base URL

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = 'https://www.hockey-reference.com/teams/'

#### Function to grab a txt file of team links

In [3]:
def get_page(url):
    page = urlopen(base_url)
    soup = BeautifulSoup(page, 'lxml')
    file = open('hockey-reference_urls.txt', 'w')
    file.write(str(soup))
    file.close()

def get_team_links(url):
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')

In [4]:
get_page(base_url)

In [5]:
with open('hockey-reference_urls.txt', 'r') as file:
    for line in file:
        line = line.strip()

In [6]:
page = open("hockey-reference_urls.txt", 'r')
soup = BeautifulSoup(page, "lxml")
div = soup.find('div', {'class': 'overthrow table_container'})


#### Saving the team links in a variable

In [7]:
team_links = []
links = div.find_all('a')
for link in links:
    team_links.append(link.get('href'))
# As teams have moved and changed over the years
# I had to manually add Arizona and Atlanta to this list
team_links.insert(1, '/teams/ARI/')
team_links.insert(2, '/teams/ATL/')

In [8]:
team_links

['/teams/ANA/',
 '/teams/ARI/',
 '/teams/ATL/',
 '/teams/PHX/',
 '/teams/BOS/',
 '/teams/BUF/',
 '/teams/CGY/',
 '/teams/CAR/',
 '/teams/CHI/',
 '/teams/COL/',
 '/teams/CBJ/',
 '/teams/DAL/',
 '/teams/DET/',
 '/teams/EDM/',
 '/teams/FLA/',
 '/teams/LAK/',
 '/teams/MIN/',
 '/teams/MTL/',
 '/teams/NSH/',
 '/teams/NJD/',
 '/teams/NYI/',
 '/teams/NYR/',
 '/teams/OTT/',
 '/teams/PHI/',
 '/teams/PIT/',
 '/teams/SJS/',
 '/teams/STL/',
 '/teams/TBL/',
 '/teams/TOR/',
 '/teams/VAN/',
 '/teams/VEG/',
 '/teams/WSH/',
 '/teams/WPG/']

### Function for scraping individual player data

- This function finds a specific table on each teams page which contains individual player stats for a given year.
- There will be a for loop later which will call this function and iterate through each team and each year.

In [9]:
def get_player_table(url):
    res = requests.get(url)
    skater_soup = BeautifulSoup(res.content, 'lxml')
    team_name = skater_soup.find('h1', {'itemprop': 'name'}).find_all('span')[1].text
    table = skater_soup.find('div', {'id': 'all_skaters'}).find('table', {'id': 'skaters'}).find('tbody')
    player_stats = []
    for row in table.find_all('tr'):
        players = {}
        for element in row:
            players['Player'] = row.find('a').text
            players['Age'] = row.find('td', {'data-stat': 'age'}).text
            players['Position'] = row.find('td', {'data-stat': 'pos'}).text
            players['Games Played'] = row.find('td', {'data-stat': 'games_played'}).text
            players['Goals'] = row.find('td', {'data-stat': 'goals'}).text
            players['Assists'] = row.find('td', {'data-stat': 'assists'}).text
            players['Points'] = row.find('td', {'data-stat': 'points'}).text
            players['Plus Minus'] = row.find('td', {'data-stat': 'plus_minus'}).text
            players['Penalty Minutes'] = row.find('td', {'data-stat': 'pen_min'}).text
            players['ES Goals'] = row.find('td', {'data-stat': 'goals_ev'}).text
            players['PP Goals'] = row.find('td', {'data-stat': 'goals_pp'}).text
            players['SH Goals'] = row.find('td', {'data-stat': 'goals_sh'}).text
            players['GW Goals'] = row.find('td', {'data-stat': 'goals_gw'}).text
            players['ES Assists'] = row.find('td', {'data-stat': 'assists_ev'}).text
            players['PP Assists'] = row.find('td', {'data-stat': 'assists_pp'}).text
            players['SH Assists'] = row.find('td', {'data-stat': 'assists_sh'}).text
            players['Shots'] = row.find('td', {'data-stat': 'shots'}).text
            players['Shooting Perecentage'] = row.find('td', {'data-stat': 'shot_pct'}).text
            players['Time on Ice'] = row.find('td', {'data-stat': 'time_on_ice'}).text
            players['Time on Ice Avg'] = row.find('td', {'data-stat': 'time_on_ice_avg'}).text
            players['Offenisve Point Shares'] = row.find('td', {'data-stat': 'ops'}).text
            players['Defensive Point Shares'] = row.find('td', {'data-stat': 'dps'}).text
            players['Point Shares'] = row.find('td', {'data-stat': 'ps'}).text
            players['ES Blocks'] = row.find('td', {'data-stat': 'blocks'}).text
            players['ES Hits'] = row.find('td', {'data-stat': 'hits'}).text
            players['ES Face-Off Wins'] = row.find('td', {'data-stat': 'faceoff_wins'}).text
            players['ES Face-Off Losses'] = row.find('td', {'data-stat': 'faceoff_losses'}).text
            players['ES Face-Off Pct'] = row.find('td', {'data-stat': 'faceoff_percentage'}).text
            players['Team'] = team_name
        player_stats.append(players)
    return player_stats

### Function for scraping team statistics

- This function finds a specific table on each teams page
- I had to get creative in scraping this table as the data in this table was formatted differently than the individual player data table.
- Once I find the specific table within the "team_soup" variable, this function uses the dictionary structure of the data to assign column names
- Again this function will be called later in a for loop to iterate through team and year. Each year will get its own csv file.

In [10]:
def get_team_table(url):
    res = requests.get(url)
    team_soup = BeautifulSoup(res.content, 'lxml')
    team_name = team_soup.find('h1', {'itemprop': 'name'}).find_all('span')[1].text
    table = team_soup.find('div', {'id': 'all_team_stats'}).find('table', {'id': 'team_stats'})
    team_list = []
    team = {'team': team_name}
    for row in table.find('tbody').find('tr').find_all('td'):
        stat = row.text
        temp = row.attrs
        column = temp['data-stat']
        team.update({column: stat})
    team_list.append(team)
    return team_list
        

### Function for scraping year results data

- This scrape will get me the finishing results for each team for each year.
- Eventually this will be combined with the team statistics data for modeling purposes

In [146]:
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
def get_team_season(url):
    res = requests.get(url)
    season_soup = BeautifulSoup(res.content, 'lxml')
    table = season_soup.find_all('div', {'id': 'all_stats'})[2]#.find('div', {'class': 'table_outer_container'})
    print(table)
    #//*[@id="stats"]/tbody
    #stats > tbody

In [147]:
get_team_season(url)

IndexError: list index out of range

- These cells were used for testing on individual web pages
- I saved the function calls as universal functions

In [11]:
team_year = get_team_table('https://www.hockey-reference.com/teams/ANA/2018.html')

In [12]:
player_year = get_player_table('https://www.hockey-reference.com/teams/ANA/2018.html')

### For loop for scraping team statistic data

- Because I need each year to be separate, I am manually calling each year in the loop which will be temporarily saved in a dataframe. Then that dataframe will be converted to a csv. Each csv will get its own dataframe name in a separate EDA notebook.

In [None]:
base_url = 'https://www.hockey-reference.com'
teams = team_links
years = ['2007.html']
year_df = pd.DataFrame()
for team in teams:
    try:
        for year in years:
            url = base_url + team + year
            team_year = get_team_table(url)
            team_df = pd.DataFrame(team_year)
            year_df = pd.concat([year_df, team_df], axis=0)
            year_df.reset_index(drop=True, inplace=True)
            cols=[i for i in year_df.columns if i not in ['team']]
            for col in cols:
                year_df[col]=pd.to_numeric(year_df[col])
            time.sleep(3)               
    except:
        continue

In [None]:
# The Phoenix Coyotes changed their name to the Arizona Coyotes in 2016.
# My scraper was giving me dual entries for this team in a few years.
# This cell was used to remove the duplicate information before saving to csv.

year_df.drop([2], axis=0, inplace=True)
year_df.reset_index(drop=True, inplace=True)

In [None]:
year_df.head()

### Saving to CSV

- Again each year was saved independently

In [None]:
year_df.to_csv('2007 team stats.csv')

### For loop for scraping individual player stats

- Similar to the previous for loop, this will output one year of data which will be saved to csv.

In [None]:
base_url = 'https://www.hockey-reference.com'
teams = team_links
years = ['2018.html']
year_player_stats = pd.DataFrame()
for team in teams:
    try:
        for year in years:
            url = base_url + team + year
            team_year = get_team_table(url)
            team_df = pd.DataFrame(team_year)
            year_player_stats = pd.concat([year_df, team_df], axis=0)
            year_player_stats.reset_index(drop=True, inplace=True)
            time.sleep(3)
    except:
        continue

In [None]:
year_player_stats

### Saving to CSV

- Again each year will be saved to csv. This csv will contain stats for every player in the league for that year.

In [None]:
year_df.to_csv('2011 skater stats.csv')