<h1>NHL Data Web Scraper</h1>

<p>This web scraper scrapes data from hockey-reference.com for all teams of a particular season. As of now, the scraper only scrapes team data from the 2023-24 NHL Season, but future changes will be made to the scraper to scrape data across multiple seasons as well as scrape player data.</p>

<h2>Step 1: Import Dependencies</h2>

In [3]:
from bs4 import BeautifulSoup
import requests
import csv
import time

<h2>Step 2: Define Fields Used to Create team_data.csv

In [2]:
FIELDS = ['Name', 'Division', 'Season', 'AvAge', 'GP', 'W', 'L', 'OL', 'PTS', 'PTS%', 'GF', 
          'GA', 'SRS', 'SOS', 'GF/G', 'GA/G', 'PP', 'PPO', 'PP%', 'PPA', 'PPOA', 
          'PK%', 'SH', 'SHA', 'S', 'S%', 'SA', 'SV%', 'PDO', 'SO', 'S%', 
          'SV%', 'PDO', 'CF', 'CA', 'CF%', 'xGF', 'xGA', 'aGF', 'aGA', 'axDiff', 
          'SCF', 'SCA', 'SCF%', 'HDF', 'HDA', 'HDF%', 'HDGF', 'HDC%', 'HDGA', 'HDCO%']

<h2>Step 3: Create Starting Soup</h2>

In [4]:
# The base URL to scrape from
START_SITE = 'https://www.hockey-reference.com/leagues/NHL_2024.html'
# Getting the page and creating soup
start_response = requests.get(START_SITE)
start_soup = BeautifulSoup(start_response.text, 'html.parser')

<h2>Step 4: Get the Links to Each Team's Page for the Season</h2>

In [5]:
# Getting the links to the Eastern Conference teams
eastern_conference_teams = start_soup.select(selector='table#standings_EAS tr th a')
eastern_conference_team_links = [team.get('href') for team in eastern_conference_teams]
# Getting the links to the Western Conference teams
western_conference_teams = start_soup.select(selector='table#standings_WES tr th a')
western_conference_team_links = [team.get('href') for team in western_conference_teams]
# Combining the two lists
team_links = eastern_conference_team_links + western_conference_team_links

<h2>Future Tasks: </h2>
<ul>
<li>Properly write to csv</li>
<li>Add capacity for past seasons</li>
<li>Prevent timing out website</li>
<li>Collect player data</li>
<li>Conduct analysis</li>
</ul>

<h3>This is testing</h3>

In [6]:
current_link = team_links[0]

new_site = 'https://www.hockey-reference.com' + current_link
response = requests.get(new_site)
soup = BeautifulSoup(response.text, 'html.parser')
team_name = soup.select(selector='h1 span')[1].text
division = soup.select(selector='div#meta div p')[2].text.split(' ')[7][4:]
print(team_name)
print(division)

Florida Panthers
Atlantic


In [7]:
team_stats = soup.select(selector='table#team_stats tr')[1]
for stat in team_stats:
    statistic = stat.get('data-stat')
    if statistic == 'team_name':
        continue
    value = stat.text
    print(statistic, value)
team_analytics = soup.select(selector='table#team_stats_adv tr')[1]
for analytic in team_analytics:
    statistic = analytic.get('data-stat')
    if statistic == 'team_name':
        continue
    value = analytic.text
    print(statistic, value)

average_age 29.5
games 82
wins 52
losses 24
losses_ot 6
points 110
points_pct .671
goals 265
goals_against 198
srs 0.81
sos -0.02
goals_for_per_game 3.23
goals_against_per_game 2.41
goals_pp 63
chances_pp 268
power_play_pct 23.51
opp_goals_pp 51
opp_chances_pp 291
pen_kill_pct 82.47
goals_sh 8
opp_goals_sh 9
shots 2764
shot_pct 9.6
shots_against 2279
save_pct .913
pdo 100.6
shutouts 8
shot_pct_5on5 7.2
sv_pct_5on5 .935
pdo 100.6
corsi_for_5on5 4212
corsi_against_5on5 3288
corsi_pct_5on5 56.2
exp_on_goals_for 166.2
exp_on_goals_against 144.8
actual_goals 156
actual_goals_against 119
actual_expected_diff 16
sc_for 1889
sc_against 1550
sc_for_pct 54.9
hdsc_for 673
hdsc_against 545
hdsc_for_pct 55.3
hdscgoal_for 51
hdsc_shot_pct 7.0
hdscgoal_against 57
hdsc_opp_shot_pct 9.5


In [8]:
print(soup.select_one(selector='h1 span').text)

2023-24


<h2>Step 5: Scrape Data from Each Team and Write it to team_data.csv

In [10]:
# Opening team_data.csv for writing
with open('team_data.csv', 'w', newline='') as csvfile:
    # Writing the header row, in this case the data field names
    fieldnames = FIELDS
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    # Going through each team link
    for link in team_links:
        # Creating a dictionary to store the team's data
        data_dict = {}
        # Getting the site for the team
        new_site = 'https://www.hockey-reference.com' + link
        response = requests.get(new_site)
        soup = BeautifulSoup(response.text, 'html.parser')
        # Getting the team's name, division, and season and adding it to the dictionary
        team_name = soup.select(selector='h1 span')[1].text
        division = soup.select(selector='div#meta div p')[2].text.split(' ')[7][4:]
        season = soup.select(selector='h1 span')[0].text
        data_dict['Name'] = team_name
        data_dict['Division'] = division
        data_dict['Season'] = season
        # Loading the team's stats and analytics tables
        team_stats = soup.select(selector='table#team_stats tr')[1]
        team_analytics = soup.select(selector='table#team_stats_adv tr')[1]
        # Going through each statistic and adding it to the dictionary
        for stat in team_stats:
            statistic = stat.get('aria-label')
            if statistic == 'team_name':
                continue
            value = stat.text
            data_dict[statistic] = value
        # Going through each analytic and adding it to the dictionary
        for analytic in team_analytics:
            statistic = analytic.get('aria-label')
            if statistic == 'team_name':
                continue
            value = analytic.text
            data_dict[statistic] = value
        # Writing the team's data to team_data.csv
        print(data_dict)
        # writer.writerow(data_dict)
        # Pausing for 5 seconds to avoid overloading the server
        time.sleep(5)

{'Name': 'Florida Panthers', 'Division': 'Atlantic', 'Season': '2023-24', None: '9.5'}
{'Name': 'Boston Bruins', 'Division': 'Atlantic', 'Season': '2023-24', None: '10.7'}
{'Name': 'Toronto Maple Leafs', 'Division': 'Atlantic', 'Season': '2023-24', None: '9.2'}
{'Name': 'Tampa Bay Lightning', 'Division': 'Atlantic', 'Season': '2023-24', None: '10.6'}
{'Name': 'Detroit Red Wings', 'Division': 'Atlantic', 'Season': '2023-24', None: '10.0'}
{'Name': 'Buffalo Sabres', 'Division': 'Atlantic', 'Season': '2023-24', None: '8.3'}
{'Name': 'Ottawa Senators', 'Division': 'Atlantic', 'Season': '2023-24', None: '9.4'}
{'Name': 'Montreal Canadiens', 'Division': 'Atlantic', 'Season': '2023-24', None: '7.6'}
{'Name': 'New York Rangers', 'Division': 'Metropolitan', 'Season': '2023-24', None: '10.0'}
{'Name': 'Carolina Hurricanes', 'Division': 'Metropolitan', 'Season': '2023-24', None: '9.9'}
{'Name': 'New York Islanders', 'Division': 'Metropolitan', 'Season': '2023-24', None: '8.0'}
{'Name': 'Washingto