# Exploring NHL Data Scraping
**By: Victor Ardulov**

In this notebook I will attempt to explore scraping data from the NHL&reg; website.

Going into this I know that the NHL provides summary and play-by-play data with official game sheets that are updated live during games as well as available for multiple years.

There as couple of things that I don't know:
1. There's a couple of different summary/stat sheets that are available for each game (it seems at least), which stats are consistent?
2. Which summary is the best for what it is I'm trying to track it seems like the event summary might be a good start but the play-by-play can also be of interest
3. What are all of the stats and how consistent are they accross years
4. There's a numbering system but it's not clear how to track the game numbering system the best, I'll have to investigate the best way to track this.

## Getting Started

For starters let's look at the event summary of one game and try to extract out what's in it. To do that let's grab the url and screen to direct.

The URL: http://www.nhl.com/scores/htmlreports/20192020/ES020627.HTM

1. The components are what I'd call the base URL: `http://www.nhl.com/scores/htmlreports/` - This seems consistent across all different URLs
2. The next part is clearly a season reference: `/20192020/`
3. From perusing the rest of the summary stats sheets it looks like the `ES` stands for "Event Summary" while the next number `02` seems to notify which part of the season the game is from:
    1. `01` is the preseason
    2. `02` is the regular season
    3. `03` is the post season (play-offs)
4. the last 4 numbers seem to represent the game number this is `0627` which can be seen in the image below:

![top-banner](top-banner.png)

The thing I'm not 100% certain about is whether or not the numbering is sequential however. I know that if it is, I can probably just compute the maximum number of games possible in a given season and iteratively try them this way in ascending order until I get a 500 or 404 HTTP request return.

This notebook is to explore what the best strategies for extracting the data are, and contemplation of how to store them most effectively in a database.

### Let's Write Some Code

To start let's just grab some webpages and rip them apart.

The `.HTM` ending on the web page urls and `htmlreports` part suggests to me that we don't really need anything other than `BeautifulSoup` (my favorite python library for scraping HTML sites) and `requests` to start:

In [1]:
import requests
from bs4 import BeautifulSoup
from os import path # I'm going to use this to construct urls dynamically

In [2]:
base_url = "http://www.nhl.com/scores/htmlreports/"
season_subfolder = "%d%d"
event_summary_uri = "ES02%04d.HTM"

In [3]:
# Let's start with the ES from the picture above
url = path.join(base_url, season_subfolder % (2019, 2020), event_summary_uri % 627)
print(f"requesting {url}")
r = requests.get(url)
print("Success!" if r.status_code == 200 else "Failed!!!")

requesting http://www.nhl.com/scores/htmlreports/20192020/ES020627.HTM
Success!


Since we got `Success!` at the top there, this implies we found a valid url and it should have returned all of the static HTML content (which should be everything) in the response object `r` and the content exists in `r.content` as seen below if the request was successful

In [4]:
print(r.content)

b'<html>\r\n<head>\r\n<META http-equiv="Content-Type" content="text/html; charset=UTF-8">\r\n<title>Event Summary</title>\r\n</head>\r\n<style type="text/css">\r\n\t\t\t\t@media screen\r\n\t\t\t\t{\r\n\t\t\t\t     .print-class { display: block;}\r\n\t\t\t\t}\r\n\t\t\t\t@media print\r\n\t\t\t\t{\r\n\t\t\t\t     .print-class { display: none;}\r\n\t\t\t\t}\r\n\r\n\t\t\t\tbody {margin: 0;border:solid; border-width: 0;}\r\n\t\t\t\tp, td {font-family: arial,verdana; font-size: 10px;}\r\n\t\t\t\ttable {empty-cells:show;}\r\n\t\t\t\t.title {fon-weight:bold;font-size:14px;}\r\n\t\t\t\t.tablewidth{width: 650px;}\r\n\t\t\t\t.sectionheading{font-weight:bold;background-color: #FFFFFF;}\r\n\t\t\t\t.visitorsectionheading{font-weight:bold;background-color: #E7E7E7;color:#000000;}\r\n\t\t\t\t.homesectionheading{font-weight:bold;background-color: #E7E7E7;color:#000000;}\r\n\t\t\t\t.heading {font-weight:bold;}\r\n\t\t\t\t.border {border:1px solid black;border-collapse: collapse;}\r\n\t\t\t\t.noborder {bo

There's a lot of HTML there and it's really layered in there so this is going to require some reverse engineering to scrape successfully. This is where "developer tools" in a browser and `BeautifulSoup` come in handy *(tee-hee, handy)*

In [5]:
html_soup = BeautifulSoup(r.content)

Exploring the html's high level reveals there's actually not a whole lot going on and actually almost all of the content is store in the (only) table on the page

![html-table](html-table.png)

In [6]:
html_table = html_soup.table

we can see from the following screenshot that the table is in fact composed of rows. These rows have only one column which are subdivided and organized into sub tables and luckily those seem to have a consistent ids (this means they have unique identifiers which means I won't have to do any awkward parsing on that level)

![](table-expanded.png)
![](sub-table.png)

In [7]:
event_summary = html_soup.find(id="GameInfo")

In [8]:
print(event_summary.prettify())

<table align="center" border="0" cellpadding="0" cellspacing="0" id="GameInfo">
 <tr>
  <td align="center" style="font-size: 14px;font-weight:bold">
   Event Summary
  </td>
 </tr>
 <tr>
  <td align="center" style="font-size: 14px;font-weight:bold">
  </td>
 </tr>
 <tr>
  <td align="center" style="font-size: 10px;font-weight:bold">
  </td>
 </tr>
 <tr>
  <td align="center" style="font-size: 10px;font-weight:bold">
   Thursday, January 2, 2020
  </td>
 </tr>
 <tr>
  <td align="center" style="font-size: 10px;font-weight:bold">
   Attendance 17,850 at TD Garden
  </td>
 </tr>
 <tr>
  <td align="center" style="font-size: 10px;font-weight:bold">
   Start 7:08 EST; End 9:30 EST
  </td>
 </tr>
 <tr>
  <td align="center" style="font-size: 10px;font-weight:bold">
   Game 0627
  </td>
 </tr>
 <tr>
  <td align="center" style="font-size: 10px;font-weight:bold">
   Final
  </td>
 </tr>
</table>



In [9]:
import pandas as pd

In [21]:
event_summary_df = pd.read_html(event_summary.prettify(), flavor="bs4")[0] # this [0] is here because it returns a list of tables I'm pulling out the only element which is a dateframe as seen below

In [22]:
event_summary_df

Unnamed: 0,0
0,Event Summary
1,"Thursday, January 2, 2020"
2,"Attendance 17,850 at TD Garden"
3,Start 7:08 EST; End 9:30 EST
4,Game 0627
5,Final


Now that's what I call data, we got a location, a Game number and a date.

To match the format that I had created for simulated data I'm going start in these tables looking for the team names, the date, and the game number I'm going to use this to save the content to a DB eventually. 

The thing of interest to notice is there's actually a game status which will be interesting to consider in the future when tracking "live" data and stuff like that

In [28]:
date_str = pd.to_datetime(event_summary_df.iloc[1, 0]).isoformat().split('T')[0]
print(f"converting {event_summary_df.iloc[1, 0]} to {date_str}")

converting Thursday, January 2, 2020 to 2020-01-02


In [29]:
game_str = event_summary_df.iloc[4, 0].split(" ")[1]
print(game_str)
status_str = event_summary_df.iloc[5, 0].lower()
print(status_str)

0627
final


In [41]:
away_team_name_table = html_soup.find(id="Visitor")
vistor_teamname_df = pd.read_html(away_team_name_table.prettify(), flavor="bs4")[0]
visitor_teamname = vistor_teamname_df.loc[3, 0].split("Game")[0].strip()
print(visitor_teamname)

COLUMBUS BLUE JACKETS


In [44]:
home_team_name_table = html_soup.find(id="Home")
home_teamname_df = pd.read_html(home_team_name_table.prettify(), flavor="bs4")[0]
home_teamname = home_teamname_df.loc[3, 0].split("Game")[0].strip()
print(home_teamname)

BOSTON BRUINS


In [45]:
player_columns = [
    "number", 
    "position", 
    "name", 
    "team",
    "G",
    "A",
    "P",
    "+/-",
    "PN",
    "PIM",
    "TOT",	
    "SHF", 
    "AVG TOI/SH",
    "PP", 
    "SH",
    "EV",
    "S",
    "A/B",	
    "MS", 
    "HT", 
    "GV", 
    "TK", 
    "BS", 
    "FW", 
    "FL", 
    "F%"
]

In [46]:
from lxml import document_from_string

ImportError: cannot import name 'document_from_string' from 'lxml' (/Users/victor/anaconda3/envs/alphantasy-hockey/lib/python3.8/site-packages/lxml/__init__.py)

In [83]:
from lxml.html.soupparser import fromstring

In [86]:
type(html_soup)

bs4.BeautifulSoup

In [132]:
print(html_soup.prettify())

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Event Summary
  </title>
 </head>
 <style type="text/css">
  @media screen
				{
				     .print-class { display: block;}
				}
				@media print
				{
				     .print-class { display: none;}
				}

				body {margin: 0;border:solid; border-width: 0;}
				p, td {font-family: arial,verdana; font-size: 10px;}
				table {empty-cells:show;}
				.title {fon-weight:bold;font-size:14px;}
				.tablewidth{width: 650px;}
				.sectionheading{font-weight:bold;background-color: #FFFFFF;}
				.visitorsectionheading{font-weight:bold;background-color: #E7E7E7;color:#000000;}
				.homesectionheading{font-weight:bold;background-color: #E7E7E7;color:#000000;}
				.heading {font-weight:bold;}
				.border {border:1px solid black;border-collapse: collapse;}
				.noborder {border:0px solid black;border-collapse: collapse;}
				.tborder{border-top:1px solid black;}
				.bborder{border-bottom:1px solid black;}
				

In [89]:
df = pd.read_html(.prettify(), flavor="bs4")

In [106]:
len(df)

20

AttributeError: 'lxml.etree._Element' object has no attribute 'children'

In [None]:
/html/body/table/tbody/tr[8]