# A new (better) scraper

A new, more versatile scraper, tailored for pro-football-reference's formatting. The basic scraper uses a selenium webdriver to render a javascript page into static html, then reads *every* table on the page. The advantage of using a webdriver is that it becomes possible to manipulate the page programmatically. In this case, I just needed to render tables, but it was a good exercise for learning the basics.

A note: this scraper is a lot slower and more prone to failure than other scrapers I've written. When practically using it within a loop over many urls, you can see that I wrote the main function so that if it failed for whatever reason, I would get a clear indication, so that I could retry the same url as many times as desired/needed before moving on.

# The scraper itself

In [1]:
# load packages
import pandas as pd
import numpy as np
from lxml import etree
from selenium import webdriver
import time
import json
import re

In [2]:
def read_table( html_tree, 
                tablename ):
    
    # Make list to house dictionaries for each row
    rows = []
    tablepath = '//table[@id="{0}"]/tbody/tr'.format(tablename)
    
    # Split table into rows
    for row in html_tree.xpath(tablepath):
        
        # Make a dictionary to store each cell in the row
        rd = {}
        rowclass = row.xpath('./@class')
        try:
            rd["rowclass"] = rowclass[0]
        except:
            pass
        try:
            cells = [e for e in row.xpath('./td|./th')]
            for i, cell in enumerate(cells):
                
                # Depending on cell contents, add cell to row dict
                try:
                    txt = cell.xpath('./text()')
                    a_text = cell.xpath('./a/text()')
                    a_href = cell.xpath('./a/@href')
                    stat = cell.xpath('./@data-stat')
                
                    # Logic map for cell contents
                    if (len(txt) >= 1) and (len(a_text) >= 1):
                        # Have both links and standard text. Save both
                        rd[stat[0]+"_text"] = "brk, ".join(txt)
                        rd[stat[0]+"_a"] = ", ".join(a_text)
                        rd[stat[0]+"_href"] = ", ".join(a_href)
                    elif len(a_text) >= 1:
                        # Have just text from a link
                        rd[stat[0]+"_a"] = a_text[0]
                        rd[stat[0]+"_href"] = a_href[0]
                    else:
                        try:
                            # Maybe we just have text
                            rd[stat[0]] = txt[0]
                        except:
                            # If all fails, then we probably have no text
                            rd[stat[0]] = ""
                                        
                except:
                    print("Couldn't parse cell")
            
            
            # Add row dictionary to list of rows
            rows.append(rd)

        except:
            pass
        
    return rows

In [3]:
# One function to take an element tree and parse all of the tables on it
def get_tables(url):
    
    options = webdriver.ChromeOptions()
    options.add_argument('headless')

    driver = webdriver.Chrome(chrome_options=options)
    try:
        driver.get(url)
        page_html = driver.page_source
        tree = etree.HTML(page_html)
        tablenames = tree.xpath('//table/@id')
        
    except:
        print("webdriver failed to get url",url)
        tablenames = [""]
    driver.quit()
    
    tables = {}
    for tab in tablenames:
        try:
            tables[tab] = read_table(tree, tab)
        except:
            print("Failed to read table",tab)
            tables[tab] = ""
            
    return tables

# Demonstrate the scraper

Scrape one page from pro-football-reference.

In [4]:
# Scrape one page
url = "https://www.pro-football-reference.com/boxscores/201511150gnb.htm"
page_dict = get_tables(url)

Let's see what tables that page contains

In [5]:
page_dict.keys()

dict_keys(['scoring', 'game_info', 'officials', 'expected_points', 'team_stats', 'player_offense', 'player_defense', 'returns', 'kicking', 'home_starters', 'vis_starters', 'home_snap_counts', 'vis_snap_counts', 'targets_directions', 'rush_directions', 'pass_tackles', 'rush_tackles', 'home_drives', 'vis_drives', 'pbp_clone', 'pbp'])

How can I look at one of these tables? Turn it into a pandas DataFrame!

In [6]:
pd.DataFrame(page_dict['home_drives'])

Unnamed: 0,drive_num,end_event,net_yds,play_count_tip,quarter,rowclass,start_at,time_start,time_total
0,1,Field Goal,57,,1,bold,GNB 17,15:00,4:54
1,2,Punt,2,,1,,GNB 39,8:20,1:05
2,3,Punt,25,,1,,GNB 16,6:24,2:29
3,4,Punt,2,,1,,GNB 21,1:45,1:57
4,5,Punt,46,,2,,GNB 7,13:27,5:11
5,6,Punt,7,,2,,GNB 5,1:30,0:45
6,7,End of Half,-1,,2,,GNB 20,0:12,0:12
7,8,Punt,2,,3,,GNB 20,13:33,0:45
8,9,Punt,41,,3,,GNB 16,11:20,3:10
9,10,Punt,1,,3,,GNB 8,1:23,1:34


Excellent! How can I save this information to work with later?

In [7]:
with open("test.json","w") as f:
    json.dump(page_dict, f)

How do I read this json file? Will the contents look the same after I read it from disk?

In [8]:
with open("test.json",'r') as f:
    read_dict = json.load(f)

In [9]:
pd.DataFrame(read_dict['vis_drives'])

Unnamed: 0,drive_num,end_event,net_yds,play_count_tip,quarter,rowclass,start_at,time_start,time_total
0,1,Punt,-1,,1,,DET 24,10:06,1:46
1,2,Punt,2,,1,,DET 26,7:15,0:51
2,3,Punt,16,,1,,DET 15,3:55,2:10
3,4,Punt,8,,2,,DET 46,14:48,1:21
4,5,Punt,55,,2,,DET 3,8:16,6:46
5,6,Field Goal,22,,2,bold,DET 47,0:45,0:33
6,7,Touchdown,1,,3,bold,GNB 1,15:00,1:27
7,8,Punt,3,,3,,GNB 47,12:48,1:28
8,9,Interception,70,,3,,DET 7,8:10,6:47
9,10,Field Goal,22,,4,bold,DET 45,14:49,1:39


You mentioned looping over many urls with some fault-tolerance. How can I do that?

Here's an example of looping over a set of urls in order to grab the 2017 profile page for each team. The loop is constructed to try scraping each page up to 3 times. The resulting tables from each page get stored in a dictionary structure that can still be stored in the same json format.

In [10]:
team_pages = [
    '/teams/nwe/2017.htm', '/teams/buf/2017.htm',
    '/teams/mia/2017.htm', '/teams/nyj/2017.htm', '/teams/pit/2017.htm',
    '/teams/rav/2017.htm', '/teams/cin/2017.htm', '/teams/cle/2017.htm',
    '/teams/jax/2017.htm', '/teams/oti/2017.htm', '/teams/htx/2017.htm',
    '/teams/clt/2017.htm', '/teams/kan/2017.htm', '/teams/sdg/2017.htm',
    '/teams/rai/2017.htm', '/teams/den/2017.htm', '/teams/phi/2017.htm',
    '/teams/dal/2017.htm', '/teams/was/2017.htm', '/teams/nyg/2017.htm',
    '/teams/min/2017.htm', '/teams/det/2017.htm', '/teams/gnb/2017.htm',
    '/teams/chi/2017.htm', '/teams/nor/2017.htm', '/teams/car/2017.htm',
    '/teams/atl/2017.htm', '/teams/tam/2017.htm', '/teams/ram/2017.htm',
    '/teams/sea/2017.htm', '/teams/crd/2017.htm', '/teams/sfo/2017.htm'
]

In [11]:
teamyear = {}
# Demonstrate by scraping first 5 pages
for x in team_pages[:5]:
    
    url = "http://www.pro-football-reference.com"+str(x)
    team_and_season = "".join( url.replace('.','/').split('/')[6:8] )
    teamyear[team_and_season] = {"":""}
    tries = 0
    while (tries < 3) and ( teamyear[team_and_season] == {"":""} ):
        print("reading",url)
        teamyear_page = get_tables(url)
        teamyear[team_and_season] = teamyear_page
        tries += 1
        time.sleep(0.5)

reading http://www.pro-football-reference.com/teams/nwe/2017.htm
reading http://www.pro-football-reference.com/teams/buf/2017.htm
reading http://www.pro-football-reference.com/teams/mia/2017.htm
reading http://www.pro-football-reference.com/teams/nyj/2017.htm
reading http://www.pro-football-reference.com/teams/pit/2017.htm


Individual tables are still accessible, just nested.

In [12]:
pd.DataFrame(teamyear['buf2017']['passing'])

Unnamed: 0,age,comebacks,g,gs,gwd,pass_adj_net_yds_per_att,pass_adj_yds_per_att,pass_att,pass_cmp,pass_cmp_perc,...,pass_yds,pass_yds_per_att,pass_yds_per_cmp,pass_yds_per_g,player_a,player_href,pos,qb_rec,qbr,uniform_number
0,28,1.0,15,14,2.0,5.67,6.9,420,263,62.6,...,2799,6.7,10.6,186.6,Tyrod Taylor,/players/T/TaylTy00.htm,QB,8-6-0,56.4,5
1,23,,4,2,,1.24,1.4,49,24,49.0,...,252,5.1,10.5,63.0,Nathan Peterman,/players/P/PeteNa00.htm,qb,1-1-0,12.1,2
2,31,0.0,16,0,1.0,-1.43,-1.4,7,2,28.6,...,35,5.0,17.5,2.2,Joe Webb,/players/W/WebbJo00.htm,,,64.2,14
