## Scraping NFL Data

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
import re

We will be scraping data from [Pro Football Reference](https://www.pro-football-reference.com/). They're the absolute best at NFL statistics, in my opinion. We will want to get:

- All MVP voting data (this will tell us the share of the vote each MVP candidate received, as well as their statistics)
- All player data
- All team data

We'll only collect data from 1990 to 2021. 31 years of MVP candidates is probably good enough.

In [4]:
years = list(range(1990, 2022))

mvp_voting_url = 'https://www.pro-football-reference.com/awards/awards_{}.htm'


If you're not familiar with web scraping, what we'll be doing in a nutshell is downloading all of the HTML files that contain the data we need, then using `BeautifulSoup` to pull out the tables that we want.

First, we'll create a folder called 'mvp' where we'll store our HTML:

In [5]:
try:
    os.mkdir('mvp')
except FileExistsError:
    pass

Then, we will download the MVP voting information for each year:

In [6]:
def download_mvp_html():
    for year in years:
        url = mvp_voting_url.format(year)
        r = requests.get(url)
        
        with open('mvp/mvp_voting_{}.html'.format(year), 'w') as f:
            f.write(r.text)

# download_mvp_html()

We need to parse the HTML. The MVP voting table has a unique ID called "voting_apmvp", so we'll use that. The tables also have an over-header that we don't need, so we'll remove that. It's important that we add a 'Year' column so we can distinguish the year that the MVP data is from.

In [7]:
dfs = []

for year in years:
    with open('mvp/mvp_voting_{}.html'.format(year)) as f:
        soup = BeautifulSoup(f, 'html.parser')

    soup.find('tr', {'class': 'over_header'}).decompose()
    
    table = soup.find('table', {'id': 'voting_apmvp'})
    df = pd.read_html(str(table))[0]
    df['Year'] = year
    
    dfs.append(df)

Finally, we combine the data frames we generated above and save them as a CSV so we don't have to keep calling the website for the data.

In [8]:
# pd.concat(dfs).to_csv('mvp_voting.csv')

Player stats are going to be a bit more complicated, mainly because A LOT of people have played football in the NFL. So, we'll need to download a list of all players:

In [9]:
player_names_url = 'https://www.pro-football-reference.com/players/{}/'

try:
    os.mkdir('players')
except FileExistsError:
    pass

alphabet = list(map(chr, range(65, 91)))

def download_player_list_html():
    for letter in alphabet:
        url = player_names_url.format(letter)
        r = requests.get(url)

        with open('players/{}.html'.format(letter), 'w') as f:
            f.write(r.text)

# download_player_list_html()

Now, we need to save each link to players who played from 1990 onward. Each player in the list has a link to their specific stats page, as well as the years they've played, so with a little RegEx, this is pretty straight forward:

In [10]:
links = []

for letter in alphabet:
    with open('players/{}.html'.format(letter)) as f:
        soup = BeautifulSoup(f, 'html.parser')
    
    # find all <p> in <div> with id 'div_players'
    for player in soup.find('div', {'id': 'div_players'}).find_all('p'):
        re_result = re.search(r'(\d{4})-\d{4}', player.text)
        start_year = int(re_result.group(1))

        if (start_year >= years[0]):
            link = player.find('a')['href'].replace('.htm', '/gamelog/')
            links.append(link)

print("There have been", len(links), "players in the NFL since", years[0])

There have been 11805 players in the NFL since 1990


That's a lot of players! Now, we can download the gamelog for each player (note, this takes a VERY long time):

In [11]:
def download_player_html():
    try:
        os.mkdir('gamelogs')
    except FileExistsError:
        pass
    
    index = 0

    for link in links:
        r = requests.get('https://www.pro-football-reference.com' + link) 

        with open('gamelogs/{}.html'.format(index), 'w') as f:
            f.write(r.text)
        
        index += 1

# download_player_html()

Next, we need to parse the stats table from each HTML page, as well as getting the players' birthday and name (this is important in the case of two players having the same name, we'd hope they have different birthdays).

In [12]:
def create_df_of_all_players():
    player_dfs = []

    for index in range(0, 11805):
        with open('gamelogs/{}.html'.format(index)) as f:
            soup = BeautifulSoup(f, 'html.parser')

            if soup.find('tr', {'class': 'over_header'}):
                soup.find('tr', {'class': 'over_header'}).decompose()

            if soup.find('tr', {'class': 'thead'}):
                for thead in soup.find_all('tr', {'class': 'thead'}):
                    thead.decompose()

            if soup.find('table', {'class': 'stats_table'}):
                table = soup.find('table', {'class': 'stats_table'})
                df = pd.read_html(str(table))[0]

                # get value of h1 with itemprop 'name'
                if soup.find('h1', {'itemprop': 'name'}):
                    df['Name'] = soup.find('h1', {'itemprop': 'name'}).text.strip()

                if soup.find('span', {'data-birth': True}):
                    df['Birthday'] = soup.find('span', {'data-birth': True})['data-birth']

                player_dfs.append(df)

    df = pd.concat(player_dfs)
    df.to_csv('player_stats.csv.gz', compression='gzip')

#create_df_of_all_players()

Now, we do the same thing with team stats.

In [13]:
team_stats_url = 'https://www.pro-football-reference.com/boxscores/standings.cgi?week=18&year={}&wk_league=NFL'

def download_team_stats_html():
    try:
        os.mkdir('teams')
    except FileExistsError:
        pass

    for year in years:
        url = team_stats_url.format(year)
        r = requests.get(url)

        with open('teams/{}.html'.format(year), 'w') as f:
            f.write(r.text)

# download_team_stats_html()

Parse the tables:

In [14]:
def create_df_of_all_teams():
    team_dfs = []

    for year in years:
        with open('teams/{}.html'.format(year)) as f:
            soup = BeautifulSoup(f, 'html.parser')

            if soup.find('tr', {'class': 'over_header'}):
                soup.find('tr', {'class': 'over_header'}).decompose()

            if soup.find('tr', {'class': 'thead'}):
                for thead in soup.find_all('tr', {'class': 'thead'}):
                    thead.decompose()

            if soup.find('table', {'id': 'AFC'}):
                table_afc = soup.find('table', {'id': 'AFC'})
                df = pd.read_html(str(table_afc))[0]
                df['Year'] = year

                team_dfs.append(df)
            
            if soup.find('table', {'id': 'NFC'}):
                table_nfc = soup.find('table', {'id': 'NFC'})
                df = pd.read_html(str(table_nfc))[0]
                df['Year'] = year

                team_dfs.append(df)

    df = pd.concat(team_dfs)
    df.to_csv('team_stats.csv')

# create_df_of_all_teams()

Now we're done! We've collected all the data we need.