
# 1) Building the DataFrame

This notebook scrapes team statistics per game from Pro Football Reference. Scraping the data from the website without getting a Too Many Requests error from Pro Football Reference takes about two minutes per season. After some additional data clean up and feature engineering, the DataFrame is saved locally as a csv; this way, the other notebooks that utilize the DataFrame for analysis, visualizations, and modelling, can still run using an older copy of the csv without needing to rescrape Pro Football Reference each session.

Taking caution for any missing values, Matthew Kim's code fills in missing statistics with 0, but upon testing, I did not find any missing data for games that had been played. I am assumming that this data is available on Pro Football Reference the day after a game at the latest.

Team names and stadiums have been set up to be accurate for regular season games in the 2010 season onwards.

Credit for sections of code and guidelines in this project go to

 - Matthew Kim, whose Pro Football Reference Web Scraper I was able to use as a guideline to build this DataFrame.
 - Josh Weiner, Jack Rosener, and Jackson Joffe of the University of Pennsylvania, whose project on modelling the outcomes of NBA games was a massive aid to this project, from creating Elo scores to running different models
 - FiveThirtyEight, for their guidelines on creating an Elo score specific to NFL teams
 - Wikipedia for a list of current, former, and temporary NFL stadiums
  - GeoHack and LatLong.net for stadium coordinates

[Matthew Kim's Pro Football Reference Web Scraper](https://pypi.org/project/pro-football-reference-web-scraper/)\
[Matthew Kim's Profile](https://pypi.org/user/mjk9/)\
[Matthew Kim's Pro Football Reference Web Scraper Github](https://github.com/mjk2244/pro-football-reference-web-scraper)\
[UPenn Project Write Up](https://towardsdatascience.com/predicting-the-outcome-of-nba-games-with-machine-learning-a810bb768f20)\
[UPenn Project Code](https://github.com/JoshWeiner/NBA_Game_Prediction/blob/main/CIS_545_Final_Project.ipynb)\
[FiveThiryEight](https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/)\
[Home and Temporary NFL Stadiums](https://en.wikipedia.org/wiki/Chronology_of_home_stadiums_for_current_National_Football_League_teams#Temporary_home_stadiums)\
[GeoHack](https://geohack.toolforge.org/)\
[LatLong.net](https://www.latlong.net/)\
[Brendan Schneider's Github](https://git.generalassemb.ly/brendan-schneider/unit-4_project)

## 1.1 Imports

These are the packages I was able to leverage in this notebook

In [1]:
#Imports used to scrape data from Pro Football Reference
from bs4 import BeautifulSoup
import requests
import time

#Imports used to clean and create new features for the DataFrame
import datetime
import pandas as pd
import numpy as np
from haversine import haversine, Unit

#Removing the limit to displayed columns on a Pandas DataFrame, for easier use
pd.set_option('display.max_columns', None)

## 1.2 Useful Variables, Dictionaries and Lists

These variables, dictionaries and lists simplified the code below for both scraping data from the internet and editing the Pandas DataFrame

In [2]:
#Setting up variables to select the first and last season to get data
first_season = 2010
last_season = 2023

#Creating a new start season after the first means the data set will give the Elo scores time to establish
#This also bypasses the issue of having null values for averages over x games before playing x games
start_season = 2012

In [3]:
#Month dictionary used in Matthew Kim's rest day calculations
months = {"September": 9, "October": 10, "November": 11, "December": 12, "January": 1}

#I was getting Too Many Request errors when I hit 30 consecutive calls to Pro Football Reference
#Splitting the 32 teams by conference simplified the timing I needed to make so many calls to the website
afc = ['New York Jets', 'New England Patriots', 'Buffalo Bills', 'Miami Dolphins',
       'Pittsburgh Steelers', 'Baltimore Ravens', 'Cleveland Browns', 'Cincinnati Bengals',
       'Indianapolis Colts', 'Jacksonville Jaguars', 'Houston Texans', 'Tennessee Titans', 
       'Los Angeles Chargers', 'Las Vegas Raiders', 'Kansas City Chiefs', 'Denver Broncos']

nfc = ['New York Giants', 'Philadelphia Eagles', 'Dallas Cowboys', 'Washington Commanders', 
       'Chicago Bears', 'Green Bay Packers', 'Detroit Lions', 'Minnesota Vikings',
       'New Orleans Saints', 'Tampa Bay Buccaneers', 'Carolina Panthers', 'Atlanta Falcons',
       'Arizona Cardinals', 'San Francisco 49ers', 'Los Angeles Rams', 'Seattle Seahawks']

#Dictionary from team name to franchise abbreviation used by pro-football-reference
#Additional names for the Chargers, Raiders, Commanders, and Rams aren't necessary after I added code to replace
#the name of the opposing team, but I kept them here in case I need them
team_hrefs = {
    'New York Jets': 'nyj',
    'New England Patriots': 'nwe',
    'Buffalo Bills': 'buf',
    'Miami Dolphins': 'mia',
    'Pittsburgh Steelers': 'pit',
    'Baltimore Ravens': 'rav',
    'Cleveland Browns': 'cle',
    'Cincinnati Bengals': 'cin',
    'Indianapolis Colts': 'clt',
    'Jacksonville Jaguars': 'jax',
    'Houston Texans': 'htx',
    'Tennessee Titans': 'oti',
    'Los Angeles Chargers': 'sdg', 'San Diego Chargers': 'sdg',
    'Las Vegas Raiders': 'rai', 'Oakland Raiders': 'rai',
    'Kansas City Chiefs': 'kan',
    'Denver Broncos': 'den',
    'New York Giants': 'nyg',
    'Philadelphia Eagles': 'phi',
    'Dallas Cowboys': 'dal',
    'Washington Commanders': 'was', 'Washington Football Team': 'was', 'Washington Redskins': 'was',
    'Chicago Bears': 'chi',
    'Green Bay Packers': 'gnb',
    'Detroit Lions': 'det',
    'Minnesota Vikings': 'min',
    'New Orleans Saints': 'nor',
    'Tampa Bay Buccaneers': 'tam',
    'Carolina Panthers': 'car',
    'Atlanta Falcons': 'atl',
    'Arizona Cardinals': 'crd',
    'San Francisco 49ers': 'sfo',
    'Los Angeles Rams': 'ram', 'St. Louis Rams': 'ram',
    'Seattle Seahawks': 'sea',
}

#Dictionary of team names to the abbreviations to give game log records clearer indexes
team_abbr = {'New York Jets': 'NYJ',
             'New England Patriots': 'NE',
             'Buffalo Bills': 'BUF',
             'Miami Dolphins': 'MIA',
             'Pittsburgh Steelers': 'PIT',
             'Baltimore Ravens': 'BAL',
             'Cleveland Browns': 'CLE',
             'Cincinnati Bengals': 'CIN',
             'Indianapolis Colts': 'IND',
             'Jacksonville Jaguars': 'JAC',
             'Houston Texans': 'HOU',
             'Tennessee Titans': 'TEN',
             'Los Angeles Chargers': 'LAC',
             'Las Vegas Raiders': 'LV',
             'Kansas City Chiefs': 'KC',
             'Denver Broncos': 'DEN',
             'New York Giants': 'NYG',
             'Philadelphia Eagles': 'PHI',
             'Dallas Cowboys': 'DAL',
             'Washington Commanders': 'WAS',
             'Chicago Bears': 'CHI',
             'Green Bay Packers': 'GB',
             'Detroit Lions': 'DET',
             'Minnesota Vikings': 'MIN',
             'New Orleans Saints': 'NO',
             'Tampa Bay Buccaneers': 'TB',
             'Carolina Panthers': 'CAR',
             'Atlanta Falcons': 'ATL',
             'Arizona Cardinals': 'ARI',
             'San Francisco 49ers': 'SF',
             'Los Angeles Rams': 'LAR',
             'Seattle Seahawks': 'SEA'
}

#Stadiums from 2010 regular season onwards
#List of NFL stadiums and locations used from Wikipedia
#Latitude and Longitude From GeoHack when available on Wikipedia, LatLong.net otherwise
stadiums = {
    'MetLife Stadium':               {'name': 'MetLife Stadium',                   'location': 'East Rutherford, NJ', 'latitude': 40.813528, 'longitude': -74.074361},
    'Gillette Stadium':              {'name': 'Gillette Stadium',                  'location': 'Foxborough, MA',      'latitude': 42.091,    'longitude': -71.264},
    'Highmark Stadium':              {'name': 'Highmark Stadium',                  'location': 'Orchard Park, NY',    'latitude': 42.774,    'longitude': -78.787},
    'Hard Rock Stadium':             {'name': 'Hard Rock Stadium',                 'location': 'Miami Gardens, FL',   'latitude': 25.958056, 'longitude': -80.238889},
    'Acrisure Stadium':              {'name': 'Acrisure Stadium',                  'location': 'Pittsburgh Steelers', 'latitude': 40.446667, 'longitude': -80.015833},
    'M&T Bank Stadium':              {'name': 'M&T Bank Stadium',                  'location': 'Baltimore, MD',       'latitude': 39.278056, 'longitude': -76.622778},
    'Cleveland Browns Stadium':      {'name': 'Cleveland Browns Stadium',          'location': 'Cleveland, OH',       'latitude': 41.506111, 'longitude': -81.699444},
    'Paycor Stadium':                {'name': 'Paycor Stadium',                    'location': 'Cincinnati, OH',      'latitude': 39.095,    'longitude': -84.516},
    'Lucas Oil Stadium':             {'name': 'Lucas Oil Stadium',                 'location': 'Indianapolis, IN',    'latitude': 39.760056, 'longitude': -86.163806},
    'EverBank Stadium':              {'name': 'EverBank Stadium',                  'location': 'Jacksonville, FL',    'latitude': 30.323889, 'longitude': -81.6375},
    'NRG Stadium':                   {'name': 'NRG Stadium',                       'location': 'Houston, TX',         'latitude': 29.684722, 'longitude': -95.410833},
    'Nissan Stadium':                {'name': 'Nissan Stadium',                    'location': 'Nashville, TN',       'latitude': 36.166389, 'longitude': -86.771389},
    'SoFi Stadium':                  {'name': 'SoFi Stadium',                      'location': 'Inglewood, CA',       'latitude': 33.953,    'longitude': -118.339},
    'Allegiant Stadium':             {'name': 'Allegiant Stadium',                 'location': 'Paradise, NV',        'latitude': 36.090556, 'longitude': -115.183889},
    'Arrowhead Stadium':             {'name': 'Arrowhead Stadium',                 'location': 'Kansas City, MO',     'latitude': 39.048889, 'longitude': -94.483889},
    'Empower Field at Mile High':    {'name': 'Empower Field at Mile High',        'location': 'Denver, CO',          'latitude': 39.743889, 'longitude': -105.02},
    'Lincoln Financial Field':       {'name': 'Lincoln Financial Field',           'location': 'Philadelphia, PA',    'latitude': 39.900833, 'longitude': -75.1675},
    'AT&T Stadium':                  {'name': 'AT&T Stadium',                      'location': 'Arlington, TX',       'latitude': 32.747778, 'longitude': -97.092778},
    'FedExField':                    {'name': 'FedExField',                        'location': 'Landover, MD',        'latitude': 38.907778, 'longitude': -76.864444},
    'Soldier Field':                 {'name': 'Soldier Field',                     'location': 'Chicago, IL',         'latitude': 41.8623,   'longitude': -87.6167},
    'Lambeau Field':                 {'name': 'Lambeau Field',                     'location': 'Green Bay, WI',       'latitude': 44.501389, 'longitude': -88.062222},
    'Ford Field':                    {'name': 'Ford Field',                        'location': 'Detroit, MI',         'latitude': 42.34,     'longitude': -83.045556},
    'U.S. Bank Stadium':             {'name': 'U.S. Bank Stadium',                 'location': 'Minneapolis, MN',     'latitude': 44.974,    'longitude': -93.258},
    'Caesars Superdome':             {'name': 'Caesars Superdome',                 'location': 'New Orleans, LA',     'latitude': 29.950833, 'longitude': -90.081111},
    'Raymond James Stadium':         {'name': 'Raymond James Stadium',             'location': 'Tampa, FL',           'latitude': 27.975833, 'longitude': -82.503333},
    'Bank of America Stadium':       {'name': 'Bank of America Stadium',           'location': 'Charlotte, NC',       'latitude': 35.225833, 'longitude': -80.852778},
    'Mercedes-Benz Stadium':         {'name': 'Mercedes-Benz Stadium',             'location': 'Atlanta, GA',         'latitude': 33.755556, 'longitude': -84.4},
    'State Farm Stadium':            {'name': 'State Farm Stadium',                'location': 'Glendale, AZ',        'latitude': 33.528,    'longitude': -112.263},
    "Levi's Stadium":                {'name': "Levi's Stadium",                    'location': 'Santa Clara, CA',     'latitude': 37.403,    'longitude': -121.97},
    'Lumen Field':                   {'name': 'Lumen Field',                       'location': 'Seattle, WA',         'latitude': 47.5952,   'longitude': -122.3316},
    'Wembley Stadium':               {'name': 'Wembley Stadium',                   'location': 'London, UK',          'latitude': 51.555556, 'longitude': -0.279444}, #International Series in UK
    'Twickenham Stadium':            {'name': 'Twickenham Stadium',                'location': 'London, UK',          'latitude': 51.455990, 'longitude': -0.342329}, #International Series in UK
    'Tottenham Hotspur Stadium':     {'name': 'Tottenham Hotspur Stadium',         'location': 'London, UK',          'latitude': 51.604444, 'longitude': -0.066389}, #International Series in UK
    'Estadio Azteca':                {'name': 'Estadio Azteca',                    'location': 'Mexico City, MX',     'latitude': 19.303056, 'longitude': -99.150556}, #International Series in Mexico
    'Allianz Arena':                 {'name': 'Allianz Arena',                     'location': 'Munich, DE',          'latitude': 48.218889, 'longitude': 11.624722}, #International Series in Germany
    'Deutsche Bank Park':            {'name': 'Deutsche Bank Park',                'location': 'Frankfurt, DE',       'latitude': 50.068056, 'longitude': 8.645806}, #International Series in Germany
    'Rogers Centre':                 {'name': 'Rogers Centre',                     'location': 'Toronto, ON',         'latitude': 43.641389, 'longitude': -79.389167}, #Bills Toronto Series
    'Oakland Coliseum':              {'name': 'Oakland Coliseum',                  'location': 'Oakland, CA',         'latitude': 37.751667, 'longitude': -122.200556}, #Raiders played here until 2019
    'Dignity Health Sports Park':    {'name': 'Dignity Health Sports Park',        'location': 'Carson, CA',          'latitude': 33.864,    'longitude': -118.261}, #Charges played here from 2017-2019
    'San Diego Stadium':             {'name': 'San Diego Stadium',                 'location': 'San Diego, CA',       'latitude': 32.783056, 'longitude': -117.119444}, #Charges played here through 2016
    'TCF Bank Stadium':              {'name': 'TCF Bank Stadium',                  'location': 'Minneapolis, MN',     'latitude': 44.976,    'longitude': -93.225}, #Vikings played here in 2014 and 2015
    'Hubert H. Humphrey Metrodome':  {'name': 'Hubert H. Humphrey Metrodome',      'location': 'Minneapolis, MN',     'latitude': 44.973889, 'longitude': -93.258056}, #Vikings played here through 2013
    'Georgia Dome':                  {'name': 'Georgia Dome',                      'location': 'Atlanta, GA',         'latitude': 33.758,    'longitude': -84.401}, #Falcons played here through 2016
    'Los Angeles Memorial Coliseum': {'name': 'Los Angeles Memorial Coliseum',     'location': 'Los Angeles, CA',     'latitude': 34.014167, 'longitude': -118.287778}, #Rams played here 2016-2019
    "The Dome at America's Center":  {'name': "The Dome at America's Center",      'location': 'St. Louis, MO',       'latitude': 38.632778, 'longitude': -90.188611}, #Rams played here through 2015
    'Candlestick Park':              {'name': 'Candlestick Park',                  'location': 'San Francisco, CA',   'latitude': 37.713611, 'longitude': -122.386111}, #49ers played here through 2013
}

#A dictionary for the current NFL franchises to their current home stadium
home_stadiums = {'New York Jets': 'MetLife Stadium',
                 'New England Patriots': 'Gillette Stadium',
                 'Buffalo Bills': 'Highmark Stadium',
                 'Miami Dolphins': 'Hard Rock Stadium',
                 'Pittsburgh Steelers': 'Acrisure Stadium',
                 'Baltimore Ravens': 'M&T Bank Stadium',
                 'Cleveland Browns': 'Cleveland Browns Stadium',
                 'Cincinnati Bengals': 'Paycor Stadium',
                 'Indianapolis Colts': 'Lucas Oil Stadium',
                 'Jacksonville Jaguars': 'EverBank Stadium',
                 'Houston Texans': 'NRG Stadium',
                 'Tennessee Titans': 'Nissan Stadium',
                 'Los Angeles Chargers': 'SoFi Stadium',
                 'Las Vegas Raiders': 'Allegiant Stadium',
                 'Kansas City Chiefs': 'Arrowhead Stadium',
                 'Denver Broncos': 'Empower Field at Mile High',
                 'New York Giants': 'MetLife Stadium',
                 'Philadelphia Eagles': 'Lincoln Financial Field',
                 'Dallas Cowboys': 'AT&T Stadium',
                 'Washington Commanders': 'FedExField',
                 'Chicago Bears': 'Soldier Field',
                 'Green Bay Packers': 'Lambeau Field',
                 'Detroit Lions': 'Ford Field',
                 'Minnesota Vikings': 'U.S. Bank Stadium',
                 'New Orleans Saints': 'Caesars Superdome',
                 'Tampa Bay Buccaneers': 'Raymond James Stadium',
                 'Carolina Panthers': 'Bank of America Stadium',
                 'Atlanta Falcons': 'Mercedes-Benz Stadium',
                 'Arizona Cardinals': 'State Farm Stadium',
                 'San Francisco 49ers': "Levi's Stadium",
                 'Los Angeles Rams': 'SoFi Stadium',
                 'Seattle Seahawks': 'Lumen Field'
}

## 1.3 Helper Functions

These functions simpliefied how I pulled data from Pro Football Reference, cleaned it, and created new features

In [4]:
#This function returns the home stadium for a team in a given year from 2010 onwards
def get_home_stadium(team, year):
    
    if team in ['Las Vegas Raiders', 'Oakland Raiders']:
        if year >= 2020:
            return stadiums['Allegiant Stadium']
        else:
            return stadiums['Oakland Coliseum']
        
    elif team in ['Los Angeles Chargers', 'San Diego Chargers']:
        if year >= 2020:
            return stadiums['SoFi Stadium']
        elif year >= 2017:
            return stadiums['Dignity Health Sports Park']
        else:
            return stadiums['San Diego Stadium']
        
    elif team == 'Minnesota Vikings':
        if year >= 2016:
            return stadiums['U.S. Bank Stadium']
        elif year >= 2014:
            return stadiums['TCF Bank Stadium']
        else:
            return stadiums['Hubert H. Humphrey Metrodome']
    
    elif team == 'Atlanta Falcons':
        if year >= 2017:
            return stadiums['Mercedes-Benz Stadium']
        else:
            return stadiums['Georgia Dome']
    
    elif team in ['Los Angeles Rams', 'St. Louis Rams']:
        if year >= 2020:
            return stadiums['SoFi Stadium']
        elif year >= 2016:
            return stadiums['Los Angeles Memorial Coliseum']
        else:
            return stadiums["The Dome at America's Center"]
    
    elif team == 'San Francisco 49ers':
        if year >= 2014:
            return stadiums["Levi's Stadium"]
        else:
            return stadiums['Candlestick Park']
    
    elif team in ['Washington Commanders', 'Washington Football Team', 'Washington Redskins']:
        return stadiums['FedExField']
        
    else:
        return stadiums[home_stadiums[team]]

#This function will go to the Pro Football Reference page for a given team and season
#Credit to Matthew Kim's project
def make_request(team: str, season: int):
    url = 'https://www.pro-football-reference.com/teams/%s/%s.htm' % (team_hrefs[team], str(season))
    return requests.get(url)

#This function uses BeautifulSoup to make reading the data pulled from the Pro Football Reference URL easier
#Credit to Matthew Kim's project
def get_soup(request) -> BeautifulSoup:
    return BeautifulSoup(request.text, 'html.parser')

#This function reads the team game data in the BeautifulSoup format and returns it as a DataFrame
#Credit to Matthew Kim's project for much of this function
def collect_data(soup: BeautifulSoup, season: int, team: str) -> pd.DataFrame:
    # set up data frame
    data = {
        'team_name': [],
        'season_year': [],
        'week': [],
        'day': [],
        'date': [],
        'game_time': [],
        'rest_days': [],
        'home_team': [],
        'stadium_name': [],
        'location': [],
        'latitude': [],
        'longitude': [],
        'opp': [],
        'result': [],
        'points_for': [],
        'points_allowed': [],
        'tot_yds': [],
        'pass_yds': [],
        'rush_yds': [],
        'first_downs': [],
        'turnovers': [],
        'opp_tot_yds': [],
        'opp_pass_yds': [],
        'opp_rush_yds': [],
        'opp_first_downs': [],
        'opp_turnovers': [],
        'exp_pts_off': [],
        'exp_pts_def': [],
        'exp_pts_st': [],
    }
    df = pd.DataFrame(data)

    # loading game data
    games = soup.find_all('tbody')[1].find_all('tr')

    # remove playoff games
    j = 0
    while j < len(games) and games[j].find('td', {'data-stat': 'game_date'}).text != 'Playoffs':
        j += 1
    for k in range(j, len(games)):
        games.pop()

    # remove bye weeks
    bye_weeks = []
    for j in range(len(games)):
        if games[j].find('td', {'data-stat': 'opp'}).text == 'Bye Week':
            bye_weeks.append(j)

    if len(bye_weeks) > 1:
        games.pop(bye_weeks[0])
        games.pop(bye_weeks[1] - 1)

    elif len(bye_weeks) == 1:
        games.pop(bye_weeks[0])

    # remove canceled games
    to_delete = []
    for j in range(len(games)):
        if games[j].find('td', {'data-stat': 'boxscore_word'}).text == 'canceled':
            to_delete.append(j)
    for k in to_delete:
        games.pop(k)

    # gathering data
    for i in range(len(games)):
        team_name = team
        season_year = season
        week = int(games[i].find('th', {'data-stat': 'week_num'}).text)
        day = games[i].find('td', {'data-stat': 'game_day_of_week'}).text
        
        year = 0
        if games[i].find('td', {'data-stat': 'game_date'}).text.split(' ')[0] in ('January', 'February'):
            year = str(season + 1)
        else:
            year = str(season)
        
        date = datetime.datetime.strptime(games[i].find('td', {'data-stat': 'game_date'}).text + ', ' + year, '%B %d, %Y').date()
        game_time = games[i].find('td', {'data-stat': 'game_time'}).text
        
        
        if i > 0:
            if games[i - 1].find('td', {'data-stat': 'opp'}).text == 'Bye Week':
                date1 = games[i - 2].find('td', {'data-stat': 'game_date'}).text.split(' ')
                date2 = games[i].find('td', {'data-stat': 'game_date'}).text.split(' ')
            else:
                date1 = games[i - 1].find('td', {'data-stat': 'game_date'}).text.split(' ')
                date2 = games[i].find('td', {'data-stat': 'game_date'}).text.split(' ')
            if date1[0] == 'January':
                rest_days = datetime.date(season + 1, months[date2[0]], int(date2[1])) - datetime.date(
                    season + 1, months[date1[0]], int(date1[1])
                )
            elif date2[0] == 'January':
                rest_days = datetime.date(season + 1, months[date2[0]], int(date2[1])) - datetime.date(
                    season, months[date1[0]], int(date1[1])
                )
            else:
                rest_days = datetime.date(season + 1, months[date2[0]], int(date2[1])) - datetime.date(
                    season + 1, months[date1[0]], int(date1[1])
                )
        else:
            rest_days = datetime.date(2022, 7, 11) - datetime.date(2022, 7, 1)  # setting first game as 10 rest days

        opp = games[i].find('td', {'data-stat': 'opp'}).text

        if games[i].find('td', {'data-stat': 'game_location'}).text == '@':
            home_team = False
            stadium = get_home_stadium(opp, season_year)
        else:
            home_team = True
            stadium = get_home_stadium(team_name, season_year)
        
        stadium_name = stadium['name']
        location = stadium['location']
        latitude = stadium['latitude']
        longitude = stadium['longitude']
        
        result = games[i].find('td', {'data-stat': 'game_outcome'}).text
        points_for = (
            int(games[i].find('td', {'data-stat': 'pts_off'}).text)
            if games[i].find('td', {'data-stat': 'pts_off'}).text != ''
            else 0
        )
        points_allowed = (
            int(games[i].find('td', {'data-stat': 'pts_def'}).text)
            if games[i].find('td', {'data-stat': 'pts_def'}).text != ''
            else 0
        )
        tot_yds = (
            int(games[i].find('td', {'data-stat': 'yards_off'}).text)
            if games[i].find('td', {'data-stat': 'yards_off'}).text != ''
            else 0
        )
        pass_yds = (
            int(games[i].find('td', {'data-stat': 'pass_yds_off'}).text)
            if games[i].find('td', {'data-stat': 'pass_yds_off'}).text != ''
            else 0
        )
        rush_yds = (
            int(games[i].find('td', {'data-stat': 'rush_yds_off'}).text)
            if games[i].find('td', {'data-stat': 'rush_yds_off'}).text != ''
            else 0
        )
        first_downs = (
            int(games[i].find('td', {'data-stat': 'first_down_off'}).text)
            if games[i].find('td', {'data-stat': 'first_down_off'}).text != ''
            else 0
        )
        turnovers = (
            int(games[i].find('td', {'data-stat': 'to_off'}).text)
            if games[i].find('td', {'data-stat': 'to_off'}).text != ''
            else 0
        )
        opp_tot_yds = (
            int(games[i].find('td', {'data-stat': 'yards_def'}).text)
            if games[i].find('td', {'data-stat': 'yards_def'}).text != ''
            else 0
        )
        opp_pass_yds = (
            int(games[i].find('td', {'data-stat': 'pass_yds_def'}).text)
            if games[i].find('td', {'data-stat': 'pass_yds_def'}).text != ''
            else 0
        )
        opp_rush_yds = (
            int(games[i].find('td', {'data-stat': 'rush_yds_def'}).text)
            if games[i].find('td', {'data-stat': 'pass_yds_def'}).text != ''
            else 0
        )
        opp_first_downs = (
            int(games[i].find('td', {'data-stat': 'first_down_def'}).text)
            if games[i].find('td', {'data-stat': 'first_down_def'}).text != ''
            else 0
        )
        opp_turnovers = (
            int(games[i].find('td', {'data-stat': 'to_def'}).text)
            if games[i].find('td', {'data-stat': 'to_def'}).text != ''
            else 0
        )
        exp_pts_off = (
            float(games[i].find('td', {'data-stat': 'exp_pts_off'}).text)
            if games[i].find('td', {'data-stat': 'exp_pts_off'}).text != ''
            else 0
        )
        exp_pts_def = (
            float(games[i].find('td', {'data-stat': 'exp_pts_def'}).text)
            if games[i].find('td', {'data-stat': 'exp_pts_def'}).text != ''
            else 0
        )
        exp_pts_st = (
            float(games[i].find('td', {'data-stat': 'exp_pts_st'}).text)
            if games[i].find('td', {'data-stat': 'exp_pts_st'}).text != ''
            else 0
        )
        # add row to data frame
        df.loc[len(df.index)] = [
            team_name,
            season_year,
            week,
            day,
            date,
            game_time,
            rest_days,
            home_team,
            stadium_name,
            location,
            latitude,
            longitude,
            opp,
            result,
            points_for,
            points_allowed,
            tot_yds,
            pass_yds,
            rush_yds,
            first_downs,
            turnovers,
            opp_tot_yds,
            opp_pass_yds,
            opp_rush_yds,
            opp_first_downs,
            opp_turnovers,
            exp_pts_off,
            exp_pts_def,
            exp_pts_st
        ]

    return remove_upcoming_games(df)

#Function that returns a team's game log in a given season
#Credit to Matthew Kim's project
def get_team_game_log(team: str, season: int) -> pd.DataFrame:
    """A function to retrieve a team's game log in a given season.

    Returns a pandas DataFrame of a NFL team's game log in a given season, including relevant team-level statistics.

    Args:
        team (str): A NFL team's name, as it appears on Pro Football Reference
        season (int): The season of the game log you are trying to retrieve

    Returns:
        pandas.DataFrame: Each game is a row of the DataFrame

    """

    # raise exception if team name is misspelled
    if team not in team_hrefs.keys():
        raise Exception('Invalid team name. Note: spelling is case sensitive')

    # make HTTP request and extract HTML
    r = make_request(team, season)

    if r.status_code == 404:
        raise Exception('404 error. The ' + team + ' may not have existed in ' + str(season))

    # parse HTML using BeautifulSoup
    soup = get_soup(r)

    # collect data and return data frame
    return collect_data(soup, season, team)

#This function will calculate the distance from a team's home stadium to the stadium where they played the game
#Credit to Matthew Kim
def calculate_distance(city1: dict, city2: dict) -> float:
    coordinates1 = (city1['latitude'], city1['longitude'])
    coordinates2 = (city2['latitude'], city2['longitude'])
    return haversine(coordinates1, coordinates2, unit=Unit.MILES)

#This function will ensure that no future games are added to the DataFrame, as they would not have data
def remove_upcoming_games(game_log_df):
    return game_log_df[game_log_df['date'] < datetime.datetime.now().date()]

#This function will make a request to Pro Football Reference for all 32 teams in the NFL for a set time frame
#The DataFrames returned are then concatenated into one single DataFrame
#Sleeping for one minute after making 16 requests avoids making too many requests at once
#This function also prints to tell users how far along in the process it is
def build_game_logs_df(year_start, year_end):

    nfl_game_logs_list = []

    for y in range(year_start, year_end + 1):
        print(y)
        
        #if y > year_start:
        print('Waiting to Load AFC')
        time.sleep(60)

        for t in afc:
            print(t)
            nfl_game_logs_list.append(get_team_game_log(t, y))

        print('Waiting to Load NFC')
        time.sleep(60)

        for t in nfc:
            print(t)
            nfl_game_logs_list.append(get_team_game_log(t, y))

    return pd.concat(nfl_game_logs_list)

## 1.4 Constructing the DataFrame

We can now call the 'build_game_logs_df' function to put together a DataFrame of team stats per game for a range of seasons. We will then make some additional changes to clean the DataFrame.

In [5]:
#Make the request to Pro Football Reference and create the initial DataFrame
nfl_game_logs_df = build_game_logs_df(first_season, last_season)

2010
Waiting to Load AFC
New York Jets
New England Patriots
Buffalo Bills
Miami Dolphins
Pittsburgh Steelers
Baltimore Ravens
Cleveland Browns
Cincinnati Bengals
Indianapolis Colts
Jacksonville Jaguars
Houston Texans
Tennessee Titans
Los Angeles Chargers
Las Vegas Raiders
Kansas City Chiefs
Denver Broncos
Waiting to Load NFC
New York Giants
Philadelphia Eagles
Dallas Cowboys
Washington Commanders
Chicago Bears
Green Bay Packers
Detroit Lions
Minnesota Vikings
New Orleans Saints
Tampa Bay Buccaneers
Carolina Panthers
Atlanta Falcons
Arizona Cardinals
San Francisco 49ers
Los Angeles Rams
Seattle Seahawks
2011
Waiting to Load AFC
New York Jets
New England Patriots
Buffalo Bills
Miami Dolphins
Pittsburgh Steelers
Baltimore Ravens
Cleveland Browns
Cincinnati Bengals
Indianapolis Colts
Jacksonville Jaguars
Houston Texans
Tennessee Titans
Los Angeles Chargers
Las Vegas Raiders
Kansas City Chiefs
Denver Broncos
Waiting to Load NFC
New York Giants
Philadelphia Eagles
Dallas Cowboys
Washington C

Kansas City Chiefs
Denver Broncos
Waiting to Load NFC
New York Giants
Philadelphia Eagles
Dallas Cowboys
Washington Commanders
Chicago Bears
Green Bay Packers
Detroit Lions
Minnesota Vikings
New Orleans Saints
Tampa Bay Buccaneers
Carolina Panthers
Atlanta Falcons
Arizona Cardinals
San Francisco 49ers
Los Angeles Rams
Seattle Seahawks


In [6]:
#Set the index to team-season-week granularity
nfl_game_logs_df.set_index(nfl_game_logs_df['team_name'].map(team_abbr) + '-' + nfl_game_logs_df['season_year'].map(str) + '-' + nfl_game_logs_df['week'].map(str), inplace=True)

#Rename Opponents to current franchise names
franchise_name_changes = {'Oakland Raiders': 'Las Vegas Raiders',
                          'St. Louis Rams': 'Los Angeles Rams',
                          'San Diego Chargers': 'Los Angeles Chargers',
                          'Washington Redskins': 'Washington Commanders',
                          'Washington Football Team': 'Washington Commanders'}
nfl_game_logs_df['opp'].replace(franchise_name_changes, inplace=True)

#Create an index for the opponent for easier joining
nfl_game_logs_df['opp_index'] = nfl_game_logs_df['opp'].map(team_abbr) + '-' + nfl_game_logs_df['season_year'].map(str) + '-' + nfl_game_logs_df['week'].map(str)

#Sets day to categorical so we can order with week starting on Thursdays
nfl_game_logs_df['day'] = pd.Categorical(nfl_game_logs_df['day'], ['Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed'])

#Turn rest days to integers
nfl_game_logs_df['rest_days'] = nfl_game_logs_df['rest_days'].apply(lambda x: int(str(x).replace(' days 00:00:00', '')));

Some NFL games took place somewhere other than the home team's stadium at the time, either as a part of a scheduled intenational game or due to an emergency. The following block lists the teams that played in such a game, and adjusts that row to the correct location of the game.

In [7]:
#Adjusts games that were played at a neutral site
moved_games = []

#Deutsche Bank Park
moved_games.append(('Indianapolis Colts',   datetime.date(2023, 11, 12), 'Deutsche Bank Park'))
moved_games.append(('New England Patriots', datetime.date(2023, 11, 12), 'Deutsche Bank Park'))

moved_games.append(('Miami Dolphins',     datetime.date(2023, 11, 5), 'Deutsche Bank Park'))
moved_games.append(('Kansas City Chiefs', datetime.date(2023, 11, 5), 'Deutsche Bank Park'))

#Allianz Arena
moved_games.append(('Tampa Bay Buccaneers', datetime.date(2022, 11, 13), 'Allianz Arena'))
moved_games.append(('Seattle Seahawks',     datetime.date(2022, 11, 13), 'Allianz Arena'))

#Estadio Azteca

moved_games.append(('San Francisco 49ers',  datetime.date(2022, 11, 21), 'Estadio Azteca'))
moved_games.append(('Arizona Cardinals',    datetime.date(2022, 11, 21), 'Estadio Azteca'))

moved_games.append(('Kansas City Chiefs',   datetime.date(2019, 11, 18), 'Estadio Azteca'))
moved_games.append(('Los Angeles Chargers', datetime.date(2019, 11, 18), 'Estadio Azteca'))

moved_games.append(('New England Patriots', datetime.date(2017, 11, 19), 'Estadio Azteca'))
moved_games.append(('Las Vegas Raiders',    datetime.date(2017, 11, 19), 'Estadio Azteca'))

moved_games.append(('Houston Texans',    datetime.date(2016, 11, 21), 'Estadio Azteca'))
moved_games.append(('Las Vegas Raiders', datetime.date(2016, 11, 21), 'Estadio Azteca'))

#Tottenham Hotspur Stadium

moved_games.append(('Baltimore Ravens', datetime.date(2023, 10, 15), 'Tottenham Hotspur Stadium'))
moved_games.append(('Tennessee Titans', datetime.date(2023, 10, 15), 'Tottenham Hotspur Stadium'))

moved_games.append(('Jacksonville Jaguars', datetime.date(2023, 10, 8), 'Tottenham Hotspur Stadium'))
moved_games.append(('Buffalo Bills',        datetime.date(2023, 10, 8), 'Tottenham Hotspur Stadium'))

moved_games.append(('New York Giants', datetime.date(2022, 10, 9), 'Tottenham Hotspur Stadium'))
moved_games.append(('Green Bay Packers', datetime.date(2022, 10, 9), 'Tottenham Hotspur Stadium'))

moved_games.append(('Minnesota Vikings',  datetime.date(2022, 10, 2), 'Tottenham Hotspur Stadium'))
moved_games.append(('New Orleans Saints', datetime.date(2022, 10, 2), 'Tottenham Hotspur Stadium'))

moved_games.append(('Miami Dolphins', datetime.date(2021, 10, 17), 'Tottenham Hotspur Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2021, 10, 17), 'Tottenham Hotspur Stadium'))

moved_games.append(('New York Jets',   datetime.date(2021, 10, 10), 'Tottenham Hotspur Stadium'))
moved_games.append(('Atlanta Falcons', datetime.date(2021, 10, 10), 'Tottenham Hotspur Stadium'))

moved_games.append(('Carolina Panthers',    datetime.date(2019, 10, 13), 'Tottenham Hotspur Stadium'))
moved_games.append(('Tampa Bay Buccaneers', datetime.date(2019, 10, 13), 'Tottenham Hotspur Stadium'))

moved_games.append(('Chicago Bears',     datetime.date(2019, 10, 6), 'Tottenham Hotspur Stadium'))
moved_games.append(('Las Vegas Raiders', datetime.date(2019, 10, 6), 'Tottenham Hotspur Stadium'))

#Twickenham Stadium
moved_games.append(('Minnesota Vikings', datetime.date(2017, 10, 29), 'Twickenham Stadium'))
moved_games.append(('Cleveland Browns',  datetime.date(2017, 10, 29), 'Twickenham Stadium'))

moved_games.append(('Arizona Cardinals', datetime.date(2017, 10, 22), 'Twickenham Stadium'))
moved_games.append(('Los Angeles Rams',  datetime.date(2017, 10, 22), 'Twickenham Stadium'))

moved_games.append(('New York Giants',   datetime.date(2016, 10, 23), 'Twickenham Stadium'))
moved_games.append(('Los Angeles Rams',  datetime.date(2016, 10, 23), 'Twickenham Stadium'))

#Wembley Stadium

moved_games.append(('Atlanta Falcons',      datetime.date(2023, 10, 1), 'Wembley Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2023, 10, 1), 'Wembley Stadium'))

moved_games.append(('Denver Broncos',       datetime.date(2022, 10, 30), 'Wembley Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2022, 10, 30), 'Wembley Stadium'))

moved_games.append(('Houston Texans',       datetime.date(2019, 11, 3), 'Wembley Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2019, 11, 3), 'Wembley Stadium'))

moved_games.append(('Cincinnati Bengals', datetime.date(2019, 10, 27), 'Wembley Stadium'))
moved_games.append(('Los Angeles Rams',   datetime.date(2019, 10, 27), 'Wembley Stadium'))

moved_games.append(('Philadelphia Eagles',  datetime.date(2018, 10, 28), 'Wembley Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2018, 10, 28), 'Wembley Stadium'))

moved_games.append(('Tennessee Titans',     datetime.date(2018, 10, 21), 'Wembley Stadium'))
moved_games.append(('Los Angeles Chargers', datetime.date(2018, 10, 21), 'Wembley Stadium'))

moved_games.append(('Seattle Seahawks',  datetime.date(2018, 10, 14), 'Wembley Stadium'))
moved_games.append(('Las Vegas Raiders', datetime.date(2018, 10, 14), 'Wembley Stadium'))

moved_games.append(('New Orleans Saints', datetime.date(2017, 10, 1), 'Wembley Stadium'))
moved_games.append(('Miami Dolphins',     datetime.date(2017, 10, 1), 'Wembley Stadium'))

moved_games.append(('Baltimore Ravens',     datetime.date(2017, 9, 24), 'Wembley Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2017, 9, 24), 'Wembley Stadium'))

moved_games.append(('Washington Commanders', datetime.date(2016, 10, 30), 'Wembley Stadium'))
moved_games.append(('Cincinnati Bengals',    datetime.date(2016, 10, 30), 'Wembley Stadium'))

moved_games.append(('Indianapolis Colts',   datetime.date(2016, 10, 2), 'Wembley Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2016, 10, 2), 'Wembley Stadium'))

moved_games.append(('Detroit Lions',      datetime.date(2015, 11, 1), 'Wembley Stadium'))
moved_games.append(('Kansas City Chiefs', datetime.date(2015, 11, 1), 'Wembley Stadium'))

moved_games.append(('Buffalo Bills',        datetime.date(2015, 10, 25), 'Wembley Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2015, 10, 25), 'Wembley Stadium'))

moved_games.append(('New York Jets',  datetime.date(2015, 10, 4), 'Wembley Stadium'))
moved_games.append(('Miami Dolphins', datetime.date(2015, 10, 4), 'Wembley Stadium'))

moved_games.append(('Dallas Cowboys',       datetime.date(2014, 11, 4), 'Wembley Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2014, 11, 4), 'Wembley Stadium'))

moved_games.append(('Detroit Lions',   datetime.date(2014, 10, 26), 'Wembley Stadium'))
moved_games.append(('Atlanta Falcons', datetime.date(2014, 10, 26), 'Wembley Stadium'))

moved_games.append(('Miami Dolphins',    datetime.date(2014, 9, 28), 'Wembley Stadium'))
moved_games.append(('Las Vegas Raiders', datetime.date(2014, 9, 28), 'Wembley Stadium'))

moved_games.append(('San Francisco 49ers',  datetime.date(2013, 10, 27), 'Wembley Stadium'))
moved_games.append(('Jacksonville Jaguars', datetime.date(2013, 10, 27), 'Wembley Stadium'))

moved_games.append(('Pittsburgh Steelers', datetime.date(2013, 9, 29), 'Wembley Stadium'))
moved_games.append(('Minnesota Vikings',   datetime.date(2013, 9, 29), 'Wembley Stadium'))

moved_games.append(('New England Patriots', datetime.date(2012, 10, 28), 'Wembley Stadium'))
moved_games.append(('Los Angeles Rams',     datetime.date(2012, 10, 28), 'Wembley Stadium'))

moved_games.append(('Chicago Bears',        datetime.date(2011, 10, 23), 'Wembley Stadium'))
moved_games.append(('Tampa Bay Buccaneers', datetime.date(2011, 10, 23), 'Wembley Stadium'))

moved_games.append(('Denver Broncos',      datetime.date(2010, 10, 31), 'Wembley Stadium'))
moved_games.append(('San Francisco 49ers', datetime.date(2010, 10, 31), 'Wembley Stadium'))

#Bills Toronto Series
moved_games.append(('Atlanta Falcons', datetime.date(2013, 12, 1), 'Rogers Centre'))
moved_games.append(('Buffalo Bills',   datetime.date(2013, 12, 1), 'Rogers Centre'))

moved_games.append(('Seattle Seahawks', datetime.date(2012, 12, 16), 'Rogers Centre'))
moved_games.append(('Buffalo Bills',    datetime.date(2012, 12, 16), 'Rogers Centre'))

moved_games.append(('Washington Commanders', datetime.date(2011, 10, 30), 'Rogers Centre'))
moved_games.append(('Buffalo Bills',         datetime.date(2011, 10, 30), 'Rogers Centre'))

moved_games.append(('Chicago Bears', datetime.date(2010, 11, 7), 'Rogers Centre'))
moved_games.append(('Buffalo Bills', datetime.date(2010, 11, 7), 'Rogers Centre'))

#Blizzard to Ford Field
moved_games.append(('Cleveland Browns', datetime.date(2022, 11, 20), 'Ford Field'))
moved_games.append(('Buffalo Bills',    datetime.date(2022, 11, 20), 'Ford Field'))

#Hurricane Ida to Everbank in Jacksonville
moved_games.append(('Green Bay Packers',  datetime.date(2021, 9, 12), 'EverBank Stadium'))
moved_games.append(('New Orleans Saints', datetime.date(2021, 9, 12), 'EverBank Stadium'))

#Covid Ban in Santa Clara County, to State Farm Stadium
moved_games.append(('Seattle Seahawks',    datetime.date(2021, 1, 3), 'State Farm Stadium'))
moved_games.append(('San Francisco 49ers', datetime.date(2021, 1, 3), 'State Farm Stadium'))

moved_games.append(('Washington Commanders',    datetime.date(2020, 12, 13), 'State Farm Stadium'))
moved_games.append(('San Francisco 49ers',      datetime.date(2020, 12, 13), 'State Farm Stadium'))

moved_games.append(('Buffalo Bills',       datetime.date(2020, 12, 7), 'State Farm Stadium'))
moved_games.append(('San Francisco 49ers', datetime.date(2020, 12, 7), 'State Farm Stadium'))

#Blizzard to Ford Field
moved_games.append(('New York Jets', datetime.date(2014, 11, 24), 'Ford Field'))
moved_games.append(('Buffalo Bills', datetime.date(2014, 11, 24), 'Ford Field'))

#Metrodome collapse, to TCF Bank Stadium
moved_games.append(('Chicago Bears',     datetime.date(2010, 12, 20), 'TCF Bank Stadium'))
moved_games.append(('Minnesota Vikings', datetime.date(2010, 12, 20), 'TCF Bank Stadium'))

#Metrodome collapse, to Ford Field
moved_games.append(('New York Giants',   datetime.date(2010, 12, 13), 'Ford Field'))
moved_games.append(('Minnesota Vikings', datetime.date(2010, 12, 13), 'Ford Field'))

for t, d, s in moved_games:
    nfl_game_logs_df.loc[(nfl_game_logs_df['team_name'] == t) & (nfl_game_logs_df['date'] == d), 'stadium_name'] = stadiums[s]['name']
    nfl_game_logs_df.loc[(nfl_game_logs_df['team_name'] == t) & (nfl_game_logs_df['date'] == d), 'location'] = stadiums[s]['location']
    nfl_game_logs_df.loc[(nfl_game_logs_df['team_name'] == t) & (nfl_game_logs_df['date'] == d), 'latitude'] = stadiums[s]['latitude']
    nfl_game_logs_df.loc[(nfl_game_logs_df['team_name'] == t) & (nfl_game_logs_df['date'] == d), 'longitude'] = stadiums[s]['longitude']

In [8]:
#Now that all of the game locations are accurate, we calculate how far each team travelled to get to the game
nfl_game_logs_df['distance_travelled']     = nfl_game_logs_df.apply(lambda x: calculate_distance(get_home_stadium(x['team_name'], x['season_year']), stadiums[x['stadium_name']]), axis=1)
nfl_game_logs_df['distance_travelled_opp'] = nfl_game_logs_df.apply(lambda x: calculate_distance(get_home_stadium(x['opp']      , x['season_year']), stadiums[x['stadium_name']]), axis=1)
nfl_game_logs_df['distance_travelled_opp_diff'] = nfl_game_logs_df['distance_travelled'] - nfl_game_logs_df['distance_travelled_opp']

In [9]:
#Sorting the DataFrame by date, game time, stadium, and home or away for simplified analysis
nfl_game_logs_df.sort_values(['date', 'game_time', 'stadium_name', 'home_team'], ascending=[True, True, True, False], inplace=True)

## 1.5 Additional Feature Engineering

Predicting the outcome of a game prior to the game means we'll need features that reflect a team's success in previous games, not just the game in question. The two primary ways I've done this is by averages and an Elo rating

In the block below, I've selected a number of game-level statistics available from Pro Football Reference. I will use these chosen statistics to find each team's average value for the statistic from their most recent games. Without knowing what range to look at, I've prepared features that calculate the average over the previous 1-8 games.

In [10]:
#Calculates a team's average value of a stat prior to a game, over the course of a range of 1-8 games
def per_game_avg(stat, game_span):
    
    s = nfl_game_logs_df.groupby(['team_name'])[stat].rolling(window=game_span, center=False).mean().groupby('team_name').shift().reset_index(level=0)[stat]
    s.rename(f"{game_span}_game_avg_{stat}", inplace=True)
    col_series.append(s)
    return s

col_series = []
stats = ('points_for',
         'points_allowed',
         'tot_yds',
         'pass_yds',
         'rush_yds',
         'first_downs',
         'turnovers',
         'opp_tot_yds',
         'opp_pass_yds',
         'opp_rush_yds',
         'opp_first_downs',
         'opp_turnovers',
         'exp_pts_off',
         'exp_pts_def',
         'exp_pts_st')

for s in stats:
    for i in range(1, 9):
        per_game_avg(s, i)
        
w = pd.concat(col_series, axis=1)
nfl_game_logs_df = pd.concat([nfl_game_logs_df, w], axis= 1)

The second way I built a feature to predict a team's success in future games is by incorporating an **Elo rating system**. Originally developed by Arpad Elo for ranking chess players, Elo ratings can be created for any zero-sum game. The higher a team's Elo rating, the more successful they are.

In this model, teams begin with an Elo rating of 1500. After each game, a new Elo rating is calculated for each team. The winner's rating will increase and the loser's rating will decrease the same amount. Winners receive a larger increase in Elo rating by defeating tougher opponents with a larger margin of victory, and vice versa.

The University of Pennsylvania project was used to understand and develop the outline of this Elo rating system. FiveThirtyEight was also used to find some of hte values that work best in an Elo system for NFL games.

In [11]:
#Sets a value for a win, tie, and loss to help calculate Elo change
result_values = {'W': "1", 'T': "0.5", 'L': "0"}

#Iterate through each team game log to set the Elo score features
for index, tg in nfl_game_logs_df.iterrows():
    
    #Create a DataFrame of previous games for both the team and their opponent
    previous_games =     nfl_game_logs_df[(nfl_game_logs_df['team_name'] == tg['team_name'])     & (nfl_game_logs_df['date'] < tg['date'])]
    opp_previous_games = nfl_game_logs_df[(nfl_game_logs_df['team_name'] == tg['opp']) & (nfl_game_logs_df['date'] < tg['date'])]
    
    #If the team has no previous games in the data set, set their initial Elo score to 1500
    #Otherwise, set their Elo score at the start of the game to their Elo score at the end of their last game
    #Unless this is the first game of a new season, in which case use an average of 75% Elo at the end of the last game
    ## and 25% 1505
    
    ##**The Tampa Bay Buccaneers and Miami Dolphins had a Week 1 bye in 2017 due to a hurricane
    
    if previous_games.shape[0] == 0:
        nfl_game_logs_df.loc[index, 'elo_start'] = 1500
    elif nfl_game_logs_df.loc[index, 'week'] == 1 or (nfl_game_logs_df.loc[index, 'team_name'] in ('Tampa Bay Buccaneers', 'Miami Dolphins') and nfl_game_logs_df.loc[index, 'season_year'] == 2017 and nfl_game_logs_df.loc[index, 'week'] == 2):
        nfl_game_logs_df.loc[index, 'elo_start'] = (previous_games.tail(1)['elo_end'][0] * 0.75) + (1505 * 0.25)
    else:
        nfl_game_logs_df.loc[index, 'elo_start'] = previous_games.tail(1)['elo_end'][0]
    
    #Repeat the calculations above to get the Elo score for the opponent at the start of the game
    if opp_previous_games.shape[0] == 0:
        nfl_game_logs_df.loc[index, 'elo_start_opp'] = 1500
    elif nfl_game_logs_df.loc[index, 'week'] == 1 or (nfl_game_logs_df.loc[index, 'opp'] in ('Tampa Bay Buccaneers', 'Miami Dolphins') and nfl_game_logs_df.loc[index, 'season_year'] == 2017 and nfl_game_logs_df.loc[index, 'week'] == 2):
        nfl_game_logs_df.loc[index, 'elo_start_opp'] = (opp_previous_games.tail(1)['elo_end'][0] * 0.75) + (1505 * 0.25)
    else:
        nfl_game_logs_df.loc[index, 'elo_start_opp'] = opp_previous_games.tail(1)['elo_end'][0]

    #According to FiveThirtyEight, we can use the Elo score of both teams to calculate their expected win probabilities
    nfl_game_logs_df.loc[index, 'elo_opp_diff_team'] = nfl_game_logs_df.loc[index, 'elo_start_opp'] - nfl_game_logs_df.loc[index, 'elo_start']
    nfl_game_logs_df.loc[index, 'exp_win_prob'] = 1 /(10**(nfl_game_logs_df.loc[index, 'elo_opp_diff_team'] / 400) + 1)
        
    #Save the value of the game result in the DataFrame
    nfl_game_logs_df.loc[index, 'result_value'] = result_values[tg['result']]
    
    #K-factor determines sensitivity, how much the result of a single game should affect the new Elo score
    #FiveThirtyEight found that 20 was a good K-factor for their NFL Elo calculations
    k = 20
    
    #Margin of Victory for the winner
    mov = abs(tg['points_for'] - tg['points_allowed'])
    
    #Get the Elo score difference from the point of view of the winner
    elo_winner_diff = 0
    if tg['result'] == 'W':
        elo_winner_diff = nfl_game_logs_df.loc[index, 'elo_start'] - nfl_game_logs_df.loc[index, 'elo_start_opp']
    elif tg['result'] == 'L':
        elo_winner_diff = nfl_game_logs_df.loc[index, 'elo_start_opp'] - nfl_game_logs_df.loc[index, 'elo_start']
    ## The Elo Difference for the winner does not matter for ties
    else:
        elo_winner_diff = 1
    
    #Calculate another multiplier based on the margin of victory and the difference in Elo scores
    #This formula again comes from FiveThirtyEight
    mov_mult = np.log(mov + 1) * 2.2 / (elo_winner_diff * 0.001 + 2.2)
    
    #Save the values in the DataFrame itself
    #Other values are used to calculate the change in Elo score and the Elo score for the team after the game
    nfl_game_logs_df.loc[index, 'mov_winner'] = mov
    nfl_game_logs_df.loc[index, 'elo_winner_diff'] = elo_winner_diff
    nfl_game_logs_df.loc[index, 'mov_mult'] = mov_mult
    nfl_game_logs_df.loc[index, 'elo_change'] = (float(nfl_game_logs_df.loc[index, 'result_value']) - nfl_game_logs_df.loc[index, 'exp_win_prob']) * k * mov_mult
    nfl_game_logs_df.loc[index, 'elo_end'] = nfl_game_logs_df.loc[index, 'elo_start'] + nfl_game_logs_df.loc[index, 'elo_change']

In [12]:
#I'm joining each team game record to their opponent's team game record. This doubles the number of columns, but makes finding statistics about the opponent easier
nfl_game_logs_df = nfl_game_logs_df.merge(nfl_game_logs_df, left_index=True, right_on='opp_index', suffixes=('', '_opp')).set_axis(nfl_game_logs_df.index)

In [13]:
#Finish this Notebook by saving the DataFrame as a csv locally.
#This way, other notebooks can use it without waiting to load the data again
#Skipping the first few seasons so Elo can settle and we don't have null averages over the past number of games
nfl_game_logs_df[nfl_game_logs_df['season_year'] >= start_season].to_csv('nfl_game_logs_df.csv')