In [3]:
import sys
import os
# Correct the path so that imports from src go through
sys.path.append(os.path.abspath(".."))
from src.data_retrieval import scrape_pfr_rosters, load_seasonal_stats, load_rosters
from src.data_cleaning import numeric_col, basic_roster_info, merge_roster_data, sorted_playoffs_next

# **Predicting NFL Playoff Teams Based on Previous Season Stats**

#1. Introduction

This mini project began as a self-driven exercise as I was reading chapter 4 of "An Introduction to Statistical Learning" by James, Witten, Hastie, Tibshirani, and Taylor. The goal is to determine which rudimentary or team-level statistics (if any) from the previous NFL season can be used to predict whether or not a team will make the playoffs in the next season. This will broadly measure how success or failure in one season carries over to the next.

#2. Data Sources and Arrangement

All data was obtained from Pro Football Reference (https://www.pro-football-reference.com/) for the regular seasons between 1990-2024. The data was collected in two stages and via two methods.

First, in my stubbornness and refusal to use web scraping, I manually copied and pasted the yearly team standings (see the AFC and NFC tables at https://www.pro-football-reference.com/years/2024/, for example) into a spreadsheet. I then manually changed the team names to reflect the current franchise abbreviations (e.g. 'Houston Oilers' and 'Tennessee Oilers' $\mapsto$ 'TEN' for the Tennessee Titans). To cap off this unnecessarily tedious endeavor, I added a column 'Playoffs' in which a 1 indicates a team made the playoffs that season and a 0 indicates they did not.

In [4]:
seasonal_stats = load_seasonal_stats("/Users/coledurham/Documents/nfl_playoff_predictor/data/Seasonal Stats - Season Stats.csv")
seasonal_stats.head()

Unnamed: 0,Tm,Season,W,L,T,win_percent,PF,PA,PD,MoV,SoS,SRS,OSRS,DSRS,Playoffs
0,BUF,2024,13,4,0,0.765,525,368,157,9.2,-1.1,8.1,7.8,0.3,1
1,MIA,2024,8,9,0,0.471,345,364,-19,-1.1,-1.9,-3.0,-3.5,0.4,0
2,NYJ,2024,5,12,0,0.294,338,404,-66,-3.9,-0.5,-4.3,-3.0,-1.4,0
3,NE,2024,4,13,0,0.235,289,417,-128,-7.5,-0.6,-8.1,-6.2,-1.9,0
4,BAL,2024,12,5,0,0.706,518,361,157,9.2,0.6,9.9,8.0,1.9,1


In the dataframe season_stats, each row contains:


*   Tm: Team abbreviation
*   Season: Year of season
*   W, L, T: Wins, losses, and ties
*   win_percent: Win percentage
*   $PF$: Points for
*   $PA$: Points against
*   $PD$: Point differential, given by PF-PA
*   $MoV$: Average margin of victory, given by $\frac{PF-PA}{W+L+T}$
*   $SoS$: Strength of schedule
*   SRS: Simple rating system, given by $MoV+SoS$
*   OSRS/DSRS: Offensive and defensive ratings from the simple rating system
*   Playoffs: Binary variable indicating whether a team made the playoffs (1) or missed the playoffs (0)

win percentage, points for, points against, point differential (PF-PA), average margin ((PF-PA)/W+L+T), strength of schedule, simple rating (MoV+SoS), offensive simple rating, and defensive simple rating.

Next, to gain more information about each team and the constitution of their roster, I used the requests.get( ) method to scrape the roster page for each team in each season (see https://www.pro-football-reference.com/teams/buf/2024_roster.htm, for example). Of note in the scraping logic: 'BeautifulSoup' is used in conjunction with 'pandas.read_html' to extract the appropriate table from the page HTML; the team abbreviations used by PFR are converted to my preferred abbreviations using a dictionary.

In [6]:
#nfl_rosters = scrape_pfr_rosters(1990, 2025)
NFL_rosters = load_rosters("/Users/coledurham/Documents/nfl_playoff_predictor/data/NFL_rosters.csv")
NFL_rosters.head()

  df.loc[df["Yrs"] == "Rook", "Yrs"] = '0'


Unnamed: 0,Player,Year,Tm,Age,Pos,G,GS,Ht,Wt,College/Univ,BirthDate,Yrs,AV,Drafted (tm/rnd/yr)
0,Jim Bakken,1970,ARI,30.0,K,14,0.0,5-11,200.0,Wisconsin,11/2/1940,8.0,5.0,Los Angeles Rams / 7th / 88th pick / 1962
1,Pete Beathard,1970,ARI,28.0,QB,4,0.0,6-1,200.0,USC,3/7/1942,6.0,1.0,"Kansas City Chiefs / 1st / 2nd pick / 1964, De..."
2,Robert Brown,1970,ARI,27.0,TE,14,1.0,6-2,225.0,Alcorn St.,1/1/1943,1.0,0.0,
3,Terry Brown,1970,ARI,23.0,DB,10,1.0,6-0,205.0,Oklahoma St.,1/9/1947,1.0,1.0,St. Louis Cardinals / 3rd / 73rd pick / 1969
4,Jerry Daanen,1970,ARI,26.0,WR,14,0.0,6-0,190.0,Miami (FL),12/15/1944,2.0,0.0,St. Louis Cardinals / 8th / 205th pick / 1968


In each row we see:
* No.: Jersey number
* Player: Name of player
* Year: Calendar year corresponding to the season
* Tm: Team, as previously
* Age: How old the player is in years
* Pos: Player position
* G: Games played
* GS: Games started
* Ht: Height, formatted as feet-inches
* Wt: Weight in pounds
* College/Univ: College(s) attended by the player
* BirthDate: Date on which player was born
* Yrs: Years the player has been in professional football
* AV: Average value, a metric produced by PFR for every player since 1960
* Drafted: Information regarding when the player was selected within the NFL draft

With regards to making or missing the playoffs next year, it is reasonable to expect the average age of the roster and the number of rookies this year to have an impact. For an overview of team composition among relevant players, we find the average age and experience of roster members who played in at least 8 games (approximately half the season irrespective of year), as well as the total number of rookies on the roster.

In [8]:
roster_makeup = basic_roster_info(NFL_rosters, 1990, 2025, 8)
roster_makeup.head()

Unnamed: 0,Year,Tm,Avg Age,Avg Experience,Num Rookies
0,1990,ARI,26.652174,3.717391,10
1,1990,ATL,26.227273,3.386364,8
2,1990,BUF,26.777778,3.511111,10
3,1990,CHI,26.782609,3.826087,12
4,1990,CIN,26.456522,3.565217,8
