NFL

Predicting overall fantasy points throughout the season based on pre-season performance

Predicting fantasy points at the end of a game based on performance at any given amount of minutes into the game

(both of these predictions should be done separetely for each position)

First I will try to just make these simple multiple regression problems and see how accurate I can get with that

Importing useful libraries

In [10]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from io import StringIO

Scrape individual player data

In [11]:
def scrape_player_data(year):
    url = f'https://www.pro-football-reference.com/years/{year}/fantasy.htm'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table and convert it to a DataFrame
    table = soup.find('table', {'id': 'fantasy'})
    df = pd.read_html(StringIO(str(table)))[0]

    # Clean the DataFrame (remove multi-level headers, etc.)
    df.columns = df.columns.droplevel(0)  # Drop the first header level
    df = df.rename(columns={'Unnamed: 0_level_1': 'Player'})  # Rename player column
    df = df[df['Player'] != 'Player']  # Remove extra header rows
    return df


Get player data for the 2023 season

In [12]:
player_df_2023 = scrape_player_data(2023)

In [13]:
# Display the first few rows of the DataFrame
print(player_df_2023.head())

# Display the variable names of the DataFrame
print(player_df_2023.columns)

  Rk                 Player   Tm FantPos Age   G  GS  Cmp  Att   Yds  ...  TD  \
0  1  Christian McCaffrey*+  SFO      RB  27  16  16    0    0     0  ...  21   
1  2          CeeDee Lamb*+  DAL      WR  24  17  17    0    0     0  ...  14   
2  3             Josh Allen  BUF      QB  27  17  17  385  579  4306  ...  15   
3  4          Tyreek Hill*+  MIA      WR  29  16  16    0    0     0  ...  13   
4  5           Jalen Hurts*  PHI      QB  25  17  17  352  538  3858  ...  15   

   2PM  2PP FantPt    PPR   DKPt   FDPt  VBD PosRank OvRank  
0  NaN  NaN    324  391.3  399.3  357.8  157       1      1  
1    1  NaN    268  403.2  411.2  335.7  131       1      2  
2  NaN    3    393  392.6  420.6  410.6  122       1      3  
3  NaN  NaN    257  376.4  380.4  316.9  120       2      4  
4  NaN  NaN    357  356.8  382.8  371.8   89       2      5  

[5 rows x 33 columns]
Index(['Rk', 'Player', 'Tm', 'FantPos', 'Age', 'G', 'GS', 'Cmp', 'Att', 'Yds',
       'TD', 'Int', 'Att', 'Yds', 'Y/A'

Preprocess the data to ensure it's suitable for regression analysis

In [14]:
# Clean the DataFrame (remove unnecessary columns, handle missing values, etc.)
# player_df = player_df[['Player', 'Tm', 'FantPos', 'G', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'Att', 'Yds', 'TD', 'Tgt', 'Rec', 'Yds', 'TD', 'FantPt']]

# Rename columns for clarity
new_names = ({'Tm': 'Team', 'FantPos': 'Position', 'G': 'Games', 
                                       'Cmp': 'PassingCompletions', 'Att': 'PassingAttempts', 'Yds': 'PassingYards', 
                                       'TD': 'PassingTD', 'Int': 'PassingInt', 'Att': 'RushingAttempts', 
                                       'Yds': 'RushingYards', 'TD': 'RushingTD', 'Tgt': 'ReceivingTargets', 
                                       'Rec': 'Receptions', 'Yds': 'ReceivingYards', 'TD': 'ReceivingTD', 
                                       'FantPt': 'FantasyPoints'
                                       })
# player_df = player_df.rename(columns = new_names)

In [15]:
# Function that passes in a dataframe to give each player a unique identifier
def create_player_id(df):
    # remove * and + from player names
    df['PlayerID'] = df['Player'].str.replace('*', '')
    df['PlayerID'] = df['PlayerID'].str.replace('+', '')
    df['PlayerID'] = df['PlayerID'].str.split('\\').str[0]  # Remove special characters
    df['PlayerID'] = df['PlayerID'].str.lower()  # Convert to lowercase
    df['PlayerID'] = df['PlayerID'] + df.groupby('PlayerID').cumcount().astype(str)  # Add a count to handle duplicates
    return df

In [16]:
# Remove special characters from the 'Player' column
player_df_2023['Player'] = player_df_2023['Player'].str.replace('*', '')
player_df_2023['Player'] = player_df_2023['Player'].str.replace('+', '')

# Apply the pd.to_numeric() function to the DataFrame
player_df_2023 = player_df_2023.apply(pd.to_numeric, errors='ignore')

# Display the cleaned DataFrame
print(player_df_2023.head())

   Rk               Player   Tm FantPos  Age   G  GS  Cmp  Att   Yds  ...  TD  \
0   1  Christian McCaffrey  SFO      RB   27  16  16    0    0     0  ...  21   
1   2          CeeDee Lamb  DAL      WR   24  17  17    0    0     0  ...  14   
2   3           Josh Allen  BUF      QB   27  17  17  385  579  4306  ...  15   
3   4          Tyreek Hill  MIA      WR   29  16  16    0    0     0  ...  13   
4   5          Jalen Hurts  PHI      QB   25  17  17  352  538  3858  ...  15   

   2PM  2PP  FantPt    PPR   DKPt   FDPt    VBD  PosRank  OvRank  
0  NaN  NaN   324.0  391.3  399.3  357.8  157.0        1     1.0  
1  1.0  NaN   268.0  403.2  411.2  335.7  131.0        1     2.0  
2  NaN  3.0   393.0  392.6  420.6  410.6  122.0        1     3.0  
3  NaN  NaN   257.0  376.4  380.4  316.9  120.0        2     4.0  
4  NaN  NaN   357.0  356.8  382.8  371.8   89.0        2     5.0  

[5 rows x 33 columns]


  player_df_2023 = player_df_2023.apply(pd.to_numeric, errors='ignore')


In [17]:
# Store the current DataFrame as a CSV file
player_df_2023.to_csv('player_data_2023.csv', index=False)

I attempted many times to find a good way to scrape lots of preseason data on individual player performance

Eventually I did find some good sources, however at that point I realized that even with good data the model to predict fantasy points througout the season using preseason performance just wouldn't be great because preseason games are so different from in season games, and the players that play in the preseason often get far less play in the actual season, therefore the predictions would likely underestimate performance for the players that do play a lot in both because it would be dragged down by players who play in the preseason but not in the regular season

So, instead I am going to shift my focus to creating visualizations of player fantasy points given their performance at a given time in the game, and their average fantasy points for this season and the previous seasons

My idea is to do this very simply by taking current fantasy points at x minutes into the game and then multiplying that by (total minutes in the game)/x then slightly altering that expectation by using the average fantasy points for this or last season to either drag up or down the prediction, this will effectively create a very simple time series forecast for the player's fantasy points

Time series forecasting for fantasy football points

In [19]:
# Read the PlayerID, FantasyPoints, Games, and Position columns from the CSV file
player_df_2023 = pd.read_csv('player_data_2023.csv', usecols=['Player', 'FantPt', 'G', 'FantPos'])

# Rename the columns to match the expected column names
player_df_2023 = player_df_2023.rename(columns={'Player': 'PlayerID', 'FantPt': 'FantasyPoints', 'G': 'Games', 'FantPos': 'Position'})

# Create a new column for average fantasy points per game
player_df_2023['AvgFPPG'] = player_df_2023['FantasyPoints'] / player_df_2023['Games']

# Sort the DataFrame by average fantasy points per game in descending order
player_df_2023 = player_df_2023.sort_values(by='AvgFPPG', ascending=False)

# Display the first few rows of the DataFrame
print(player_df_2023.head())

# Store the updated DataFrame as a CSV file
player_df_2023.to_csv('AvgFPPG_2023.csv', index=False)

                PlayerID Position  Games  FantasyPoints    AvgFPPG
2             Josh Allen       QB     17          393.0  23.117647
4            Jalen Hurts       QB     17          357.0  21.000000
9          Lamar Jackson       QB     16          331.0  20.687500
0    Christian McCaffrey       RB     16          324.0  20.250000
126           Joe Flacco       QB      5          101.0  20.200000


Now repeating the same several steps for 2022

In [20]:
# Scraping data for 2022
player_df_2022 = scrape_player_data(2022)

# Clean the DataFrame (remove unnecessary columns, handle missing values, etc.)
player_df_2022 = player_df_2022[['Player', 'Tm', 'FantPos', 'G', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'Att', 'Yds', 'TD', 'Tgt', 'Rec', 'Yds', 'TD', 'FantPt']]

# Rename columns for clarity
player_df_2022 = player_df_2022.rename(columns = new_names)

# Apply the function to the player DataFrame
player_df_2022 = create_player_id(player_df_2022)

# Convert columns to appropriate data types
player_df_2022 = player_df_2022.apply(pd.to_numeric, errors='ignore')
player_df_2022['FantasyPoints'] = player_df_2022['FantasyPoints'].astype(float)

# Handle missing values (e.g., fill with 0 or use appropriate imputation method)
player_df_2022 = player_df_2022.fillna(0)

# Store the current DataFrame as a CSV file
player_df_2022.to_csv('player_data_2022.csv', index=False)

  player_df_2022 = player_df_2022.apply(pd.to_numeric, errors='ignore')


In [21]:
# Read the PlayerID, FantasyPoints, Games, and Position columns from the CSV file
player_df_2022 = pd.read_csv('player_data_2022.csv', usecols=['PlayerID', 'FantasyPoints', 'Games', 'Position'])

# Create a new column for average fantasy points per game
player_df_2022['AvgFPPG'] = player_df_2022['FantasyPoints'] / player_df_2022['Games']

# Sort the DataFrame by average fantasy points per game in descending order
player_df_2022 = player_df_2022.sort_values(by='AvgFPPG', ascending=False)

# Display the first few rows of the DataFrame
print(player_df_2022.head())

# Store the updated DataFrame as a CSV file
player_df_2022.to_csv('AvgFPPG_2022.csv', index=False)

    Position  Games  FantasyPoints          PlayerID    AvgFPPG
11        QB     15          378.0      jalen hurts0  25.200000
6         QB     16          396.0       josh allen0  24.750000
0         QB     17          416.0  patrick mahomes0  24.470588
16        QB     16          351.0       joe burrow0  21.937500
334       QB      1           21.0       davis webb0  21.000000


In [22]:
import pandas as pd

# Read the CSV file into a new DataFrame
player_df_2022 = pd.read_csv('AvgFPPG_2022.csv')

# Create a DataFrame for each position
qb_df = player_df_2022[player_df_2022['Position'] == 'QB']
rb_df = player_df_2022[player_df_2022['Position'] == 'RB']
wr_df = player_df_2022[player_df_2022['Position'] == 'WR']
te_df = player_df_2022[player_df_2022['Position'] == 'TE']

# Display the first 50 rows of each

print(qb_df.head(50))
print(rb_df.head(50))
print(wr_df.head(50))
print(te_df.head())

    Position  Games  FantasyPoints            PlayerID    AvgFPPG
0         QB     15          378.0        jalen hurts0  25.200000
1         QB     16          396.0         josh allen0  24.750000
2         QB     17          416.0    patrick mahomes0  24.470588
3         QB     16          351.0         joe burrow0  21.937500
4         QB      1           21.0         davis webb0  21.000000
5         QB     15          296.0      justin fields0  19.733333
6         QB     12          236.0      lamar jackson0  19.666667
7         QB     11          201.0       kyler murray0  18.272727
8         QB     16          289.0       daniel jones0  18.062500
9         QB      1           18.0         sam howell0  18.000000
10        QB     17          304.0         geno smith0  17.882353
11        QB     13          231.0     tua tagovailoa0  17.769231
12        QB     17          296.0    trevor lawrence0  17.411765
13        QB     17          292.0       kirk cousins0  17.176471
15        

In [23]:
# Read the CSV file into a new DataFrame
player_df_2023 = pd.read_csv('AvgFPPG_2023.csv')

# Create a DataFrame for each position
qb_df = player_df_2023[player_df_2023['Position'] == 'QB']
rb_df = player_df_2023[player_df_2023['Position'] == 'RB']
wr_df = player_df_2023[player_df_2023['Position'] == 'WR']
te_df = player_df_2023[player_df_2023['Position'] == 'TE']

# Display the first few rows of each position DataFrame
print(qb_df.head(50))
print(rb_df.head(50))
print(wr_df.head(50))
print(te_df.head(50))

               PlayerID Position  Games  FantasyPoints    AvgFPPG
0            Josh Allen       QB     17          393.0  23.117647
1           Jalen Hurts       QB     17          357.0  21.000000
2         Lamar Jackson       QB     16          331.0  20.687500
4            Joe Flacco       QB      5          101.0  20.200000
5          Dak Prescott       QB     17          343.0  20.176471
6           Jordan Love       QB     17          319.0  18.764706
7          Kirk Cousins       QB      8          150.0  18.750000
9           Brock Purdy       QB     16          296.0  18.500000
10          C.J. Stroud       QB     15          275.0  18.333333
11   Anthony Richardson       QB      4           73.0  18.250000
12         Kyler Murray       QB      8          146.0  18.250000
13       Justin Herbert       QB     13          233.0  17.923077
14        Justin Fields       QB     13          230.0  17.692308
15      Patrick Mahomes       QB     16          280.0  17.500000
16       R

Based on this data I think I will attempt the best draft possible for my league

The best order for drafting is<br>
Round 1: RB<br>RB<br>
Round 2: WR<br>WR<br>
Round 3: RB or WR<br>WR<br>
Round 4: RB or WR<br>RB<br>
Round 5: RB or WR (or really good TE)<br>QB<br>
Round 6: RB or WR (or really good TE)<br>WR(FLEX)<br>
Round 7: RB, WR, or TE<br>TE<br>
Round 8: QB, RB, or WR<br>WR<br>
Round 9: QB, RB, or WR (or really good TE)<br>RB<br>
Round 10: QB, RB, WR or TE<br>
Round 11: QB, RB, WR or TE<br>
Round 12: QB, RB, WR or TE<br>
Round 13: K<br>
Round 14: K or D/ST<br>
Round 15: K or D/ST<br>

