# Data Collection
The goal of this notebook is to collect all the necessary data required to run a basic NFL win predictor model. The data will be pulled using the Sportsipy API which collects data from sports-reference.com.

Relevant Sportsipy links:
- https://sportsreference.readthedocs.io/en/stable/
- https://github.com/roclark/sportsipy/tree/ea0043747015209550abeee15df75914a58fe40b

Note: Some cells have been converted to Raw NBConvert as they were only needed once to extract the data. Afterwards there was no need to run them again however they were not deleted to preserve a trace of my work.

In [22]:
#Import necessary libraries
from sportsipy.nfl.boxscore import Boxscores
from sportsipy.nfl.schedule import Schedule, Game
from sportsipy.nfl.teams import Teams, Team

import pandas as pd
import numpy as np


In [23]:
#Print the Boxscores from week 1 of the 2022 season
boxscores_2022 = Boxscores(1,2022).games
boxscores_2022

{'1-2022': [{'boxscore': '202209080ram',
   'away_name': 'Buffalo Bills',
   'away_abbr': 'buf',
   'away_score': 31,
   'home_name': 'Los Angeles Rams',
   'home_abbr': 'ram',
   'home_score': 10,
   'winning_name': 'Buffalo Bills',
   'winning_abbr': 'buf',
   'losing_name': 'Los Angeles Rams',
   'losing_abbr': 'ram'},
  {'boxscore': '202209110atl',
   'away_name': 'New Orleans Saints',
   'away_abbr': 'nor',
   'away_score': 27,
   'home_name': 'Atlanta Falcons',
   'home_abbr': 'atl',
   'home_score': 26,
   'winning_name': 'New Orleans Saints',
   'winning_abbr': 'nor',
   'losing_name': 'Atlanta Falcons',
   'losing_abbr': 'atl'},
  {'boxscore': '202209110car',
   'away_name': 'Cleveland Browns',
   'away_abbr': 'cle',
   'away_score': 26,
   'home_name': 'Carolina Panthers',
   'home_abbr': 'car',
   'home_score': 24,
   'winning_name': 'Cleveland Browns',
   'winning_abbr': 'cle',
   'losing_name': 'Carolina Panthers',
   'losing_abbr': 'car'},
  {'boxscore': '202209110chi',
 

In [24]:
#Importing the data
combined_schedule = pd.read_csv('game_data.csv')


In [25]:
combined_schedule.columns

Index(['boxscore_index', 'date', 'datetime', 'day', 'extra_points_attempted',
       'extra_points_made', 'field_goals_attempted', 'field_goals_made',
       'fourth_down_attempts', 'fourth_down_conversions', 'interceptions',
       'location', 'opponent_abbr', 'opponent_name', 'overtime',
       'pass_attempts', 'pass_completion_rate', 'pass_completions',
       'pass_touchdowns', 'pass_yards', 'pass_yards_per_attempt',
       'points_allowed', 'points_scored', 'punt_yards', 'punts',
       'quarterback_rating', 'result', 'rush_attempts', 'rush_touchdowns',
       'rush_yards', 'rush_yards_per_attempt', 'third_down_attempts',
       'third_down_conversions', 'time_of_possession', 'times_sacked', 'type',
       'week', 'yards_lost_from_sacks', 'team_abbr'],
      dtype='object')

In [26]:
combined_schedule.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   boxscore_index           568 non-null    object 
 1   date                     568 non-null    object 
 2   datetime                 568 non-null    object 
 3   day                      568 non-null    object 
 4   extra_points_attempted   568 non-null    int64  
 5   extra_points_made        568 non-null    int64  
 6   field_goals_attempted    568 non-null    int64  
 7   field_goals_made         568 non-null    int64  
 8   fourth_down_attempts     568 non-null    int64  
 9   fourth_down_conversions  568 non-null    int64  
 10  interceptions            568 non-null    int64  
 11  location                 568 non-null    object 
 12  opponent_abbr            568 non-null    object 
 13  opponent_name            568 non-null    object 
 14  overtime                 5

In [27]:
#Dropping irrelevant columns
combined_schedule = combined_schedule.drop(['date','overtime','type','day','opponent_name','boxscore_index', 'location', 'opponent_abbr', 'datetime'], axis =1)

In [28]:
combined_schedule.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   extra_points_attempted   568 non-null    int64  
 1   extra_points_made        568 non-null    int64  
 2   field_goals_attempted    568 non-null    int64  
 3   field_goals_made         568 non-null    int64  
 4   fourth_down_attempts     568 non-null    int64  
 5   fourth_down_conversions  568 non-null    int64  
 6   interceptions            568 non-null    int64  
 7   pass_attempts            568 non-null    int64  
 8   pass_completion_rate     568 non-null    float64
 9   pass_completions         568 non-null    int64  
 10  pass_touchdowns          568 non-null    int64  
 11  pass_yards               568 non-null    int64  
 12  pass_yards_per_attempt   568 non-null    float64
 13  points_allowed           568 non-null    int64  
 14  points_scored            5

In [29]:
combined_schedule['time_of_possession']

0      28:46
1      30:24
2      26:04
3      34:34
4      30:30
       ...  
563    29:47
564    32:58
565    23:25
566    33:06
567    31:03
Name: time_of_possession, Length: 568, dtype: object

In [30]:
#Converting time_of_possession to an integer (seconds) 

from datetime import datetime
def convert_to_seconds(time_str):
    time_obj = datetime.strptime(time_str, '%M:%S')
    total_seconds = time_obj.minute*60 + time_obj.second
    return total_seconds

combined_schedule['time_of_possession'] = combined_schedule['time_of_possession'].apply(convert_to_seconds)

print(combined_schedule['time_of_possession'])

0      1726
1      1824
2      1564
3      2074
4      1830
       ... 
563    1787
564    1978
565    1405
566    1986
567    1863
Name: time_of_possession, Length: 568, dtype: int64


In [31]:
combined_schedule.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   extra_points_attempted   568 non-null    int64  
 1   extra_points_made        568 non-null    int64  
 2   field_goals_attempted    568 non-null    int64  
 3   field_goals_made         568 non-null    int64  
 4   fourth_down_attempts     568 non-null    int64  
 5   fourth_down_conversions  568 non-null    int64  
 6   interceptions            568 non-null    int64  
 7   pass_attempts            568 non-null    int64  
 8   pass_completion_rate     568 non-null    float64
 9   pass_completions         568 non-null    int64  
 10  pass_touchdowns          568 non-null    int64  
 11  pass_yards               568 non-null    int64  
 12  pass_yards_per_attempt   568 non-null    float64
 13  points_allowed           568 non-null    int64  
 14  points_scored            5

In [32]:
# Feature reduction by combining features into ratios
combined_schedule['extra_points_ratio'] = combined_schedule['extra_points_made'] / combined_schedule['extra_points_attempted']

combined_schedule['field_goal_ratio'] = combined_schedule['field_goals_made'] / combined_schedule['field_goals_attempted']

combined_schedule['fourth_down_ratio'] = combined_schedule['fourth_down_conversions'] / combined_schedule['fourth_down_attempts']

combined_schedule = combined_schedule.drop(['extra_points_attempted', 'extra_points_made','field_goals_made', 'field_goals_attempted', 'fourth_down_conversions', 'fourth_down_attempts'], axis=1)

combined_schedule.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   interceptions           568 non-null    int64  
 1   pass_attempts           568 non-null    int64  
 2   pass_completion_rate    568 non-null    float64
 3   pass_completions        568 non-null    int64  
 4   pass_touchdowns         568 non-null    int64  
 5   pass_yards              568 non-null    int64  
 6   pass_yards_per_attempt  568 non-null    float64
 7   points_allowed          568 non-null    int64  
 8   points_scored           568 non-null    int64  
 9   punt_yards              568 non-null    int64  
 10  punts                   568 non-null    int64  
 11  quarterback_rating      568 non-null    float64
 12  result                  568 non-null    object 
 13  rush_attempts           568 non-null    int64  
 14  rush_touchdowns         568 non-null    in

In [33]:
#Dropping more columns that are summarized by others and would certainly be highly correlated

combined_schedule = combined_schedule.drop(['pass_attempts','pass_completions','pass_yards','rush_attempts', 'rush_yards'], axis=1)
combined_schedule.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   interceptions           568 non-null    int64  
 1   pass_completion_rate    568 non-null    float64
 2   pass_touchdowns         568 non-null    int64  
 3   pass_yards_per_attempt  568 non-null    float64
 4   points_allowed          568 non-null    int64  
 5   points_scored           568 non-null    int64  
 6   punt_yards              568 non-null    int64  
 7   punts                   568 non-null    int64  
 8   quarterback_rating      568 non-null    float64
 9   result                  568 non-null    object 
 10  rush_touchdowns         568 non-null    int64  
 11  rush_yards_per_attempt  568 non-null    float64
 12  third_down_attempts     568 non-null    int64  
 13  third_down_conversions  568 non-null    int64  
 14  time_of_possession      568 non-null    in

In [34]:
#Creating a third down ratio feature

combined_schedule['third_down_ratio'] = combined_schedule['third_down_conversions'] / combined_schedule['third_down_attempts']

combined_schedule = combined_schedule.drop(['third_down_conversions', 'third_down_attempts'], axis=1)

combined_schedule.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   interceptions           568 non-null    int64  
 1   pass_completion_rate    568 non-null    float64
 2   pass_touchdowns         568 non-null    int64  
 3   pass_yards_per_attempt  568 non-null    float64
 4   points_allowed          568 non-null    int64  
 5   points_scored           568 non-null    int64  
 6   punt_yards              568 non-null    int64  
 7   punts                   568 non-null    int64  
 8   quarterback_rating      568 non-null    float64
 9   result                  568 non-null    object 
 10  rush_touchdowns         568 non-null    int64  
 11  rush_yards_per_attempt  568 non-null    float64
 12  time_of_possession      568 non-null    int64  
 13  times_sacked            568 non-null    int64  
 14  week                    568 non-null    in

In [35]:
combined_schedule[['punts','punt_yards']]
combined_schedule = combined_schedule.drop(['punt_yards'], axis=1)


Once again, these two features are highly correlated. I decided to drop punt yards because it is less representative of a team's success. Punts alone is much more powerful of a variable in terms of impact on a game


In [36]:
#Mapping wins to 1 and losses to 0

combined_schedule['result'] = combined_schedule['result'].map({'Loss':0,'Win':1})
print(combined_schedule['result'])

0      0.0
1      1.0
2      1.0
3      0.0
4      0.0
      ... 
563    0.0
564    1.0
565    0.0
566    0.0
567    1.0
Name: result, Length: 568, dtype: float64


In [37]:
combined_schedule.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   interceptions           568 non-null    int64  
 1   pass_completion_rate    568 non-null    float64
 2   pass_touchdowns         568 non-null    int64  
 3   pass_yards_per_attempt  568 non-null    float64
 4   points_allowed          568 non-null    int64  
 5   points_scored           568 non-null    int64  
 6   punts                   568 non-null    int64  
 7   quarterback_rating      568 non-null    float64
 8   result                  564 non-null    float64
 9   rush_touchdowns         568 non-null    int64  
 10  rush_yards_per_attempt  568 non-null    float64
 11  time_of_possession      568 non-null    int64  
 12  times_sacked            568 non-null    int64  
 13  week                    568 non-null    int64  
 14  yards_lost_from_sacks   568 non-null    in

In [38]:
# Attributing 0.5 to games that ended in a tie
combined_schedule['result'].fillna(0.5, inplace=True)
combined_schedule.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   interceptions           568 non-null    int64  
 1   pass_completion_rate    568 non-null    float64
 2   pass_touchdowns         568 non-null    int64  
 3   pass_yards_per_attempt  568 non-null    float64
 4   points_allowed          568 non-null    int64  
 5   points_scored           568 non-null    int64  
 6   punts                   568 non-null    int64  
 7   quarterback_rating      568 non-null    float64
 8   result                  568 non-null    float64
 9   rush_touchdowns         568 non-null    int64  
 10  rush_yards_per_attempt  568 non-null    float64
 11  time_of_possession      568 non-null    int64  
 12  times_sacked            568 non-null    int64  
 13  week                    568 non-null    int64  
 14  yards_lost_from_sacks   568 non-null    in

In [39]:
#Creating a function that will aggregate team data up to any given week of the season

def agg_weekly_data(df, week):
    
    filtered_df = df[df['week'] < week]
    agg_df = filtered_df.groupby('team_abbr').agg({
        'interceptions': 'mean',
        'pass_completion_rate': 'mean',
        'pass_touchdowns': 'mean',
        'pass_yards_per_attempt': 'mean',
        'points_allowed': 'mean',
        'points_scored': 'mean',
        'punts': 'mean',
        'quarterback_rating': 'mean',
        'result': 'mean', #essentially creating a win ratio
        'rush_touchdowns': 'mean',
        'rush_yards_per_attempt': 'mean',
        'time_of_possession': 'mean',
        'times_sacked': 'mean',
        'yards_lost_from_sacks': 'mean',
        'extra_points_ratio': 'mean', #pandas will automatically exclude nullvalues from the agg
        'field_goal_ratio': 'mean',
        'fourth_down_ratio': 'mean',
        'third_down_ratio': 'mean'
    }).reset_index()
    
    #Renaming columns
    agg_df = agg_df.rename(columns={
        'interceptions': 'avg_interceptions_thrown',
        'pass_completion_rate':'avg_pass_completion_rate',
        'pass_touchdowns': 'avg_pass_touchdowns',
        'pass_yards_per_attempt': 'avg_pass_yards_per_attempt',
        'points_allowed': 'avg_points_allowed',
        'points_scored': 'avg_points_scored',
        'punts': 'avg_punts',
        'quarterback_rating': 'avg_quarterback_rating',
        'result': 'win_ratio', 
        'rush_touchdowns': 'avg_rush_touchdowns',
        'rush_yards_per_attempt': 'avg_rush_yards_per_attempt',
        'time_of_possession': 'avg_time_of_possession',
        'times_sacked': 'avg_times_sacked',
        'yards_lost_from_sacks': 'avg_yards_lost_from_sacks',
        'extra_points_ratio': 'avg_extra_points_ratio', 
        'field_goal_ratio': 'avg_field_goal_ratio',
        'fourth_down_ratio': 'avg_fourth_down_ratio',
        'third_down_ratio': 'avg_third_down_ratio'
     })
    #Return the current week that is being aggregated to
    agg_df['week'] = week

    
    return agg_df

In [40]:
combined_schedule.head(10)

Unnamed: 0,interceptions,pass_completion_rate,pass_touchdowns,pass_yards_per_attempt,points_allowed,points_scored,punts,quarterback_rating,result,rush_touchdowns,rush_yards_per_attempt,time_of_possession,times_sacked,week,yards_lost_from_sacks,team_abbr,extra_points_ratio,field_goal_ratio,fourth_down_ratio,third_down_ratio
0,3,70.7,1,5.9,31,10,4,58.1,0.0,0,2.9,1726,7,1,49,ram,1.0,1.0,0.666667,0.461538
1,2,75.0,3,7.6,27,31,0,100.7,1.0,1,2.5,1824,1,2,0,ram,1.0,1.0,0.0,0.6
2,0,72.0,0,10.0,12,20,3,101.9,1.0,2,5.0,1564,1,3,10,ram,1.0,1.0,,0.375
3,1,66.7,0,5.3,24,9,4,66.3,0.0,0,3.2,2074,7,4,54,ram,,1.0,1.0,0.333333
4,1,65.9,1,7.3,22,10,6,82.1,0.0,0,2.5,1830,5,5,35,ram,1.0,0.5,1.0,0.352941
5,1,78.8,1,7.7,10,24,5,95.6,1.0,2,3.8,2225,1,6,4,ram,1.0,1.0,,0.5
6,0,66.7,1,5.7,31,14,5,88.8,0.0,1,2.7,1804,2,8,20,ram,1.0,,,0.615385
7,0,48.1,1,6.1,16,13,9,75.8,0.0,0,2.8,1724,4,9,27,ram,1.0,1.0,,0.266667
8,1,67.6,1,5.6,27,17,5,77.5,0.0,1,3.3,1679,3,10,19,ram,1.0,1.0,1.0,0.272727
9,0,57.1,2,8.0,27,20,6,101.5,0.0,0,4.9,1733,4,11,35,ram,1.0,1.0,0.5,0.357143


In [41]:
#Display's the aggregated stats of each team up to week 7
agg_weekly_data(combined_schedule,7)

Unnamed: 0,team_abbr,avg_interceptions_thrown,avg_pass_completion_rate,avg_pass_touchdowns,avg_pass_yards_per_attempt,avg_points_allowed,avg_points_scored,avg_punts,avg_quarterback_rating,win_ratio,avg_rush_touchdowns,avg_rush_yards_per_attempt,avg_time_of_possession,avg_times_sacked,avg_yards_lost_from_sacks,avg_extra_points_ratio,avg_field_goal_ratio,avg_fourth_down_ratio,avg_third_down_ratio,week
0,atl,0.666667,62.783333,1.0,7.983333,22.666667,24.333333,3.333333,85.8,0.5,1.333333,4.883333,1798.166667,2.333333,16.666667,1.0,0.66,0.75,0.443468,7
1,buf,0.666667,66.166667,2.833333,8.316667,13.5,29.333333,1.833333,106.5,0.833333,0.5,4.95,1806.333333,1.5,8.333333,1.0,0.833333,0.666667,0.529759,7
2,car,0.833333,56.35,0.666667,6.366667,24.333333,17.166667,5.0,69.066667,0.166667,0.5,4.183333,1496.833333,3.166667,20.5,1.0,0.958333,0.483333,0.236003,7
3,chi,0.833333,55.183333,0.666667,7.416667,19.666667,15.5,4.166667,65.633333,0.333333,0.833333,5.133333,1648.333333,3.833333,22.0,0.866667,1.0,0.416667,0.341087,7
4,cin,0.833333,67.133333,2.0,7.25,19.166667,23.0,3.666667,96.25,0.5,0.5,3.733333,1992.666667,3.666667,24.5,0.8,0.888889,0.266667,0.481046,7
5,cle,0.833333,61.766667,1.0,6.55,27.166667,24.666667,3.333333,79.733333,0.333333,1.666667,5.166667,1975.0,1.5,8.0,0.883333,0.833333,0.503333,0.41832,7
6,clt,1.166667,66.516667,1.333333,6.983333,20.166667,17.166667,4.0,80.266667,0.583333,0.333333,3.416667,1924.666667,3.5,28.0,1.0,0.833333,0.111111,0.409188,7
7,crd,0.666667,65.183333,1.0,5.833333,23.666667,19.0,3.833333,80.383333,0.333333,0.833333,4.433333,1902.666667,2.333333,19.166667,0.8,0.875,0.558333,0.339255,7
8,dal,0.666667,57.416667,0.833333,6.45,16.333333,18.333333,4.666667,76.983333,0.666667,0.666667,4.316667,1681.0,1.5,10.333333,0.9,0.875,0.6,0.325214,7
9,den,0.5,58.366667,0.833333,7.333333,16.5,15.166667,5.5,82.15,0.333333,0.333333,4.366667,1870.166667,3.333333,21.166667,0.916667,0.833333,0.5,0.298483,7


In [42]:
#Creating a dataframe combining the agg_weekly_data function for each week of the season

complete_agg_weekly_data = pd.DataFrame()

for week in range(2,24):
    
    agg_data = agg_weekly_data(combined_schedule,week)
    
    complete_agg_weekly_data = complete_agg_weekly_data.append(agg_data, ignore_index=True)
    
#Checking for duplicates just in case
print(complete_agg_weekly_data.duplicated().sum())


0


In [43]:
#Making the week columns a string and adding -2022

complete_agg_weekly_data['week'] = complete_agg_weekly_data['week'].astype(str) + '-2022'
complete_agg_weekly_data['week'].unique()

array(['2-2022', '3-2022', '4-2022', '5-2022', '6-2022', '7-2022',
       '8-2022', '9-2022', '10-2022', '11-2022', '12-2022', '13-2022',
       '14-2022', '15-2022', '16-2022', '17-2022', '18-2022', '19-2022',
       '20-2022', '21-2022', '22-2022', '23-2022'], dtype=object)

In [44]:
complete_agg_weekly_data

Unnamed: 0,team_abbr,avg_interceptions_thrown,avg_pass_completion_rate,avg_pass_touchdowns,avg_pass_yards_per_attempt,avg_points_allowed,avg_points_scored,avg_punts,avg_quarterback_rating,win_ratio,avg_rush_touchdowns,avg_rush_yards_per_attempt,avg_time_of_possession,avg_times_sacked,avg_yards_lost_from_sacks,avg_extra_points_ratio,avg_field_goal_ratio,avg_fourth_down_ratio,avg_third_down_ratio,week
0,atl,0.000000,60.600000,0.000000,6.500000,27.000000,26.000000,4.000000,79.700000,0.000000,2.000000,5.300000,2024.000000,0.000000,0.000000,1.000000,0.800000,,0.384615,2-2022
1,buf,2.000000,83.900000,3.000000,9.600000,10.000000,31.000000,0.000000,111.300000,1.000000,1.000000,4.800000,1874.000000,2.000000,5.000000,1.000000,1.000000,,0.900000,2-2022
2,car,1.000000,59.300000,1.000000,8.700000,26.000000,24.000000,5.000000,80.300000,0.000000,2.000000,2.800000,1294.000000,4.000000,28.000000,1.000000,1.000000,,0.363636,2-2022
3,chi,1.000000,47.100000,2.000000,7.100000,10.000000,19.000000,6.000000,81.700000,1.000000,1.000000,2.700000,1592.000000,2.000000,16.000000,0.333333,,,0.357143,2-2022
4,cin,4.000000,62.300000,2.000000,6.400000,23.000000,20.000000,3.000000,58.600000,0.000000,0.000000,3.900000,2622.000000,7.000000,39.000000,0.000000,0.666667,0.333333,0.500000,2-2022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
699,sdg,0.555556,68.116667,1.500000,6.855556,23.055556,23.388889,4.333333,92.911111,0.555556,0.944444,3.616667,1848.833333,2.333333,12.611111,1.000000,0.936275,0.482051,0.434311,23-2022
700,sea,0.722222,69.788889,1.777778,7.544444,24.555556,23.888889,3.888889,98.072222,0.500000,0.722222,4.550000,1735.500000,2.722222,20.722222,0.981481,0.953922,0.555556,0.372373,23-2022
701,sfo,0.450000,65.685000,1.650000,8.050000,17.150000,25.850000,3.500000,101.415000,0.750000,1.150000,4.600000,1891.550000,1.850000,11.500000,0.965000,0.848039,0.333333,0.443986,23-2022
702,tam,0.611111,65.672222,1.555556,6.355556,21.611111,18.166667,4.666667,88.227778,0.444444,0.277778,3.272222,1765.166667,1.333333,9.833333,0.964286,0.787255,0.528912,0.381718,23-2022


In [45]:
#Ensuring the data is the same as four code blocks above
complete_agg_weekly_data[complete_agg_weekly_data['week'] == '7-2022']

Unnamed: 0,team_abbr,avg_interceptions_thrown,avg_pass_completion_rate,avg_pass_touchdowns,avg_pass_yards_per_attempt,avg_points_allowed,avg_points_scored,avg_punts,avg_quarterback_rating,win_ratio,avg_rush_touchdowns,avg_rush_yards_per_attempt,avg_time_of_possession,avg_times_sacked,avg_yards_lost_from_sacks,avg_extra_points_ratio,avg_field_goal_ratio,avg_fourth_down_ratio,avg_third_down_ratio,week
160,atl,0.666667,62.783333,1.0,7.983333,22.666667,24.333333,3.333333,85.8,0.5,1.333333,4.883333,1798.166667,2.333333,16.666667,1.0,0.66,0.75,0.443468,7-2022
161,buf,0.666667,66.166667,2.833333,8.316667,13.5,29.333333,1.833333,106.5,0.833333,0.5,4.95,1806.333333,1.5,8.333333,1.0,0.833333,0.666667,0.529759,7-2022
162,car,0.833333,56.35,0.666667,6.366667,24.333333,17.166667,5.0,69.066667,0.166667,0.5,4.183333,1496.833333,3.166667,20.5,1.0,0.958333,0.483333,0.236003,7-2022
163,chi,0.833333,55.183333,0.666667,7.416667,19.666667,15.5,4.166667,65.633333,0.333333,0.833333,5.133333,1648.333333,3.833333,22.0,0.866667,1.0,0.416667,0.341087,7-2022
164,cin,0.833333,67.133333,2.0,7.25,19.166667,23.0,3.666667,96.25,0.5,0.5,3.733333,1992.666667,3.666667,24.5,0.8,0.888889,0.266667,0.481046,7-2022
165,cle,0.833333,61.766667,1.0,6.55,27.166667,24.666667,3.333333,79.733333,0.333333,1.666667,5.166667,1975.0,1.5,8.0,0.883333,0.833333,0.503333,0.41832,7-2022
166,clt,1.166667,66.516667,1.333333,6.983333,20.166667,17.166667,4.0,80.266667,0.583333,0.333333,3.416667,1924.666667,3.5,28.0,1.0,0.833333,0.111111,0.409188,7-2022
167,crd,0.666667,65.183333,1.0,5.833333,23.666667,19.0,3.833333,80.383333,0.333333,0.833333,4.433333,1902.666667,2.333333,19.166667,0.8,0.875,0.558333,0.339255,7-2022
168,dal,0.666667,57.416667,0.833333,6.45,16.333333,18.333333,4.666667,76.983333,0.666667,0.666667,4.316667,1681.0,1.5,10.333333,0.9,0.875,0.6,0.325214,7-2022
169,den,0.5,58.366667,0.833333,7.333333,16.5,15.166667,5.5,82.15,0.333333,0.333333,4.366667,1870.166667,3.333333,21.166667,0.916667,0.833333,0.5,0.298483,7-2022


In [46]:
schedule_df = pd.read_csv('2022_schedule.csv')

In [47]:
schedule_df

Unnamed: 0,week,boxscore,away_abbr,away_score,home_abbr,home_score,winning_abbr,losing_abbr
0,1-2022,202209080ram,buf,31,ram,10,buf,ram
1,1-2022,202209110atl,nor,27,atl,26,nor,atl
2,1-2022,202209110car,cle,26,car,24,cle,car
3,1-2022,202209110chi,sfo,10,chi,19,chi,sfo
4,1-2022,202209110cin,pit,23,cin,20,pit,cin
...,...,...,...,...,...,...,...,...
279,20-2022,202301220buf,cin,27,buf,10,cin,buf
280,20-2022,202301220sfo,dal,12,sfo,19,sfo,dal
281,21-2022,202301290phi,sfo,7,phi,31,phi,sfo
282,21-2022,202301290kan,cin,20,kan,23,kan,cin


In [48]:
schedule_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284 entries, 0 to 283
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   week          284 non-null    object
 1   boxscore      284 non-null    object
 2   away_abbr     284 non-null    object
 3   away_score    284 non-null    int64 
 4   home_abbr     284 non-null    object
 5   home_score    284 non-null    int64 
 6   winning_abbr  282 non-null    object
 7   losing_abbr   282 non-null    object
dtypes: int64(2), object(6)
memory usage: 17.9+ KB


In [49]:
#Creating a df for the home and away teams by merging the game data with the schedule

home_df = pd.merge(schedule_df,complete_agg_weekly_data, left_on = ['week','home_abbr'], right_on=['week','team_abbr'])

#Sanity check to ensure that home and away dataframes are the same size
print(home_df.duplicated().sum())
print(home_df.shape)
print(schedule_df.shape)

away_df = pd.merge(schedule_df,complete_agg_weekly_data, left_on = ['week','away_abbr'], right_on=['week','team_abbr'])
print(away_df.duplicated().sum())
print(away_df.shape)
print(schedule_df.shape)


0
(268, 27)
(284, 8)
0
(268, 27)
(284, 8)


In [50]:
#Adjusting settings so that all columns can be seen
pd.set_option('display.max_columns', None)

home_df

Unnamed: 0,week,boxscore,away_abbr,away_score,home_abbr,home_score,winning_abbr,losing_abbr,team_abbr,avg_interceptions_thrown,avg_pass_completion_rate,avg_pass_touchdowns,avg_pass_yards_per_attempt,avg_points_allowed,avg_points_scored,avg_punts,avg_quarterback_rating,win_ratio,avg_rush_touchdowns,avg_rush_yards_per_attempt,avg_time_of_possession,avg_times_sacked,avg_yards_lost_from_sacks,avg_extra_points_ratio,avg_field_goal_ratio,avg_fourth_down_ratio,avg_third_down_ratio
0,2-2022,202209150kan,sdg,24,kan,27,kan,sdg,kan,0.000000,76.900000,5.000000,9.200000,21.000000,44.000000,2.000000,144.200000,1.000000,1.000000,4.700000,2082.000000,0.000000,0.000000,0.833333,1.000000,1.000000,0.625000
1,2-2022,202209180cle,nyj,31,cle,30,nyj,cle,cle,0.000000,52.900000,1.000000,4.300000,24.000000,26.000000,4.000000,72.900000,1.000000,1.000000,5.600000,2306.000000,1.000000,9.000000,1.000000,1.000000,0.500000,0.444444
2,2-2022,202209180det,was,27,det,36,det,was,det,1.000000,56.800000,2.000000,5.800000,38.000000,35.000000,4.000000,79.200000,0.000000,3.000000,6.500000,1706.000000,1.000000,10.000000,1.000000,,1.000000,0.642857
3,2-2022,202209180jax,clt,0,jax,24,jax,clt,jax,1.000000,57.100000,1.000000,6.500000,28.000000,22.000000,3.000000,73.500000,0.000000,1.000000,6.800000,1587.000000,2.000000,15.000000,1.000000,0.750000,0.000000,0.250000
4,2-2022,202209180nor,tam,20,nor,10,tam,nor,nor,0.000000,67.600000,2.000000,7.900000,26.000000,27.000000,5.000000,106.700000,1.000000,1.000000,7.900000,1576.000000,4.000000,35.000000,1.000000,0.666667,,0.307692
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,20-2022,202301220buf,cin,27,buf,10,cin,buf,buf,0.941176,62.329412,2.235294,7.600000,18.647059,28.764706,3.000000,93.241176,0.823529,0.941176,5.170588,1799.764706,2.352941,11.647059,0.965686,0.843750,0.458333,0.499455
264,20-2022,202301220sfo,dal,12,sfo,19,sfo,dal,sfo,0.500000,65.950000,1.833333,8.233333,16.666667,27.277778,3.444444,103.933333,0.777778,1.166667,4.727778,1916.888889,1.777778,11.166667,0.961111,0.838542,0.366667,0.451651
265,21-2022,202301290phi,sfo,7,phi,31,phi,sfo,phi,0.500000,65.488889,1.500000,7.972222,19.500000,28.611111,3.277778,97.466667,0.833333,1.944444,4.622222,1850.222222,2.500000,14.722222,0.970370,0.866667,0.733333,0.474927
266,21-2022,202301290kan,cin,20,kan,23,kan,cin,kan,0.666667,66.961111,2.444444,8.044444,21.611111,29.055556,3.166667,104.588889,0.833333,1.000000,4.616667,1833.888889,1.444444,10.444444,0.837963,0.725490,0.722222,0.486078


In [51]:
away_df

Unnamed: 0,week,boxscore,away_abbr,away_score,home_abbr,home_score,winning_abbr,losing_abbr,team_abbr,avg_interceptions_thrown,avg_pass_completion_rate,avg_pass_touchdowns,avg_pass_yards_per_attempt,avg_points_allowed,avg_points_scored,avg_punts,avg_quarterback_rating,win_ratio,avg_rush_touchdowns,avg_rush_yards_per_attempt,avg_time_of_possession,avg_times_sacked,avg_yards_lost_from_sacks,avg_extra_points_ratio,avg_field_goal_ratio,avg_fourth_down_ratio,avg_third_down_ratio
0,2-2022,202209150kan,sdg,24,kan,27,kan,sdg,sdg,0.000000,76.500000,3.000000,8.200000,19.000000,24.000000,4.000000,129.400000,1.000000,0.000000,2.500000,1952.000000,0.000000,0.000000,1.000000,0.500000,0.000000,0.428571
1,2-2022,202209180cle,nyj,31,cle,30,nyj,cle,nyj,1.000000,62.700000,1.000000,5.200000,24.000000,9.000000,6.000000,73.900000,0.000000,0.000000,4.900000,1950.000000,3.000000,12.000000,0.000000,0.500000,0.750000,0.142857
2,2-2022,202209180det,was,27,det,36,det,was,was,2.000000,65.900000,4.000000,7.600000,22.000000,28.000000,3.000000,100.200000,1.000000,0.000000,3.000000,2013.000000,1.000000,8.000000,1.000000,,,0.700000
3,2-2022,202209180jax,clt,0,jax,24,jax,clt,clt,1.000000,64.000000,1.000000,7.000000,20.000000,20.000000,4.000000,82.100000,0.500000,1.000000,4.700000,2377.000000,2.000000,12.000000,1.000000,0.666667,0.000000,0.400000
4,2-2022,202209180nor,tam,20,nor,10,tam,nor,tam,1.000000,66.700000,1.000000,7.900000,3.000000,19.000000,3.000000,84.600000,1.000000,0.000000,4.600000,1962.000000,2.000000,17.000000,1.000000,0.800000,,0.357143
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,20-2022,202301220buf,cin,27,buf,10,cin,buf,cin,0.705882,68.788235,2.117647,7.394118,19.941176,26.000000,3.647059,99.629412,0.764706,0.882353,3.717647,1918.764706,2.823529,18.000000,0.822917,0.802083,0.233333,0.470935
264,20-2022,202301220sfo,dal,12,sfo,19,sfo,dal,dal,1.000000,65.500000,1.777778,7.327778,19.777778,27.666667,4.000000,91.777778,0.722222,1.388889,4.288889,1794.388889,1.555556,10.166667,0.856303,0.865385,0.627273,0.462337
265,21-2022,202301290phi,sfo,7,phi,31,phi,sfo,sfo,0.473684,65.926316,1.736842,8.189474,16.421053,26.842105,3.473684,102.952632,0.789474,1.157895,4.663158,1919.842105,1.789474,11.368421,0.963158,0.848039,0.366667,0.454196
266,21-2022,202301290kan,cin,20,kan,23,kan,cin,cin,0.666667,68.516667,2.111111,7.355556,19.388889,26.055556,3.555556,99.738889,0.777778,0.888889,3.794444,1925.166667,2.722222,17.111111,0.833333,0.813725,0.233333,0.478105


In [52]:
print(home_df.info())
print(away_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 268 entries, 0 to 267
Data columns (total 27 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   week                        268 non-null    object 
 1   boxscore                    268 non-null    object 
 2   away_abbr                   268 non-null    object 
 3   away_score                  268 non-null    int64  
 4   home_abbr                   268 non-null    object 
 5   home_score                  268 non-null    int64  
 6   winning_abbr                267 non-null    object 
 7   losing_abbr                 267 non-null    object 
 8   team_abbr                   268 non-null    object 
 9   avg_interceptions_thrown    268 non-null    float64
 10  avg_pass_completion_rate    268 non-null    float64
 11  avg_pass_touchdowns         268 non-null    float64
 12  avg_pass_yards_per_attempt  268 non-null    float64
 13  avg_points_allowed          268 non

In [53]:
home_away_df = pd.DataFrame()

In [54]:
home_away_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Empty DataFrame

In [55]:
#Creating a final df that combines the difference between home and away team stats for every matchup of the year
home_away_df['week'] = home_df['week']
home_away_df['home_abbr']= home_df['home_abbr']
home_away_df['away_abbr']= home_df['away_abbr']
home_away_df['home_win']= np.where(home_df['winning_abbr'] == home_df['home_abbr'],1,0)
home_away_df['interceptions_thrown'] = home_df['avg_interceptions_thrown'] - away_df['avg_interceptions_thrown']
home_away_df['avg_pass_completion_rate'] = home_df['avg_pass_completion_rate'] - away_df['avg_pass_completion_rate']
home_away_df['avg_pass_touchdowns'] = home_df['avg_pass_touchdowns'] - away_df['avg_pass_touchdowns']
home_away_df['avg_pass_yards_per_attempt'] = home_df['avg_pass_yards_per_attempt'] - away_df['avg_pass_yards_per_attempt']
home_away_df['avg_points_allowed'] = home_df['avg_points_allowed'] - away_df['avg_points_allowed']
home_away_df['avg_points_scored'] = home_df['avg_points_scored'] - away_df['avg_points_scored']
home_away_df['avg_punts'] = home_df['avg_punts'] - away_df['avg_punts']
home_away_df['avg_quarterback_rating'] = home_df['avg_quarterback_rating'] - away_df['avg_quarterback_rating']
home_away_df['win_ratio'] = home_df['win_ratio'] - away_df['win_ratio']
home_away_df['avg_rush_touchdowns'] = home_df['avg_rush_touchdowns'] - away_df['avg_rush_touchdowns']
home_away_df['avg_rush_yards_per_attempt'] = home_df['avg_rush_yards_per_attempt'] - away_df['avg_rush_yards_per_attempt']
home_away_df['avg_time_of_possession'] = home_df['avg_time_of_possession'] - away_df['avg_time_of_possession']
home_away_df['avg_times_sacked'] = home_df['avg_times_sacked'] - away_df['avg_times_sacked']
home_away_df['avg_yards_lost_from_sacks'] = home_df['avg_yards_lost_from_sacks'] - away_df['avg_yards_lost_from_sacks']
home_away_df['avg_extra_points_ratio'] = home_df['avg_extra_points_ratio'] - away_df['avg_extra_points_ratio']
home_away_df['avg_field_goal_ratio'] = home_df['avg_field_goal_ratio'] - away_df['avg_field_goal_ratio']
home_away_df['avg_fourth_down_ratio'] = home_df['avg_fourth_down_ratio'] - away_df['avg_fourth_down_ratio']
home_away_df['avg_third_down_ratio'] = home_df['avg_third_down_ratio'] - away_df['avg_third_down_ratio']

home_away_df = home_away_df.reset_index(drop=True)

home_away_df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 268 entries, 0 to 267
Data columns (total 22 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   week                        268 non-null    object 
 1   home_abbr                   268 non-null    object 
 2   away_abbr                   268 non-null    object 
 3   home_win                    268 non-null    int64  
 4   interceptions_thrown        268 non-null    float64
 5   avg_pass_completion_rate    268 non-null    float64
 6   avg_pass_touchdowns         268 non-null    float64
 7   avg_pass_yards_per_attempt  268 non-null    float64
 8   avg_points_allowed          268 non-null    float64
 9   avg_points_scored           268 non-null    float64
 10  avg_punts                   268 non-null    float64
 11  avg_quarterback_rating      268 non-null    float64
 12  win_ratio                   268 non-null    float64
 13  avg_rush_touchdowns         268 non

In [56]:
home_away_df

Unnamed: 0,week,home_abbr,away_abbr,home_win,interceptions_thrown,avg_pass_completion_rate,avg_pass_touchdowns,avg_pass_yards_per_attempt,avg_points_allowed,avg_points_scored,avg_punts,avg_quarterback_rating,win_ratio,avg_rush_touchdowns,avg_rush_yards_per_attempt,avg_time_of_possession,avg_times_sacked,avg_yards_lost_from_sacks,avg_extra_points_ratio,avg_field_goal_ratio,avg_fourth_down_ratio,avg_third_down_ratio
0,2-2022,kan,sdg,1,0.000000,0.400000,2.000000,1.000000,2.000000,20.000000,-2.000000,14.800000,0.000000,1.000000,2.200000,130.000000,0.000000,0.000000,-0.166667,0.500000,1.000000,0.196429
1,2-2022,cle,nyj,0,-1.000000,-9.800000,0.000000,-0.900000,0.000000,17.000000,-2.000000,-1.000000,1.000000,1.000000,0.700000,356.000000,-2.000000,-3.000000,1.000000,0.500000,-0.250000,0.301587
2,2-2022,det,was,1,-1.000000,-9.100000,-2.000000,-1.800000,16.000000,7.000000,1.000000,-21.000000,-1.000000,3.000000,3.500000,-307.000000,0.000000,2.000000,0.000000,,,-0.057143
3,2-2022,jax,clt,1,0.000000,-6.900000,0.000000,-0.500000,8.000000,2.000000,-1.000000,-8.600000,-0.500000,0.000000,2.100000,-790.000000,0.000000,3.000000,0.000000,0.083333,0.000000,-0.150000
4,2-2022,nor,tam,0,-1.000000,0.900000,1.000000,0.000000,23.000000,8.000000,2.000000,22.100000,0.000000,1.000000,3.300000,-386.000000,2.000000,18.000000,0.000000,-0.133333,,-0.049451
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,20-2022,buf,cin,0,0.235294,-6.458824,0.117647,0.205882,-1.294118,2.764706,-0.647059,-6.388235,0.058824,0.058824,1.452941,-119.000000,-0.470588,-6.352941,0.142770,0.041667,0.225000,0.028521
264,20-2022,sfo,dal,1,-0.500000,0.450000,0.055556,0.905556,-3.111111,-0.388889,-0.555556,12.155556,0.055556,-0.222222,0.438889,122.500000,0.222222,1.000000,0.104809,-0.026843,-0.260606,-0.010686
265,21-2022,phi,sfo,1,0.026316,-0.437427,-0.236842,-0.217251,3.078947,1.769006,-0.195906,-5.485965,0.043860,0.786550,-0.040936,-69.619883,0.710526,3.353801,0.007212,0.018627,0.366667,0.020731
266,21-2022,kan,cin,1,0.000000,-1.555556,0.333333,0.688889,2.222222,3.000000,-0.388889,4.850000,0.055556,0.111111,0.822222,-91.277778,-1.277778,-6.666667,0.004630,-0.088235,0.488889,0.007973


In [57]:
#Removing the -2022 and converting the week column back to an int
home_away_df['week'] = [int(entry.split('-')[0]) for entry in home_away_df['week']]
home_away_df

Unnamed: 0,week,home_abbr,away_abbr,home_win,interceptions_thrown,avg_pass_completion_rate,avg_pass_touchdowns,avg_pass_yards_per_attempt,avg_points_allowed,avg_points_scored,avg_punts,avg_quarterback_rating,win_ratio,avg_rush_touchdowns,avg_rush_yards_per_attempt,avg_time_of_possession,avg_times_sacked,avg_yards_lost_from_sacks,avg_extra_points_ratio,avg_field_goal_ratio,avg_fourth_down_ratio,avg_third_down_ratio
0,2,kan,sdg,1,0.000000,0.400000,2.000000,1.000000,2.000000,20.000000,-2.000000,14.800000,0.000000,1.000000,2.200000,130.000000,0.000000,0.000000,-0.166667,0.500000,1.000000,0.196429
1,2,cle,nyj,0,-1.000000,-9.800000,0.000000,-0.900000,0.000000,17.000000,-2.000000,-1.000000,1.000000,1.000000,0.700000,356.000000,-2.000000,-3.000000,1.000000,0.500000,-0.250000,0.301587
2,2,det,was,1,-1.000000,-9.100000,-2.000000,-1.800000,16.000000,7.000000,1.000000,-21.000000,-1.000000,3.000000,3.500000,-307.000000,0.000000,2.000000,0.000000,,,-0.057143
3,2,jax,clt,1,0.000000,-6.900000,0.000000,-0.500000,8.000000,2.000000,-1.000000,-8.600000,-0.500000,0.000000,2.100000,-790.000000,0.000000,3.000000,0.000000,0.083333,0.000000,-0.150000
4,2,nor,tam,0,-1.000000,0.900000,1.000000,0.000000,23.000000,8.000000,2.000000,22.100000,0.000000,1.000000,3.300000,-386.000000,2.000000,18.000000,0.000000,-0.133333,,-0.049451
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,20,buf,cin,0,0.235294,-6.458824,0.117647,0.205882,-1.294118,2.764706,-0.647059,-6.388235,0.058824,0.058824,1.452941,-119.000000,-0.470588,-6.352941,0.142770,0.041667,0.225000,0.028521
264,20,sfo,dal,1,-0.500000,0.450000,0.055556,0.905556,-3.111111,-0.388889,-0.555556,12.155556,0.055556,-0.222222,0.438889,122.500000,0.222222,1.000000,0.104809,-0.026843,-0.260606,-0.010686
265,21,phi,sfo,1,0.026316,-0.437427,-0.236842,-0.217251,3.078947,1.769006,-0.195906,-5.485965,0.043860,0.786550,-0.040936,-69.619883,0.710526,3.353801,0.007212,0.018627,0.366667,0.020731
266,21,kan,cin,1,0.000000,-1.555556,0.333333,0.688889,2.222222,3.000000,-0.388889,4.850000,0.055556,0.111111,0.822222,-91.277778,-1.277778,-6.666667,0.004630,-0.088235,0.488889,0.007973


In [58]:
home_away_df.drop(['home_abbr', 'away_abbr'], axis=1, inplace=True)
home_away_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 268 entries, 0 to 267
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   week                        268 non-null    int64  
 1   home_win                    268 non-null    int64  
 2   interceptions_thrown        268 non-null    float64
 3   avg_pass_completion_rate    268 non-null    float64
 4   avg_pass_touchdowns         268 non-null    float64
 5   avg_pass_yards_per_attempt  268 non-null    float64
 6   avg_points_allowed          268 non-null    float64
 7   avg_points_scored           268 non-null    float64
 8   avg_punts                   268 non-null    float64
 9   avg_quarterback_rating      268 non-null    float64
 10  win_ratio                   268 non-null    float64
 11  avg_rush_touchdowns         268 non-null    float64
 12  avg_rush_yards_per_attempt  268 non-null    float64
 13  avg_time_of_possession      268 non

There were a few null values in cases where for a given week a team did not attempt a 4th down conversion/extra point/field goal, hence leading to a 0/0.

For now we will drop these columns in order quickly be able to test some basic models and confirm the dataframe is working as intended. In later workbooks these nulls will be addressed.

In [59]:
home_away_df.dropna(inplace=True)


In [60]:
#Running a log reg model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#let's check a baseline model that only considers win ratio
x_base = np.array(home_away_df['win_ratio']).reshape(-1, 1)

y_base = home_away_df['home_win']

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(x_base,y_base, test_size=0.2, random_state=1)

my_logreg_model = LogisticRegression().fit(X_train_b, y_train_b)
print(f'Train score: {my_logreg_model.score(X_train_b,y_train_b)}')
print(f'Test score: {my_logreg_model.score(X_test_b,y_test_b)}')


Train score: 0.5621890547263682
Test score: 0.6470588235294118


In [61]:
#Now let's check how a basic log reg performs with all the variables

x = home_away_df.drop('home_win', axis=1)
y = home_away_df['home_win']

X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=1)

my_logreg_model = LogisticRegression(max_iter=500).fit(X_train, y_train)
print(f'Train score: {my_logreg_model.score(X_train,y_train)}')
print(f'Test score: {my_logreg_model.score(X_test,y_test)}')

Train score: 0.6965174129353234
Test score: 0.6274509803921569


In [62]:
#let's try another model that only looks at week 5 onwards in the season
home_away_df_5 = home_away_df[home_away_df['week'] >= 6]

x_5 = np.array(home_away_df_5['win_ratio']).reshape(-1, 1)

y_5 = home_away_df_5['home_win']

X_train_5, X_test_5, y_train_5, y_test_5 = train_test_split(x_5,y_5, test_size=0.2, random_state=1)

my_logreg_model = LogisticRegression().fit(X_train_5, y_train_5)
print(f'Train score: {my_logreg_model.score(X_train_5,y_train_5)}')
print(f'Test score: {my_logreg_model.score(X_test_5,y_test_5)}')

Train score: 0.6748466257668712
Test score: 0.43902439024390244


Based on these results it appears that the win ratio alone is a better indicator than all the variables and that week 5 onwards data is not good at predicting winners. This is okay for now because we were only seeking to make sure that these functions were working properly. In further notebooks we will include data from seasons dating back to 1970, explore other models as well as what features may need to be added or removed and hopefully achieve better results!