# Getting, Cleaning, Exporting Data
Source: [kaggle](https://www.kaggle.com/datasets/smadler92/nfl-pfr)
The Dataset comes in multiple Tables that will need to be merged. We have Data from 1960 up until 2021 (but only two games from 2021, so it should be cut to 2020). These tables are organized:
* By Team Folder
    * 4 Files - Roster, Stats by Year, Weekly Odds, Weekly Scores
    * Weekly Injuries Folder
        * Each Year (2009-2021)
* Weather File

Want to be working with the weeklly scores from each team, adding a new column for the team name, and adding them to a DataFrame which will contain all teams data.

In [10]:
# Libraries
import pandas as pd # working with DataFrames
import numpy as np # linear algebra
import os #navigating the folders and files
import glob
from pathlib import Path

Need to Creating a Function that will navigate through our team folders (Ex: ATL)

In [11]:
# Extract teams and filenames from data folder
csv_files = [] # Empty list to store the file names
teams = [] # empty list to hold all team abbreviations
for filename in Path('data/NFLML').glob('**/* WeeklyScores All Years.csv'):
    csv_files.append(filename)
    team_abr = str(filename).split(' ')[0].split('\\')[3] # Only selects the team abbreviation from the file name
    teams.append(team_abr)

In [46]:
# Function for pulling csv files and adding to single dataframe
def join_frames(files, teams, save=False, verbose=True):
    
    data = [] #empty list to store the dataframes
    
    #loop through index up to 32
    for i in range(len(teams)):
        if verbose:
            print(f"Getting Weekly Scores for {teams[i]}") # verbose
        team_df = pd.read_csv(files[i], index_col=0)
        team_df['Team'] = teams[i] # new column for each DataFrames associated team
        data.append(team_df)
    
    df = pd.concat(data, axis=0)
    if save:
        df.to_csv('data/nfl_scores_merged.csv', index=False)
    return df

In [57]:
df = join_frames(csv_files,teams, save=False)

Getting Weekly Scores for ATL
Getting Weekly Scores for BUF
Getting Weekly Scores for CAR
Getting Weekly Scores for CHI
Getting Weekly Scores for CIN
Getting Weekly Scores for CLE
Getting Weekly Scores for CLT
Getting Weekly Scores for CRD
Getting Weekly Scores for DAL
Getting Weekly Scores for DEN
Getting Weekly Scores for DET
Getting Weekly Scores for GNB
Getting Weekly Scores for HTX
Getting Weekly Scores for JAX
Getting Weekly Scores for KAN
Getting Weekly Scores for MIA
Getting Weekly Scores for MIN
Getting Weekly Scores for NOR
Getting Weekly Scores for NWE
Getting Weekly Scores for NYG
Getting Weekly Scores for NYJ
Getting Weekly Scores for OTI
Getting Weekly Scores for PHI
Getting Weekly Scores for PIT
Getting Weekly Scores for RAI
Getting Weekly Scores for RAM
Getting Weekly Scores for RAV
Getting Weekly Scores for SDG
Getting Weekly Scores for SEA
Getting Weekly Scores for SFO
Getting Weekly Scores for TAM
Getting Weekly Scores for WAS


In [58]:
df.shape

(33890, 26)

In [59]:
df.tail()

Unnamed: 0,Week,Day,Date,Time,Outcome,OT,Rec,Home,Opp,Score_Tm,...,Defense_1stD,Defense_TotYd,Defense_PassY,Defense_RushY,Defense_TO,Year,Expected Points_Offense,Expected Points_Defense,Expected Points_Sp. Tms,Team
1363,14,Sun,December 12,1:00PM ET,,False,,True,Dallas Cowboys,,...,,,,,,2021,,,,WAS
1364,15,Sun,December 19,1:00PM ET,,False,,False,Philadelphia Eagles,,...,,,,,,2021,,,,WAS
1365,16,Sun,December 26,8:20PM ET,,False,,False,Dallas Cowboys,,...,,,,,,2021,,,,WAS
1366,17,Sun,January 2,1:00PM ET,,False,,True,Philadelphia Eagles,,...,,,,,,2021,,,,WAS
1367,18,Sun,January 9,1:00PM ET,,False,,False,New York Giants,,...,,,,,,2021,,,,WAS


In [60]:
df.loc[df['Year']==2021]

Unnamed: 0,Week,Day,Date,Time,Outcome,OT,Rec,Home,Opp,Score_Tm,...,Defense_1stD,Defense_TotYd,Defense_PassY,Defense_RushY,Defense_TO,Year,Expected Points_Offense,Expected Points_Defense,Expected Points_Sp. Tms,Team
918,1,Sun,September 12,1:00PM ET,L,False,0-1,True,Philadelphia Eagles,6.0,...,24.0,434.0,261.0,173.0,,2021,-12.48,-13.14,2.36,ATL
919,2,Sun,September 19,4:05PM ET,L,False,0-2,False,Tampa Bay Buccaneers,25.0,...,21.0,341.0,259.0,82.0,1.0,2021,-15.96,-8.63,0.72,ATL
920,3,Sun,September 26,1:00PM ET,,False,,False,New York Giants,,...,,,,,,2021,,,,ATL
921,4,Sun,October 3,1:00PM ET,,False,,True,Washington Football Team,,...,,,,,,2021,,,,ATL
922,5,Sun,October 10,9:30AM ET,,False,,True,New York Jets,,...,,,,,,2021,,,,ATL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1363,14,Sun,December 12,1:00PM ET,,False,,True,Dallas Cowboys,,...,,,,,,2021,,,,WAS
1364,15,Sun,December 19,1:00PM ET,,False,,False,Philadelphia Eagles,,...,,,,,,2021,,,,WAS
1365,16,Sun,December 26,8:20PM ET,,False,,False,Dallas Cowboys,,...,,,,,,2021,,,,WAS
1366,17,Sun,January 2,1:00PM ET,,False,,True,Philadelphia Eagles,,...,,,,,,2021,,,,WAS


We can see above that the tail end of our DataFrame conatins many NaN values. This is due to the dataset containing up until the year 2021, yet only having data on the first two games of that season. Therefore, I will be dropping all rows from the 2021 season to keep it simple, without losing too much data. 

In [61]:
df = df.loc[df['Year']!=2021]
df.shape

(33314, 26)

## Missing Values

In [62]:
df.isna().sum()

Week                         682
Day                         1658
Date                         976
Time                        8792
Outcome                     1658
OT                             0
Rec                         1658
Home                           0
Opp                          682
Score_Tm                    1658
Score_Opp                   1658
Offense_1stD                2741
Offense_TotYd               2731
Offense_PassY               2759
Offense_RushY               2736
Offense_TO                  7346
Defense_1stD                2741
Defense_TotYd               2731
Defense_PassY               2760
Defense_RushY               2737
Defense_TO                  7349
Year                           0
Expected Points_Offense    19132
Expected Points_Defense    19132
Expected Points_Sp. Tms    19132
Team                           0
dtype: int64

Things to notice:
- We are missing values from Week (this could be because at the beginning of playoffs each year, each team has a separator row in their data).

In [64]:
# Looking at missing week data
df[df['Week'].isna()].Date.unique()

array(['Playoffs'], dtype=object)

We can see here that all of our missing 'Week' values come from these separator rows. Therefore they can be safely dropped

In [65]:
# Drop rows where week is NaN
df.dropna(subset=['Week'], inplace=True)
df.shape # check that we are 682 less rows

(32632, 26)

Next, I will look at the `Day` that are missing values

In [67]:
df[df['Day'].isna()].sample(5)

Unnamed: 0,Week,Day,Date,Time,Outcome,OT,Rec,Home,Opp,Score_Tm,...,Defense_1stD,Defense_TotYd,Defense_PassY,Defense_RushY,Defense_TO,Year,Expected Points_Offense,Expected Points_Defense,Expected Points_Sp. Tms,Team
465,14,,,,,False,,True,Bye Week,,...,,,,,,1990,,,,NYJ
922,14,,,,,False,,True,Bye Week,,...,,,,,,1991,,,,CRD
898,7,,,,,False,,True,Bye Week,,...,,,,,,2013,,,,RAI
200,8,,,,,False,,True,Bye Week,,...,,,,,,2013,,,,HTX
1248,6,,,,,False,,True,Bye Week,,...,,,,,,2010,,,,CRD


It appears that this is all due to when a team has a 'bye-week'. Let's first confirm this, and if it turns out to be the case, then these rows missing `Day` can be safely dropped.

In [72]:
df[df['Day'].isna()]['Opp'].unique() # Check that the only value for 'Opp' is 'Bye Week'

# Can safely drop the rows containg Day=NaN
df.dropna(subset=['Day'], inplace=True)

I have a feeling that these 'Bye Week' were causing a lot of missing values in our dataset. Let's take another overview of the misisng values:

In [73]:
# Check missing values
df.isna().sum()

Week                           0
Day                            0
Date                           0
Time                        7134
Outcome                        0
OT                             0
Rec                            0
Home                           0
Opp                            0
Score_Tm                       0
Score_Opp                      0
Offense_1stD                1083
Offense_TotYd               1073
Offense_PassY               1101
Offense_RushY               1078
Offense_TO                  5688
Defense_1stD                1083
Defense_TotYd               1073
Defense_PassY               1102
Defense_RushY               1079
Defense_TO                  5691
Year                           0
Expected Points_Offense    17474
Expected Points_Defense    17474
Expected Points_Sp. Tms    17474
Team                           0
dtype: int64

My assumption was correct, and it appears many of our missing values are gone! But definitely some important ones remain, such as time and Offensive and Defensive Stats. We also have a LOT of missing values for `Expected` stats, but because these are going to be irrelelvant to our analysis, it is safe to simply drop these columns.

In [75]:
# Drop the 3 'Expected' Columns
cols_to_drop = ['Expected Points_Offense', 'Expected Points_Defense', 'Expected Points_Sp. Tms']
df.drop(columns=cols_to_drop, inplace=True)
df.head()

Unnamed: 0,Week,Day,Date,Time,Outcome,OT,Rec,Home,Opp,Score_Tm,...,Offense_PassY,Offense_RushY,Offense_TO,Defense_1stD,Defense_TotYd,Defense_PassY,Defense_RushY,Defense_TO,Year,Team
0,1,Sun,September 11,,L,False,0-1,True,Los Angeles Rams,14.0,...,116.0,121.0,2.0,23.0,421.0,275.0,146.0,2.0,1966,ATL
1,2,Sun,September 18,,L,False,0-2,False,Philadelphia Eagles,10.0,...,202.0,96.0,3.0,20.0,340.0,135.0,205.0,2.0,1966,ATL
2,3,Sun,September 25,,L,False,0-3,False,Detroit Lions,10.0,...,104.0,155.0,3.0,18.0,344.0,208.0,136.0,3.0,1966,ATL
3,4,Sun,October 2,,L,False,0-4,True,Dallas Cowboys,14.0,...,170.0,106.0,5.0,22.0,363.0,220.0,143.0,2.0,1966,ATL
4,5,Sun,October 9,,L,False,0-5,False,Washington Redskins,20.0,...,146.0,122.0,2.0,26.0,432.0,286.0,146.0,,1966,ATL
