# Data Collection & Aggregation

This notebook aims to summarize my data collection and aggregation process. My data has three sources:
* [Kaggle Play-by-Play Data](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016)
* [NFL Teams](https://gist.githubusercontent.com/cnizzardini/13d0a072adb35a0d5817/raw/4b555e084e8cec673dc587555008607fb06c6a60/nfl_teams.csv)
* [NFLWeather.com](http://www.nflweather.com/en/archive)

**Please Note: Restarting this notebook will cause errors in the code since all the data is not stored locally.**

## Contents
[Play-by-Play Data](#Play-by-Play-Data)<br>
[Weather Data](#Weather-Data)<br>
[Joining Dataframes](#Joining-Dataframes)

In [1]:
import pandas as pd
import s3fs
import numpy as np
import time
import requests
from bs4 import BeautifulSoup

pd.set_option('display.max_columns', 999)
pd.set_option('display.max_rows', 999999)

import warnings
warnings.filterwarnings('ignore')

### Play-by-Play Data

As I mentioned above, the play-by-play dataset can be accessed on [Kaggle](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016). The file can be downloaded as a zip file. When unzipped, the file is a whopping 700 MB (not enormous but big enough where I don't want to store it locally). I uploaded the file onto a S3 server on AWS. My AWS key and secret key are not listed below for security purposes. 
<br>
<br>
To run this notebook without uploading the file onto AWS, the easiest thing is to take these steps:
* Download dataset from link above
* Unzip file
* Run this code `df = pd.read_csv('file_name.csv')` adjust 'file_name.csv' to the file path of your unzipped csv

In [2]:
fs = s3fs.S3FileSystem(anon=False,key='AWS KEY',secret='AWS SECRET KEY')

key = 'NFL Play by Play 2009-2018 (v5).csv'
bucket = 'nfl-play-by-play-capstone'

df = pd.read_csv(fs.open('{}/{}'.format(bucket, key),
                         mode='rb'))

### Weather Data

Code below works in a jupyter notebook, but this directory also contains get_nfl_weather.py that can be run in the command line. The first function is just a progress bar, and the second function is the actual scrape. It takes two inputs: range of seasons (min 2009, max 2018) and a range of weeks (min 1, max 17). The main difference betweent the code below and the script is the location of the csv: the code below saves it in my datasets folder while the script saves the csv in your current directory.

In [3]:
# source: https://stackoverflow.com/questions/3173320/text-progress-bar-in-the-console

def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█'):
    """
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
    """
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = '\r')
    # Print New Line on Complete
    if iteration == total: 
        print()

In [4]:
def get_nfl_weather(year_range:range,week_range:range):
    
    years = list(year_range)
    weeks = list(week_range)
    
    l = len(years)
    
    games = []
    
    printProgressBar(0, l, prefix = 'Progress:', suffix = 'Complete', length = 50)
    
    for i, year in enumerate(years):
        for week in weeks:
        
            url = f'http://www.nflweather.com/en/week/{year}/week-{week}/'
            
            check = requests.get(url).status_code
            
            if check == 200:
                
                url = url
            
            else: 
                
                url = f'http://www.nflweather.com/en/week/{year}/week-{week}-2/'
                
            req = requests.get(url).text

            soup = BeautifulSoup(req,'lxml')

            for row in range(len(soup.find('table').find_all('tr')[1::1])):
                
                game = {}
                data = soup.find('table').find_all('tr')[1::1][row]
                
                try:
                    away = data.find_all('a')[0].text
                    home = data.find_all('a')[3].text
                    forecast = data.find_all('td', class_ ='text-center')[5].text
                    wind = data.find_all('td', class_ ='text-center')[6].text
                    game['away'] = away
                    game['home'] = home
                    game['forecast'] = forecast.strip('\n').strip('\n ')
                    game['wind'] = wind
                    game['year'] = year
                    game['week'] = week
                    
                except IndexError:pass
                
                games.append(game)
                
            time.sleep(1)
            
        printProgressBar(i + 1, l, prefix = 'Progress:', suffix = 'Complete', length = 50)
                
    df = pd.DataFrame(games)
    df.to_csv(f"../datasets/nfl_weather_{min(years)}_to_{max(years)}_weeks_{min(weeks)}_to_{max(weeks)}.csv",
              index=False)
                        
    return df

You would run this code below to execute the function in the notebook.

```python 
weather = get_nfl_weather(range(2009,2019),range(1,18))
```

### Joining Dataframes

In this section, I am going to join the weather forecast for each game to the play-by-play dataframe. In order to do this, I first need to find an intermediary dataframe that has both team names and abbreviations since weather uses names and play-by-play uses abbreviations. 

In [5]:
weather = pd.read_csv('../datasets/nfl_weather_2009_to_2018_weeks_1_to_17.csv')

In [6]:
teams = pd.read_csv('https://gist.githubusercontent.com/cnizzardini/'+
                    '13d0a072adb35a0d5817/raw/4b555e084e8cec673dc587555008607fb06c6a60/nfl_teams.csv')

I want to check if we have the same number of games in weather and play-by-play. I also want to check that we have 2560 games in total (256 regular season games a year, 10 years).

In [7]:
df['game_id'].nunique(),weather.shape[0],256*10

(2526, 2559, 2560)

Already we have more games in weather than in the play-by-play dataframe. Weather seems to be missing one game, and play-by-play is missing 34 games. I will come back to this when I join the two dataframes.

In order to join weather and play-by-play data, I will need to manipulate the teams dataframe first.

In [8]:
teams_abbreviations = list(teams['Abbreviation'].unique())
df_abbreviations = list(df[['posteam']].sort_values('posteam')['posteam'].unique())

I need to check if the teams dataframe is missing any abbreviations used in the play-by-play dataframe, and vice versa.

In [9]:
for i in df_abbreviations:
    if i not in teams_abbreviations:
        print(i)

JAC
LA
LAC
nan


In [10]:
for i in teams_abbreviations:
    if i not in df_abbreviations:
        print(i)

The inconsistency arises for three teams. LA Rams were formerly St Louis Rams. LA Chargers were formerly San Diego Chargers. I was not sure why JAC & JAX are used in the df, so I looked into it. There was actually a dispute for changing the abbreviation that you can read about [here](https://www.espn.com/blog/jacksonville-jaguars/post/_/id/111/jags-fans-win-abbreviation-battle). I need to add these to the team dataframe.

In [11]:
teams = teams.append([{'ID':teams[teams['Abbreviation'] == 'STL']['ID'].values[0],'Name':'Los Angeles Rams',
               'Abbreviation':'LA','Conference':teams[teams['Abbreviation'] == 'STL']['Conference'].values[0],
               'Division':teams[teams['Abbreviation'] == 'STL']['Division'].values[0]},
              {'ID':teams[teams['Abbreviation'] == 'SD']['ID'].values[0],'Name':'Los Angeles Chargers',
               'Abbreviation':'LAC','Conference':teams[teams['Abbreviation'] == 'SD']['Conference'].values[0],
               'Division':teams[teams['Abbreviation'] == 'SD']['Division'].values[0]},
              {'ID':teams[teams['Abbreviation'] == 'JAX']['ID'].values[0],'Name':'Jacksonville Jaguars',
               'Abbreviation':'JAC','Conference':teams[teams['Abbreviation'] == 'JAX']['Conference'].values[0],
               'Division':teams[teams['Abbreviation'] == 'JAX']['Division'].values[0]}],
             ignore_index=True)

Now I need to isolate the name for each team from the location since the weather dataframe uses the name. I then make a dictionary with the abbreviations as the keys and save the updated teams table.

In [12]:
teams['team'] = teams['Name'].str.split().str[-1]
teams_dict = dict(zip(teams['Abbreviation'],teams['team']))

In [13]:
teams.to_csv('../datasets/nfl_teams_updated.csv',index=False)

Next, I need to map the dictionary to the play-by-play dataframe.

In [14]:
df['away_team_name'] = df['away_team'].map(teams_dict)
df['home_team_name'] = df['home_team'].map(teams_dict)

The last step before the join is isolating the seasons for each game. I will be joining on home, away, and season. I could not just pull out the year for each game because every year some games in a season are played in January of the next year. For example, a game played in January of 2017 is actually a 2016 season game.

In [15]:
seasons = {range(200909,201002):2009, range(201009,201102):2010, range(201109,201202):2011, range(201209,201302):2012,
          range(201309,201402):2013, range(201409,201502):2014, range(201509,201602):2015, range(201609,201702): 2016,
          range(201709,201802):2017, range(201809,201902):2018}

In [16]:
df['season'] = df['game_id'].astype(str).str[0:6].astype(int)
df['season'] = df['season'].apply(lambda x: next((v for k, v in seasons.items() if x in k), 0))

Finally I merge the dataframes.

In [17]:
play_by_play = pd.merge(weather,df,left_on = ['away','home','year'], right_on = ['away_team_name', 
        'home_team_name','season'],how='left').drop(columns=['away','home','away_team_name','home_team_name'])

Now let's investigate the discrepancies between the two dataframes that I noted earlier. 

In [18]:
play_by_play['game_id'].isnull().sum()

34

In [19]:
play_by_play[play_by_play['game_id'].isnull()][['week','year','forecast']]

Unnamed: 0,week,year,forecast
151036,7,2012,67f A Few Clouds
346567,12,2016,46f Clear
449212,16,2018,50f Clear
449213,16,2018,64f Clear
449214,16,2018,DOME
449215,16,2018,43f Clear
449216,16,2018,71f Clear
449217,16,2018,41f Clear
449218,16,2018,37f Overcast
449219,16,2018,48f Clear


This confirms that the play-by-play data is missing 34 games. Below are the games that the play-by-play data is missing. It is mostly 2018 weeks 16-17 which admittedly, when I look back at the Kaggle source, it does say up to week 15, so that takes care of 32 missing games. There are also an additional two weeks that just seem to be missing at random.

In [20]:
play_by_play['game_id'].nunique()

2525

The join also got rid of the play-by-play data for the game that weather was missing since the number of unique games decreased by 1. At this point, I am okay with missing 3 games (two randomly from play-by-play & one randomly from nflweather.com) and the final two weeks of 2018. I will drop the rows for these games.

In [21]:
play_by_play = play_by_play.drop(list(play_by_play[play_by_play['game_id'].isnull()].index))

Finally, I uploaded the new file back to S3.

In [22]:
fs = s3fs.S3FileSystem(anon=False,key='AWS KEY',secret='AWS SECRET KEY')

bucket = 'nfl-play-by-play-capstone'

with fs.open(f'{bucket}/nfl_play_by_play_with_weather_2009_2018.csv','w') as f:
    play_by_play.to_csv(f)