---------- data_exploration.ipynb ---------- \
Time    :  2022/11/04 23:07:38 \
Version :  1.0 \
Author  :  Austin Villegas \
Github  :  https://github.com/anacrusis24 \
Contact :  ajv131@gmail.com \
Desc    :  File explores NFL Big Bowl data and does data cleaning and feature engineering

# Contents
- [Plan](#plan) 
- [Libraries](#libraries) 
- [Load Data](#loaddata) 
- [Data Exploration](#dataexploration)
    - [Data Head](#datahead) 
    - [Shape and NaN](#shapeandnan)

<a id='plan'></a>
# Plan 

**11/04/2022**
1. Remove NaN 
2. Give justification for removal of NaN
3. Add feature engineering to Contents
4. Feature engineer to remove columns we don't need (university, jersey number, etc.)
5. Give justification for removal of columns
6. Add data visualization section to Contents
7. Make data visualizations to best understand data (ideas could be table of sacks, pressures, etc.)
8. Work on general idea! "QB is the most important position, so the next most important positions are those that can go after the QB". What factors determine sacks and pressures and what new stat will measure how well a player can get to the QB.

<a id='libraries'></a>
# Libraries

In [1]:
import os
import pandas as pd
import numpy as np 
import matplotlib.pyplot as pyplot
import seaborn as sns
from tqdm.auto import tqdm

<a id='loaddata'></a>
# Load Data

In [2]:
# Find and set wd
print('current directory', os.getcwd())
new_path = os.path.join(os.getcwd(), 'nfl-big-data-bowl-2023')
os.chdir(new_path)
print('new directory', os.getcwd())

current directory c:\Users\Austin\src\nfl_big_data_bowl_2023
new directory c:\Users\Austin\src\nfl_big_data_bowl_2023\nfl-big-data-bowl-2023


In [3]:
# Load the data
games = pd.read_csv('games.csv')
scout = pd.read_csv('pffScoutingData.csv')
plays = pd.read_csv('plays.csv')
players = pd.read_csv('players.csv')
week_1 = pd.read_csv('week1.csv')
week_2 = pd.read_csv('week2.csv')
week_3 = pd.read_csv('week3.csv')
week_4 = pd.read_csv('week4.csv')
week_5 = pd.read_csv('week5.csv')
week_6 = pd.read_csv('week6.csv')
week_7 = pd.read_csv('week7.csv')
week_8 = pd.read_csv('week8.csv')

<a id='dataexploration'></a>
# Data Exploration

<a id='datahead'></a>
## Data Head

In [4]:
games.head()

Unnamed: 0,gameId,season,week,gameDate,gameTimeEastern,homeTeamAbbr,visitorTeamAbbr
0,2021090900,2021,1,09/09/2021,20:20:00,TB,DAL
1,2021091200,2021,1,09/12/2021,13:00:00,ATL,PHI
2,2021091201,2021,1,09/12/2021,13:00:00,BUF,PIT
3,2021091202,2021,1,09/12/2021,13:00:00,CAR,NYJ
4,2021091203,2021,1,09/12/2021,13:00:00,CIN,MIN


In [5]:
scout.head()

Unnamed: 0,gameId,playId,nflId,pff_role,pff_positionLinedUp,pff_hit,pff_hurry,pff_sack,pff_beatenByDefender,pff_hitAllowed,pff_hurryAllowed,pff_sackAllowed,pff_nflIdBlockedPlayer,pff_blockType,pff_backFieldBlock
0,2021090900,97,25511,Pass,QB,,,,,,,,,,
1,2021090900,97,35481,Pass Route,TE-L,,,,,,,,,,
2,2021090900,97,35634,Pass Route,LWR,,,,,,,,,,
3,2021090900,97,39985,Pass Route,HB-R,,,,,,,,,,
4,2021090900,97,40151,Pass Block,C,,,,0.0,0.0,0.0,0.0,44955.0,SW,0.0


In [6]:
plays.head()

Unnamed: 0,gameId,playId,playDescription,quarter,down,yardsToGo,possessionTeam,defensiveTeam,yardlineSide,yardlineNumber,...,foulNFLId3,absoluteYardlineNumber,offenseFormation,personnelO,defendersInBox,personnelD,dropBackType,pff_playAction,pff_passCoverage,pff_passCoverageType
0,2021090900,97,(13:33) (Shotgun) T.Brady pass incomplete deep...,1,3,2,TB,DAL,TB,33,...,,43.0,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"4 DL, 2 LB, 5 DB",TRADITIONAL,0,Cover-1,Man
1,2021090900,137,(13:18) (Shotgun) D.Prescott pass deep left to...,1,1,10,DAL,TB,DAL,2,...,,108.0,EMPTY,"1 RB, 2 TE, 2 WR",6.0,"4 DL, 4 LB, 3 DB",TRADITIONAL,0,Cover-3,Zone
2,2021090900,187,(12:23) (Shotgun) D.Prescott pass short middle...,1,2,6,DAL,TB,DAL,34,...,,76.0,SHOTGUN,"0 RB, 2 TE, 3 WR",6.0,"3 DL, 3 LB, 5 DB",TRADITIONAL,0,Cover-3,Zone
3,2021090900,282,(9:56) D.Prescott pass incomplete deep left to...,1,1,10,DAL,TB,TB,39,...,,49.0,SINGLEBACK,"1 RB, 2 TE, 2 WR",6.0,"4 DL, 3 LB, 4 DB",TRADITIONAL,1,Cover-3,Zone
4,2021090900,349,(9:46) (Shotgun) D.Prescott pass incomplete sh...,1,3,15,DAL,TB,TB,44,...,,54.0,SHOTGUN,"1 RB, 1 TE, 3 WR",7.0,"3 DL, 4 LB, 4 DB",TRADITIONAL,0,Cover-3,Zone


In [7]:
players.head()

Unnamed: 0,nflId,height,weight,birthDate,collegeName,officialPosition,displayName
0,25511,6-4,225,1977-08-03,Michigan,QB,Tom Brady
1,28963,6-5,240,1982-03-02,"Miami, O.",QB,Ben Roethlisberger
2,29550,6-4,328,1982-01-22,Arkansas,T,Jason Peters
3,29851,6-2,225,1983-12-02,California,QB,Aaron Rodgers
4,30078,6-2,228,1982-11-24,Harvard,QB,Ryan Fitzpatrick


In [8]:
week_1.head()

Unnamed: 0,gameId,playId,nflId,frameId,time,jerseyNumber,team,playDirection,x,y,s,a,dis,o,dir,event
0,2021090900,97,25511.0,1,2021-09-10T00:26:31.100,12.0,TB,right,37.77,24.22,0.29,0.3,0.03,165.16,84.99,
1,2021090900,97,25511.0,2,2021-09-10T00:26:31.200,12.0,TB,right,37.78,24.22,0.23,0.11,0.02,164.33,92.87,
2,2021090900,97,25511.0,3,2021-09-10T00:26:31.300,12.0,TB,right,37.78,24.24,0.16,0.1,0.01,160.24,68.55,
3,2021090900,97,25511.0,4,2021-09-10T00:26:31.400,12.0,TB,right,37.73,24.25,0.15,0.24,0.06,152.13,296.85,
4,2021090900,97,25511.0,5,2021-09-10T00:26:31.500,12.0,TB,right,37.69,24.26,0.25,0.18,0.04,148.33,287.55,


<a id='shapeandnan'></a>
## Shape and NaN

In [9]:
print('games:', games.shape)
print('scout:', scout.shape)
print('plays:', plays.shape)
print('players:', players.shape)
print('week_1:', week_1.shape)
print('week_2:', week_2.shape)
print('week_3:', week_3.shape)
print('week_4:', week_4.shape)
print('week_5:', week_5.shape)
print('week_6:', week_6.shape)
print('week_7:', week_7.shape)
print('week_8:', week_8.shape)

games: (122, 7)
scout: (188254, 15)
plays: (8557, 32)
players: (1679, 7)
week_1: (1118122, 16)
week_2: (1042774, 16)
week_3: (1121825, 16)
week_4: (1074606, 16)
week_5: (1097813, 16)
week_6: (973797, 16)
week_7: (906292, 16)
week_8: (978949, 16)


In [10]:
games.isna().sum(axis=0)[games.isna().sum(axis=0) > 0]

Series([], dtype: int64)

In [11]:
scout.isna().sum(axis=0)[scout.isna().sum(axis=0) > 0]

pff_hit                    94127
pff_hurry                  94127
pff_sack                   94127
pff_beatenByDefender      140167
pff_hitAllowed            140167
pff_hurryAllowed          140167
pff_sackAllowed           140167
pff_nflIdBlockedPlayer    141728
pff_blockType             140350
pff_backFieldBlock        140351
dtype: int64

In [12]:
plays.isna().sum(axis=0)[plays.isna().sum(axis=0) > 0]

yardlineSide               125
penaltyYards              7801
foulName1                 7821
foulNFLId1                7821
foulName2                 8527
foulNFLId2                8527
foulName3                 8556
foulNFLId3                8556
absoluteYardlineNumber       1
offenseFormation             7
personnelO                   1
defendersInBox               7
personnelD                   1
dropBackType               528
dtype: int64

In [13]:
players.isna().sum(axis=0)[players.isna().sum(axis=0) > 0]

birthDate      232
collegeName    224
dtype: int64

In [14]:
week_1.isna().sum(axis=0)[week_1.isna().sum(axis=0) > 0]

nflId           48614
jerseyNumber    48614
o               48614
dir             48614
dtype: int64

In [15]:
week_2.isna().sum(axis=0)[week_2.isna().sum(axis=0) > 0]

nflId           45338
jerseyNumber    45338
o               45338
dir             45338
dtype: int64

In [16]:
week_3.isna().sum(axis=0)[week_3.isna().sum(axis=0) > 0]

nflId           48775
jerseyNumber    48775
o               48775
dir             48775
dtype: int64

In [17]:
week_3.isna().sum(axis=0)[week_3.isna().sum(axis=0) > 0]

nflId           48775
jerseyNumber    48775
o               48775
dir             48775
dtype: int64

In [18]:
week_4.isna().sum(axis=0)[week_4.isna().sum(axis=0) > 0]

nflId           46722
jerseyNumber    46722
o               46722
dir             46722
dtype: int64

In [19]:
week_5.isna().sum(axis=0)[week_5.isna().sum(axis=0) > 0]

nflId           47731
jerseyNumber    47731
o               47731
dir             47731
dtype: int64

In [20]:
week_6.isna().sum(axis=0)[week_6.isna().sum(axis=0) > 0]

nflId           42339
jerseyNumber    42339
o               42339
dir             42339
dtype: int64

In [21]:
week_7.isna().sum(axis=0)[week_7.isna().sum(axis=0) > 0]

nflId           39404
jerseyNumber    39404
o               39404
dir             39404
dtype: int64

In [22]:
week_8.isna().sum(axis=0)[week_8.isna().sum(axis=0) > 0]

nflId           42563
jerseyNumber    42563
o               42563
dir             42563
dtype: int64