This is the initial notebook for extracting data for the Big Data Bowl.

The purpose of this notebook is to load in the initial data, and extract the most relevant features first.

To do this the data must be cleaned with irrelevant plays, events, and positions removed.

Output: Two csv files.

One that will be used for coverage identification (coverage_id) that will get it's own notebook.

The other to model completion probability (df_merged) which will go to the notebooks calculate_nearest_stats -> summarise_plays -> merge_clean_data.

In [1]:
library(tidyverse)
library(repr)
library(tm)
library(ggrepel)

options(warn=-1)

options(repr.plot.width=15, repr.plot.height = 10)

library(dplyr, warn.conflicts = FALSE)
# Suppress summarise info
options(dplyr.summarise.inform = FALSE)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Loading required package: NLP


Attaching package: ‘NLP’


The following object is masked from ‘package:ggplot2’:

    annotate


The following object is masked from ‘package:httr’:

    content




In [2]:
#includes play-by-play info on specific plays
df_plays <- read_csv("../input/nfl-big-data-bowl-2021/plays.csv",
                    col_types = cols())

#includes background info for players
df_players <- read_csv("../input/nfl-big-data-bowl-2021/players.csv",
                      col_types = cols())

#includes targetted receiver by play
df_targetedReceiver <- read_csv("../input/nfl-big-data-bowl-2021-bonus/targetedReceiver.csv",
                      col_types = cols())

#includes schedule info for games
df_games <- read_csv("../input/nfl-big-data-bowl-2021/games.csv",
                    col_types = cols())

df_plays <- inner_join(df_plays,
                      df_targetedReceiver,
                      by = c('playId', 'gameId'))
head(df_plays, 3)

gameId,playId,playDescription,quarter,down,yardsToGo,possessionTeam,playType,yardlineSide,yardlineNumber,⋯,gameClock,absoluteYardlineNumber,penaltyCodes,penaltyJerseyNumbers,passResult,offensePlayResult,playResult,epa,isDefensivePI,targetNflId
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,⋯,<time>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<lgl>,<dbl>
2018090600,75,(15:00) M.Ryan pass short right to J.Jones pushed ob at ATL 30 for 10 yards (M.Jenkins).,1,1,15,ATL,play_type_pass,ATL,20,⋯,15:00:00,90,,,C,10,10,0.2618273,False,2495454
2018090600,146,"(13:10) M.Ryan pass incomplete short right to C.Ridley (J.Mills, J.Hicks).",1,1,10,ATL,play_type_pass,PHI,39,⋯,13:10:00,49,,,I,0,0,-0.3723598,False,2560854
2018090600,168,(13:05) (Shotgun) M.Ryan pass incomplete short left to D.Freeman.,1,2,10,ATL,play_type_pass,PHI,39,⋯,13:05:00,49,,,I,0,0,-0.7027787,False,2543583


In [3]:
#weeks of NFL season
weeks <- seq(1, 17)

#blank dataframe to store tracking data
df_tracking <- data.frame()

#iterating through all weeks
for(w in weeks){
    
    #temperory dataframe used for reading week for given iteration
    df_tracking_temp <- read_csv(paste0("../input/nfl-big-data-bowl-2021/week",w,".csv"),
                                col_types = cols())
    
    #storing temporary dataframe in full season dataframe
    df_tracking <- bind_rows(df_tracking_temp, df_tracking)                            
    
}
nrow(df_tracking)
rm(df_tracking_temp)

In [4]:
# here I want to make the df much smaller to save ram space and only get the position of DBs, potential targets for the offense,
# and the ball (implying the thrower (QB) is holding it on these plays)
keeper_positions <- c('WR', 'RB', 'TE', 'FB', 'SS', 'FS', 'CB', 
                     'DB', 'HB', 'S', 'QB')

keeper_positions <- which(df_tracking$position %in% keeper_positions | df_tracking$displayName == 'Football')
df_tracking <- df_tracking[keeper_positions,]

In [5]:
banned_plays <- c('qb_sack', 'handoff', 'touchback',
                  'qb_strip_sack', 'pass_shovel', 'qb_spike',
                  'run', 'touchback', 'field_goal_blocked',
                  'punt_fake', 'pass_lateral', 'lateral',
                  'field_goal_fake', 'safety', 'field_goal_play', 'handoff')

necessary_events = c('pass_outcome_touchdown', 'pass_outcome_caught',
                    'pass_outcome_incomplete', 'pass_outcome_interception',
                    'pass_arrived')
# create function to remove undesired plays
remove_badPlays <- function(play_df){
    event_match <- sum('pass_forward' %in% unique(play_df$event))
    # if pass_forward is not in the play_df return NA
    if(event_match == 0){
        return(NA)
    }
    event_match <- sum(necessary_events %in% unique(play_df$event))
    # if one of the necessary events is not in the play_df return NA
    if(event_match == 0){
        return(NA)
    }
    # if any of the banned events are in the play return NA
    if(sum(banned_plays %in% unique(play_df$event)) > 0){
        return(NA)
    }
    return(play_df)
}
# create function to remove NA values from the created list from remove_badPlays
na.omit.list <- function(y) { return(y[!sapply(y, function(x) all(is.na(x)))]) }

In [6]:
temp_df <- df_tracking %>%

group_split(gameId, playId)
# apply remove bad_badPlays to each play
temp_df <- lapply(temp_df, remove_badPlays)
# get rid of banned plays
temp_df <- na.omit.list(temp_df)

In [7]:
# cut down dfs by player for their movements from ball_snap to the arrvial/outcome
slim_ind_player <- function(player_df){
    ball_snap_index <- which(player_df$event == 'ball_snap')

    outcome_finder = player_df$event[which(player_df$event %in% c('pass_outcome_touchdown',
                                                                  'pass_outcome_complete',
                                                                  'pass_outcome_incomplete',
                                                                  'pass_outcome_interception',
                                                                  'pass_arrived'))[1]]
    outcome_index = which(player_df$event == outcome_finder)

    if(length(ball_snap_index) == 0 | length(outcome_index) == 0){
        return(NA)
    }
    return(player_df[ball_snap_index:outcome_index,])
}
# apply the slimming function to each play in the temp_df created above
# so, we're taking a list of individual plays and grouping by each player
# then we're applying the slim to the player df
# finally we bind all the rows back together
get_snap_to_outcome <- function(play_df){
    temp_lst <- play_df %>%
    
    group_split(nflId)
    
    temp_lst <- lapply(temp_lst, slim_ind_player)
    
    temp_lst <- na.omit.list(temp_lst)
    
    play_df <- bind_rows(temp_lst)
    
    return(play_df)
}

In [8]:
temp_df <- lapply(temp_df, get_snap_to_outcome)
length(temp_df)
temp_df <- na.omit.list(temp_df)
length(temp_df)

In [9]:
df_tracking <- bind_rows(temp_df)
rm(temp_df)
nrow(df_tracking)

In [10]:
#Standardizing tracking data so its always in direction of offense vs raw on-field coordinates.
df_tracking <- df_tracking %>%
                mutate(x = ifelse(playDirection == "left", 120-x, x),
                       y = ifelse(playDirection == "left", 160/3 - y, y),
                       o_std = ((-o + 90) %% 360) * 3.1415 / 180.0) # o_std is in radians

In [11]:
#merging plays and tracking data
df_merged <- inner_join(df_games,
                        df_plays,
                        by = c("gameId" = "gameId"))

#merging games data to previously merged frame, note slightly different id names from needed mutate
df_merged <- inner_join(df_merged,
                        df_tracking,
                        by = c("gameId", "playId"))

In [12]:
# clear space on RAM
rm(df_tracking)
rm(df_plays)
rm(df_targetedReceiver)
rm(df_games)

In [13]:
df_merged <- df_merged %>%

select( gameId, playId, passResult, targetNflId, time, x, y, o_std, o, dir,
       s, a, dis, event, epa, frameId, nflId, displayName, position, route,
       team, homeTeamAbbr, visitorTeamAbbr, possessionTeam, isDefensivePI) %>%

mutate(targetedReceiver = ifelse(nflId == targetNflId, 1, 0),
       team_name = case_when(team == 'away' ~ visitorTeamAbbr,
                            team == 'home' ~ homeTeamAbbr,
                            team == 'football' ~ 'ball'))

head(df_merged)

gameId,playId,passResult,targetNflId,time,x,y,o_std,o,dir,⋯,displayName,position,route,team,homeTeamAbbr,visitorTeamAbbr,possessionTeam,isDefensivePI,targetedReceiver,team_name
<dbl>,<dbl>,<chr>,<dbl>,<dttm>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>
2018090600,75,C,2495454,2018-09-07 01:07:15,28.26,26.66333,3.012175,277.41,235.01,⋯,Matt Ryan,QB,,away,PHI,ATL,ATL,False,0,ATL
2018090600,75,C,2495454,2018-09-07 01:07:15,28.24,26.65333,2.92142,282.61,147.8,⋯,Matt Ryan,QB,,away,PHI,ATL,ATL,False,0,ATL
2018090600,75,C,2495454,2018-09-07 01:07:15,28.22,26.65333,2.855798,286.37,83.42,⋯,Matt Ryan,QB,,away,PHI,ATL,ATL,False,0,ATL
2018090600,75,C,2495454,2018-09-07 01:07:15,28.21,26.65333,2.804487,289.31,87.85,⋯,Matt Ryan,QB,,away,PHI,ATL,ATL,False,0,ATL
2018090600,75,C,2495454,2018-09-07 01:07:16,28.16,26.65333,2.755968,292.09,91.72,⋯,Matt Ryan,QB,,away,PHI,ATL,ATL,False,0,ATL
2018090600,75,C,2495454,2018-09-07 01:07:16,28.06,26.66333,2.669926,297.02,95.13,⋯,Matt Ryan,QB,,away,PHI,ATL,ATL,False,0,ATL


In [14]:
# convert positioning of ball in every frame to columns
ballPositioning <- df_merged %>%

select(gameId, playId, frameId, x, y, team) %>%

filter(team == 'football') %>%

select(gameId, playId, frameId, x, y)

colnames(ballPositioning) <- list('gameId', 'playId', 'frameId', 'ball_x', 'ball_y')

df_merged <- left_join(df_merged,
                       ballPositioning,
                       by=c('gameId', 'playId', 'frameId'))
rm(ballPositioning)

In [15]:
line_of_scrimmage <- df_merged %>%

select(gameId, playId, x, team, event) %>%

filter(event == 'ball_snap' & team == 'football') %>%

select(gameId, playId, x)

colnames(line_of_scrimmage) <- list('gameId', 'playId', 'ball_snap_x')

line_of_scrimmage <- line_of_scrimmage %>%

mutate(max_space_available = 120 - ball_snap_x)

df_merged <- left_join(df_merged,
                       line_of_scrimmage,
                       by=c('gameId', 'playId'))

rm(line_of_scrimmage)

In [16]:
# convert positioning of target in every frame to columns
targetPositioning <- df_merged %>%

select(gameId, playId, frameId, x, y, targetedReceiver, dir) %>%

filter(targetedReceiver == 1) %>%

select(gameId, playId, frameId, x, y, dir)

colnames(targetPositioning) <- list('gameId', 'playId', 'frameId', 'target_x', 'target_y', 'target_dir')

df_merged <- left_join(df_merged,
                       targetPositioning,
                       by=c('gameId', 'playId', 'frameId'))

rm(targetPositioning)

In [17]:
# convert positioning of target in every frame to columns
qb_positioning <- df_merged %>%

select(gameId, playId, frameId, x, y, position, s, a) %>%

filter(position == 'QB') %>%

select(gameId, playId, frameId, x, y, s, a)

colnames(qb_positioning) <- list('gameId', 'playId', 'frameId', 'qb_x', 'qb_y', 'qb_s', 'qb_a')

df_merged <- left_join(df_merged,
                       qb_positioning,
                       by=c('gameId', 'playId', 'frameId'))

rm(qb_positioning)

In [18]:
keeper_positions <- c('WR', 'RB', 'TE', 'FB', 'SS', 'FS', 'CB', 
                     'DB', 'HB', 'S')

keeper_positions <- which(df_merged$position %in% keeper_positions)
df_merged <- df_merged[keeper_positions,]
head(df_merged)
colnames(df_merged)

gameId,playId,passResult,targetNflId,time,x,y,o_std,o,dir,⋯,ball_y,ball_snap_x,max_space_available,target_x,target_y,target_dir,qb_x,qb_y,qb_s,qb_a
<dbl>,<dbl>,<chr>,<dbl>,<dttm>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2018090600,75,C,2495454,2018-09-07 01:07:15,31.11,16.83333,6.030109,104.49,36.45,⋯,26.48333,29.89,90.11,28.64,9.193333,49.86,28.26,26.66333,0.0,0.0
2018090600,75,C,2495454,2018-09-07 01:07:15,31.11,16.82333,6.030109,104.49,37.08,⋯,26.48333,29.89,90.11,28.65,9.193333,39.2,28.24,26.65333,0.0,0.0
2018090600,75,C,2495454,2018-09-07 01:07:15,31.11,16.82333,6.065887,102.44,34.0,⋯,26.51333,29.89,90.11,28.68,9.193333,321.51,28.22,26.65333,0.03,0.82
2018090600,75,C,2495454,2018-09-07 01:07:15,31.11,16.83333,6.111963,99.8,278.87,⋯,26.51333,29.89,90.11,28.72,9.193333,270.73,28.21,26.65333,0.22,2.24
2018090600,75,C,2495454,2018-09-07 01:07:16,31.12,16.82333,6.124703,99.07,288.04,⋯,26.50333,29.89,90.11,28.76,9.203333,264.53,28.16,26.65333,0.61,3.46
2018090600,75,C,2495454,2018-09-07 01:07:16,31.13,16.81333,6.136746,98.38,317.54,⋯,26.47333,29.89,90.11,28.84,9.213333,263.56,28.06,26.66333,1.18,4.58


In [19]:
convert_orientation <- function(arctan){
    arctan <- arctan %% 360
    return(arctan)
}

In [20]:
df_merged <- df_merged %>%

mutate(
distanceFromBall = sqrt((x-ball_x)^2 + (y-ball_y)^2), # euclidean distance from QB
angleToBall =  if_else( team == 'football', 0, atan((y - ball_y) / (x - ball_x)) * 57.295779 ),# must convert from radians to degrees
angleToBall2 = if_else( team == 'football', 0, atan2((y - ball_y), (x - ball_x)) * 57.295779 ),
angleToBall360 = if_else(angleToBall2 > 90, 450-angleToBall2, 90 - angleToBall2),
distanceFromTarget = if_else( targetedReceiver == 1, 0, sqrt((x-target_x)^2 + (y-target_y)^2) ),
angleToTarget = if_else( team == 'football', 0, atan2((y - target_y), (x - target_x)) * 57.295779 ),
qb_slope = convert_orientation(angleToBall2), # slope of line between QB and DB
wr_slope = convert_orientation(angleToTarget),
defender_o = convert_orientation(o_std * 57.295779),
diff_qb = qb_slope - defender_o,
diff_wr = wr_slope - defender_o,
diff_qb = if_else(diff_qb < -180, diff_qb + 360, diff_qb), # these prevent the angles from failing when defender is behind the ball
diff_qb = if_else(diff_qb > 180, diff_qb - 360, diff_qb), # ie angle = -180 needs to be = 180 bc thats the orientation direction
diff_wr = if_else(diff_wr < -180, diff_wr + 360, diff_wr),
diff_wr = if_else(diff_wr > 180, diff_wr - 360, diff_wr),
look_at_qb = if_else(diff_qb < diff_wr, 1, 0),
distance_from_qb = sqrt((x-qb_x)^2 + (y-qb_y)^2)
)

In [21]:
colnames(df_merged)

In [22]:
coverage_id <- df_merged %>%

select(gameId, playId, frameId, x, y, o_std, o, dir, s, a, event, nflId, displayName, target_dir, distanceFromBall, angleToBall,
      angleToBall2, angleToBall360, distanceFromTarget, qb_slope, wr_slope, defender_o, diff_qb, look_at_qb, distance_from_qb,
      possessionTeam, team_name) %>%

filter(possessionTeam != team_name)

coverage_id <- subset(df_merged)
write.csv(coverage_id, 'coverage_id.csv')

In [23]:
df_merged <- subset(df_merged, select = -c(time, o_std, dis, homeTeamAbbr, visitorTeamAbbr,
                                          target_dir, angleToBall, angleToBall2, angleToBall360,
                                           qb_slope, wr_slope, defender_o, diff_qb, look_at_qb))
head(df_merged)
colnames(df_merged)
nrow(df_merged)

gameId,playId,passResult,targetNflId,x,y,o,dir,s,a,⋯,target_y,qb_x,qb_y,qb_s,qb_a,distanceFromBall,distanceFromTarget,angleToTarget,diff_wr,distance_from_qb
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2018090600,75,C,2495454,31.11,16.83333,104.49,36.45,0.01,0.01,⋯,9.193333,28.26,26.66333,0.0,0.0,9.726813,8.029352,72.08406,86.58425,10.23481
2018090600,75,C,2495454,31.11,16.82333,104.49,37.08,0.01,0.01,⋯,9.193333,28.24,26.65333,0.0,0.0,9.736735,8.016764,72.13008,86.63027,10.2404
2018090600,75,C,2495454,31.11,16.82333,102.44,34.0,0.0,0.01,⋯,9.193333,28.22,26.65333,0.03,0.82,9.762797,8.007609,72.33438,84.78463,10.24602
2018090600,75,C,2495454,31.11,16.83333,99.8,278.87,0.01,0.18,⋯,9.193333,28.21,26.65333,0.22,2.24,9.814622,8.005105,72.6289,82.43923,10.23926
2018090600,75,C,2495454,31.12,16.82333,99.07,288.04,0.07,0.41,⋯,9.203333,28.16,26.65333,0.61,3.46,9.884452,7.977092,72.79166,81.87201,10.26599
2018090600,75,C,2495454,31.13,16.81333,98.38,317.54,0.14,0.61,⋯,9.213333,28.06,26.66333,1.18,4.58,9.927724,7.937512,73.23162,81.62199,10.31733


In [24]:
write.csv(df_merged, 'df_merged.csv')

With output from this I can join back to the merged_df by gameId, frameId, and the nflId. use one game as a sample to ensure you get the merge right. The xComp model can be made on all instances of the pass arriving.

Using all receivers for the model training can create too much disparity for 0's. So, only train on targeted receivers. The logic here is that IF the ball were to be thrown to any receiver at x instance in time what is the xComp?

So, create the whole df then filter by event to fit the model.

Get features at all instances for consistency and ease of calculation later.

In [25]:
# write.csv(df_merged, 'full_tracking_df.csv')