# Alternate Analysis

In [1]:
import pandas as pd
import numpy as np
pd.set_option('mode.chained_assignment', None)

In [2]:
playlist = pd.read_csv("NFL_Turf/PlayList.csv")
injuries = pd.read_csv("NFL_Turf/InjuryRecord.csv")

## Cleaning Playlist Data

***Analyzing the Plays from the PlayList file***

- The first thing to note is that this list contains all of the plays, including the exact play that will match with the injury list, therefore anything that is on both with the exception of the PlayerKey should be maintained on THIS DF so that we don't lose data on the non-injury columns
- In order to separate the files to do predictive analysis on ONLY the injuries, there will be two output files, one with an outer merge that maintains the non-injuty data and one with an inner merge that only keeps data associated with injury
- PlayKey will be used as the Key to merge the datasets, so PlayerKey and GameID can be removed. While FieldType information is also in the surface column of the injuries table, we need to maintain it here, so we don't lose the data from the columns not containing injuries. 

### The Dataset

- PlayKeys represent all plays, not only those where injuries occurred - these will function to merge the tables
- FieldType only has 2 values, Natural or Synthetic and can be easily changed to binary values 
- Stadium Type is also strange with 29 unique types of stadiums. These will be grouped as either Outdoor, or Indoor
- Games played in retractable roof stadiums with Open Domes are Outdoor, Closed Domes are Indoor
- Weather - there are 63 unique types of weather.... this is odd
- RosterPosition, Position, and Position Group are all similar and need to be investigated
- PlayTypes should be encoded, as they are categorical such as pass, rush, kick, ... 

### Encoding the Data

- Binary Encoding can happen for FieldType and StadiumType
- For positions, plays, and weather, need to consider whether it is better to use dummies/OneHotEncoder or use numerical values in a single column

In [3]:
playlist.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB
2,26624,26624-1,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB
3,26624,26624-1,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,QB
4,26624,26624-1,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,QB


## Transformation Functions

### Playlist Functions

1. Surface_Coder - takes the Dataframe input and creates a new column in the df called SyntheticField with a 1 for True and 0 for false, or Natural Field
2. Stadium_Coder - takes the DF and replaces the StadiumTypes from having 29 stadium types to 2, either Indoor or Outdoor. Also creates a new numerical column called 'Outdoor' with binary values where 1 is True (Outdoor) and 0 is false (Indoor)
3. Temperature_Adjuster - takes rows from df where the temperature was recorded as -999 degrees. For all stadiums that are indoor, this temperature is set to 70 degrees. The others are removed from the dataframe
4. Weather_Coder - creates categories and groups the 63 weather types into 7 (Indoor, Clear, Cloudy, Windy, Hazy/Fog, Rain, Snow). Also creates a new column called 'precipitation' where 1 is True and denotes that there was rain or snow, and 0 that there was not. 
5. Position_Coder - This changes the positions from string to numeric using a full list of NFL positions to accommodate for any future injuries or players in the data. This applies to both the RosterPosition column and the Positions column
6. PlayerDay_Adjuster - The minimum playerday was -62, which causes problems in some of the analysis expecting all positive numbers. This function simply adds 63 days to all players
7. Play_Coder - Changes all plays involving the special player for kicking plays to 'Kick', so now there are only 3 categories for the plays. Also creates a new column called PlayCode with numerical encoding for the 3 play types
8. Process_Playlist_Data - Applies all of the Playlist function to clean and code the Playlist data, outputs cleaned df

### InjuryRecord Functions

1. Injury_Coder - assigns a numerical value to each of the injuries from BodyPart
2. Injury_Duration_Classifier - uses the 4 duration columns to return a list of minimum days injured
3. Injury_Duration_Coder - applies the duration code to create the column as well as to create an additional Severe column, where all injuries over 28 days are considered severe
4. Process_Injury_Data - Applies all of the Injury functions to the clean and code the InjuryRecord data, outputs cleaned df



In [4]:
def Surface_Coder(df):
    # Surface_Coder: Function that encodes the Field Surface to identify natural or synthetic
    surface_map = {
        'Natural': 0,
        'Synthetic': 1
    }

    df['SyntheticField'] = df.FieldType.map(surface_map)

    return df    

In [5]:
def Stadium_Coder(df):
    # Stadium_Coder: This function changes the stadium type to either Outdoor or Indoor, maintaining the categorical label
    df.StadiumType.fillna('Outdoor', inplace=True)
    
    dict = {'Outdoor': 'Outdoor',
        'Indoors': 'Indoor',
        'Oudoor': 'Outdoor',
        'Outdoors': 'Outdoor',
        'Open': 'Outdoor',
        'Closed Dome': 'Indoor',
        'Domed, closed': 'Indoor',
        'Dome': 'Indoor',
        'Indoor': 'Indoor',
        'Domed': 'Indoor',
        'Retr. Roof-Closed': 'Indoor',
        'Outdoor Retr Roof-Open': 'Outdoor',
        'Retractable Roof': 'Indoor',
        'Ourdoor': 'Outdoor',
        'Indoor, Roof Closed': 'Indoor',
        'Retr. Roof - Closed': 'Indoor',
        'Bowl': 'Outdoor',
        'Outddors': 'Outdoor',
        'Retr. Roof-Open': 'Outdoor',
        'Dome, closed': 'Indoor',
        'Indoor, Open Roof': 'Outdoor',
        'Domed, Open': 'Outdoor',
        'Domed, open': 'Outdoor',
        'Heinz Field': 'Outdoor',
        'Cloudy': 'Outdoor',
        'Retr. Roof - Open': 'Outdoor',
        'Retr. Roof Closed': 'Indoor',
        'Outdor': 'Outdoor',
        'Outside': 'Outdoor'}

    df.StadiumType.replace(dict, inplace=True)


    # Create a new column with stadiums coded numerically
    stadium = {
        'Outdoor': 1, 
        'Indoor': 0
    }
    
    # Map the stadiumtype for outdoor as 1 = True and 0 = false
    df['Outdoor'] = df.StadiumType.map(stadium)

    return df

In [6]:
def Temperature_Adjuster(df):
    # Temperature_Adjuster: This function also fixes the -999 temperature issue for all indoor stadiums
    
    # Fix the temperature from -999 at any indoor stadium to 70
    df['Temperature'] = np.where(
        (df['Temperature'] == -999) & (df['StadiumType'] == 'Indoor'), 70, df.Temperature)

    # Extract all values that are not -999 degrees
    df_filtered = df[df['Temperature'] != -999]

    return df_filtered


In [7]:
def Weather_Coder(df):
    # Weather_Coder: This function changes the weather into a smaller subset of categorical groups
    weather_dict = {'Clear and warm': 'Clear',
                    'Mostly Cloudy': 'Cloudy',
                    'Sunny': 'Clear',
                    'Clear': 'Clear',
                    'Cloudy': 'Cloudy',
                    'Cloudy, fog started developing in 2nd quarter': 'Hazy/Fog',
                    'Rain': 'Rain',
                    'Partly Cloudy': 'Cloudy',
                    'Mostly cloudy': 'Cloudy',
                    'Cloudy and cold': 'Cloudy',
                    'Cloudy and Cool': 'Cloudy',
                    'Rain Chance 40%': 'Rain',
                    'Controlled Climate': 'Indoor',
                    'Sunny and warm': 'Clear',
                    'Partly cloudy': 'Cloudy',
                    'Clear and Cool': 'Cloudy',
                    'Clear and cold': 'Cloudy',
                    'Sunny and cold': 'Clear',
                    'Indoor': 'Indoor',
                    'Partly Sunny': 'Clear',
                    'N/A (Indoors)': 'Indoor',
                    'Mostly Sunny': 'Clear',
                    'Indoors': 'Indoor',
                    'Clear Skies': 'Clear',
                    'Partly sunny': 'Clear',
                    'Showers': 'Rain',
                    'N/A Indoor': 'Indoor',
                    'Sunny and clear': 'Clear',
                    'Snow': 'Snow',
                    'Scattered Showers': 'Rain',
                    'Party Cloudy': 'Cloudy',
                    'Clear skies': 'Clear',
                    'Rain likely, temps in low 40s.': 'Rain',
                    'Hazy': 'Hazy/Fog',
                    'Partly Clouidy': 'Cloudy',
                    'Sunny Skies': 'Clear',
                    'Overcast': 'Cloudy',
                    'Cloudy, 50% change of rain': 'Cloudy',
                    'Fair': 'Clear',
                    'Light Rain': 'Rain',
                    'Partly clear': 'Clear',
                    'Mostly Coudy': 'Cloudy',
                    '10% Chance of Rain': 'Cloudy',
                    'Cloudy, chance of rain': 'Cloudy',
                    'Heat Index 95': 'Clear',
                    'Sunny, highs to upper 80s': 'Clear',
                    'Sun & clouds': 'Cloudy',
                    'Heavy lake effect snow': 'Snow',
                    'Mostly sunny': 'Clear',
                    'Cloudy, Rain': 'Rain',
                    'Sunny, Windy': 'Windy',
                    'Mostly Sunny Skies': 'Clear',
                    'Rainy': 'Rain',
                    '30% Chance of Rain': 'Rain',
                    'Cloudy, light snow accumulating 1-3"': 'Snow',
                    'cloudy': 'Cloudy',
                    'Clear and Sunny': 'Clear',
                    'Coudy': 'Cloudy',
                    'Clear and sunny': 'Clear',
                    'Clear to Partly Cloudy': 'Clear',
                    'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.': 'Windy',
                    'Rain shower': 'Rain',
                    'Cold': 'Clear'}


    df.Weather.replace(weather_dict, inplace=True)

    # There are still na values within the weather group that need to be addressed
    df.loc[df.StadiumType == 'Indoor', 'Weather'] = df.loc[df.StadiumType == 'Indoor', 'Weather'].fillna('Indoor')

    # Because we can't make a determination on the type of weather for outdoor, drop the remaining na values
    df = df.loc[df.Weather.isna() == False]


       
    return df

In [8]:
def Precipitation_Coder(df):
    # Add a column for the presence of precipitation, that will ultimately be used for numerical analysis of the weather.
    precipitation = {
        'Indoor': 0,
        'Clear': 0,
        'Cloudy': 0,
        'Windy': 0,
        'Hazy/Fog': 0,
        'Rain': 1,
        'Snow': 1
    }

    df['Precipitation'] = df.Weather.map(precipitation)
    return df

In [149]:
def ML_Position_Coder(df): 
    # ML_Position_Coder: This function encodes the players by position and rosterposition
    df['Position'] = np.where(df['Position'] == 'Missing Data', df['RosterPosition'], df['Position'])

    position = {
        'Quarterback': 0,
        'QB': 0,
        'Running Back': 1,
        'RB': 1,
        'FB': 2, 
        'Wide Receiver': 3,
        'WR': 3,
        'Tight End': 4,
        'TE': 4,
        'Offensive Lineman': 5,
        'OL': 5,
        'C': 6,
        'G': 7,
        'LG': 8,
        'RG': 9, 
        'T': 10, 
        'LT': 11, 
        'RT': 12, 
        'Kicker': 13,
        'K': 13,
        'KR': 14, 
        'Defensive Lineman': 15,
        'DL': 15,
        'DE': 16,
        'DT': 17, 
        'NT': 18, 
        'Linebacker': 19,
        'LB': 19,
        'OLB': 20,
        'ILB': 21,
        'MLB': 22,
        'DB': 23,
        'Cornerback': 24,
        'CB': 24,
        'Safety': 25,
        'S': 25,
        'SS': 26,
        'FS': 27,
        'P': 28,
        'PR': 29, 
        'HB': 30
    }

    df.RosterPosition.replace(position, inplace=True)
    df.Position.replace(position, inplace=True)
    df.Position.astype(int)
    df.drop(columns='PositionGroup', inplace=True)

    return df

In [10]:
def Vis_Position_Coder(df):
    # Vis_Position_Coder: This function replaces nan position with rosterposition and drops the position group column
    df['Position'] = np.where(df['Position'] == 'Missing Data', df['RosterPosition'], df['Position'])
    
    position = {
        'Quarterback': 'QB',
        'Running Back': 'RB',
        'Wide Receiver': 'WR',
        'Tight End': 'TE',
        'Offensive Lineman': 'OL',
        'Kicker': 'K',
        'Defensive Lineman': 'DL',
        'Linebacker': 'LB',
        'Cornerback': 'CB',
        'Safety': 'S'
         }
    
    df.Position.replace(position, inplace=True)
    df.drop(columns='PositionGroup', inplace=True)
    return df

In [11]:
def PlayerDay_Adjuster(df):
    # PlayerDay_Adjuster: This function adjusts the player day to remove the negative values
    df.assign(DaysPlayed = lambda x: x['PlayerDay'] + 63)

    return df

In [155]:
def Play_Coder(df):
    # Play_Coder: This function creates a categorical grouping for the different types of plays, grouping into passing, rushing, or kicking plays

    play_type = {
        'Pass': 'Pass',
        'Rush': 'Rush',
        'Extra Point': 'Kick',
        'Kickoff': 'Kick',
        'Punt': 'Kick',
        'Field Goal': 'Kick',
        'Kickoff Not Returned': 'Kick',
        'Punt Not Returned': 'Kick',
        'Kickoff Returned': 'Kick',
        'Punt Returned': 'Kick',
        '0': 'Kick'
    }

    play_map = {
        'Pass': 0, 
        'Rush': 1, 
        'Kick': 2
    }

    df.PlayType.replace(play_type, inplace=True)
    df['PlayCode'] = df.PlayType.map(play_map)

    df = df.loc[df.PlayType.isna() == False]
    df.PlayCode.astype(int)

    return df

In [121]:
def ML_Process_Playlist_Data(df):
    # ML_Process_Playlist_Data: Create cleaning function to apply all of the other functions to the single df input for Machine Learning
    df = Surface_Coder(df)
    df = Stadium_Coder(df)
    df = Temperature_Adjuster(df)
    df = Weather_Coder(df)
    df = Precipitation_Coder(df)
    df = ML_Position_Coder(df)
    df = PlayerDay_Adjuster(df)
    df = Play_Coder(df)

    df.drop(columns=['PlayerKey', 'GameID', 'StadiumType',
            'FieldType', 'Weather', 'PlayType'], inplace=True)
    return df

In [14]:
def Vis_Process_Playlist_Data(df):
    # Vis_Process_Playlist_Data: Create cleaning function to apply all of the other functions to the single df input for visualizations

    df = Stadium_Coder(df)
    df = Temperature_Adjuster(df)
    df = Weather_Coder(df)
    df = Vis_Position_Coder(df)
    df = PlayerDay_Adjuster(df)
    df = Play_Coder(df)

    df.drop(columns=['PlayerKey', 'GameID', 'Outdoor', 'PlayCode'], inplace=True)
    return df

## Cleaning Injury List Data

- Depending on the Join, there will be groupings with NoInjury, which needs to be coded as well as the others
- The DM_M# columns represent the minimum number of days that the player was out - this can be turned into a continuous data type
- Can also create a binary output for Severe, over 28 days, and Less Severe for under
- There are some playKeys that are NAN and need to be removed, since there is no indication when or how the injury occurred
- Surface can be removed, since this will be merged with the other table that already contains this information

In [163]:
def Injury_Coder(df):
    # Injury_Coder: This function codifies the injury types based on their frequency of occurrence, adding this as a new column called "InjuryType"

    knee_freq = df.BodyPart.value_counts()['Knee']
    ankle_freq = df.BodyPart.value_counts()['Ankle']
    foot_freq = df.BodyPart.value_counts()['Foot']
        
    injury_map = {
        'NoInjury': 0, 
        'Foot': foot_freq, 
        'Ankle': ankle_freq, 
        'Knee': knee_freq
    }

    df['InjuryType'] = df.BodyPart.map(injury_map)

    # Remove any injuries not associated with a play
    df = df.loc[df.PlayKey.isna() == False]
    
    return df

In [16]:
def Injury_Duration_Classifier(row):
    # Injury_Duration_Classifier: This creates a new list of numerical values as the shortest number of days of injury

    injury_duration = 0
    if row["DM_M42"] == 1:
        injury_duration = 42
    else:
        if row["DM_M28"] == 1:
            injury_duration = 28
        else:
            if row["DM_M7"] == 1:
                injury_duration = 7
            else: 
                injury_duration = 1
    

    return injury_duration

In [156]:
def Injury_Duration_Coder(df): 
    # Injury_Duration_Coder: Apply the Injury Duration Classifier to the dataframe

    df['InjuryDuration'] = df.apply(Injury_Duration_Classifier, axis=1)

    severity_map = {
        42: 1, 
        28: 1,
        7: 0, 
        1: 0 
    }
    df['SevereInjury'] = df.InjuryDuration.map(severity_map)
    df.InjuryDuration.astype(int)
    df.SevereInjury.astype(int)

    return df

In [167]:
def ML_Process_Injury_Data(df): 
    # ML_Process_Injury_Data: This function applies all of the InjuryRecord table processing Functions

    df = Injury_Coder(df)
    df = Injury_Duration_Coder(df)

    # Drop any PlayKey NaN values
    df = df.loc[df.PlayKey.isna() == False]    


    # Drop the columns that are redundant with those from the Playlist Dataframe inpreparation for the Merge.
    df.drop(columns=['GameID',
                       'PlayerKey',
                       'Surface',
                       'BodyPart',
                       'DM_M1',
                       'DM_M7',
                       'DM_M28',
                       'DM_M42'], inplace=True)


    return df

In [19]:
def Vis_Process_Injury_Data(df):
    # Process_Injury_Data: This function applies all of the InjuryRecord table processing Functions

    df = Injury_Coder(df)
    df = Injury_Duration_Coder(df)

    # Drop any PlayKey NaN values
    df.PlayKey.dropna(inplace=True)

    # Drop the columns that are redundant with those from the Playlist Dataframe inpreparation for the Merge.
    df.drop(columns=['GameID',
                     'PlayerKey',
                     'InjuryType',
                     'Surface',
                     'DM_M1',
                     'DM_M7',
                     'DM_M28',
                     'DM_M42'], inplace=True)

    return df

## Merge the PlayList and Injury Dataframes

The next line creates two merges:
- The outer merge that will include all data, not only the data associated with injuries
- The outer merge data may be better suited to only looking at the final play position from the tracking data
- The inner merge will only include data associated with the injuries
- The inner merge won't produce many unique plays, but when merged with the tracking data, this df will get exponentially larger, giving better predictability with injury type classification


In [179]:
def ML_Data_Cleaner(playlist, injuries):
    # This is the overall function that cleans and merges the data for Machine Learning data processing

    playlist = ML_Process_Playlist_Data(playlist)
    injuries = ML_Process_Injury_Data(injuries)
    
    ML_outer = pd.merge(playlist, injuries, on='PlayKey', how='outer')
    
    ML_outer.InjuryType.fillna(0, inplace=True)
    ML_outer.InjuryDuration.fillna(0, inplace=True)
    ML_outer.SevereInjury.fillna(0, inplace=True)

    return ML_outer

In [174]:
playlist = pd.read_csv("NFL_Turf/PlayList.csv")
injuries = pd.read_csv("NFL_Turf/InjuryRecord.csv")

ML_outer = ML_Data_Cleaner(playlist, injuries)

Verify that all output values are numeric before exporting the table

In [175]:
ML_outer.dtypes

PlayKey            object
RosterPosition      int64
PlayerDay           int64
PlayerGame          int64
Temperature         int64
PlayerGamePlay      int64
Position            int64
SyntheticField      int64
Outdoor             int64
Precipitation       int64
PlayCode          float64
InjuryType        float64
InjuryDuration    float64
SevereInjury      float64
dtype: object

In [23]:
def Vis_Data_Cleaner(playlist, injuries):
    # This is the overall function for cleaning and merging the data for Visualizations 

    playlist = Vis_Process_Playlist_Data(playlist)
    injuries = Vis_Process_Injury_Data(injuries)

    Vis_outer = pd.merge(playlist, injuries, on='PlayKey', how='outer')

    Vis_outer.BodyPart.fillna("NoInjury", inplace=True)
    Vis_outer.InjuryDuration.fillna(0, inplace=True)
    Vis_outer.SevereInjury.fillna(0, inplace=True)

    return Vis_outer

In [24]:
playlist = pd.read_csv("NFL_Turf/PlayList.csv")
injuries = pd.read_csv("NFL_Turf/InjuryRecord.csv")

In [177]:
playlist = pd.read_csv("NFL_Turf/PlayList.csv")
injuries = pd.read_csv("NFL_Turf/InjuryRecord.csv")

Vis_outer = Vis_Data_Cleaner(playlist, injuries)

In [178]:
Vis_outer.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,BodyPart,InjuryDuration,SevereInjury
0,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear,Pass,1,QB,NoInjury,0.0,0.0
1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear,Pass,2,QB,NoInjury,0.0,0.0
2,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear,Rush,3,QB,NoInjury,0.0,0.0
3,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear,Rush,4,QB,NoInjury,0.0,0.0
4,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear,Pass,5,QB,NoInjury,0.0,0.0


## Tracking Data Cleaner

In [27]:
tracking = pd.read_csv('NFL_Turf/PlayerTrackData.csv')
tracking.head()

Unnamed: 0,PlayKey,time,event,x,y,dir,dis,o,s
0,26624-1-1,0.0,huddle_start_offense,87.46,28.93,288.24,0.01,262.33,0.13
1,26624-1-1,0.1,,87.45,28.92,283.91,0.01,261.69,0.12
2,26624-1-1,0.2,,87.44,28.92,280.4,0.01,261.17,0.12
3,26624-1-1,0.3,,87.44,28.92,278.79,0.01,260.66,0.1
4,26624-1-1,0.4,,87.44,28.92,275.44,0.01,260.27,0.09


The tracking data measures the movement every 1/10 second, giving us an exorbitant amount of data that can be applied differently, depending on the purpose of the analysis. 

- For the Outer Merged Data including all non-injury data, it may be beneficial to only look at the last time and position of each play. 
- For the Inner Merged Data only including the injuries, it may be more beneficial to look at each full play.  
- The first play in the table alone has 298 rows in the single 29.8 second play
- There isn't really anything to clean in this table, but it will be problematic to add all of the additional columns to 76 million rows

## Adding Injuries to Tracking Data


In [28]:
print(tracking.PlayKey.nunique() - playlist.PlayKey.nunique())

-45


There seem to be about 45 plays in the tracking that do not exist in the Playlist, so these will be dropped.

In [29]:
# Create a new df that is a subset of PlayKeys and Injuries both Categorical and Numerical
injury_columns = Vis_outer.loc[:,['PlayKey', 'BodyPart']]
injury_columns = Injury_Coder(injury_columns)
injury_columns.head()

Unnamed: 0,PlayKey,BodyPart,InjuryType
0,26624-1-1,NoInjury,0.0
1,26624-1-2,NoInjury,0.0
2,26624-1-3,NoInjury,0.0
3,26624-1-4,NoInjury,0.0
4,26624-1-5,NoInjury,0.0


In [33]:
track_outer = pd.merge(tracking, injury_columns, on='PlayKey', how='outer')

In [34]:
track_outer.head()

Unnamed: 0,PlayKey,time,event,x,y,dir,dis,o,s,BodyPart,InjuryType
0,26624-1-1,0.0,huddle_start_offense,87.46,28.93,288.24,0.01,262.33,0.13,NoInjury,0.0
1,26624-1-1,0.1,,87.45,28.92,283.91,0.01,261.69,0.12,NoInjury,0.0
2,26624-1-1,0.2,,87.44,28.92,280.4,0.01,261.17,0.12,NoInjury,0.0
3,26624-1-1,0.3,,87.44,28.92,278.79,0.01,260.66,0.1,NoInjury,0.0
4,26624-1-1,0.4,,87.44,28.92,275.44,0.01,260.27,0.09,NoInjury,0.0


In [35]:
track_outer.shape

(76367111, 11)

We needed to remove all of the plays from the play key that aren't associated to the other lists, since we can't even say that they were 'noinjury' plays

So this took 5.25 minutes to save as a csv, which I don't intend to upload or deal with until it's been cut down a bit

In [36]:
track_outer = track_outer.loc[track_outer.BodyPart.isna() == False, :]
track_outer.shape

(74498983, 11)

If we're going to do an unsupervised analysis on the big dataset, we'll need to use AWS and Colab, since a local machine doesn't have enough memory to open the csv, let alone run machine learning models on this

In [37]:
# track_outer.to_csv('Merged_Tables/tracking_all.csv', index=False)

In [38]:
injury_plays = track_outer.loc[track_outer.BodyPart != "NoInjury", :]

In [39]:
injury_plays.head()

Unnamed: 0,PlayKey,time,event,x,y,dir,dis,o,s,BodyPart,InjuryType
2085848,31070-3-7,0.0,line_set,44.07,32.14,23.31,0.0,174.83,0.03,Knee,48.0
2085849,31070-3-7,0.1,,44.08,32.14,20.18,0.0,175.09,0.03,Knee,48.0
2085850,31070-3-7,0.2,,44.08,32.14,16.53,0.0,175.35,0.03,Knee,48.0
2085851,31070-3-7,0.3,,44.08,32.14,13.23,0.0,175.6,0.02,Knee,48.0
2085852,31070-3-7,0.4,,44.08,32.14,9.78,0.0,175.82,0.02,Knee,48.0


## Merge the Injury Plays with the Tracking Data

### Merge the Machine Learning Data in preparation for export

In [184]:
Pre_injury_plays = injury_plays.drop(columns=["BodyPart", "InjuryType", 'event'])
ML_injury_tracking = pd.merge(Pre_injury_plays, ML_outer, on='PlayKey', how='outer')
ML_injury_tracking.head()

Unnamed: 0,PlayKey,time,x,y,dir,dis,o,s,RosterPosition,PlayerDay,...,Temperature,PlayerGamePlay,Position,SyntheticField,Outdoor,Precipitation,PlayCode,InjuryType,InjuryDuration,SevereInjury
0,31070-3-7,0.0,44.07,32.14,23.31,0.0,174.83,0.03,1.0,15.0,...,89.0,7.0,1.0,0.0,1.0,0.0,1.0,48.0,42.0,1.0
1,31070-3-7,0.1,44.08,32.14,20.18,0.0,175.09,0.03,1.0,15.0,...,89.0,7.0,1.0,0.0,1.0,0.0,1.0,48.0,42.0,1.0
2,31070-3-7,0.2,44.08,32.14,16.53,0.0,175.35,0.03,1.0,15.0,...,89.0,7.0,1.0,0.0,1.0,0.0,1.0,48.0,42.0,1.0
3,31070-3-7,0.3,44.08,32.14,13.23,0.0,175.6,0.02,1.0,15.0,...,89.0,7.0,1.0,0.0,1.0,0.0,1.0,48.0,42.0,1.0
4,31070-3-7,0.4,44.08,32.14,9.78,0.0,175.82,0.02,1.0,15.0,...,89.0,7.0,1.0,0.0,1.0,0.0,1.0,48.0,42.0,1.0


In [185]:
ML_injury_tracking.dtypes

PlayKey            object
time              float64
x                 float64
y                 float64
dir               float64
dis               float64
o                 float64
s                 float64
RosterPosition    float64
PlayerDay         float64
PlayerGame        float64
Temperature       float64
PlayerGamePlay    float64
Position          float64
SyntheticField    float64
Outdoor           float64
Precipitation     float64
PlayCode          float64
InjuryType        float64
InjuryDuration    float64
SevereInjury      float64
dtype: object

In [187]:
ML_injury_tracking.shape

(283456, 21)

In [194]:
ML_injury_tracking = ML_injury_tracking.loc[ML_injury_tracking.time.isna() == False]
ML_injury_tracking.shape

(22775, 21)

Export the machine learning merged data to csv

In [195]:
ML_injury_tracking.to_csv("Shared_Tables/ml_injury_tracking.csv")

### Merge the Visual Injury and Tracking in preparation for Export

In [196]:
Vis_injury_tracking = pd.merge(injury_plays, Vis_outer, on='PlayKey', how='outer')
Vis_injury_tracking.head()

Unnamed: 0,PlayKey,time,event,x,y,dir,dis,o,s,BodyPart_x,...,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,BodyPart_y,InjuryDuration,SevereInjury
0,31070-3-7,0.0,line_set,44.07,32.14,23.31,0.0,174.83,0.03,Knee,...,Outdoor,Natural,89.0,Clear,Rush,7.0,RB,Knee,42.0,1.0
1,31070-3-7,0.1,,44.08,32.14,20.18,0.0,175.09,0.03,Knee,...,Outdoor,Natural,89.0,Clear,Rush,7.0,RB,Knee,42.0,1.0
2,31070-3-7,0.2,,44.08,32.14,16.53,0.0,175.35,0.03,Knee,...,Outdoor,Natural,89.0,Clear,Rush,7.0,RB,Knee,42.0,1.0
3,31070-3-7,0.3,,44.08,32.14,13.23,0.0,175.6,0.02,Knee,...,Outdoor,Natural,89.0,Clear,Rush,7.0,RB,Knee,42.0,1.0
4,31070-3-7,0.4,,44.08,32.14,9.78,0.0,175.82,0.02,Knee,...,Outdoor,Natural,89.0,Clear,Rush,7.0,RB,Knee,42.0,1.0


In [197]:
Vis_injury_tracking.shape

(283456, 24)

In [198]:
Vis_injury_tracking = Vis_injury_tracking.loc[Vis_injury_tracking.time.isna() == False]
Vis_injury_tracking.shape


(22775, 24)

In [200]:
Vis_injury_tracking.to_csv("Shared_Tables/vis_injury_tracking.csv")