# Alternate Analysis

In [404]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [405]:
playlist = pd.read_csv("NFL_Turf/PlayList.csv")

## Cleaning Playlist Data

***Analyzing the Plays from the PlayList file***

- The first thing to note is that this list contains all of the plays, including the exact play that will match with the injury list, therefore anything that is on both with the exception of the PlayerKey should be maintained on THIS DF so that we don't lose data on the non-injury columns
- In order to separate the files to do predictive analysis on ONLY the injuries, there will be two output files, one with an outer merge that maintains the non-injuty data and one with an inner merge that only keeps data associated with injury
- PlayKey will be used as the Key to merge the datasets, so PlayerKey and GameID can be removed. While FieldType information is also in the surface column of the injuries table, we need to maintain it here, so we don't lose the data from the columns not containing injuries. 

### The Dataset

- PlayKeys represent all plays, not only those where injuries occurred - these will function to merge the tables
- FieldType only has 2 values, Natural or Synthetic and can be easily changed to binary values 
- Stadium Type is also strange with 29 unique types of stadiums. These will be grouped as either Outdoor, or Indoor
- Games played in retractable roof stadiums with Open Domes are Outdoor, Closed Domes are Indoor
- Weather - there are 63 unique types of weather.... this is odd
- RosterPosition, Position, and Position Group are all similar and need to be investigated
- PlayTypes should be encoded, as they are categorical such as pass, rush, kick, ... 

### Encoding the Data

- Binary Encoding can happen for FieldType and StadiumType
- For positions, plays, and weather, need to consider whether it is better to use dummies/OneHotEncoder or use numerical values in a single column

In [406]:
playlist.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB
2,26624,26624-1,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB
3,26624,26624-1,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,QB
4,26624,26624-1,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,QB


## Transformation Functions

### Playlist Functions

1. Surface_Coder - takes the Dataframe input and creates a new column in the df called SyntheticField with a 1 for True and 0 for false, or Natural Field
2. Stadium_Coder - takes the DF and replaces the StadiumTypes from having 29 stadium types to 2, either Indoor or Outdoor. Also creates a new numerical column called 'Outdoor' with binary values where 1 is True (Outdoor) and 0 is false (Indoor)
3. Temperature_Adjuster - takes rows from df where the temperature was recorded as -999 degrees. For all stadiums that are indoor, this temperature is set to 70 degrees. The others are removed from the dataframe
4. Weather_Coder - creates categories and groups the 63 weather types into 7 (Indoor, Clear, Cloudy, Windy, Hazy/Fog, Rain, Snow). Also creates a new column called 'precipitation' where 1 is True and denotes that there was rain or snow, and 0 that there was not. 
5. Position_Coder - This changes the positions from string to numeric using a full list of NFL positions to accommodate for any future injuries or players in the data. This applies to both the RosterPosition column and the Positions column
6. PlayerDay_Adjuster - The minimum playerday was -62, which causes problems in some of the analysis expecting all positive numbers. This function simply adds 63 days to all players
7. Play_Coder - Changes all plays involving the special player for kicking plays to 'Kick', so now there are only 3 categories for the plays. Also creates a new column called PlayCode with numerical encoding for the 3 play types
8. Process_Playlist_Data - Applies all of the Playlist function to clean and code the Playlist data, outputs cleaned df

### InjuryRecord Functions

1. Injury_Coder - assigns a numerical value to each of the injuries from BodyPart
2. Injury_Duration_Classifier - uses the 4 duration columns to return a list of minimum days injured
3. Injury_Duration_Coder - applies the duration code to create the column as well as to create an additional Severe column, where all injuries over 28 days are considered severe
4. Process_Injury_Data - Applies all of the Injury functions to the clean and code the InjuryRecord data, outputs cleaned df



In [407]:
# Function that encodes the Field Surface to identify natural or synthetic
def Surface_Coder(df):
    surface_map = {
        'Natural': 0,
        'Synthetic': 1
    }

    df['SyntheticField'] = df.FieldType.map(surface_map)

    return df    

In [408]:
# This function changes the stadium type to either Outdoor or Indoor, maintaining the categorical label
def Stadium_Coder(df):
    df.StadiumType.fillna('Outdoor', inplace=True)
    
    dict = {'Outdoor': 'Outdoor',
        'Indoors': 'Indoor',
        'Oudoor': 'Outdoor',
        'Outdoors': 'Outdoor',
        'Open': 'Outdoor',
        'Closed Dome': 'Indoor',
        'Domed, closed': 'Indoor',
        'Dome': 'Indoor',
        'Indoor': 'Indoor',
        'Domed': 'Indoor',
        'Retr. Roof-Closed': 'Indoor',
        'Outdoor Retr Roof-Open': 'Outdoor',
        'Retractable Roof': 'Indoor',
        'Ourdoor': 'Outdoor',
        'Indoor, Roof Closed': 'Indoor',
        'Retr. Roof - Closed': 'Indoor',
        'Bowl': 'Outdoor',
        'Outddors': 'Outdoor',
        'Retr. Roof-Open': 'Outdoor',
        'Dome, closed': 'Indoor',
        'Indoor, Open Roof': 'Outdoor',
        'Domed, Open': 'Outdoor',
        'Domed, open': 'Outdoor',
        'Heinz Field': 'Outdoor',
        'Cloudy': 'Outdoor',
        'Retr. Roof - Open': 'Outdoor',
        'Retr. Roof Closed': 'Indoor',
        'Outdor': 'Outdoor',
        'Outside': 'Outdoor'}

    df.StadiumType.replace(dict, inplace=True)


    # Create a new column with stadiums coded numerically
    stadium = {
        'Outdoor': 1, 
        'Indoor': 0
    }
    
    # Map the stadiumtype for outdoor as 1 = True and 0 = false
    df['Outdoor'] = df.StadiumType.map(stadium)

    return df

In [409]:
# This function also fixes the -999 temperature issue for all indoor stadiums
def Temperature_Adjuster(df):
    # Fix the temperature from -999 at any indoor stadium to 70
    df['Temperature'] = np.where(
        (df['Temperature'] == -999) & (df['StadiumType'] == 'Indoor'), 70, df.Temperature)

    # Extract all values that are not -999 degrees
    df = df[df['Temperature'] != -999]

    return df


In [410]:
# This function changes the weather into a smaller subset of categorical groups
def Weather_Coder(df):
    weather_dict = {'Clear and warm': 'Clear',
                    'Mostly Cloudy': 'Cloudy',
                    'Sunny': 'Clear',
                    'Clear': 'Clear',
                    'Cloudy': 'Cloudy',
                    'Cloudy, fog started developing in 2nd quarter': 'Hazy/Fog',
                    'Rain': 'Rain',
                    'Partly Cloudy': 'Cloudy',
                    'Mostly cloudy': 'Cloudy',
                    'Cloudy and cold': 'Cloudy',
                    'Cloudy and Cool': 'Cloudy',
                    'Rain Chance 40%': 'Rain',
                    'Controlled Climate': 'Indoor',
                    'Sunny and warm': 'Clear',
                    'Partly cloudy': 'Cloudy',
                    'Clear and Cool': 'Cloudy',
                    'Clear and cold': 'Cloudy',
                    'Sunny and cold': 'Clear',
                    'Indoor': 'Indoor',
                    'Partly Sunny': 'Clear',
                    'N/A (Indoors)': 'Indoor',
                    'Mostly Sunny': 'Clear',
                    'Indoors': 'Indoor',
                    'Clear Skies': 'Clear',
                    'Partly sunny': 'Clear',
                    'Showers': 'Rain',
                    'N/A Indoor': 'Indoor',
                    'Sunny and clear': 'Clear',
                    'Snow': 'Snow',
                    'Scattered Showers': 'Rain',
                    'Party Cloudy': 'Cloudy',
                    'Clear skies': 'Clear',
                    'Rain likely, temps in low 40s.': 'Rain',
                    'Hazy': 'Hazy/Fog',
                    'Partly Clouidy': 'Cloudy',
                    'Sunny Skies': 'Clear',
                    'Overcast': 'Cloudy',
                    'Cloudy, 50% change of rain': 'Cloudy',
                    'Fair': 'Clear',
                    'Light Rain': 'Rain',
                    'Partly clear': 'Clear',
                    'Mostly Coudy': 'Cloudy',
                    '10% Chance of Rain': 'Cloudy',
                    'Cloudy, chance of rain': 'Cloudy',
                    'Heat Index 95': 'Clear',
                    'Sunny, highs to upper 80s': 'Clear',
                    'Sun & clouds': 'Cloudy',
                    'Heavy lake effect snow': 'Snow',
                    'Mostly sunny': 'Clear',
                    'Cloudy, Rain': 'Rain',
                    'Sunny, Windy': 'Windy',
                    'Mostly Sunny Skies': 'Clear',
                    'Rainy': 'Rain',
                    '30% Chance of Rain': 'Rain',
                    'Cloudy, light snow accumulating 1-3"': 'Snow',
                    'cloudy': 'Cloudy',
                    'Clear and Sunny': 'Clear',
                    'Coudy': 'Cloudy',
                    'Clear and sunny': 'Clear',
                    'Clear to Partly Cloudy': 'Clear',
                    'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.': 'Windy',
                    'Rain shower': 'Rain',
                    'Cold': 'Clear'}


    df.Weather.replace(weather_dict, inplace=True)

    # There are still na values within the weather group that need to be addressed
    df.loc[df.StadiumType == 'Indoor', 'Weather'].fillna('Indoor', inplace=True)

    # Because we can't make a determination on the type of weather for outdoor, drop the remaining na values
    df.Weather.dropna(inplace=True)


    # Add a column for the presence of precipitation, that will ultimately be used for numerical analysis of the weather. 
    precipitation = {
        'Indoor': 0,
        'Clear': 0,
        'Cloudy': 0,
        'Windy': 0,
        'Hazy/Fog': 0,
        'Rain': 1,
        'Snow': 1
    }
    
    df['Precipitation'] = df.Weather.map(precipitation)
    
    return df


In [411]:
# This function encodes the players by position and rosterposition
def Position_Coder(df): 
    df['Position'] = np.where(df['Position'] == 'Missing Data', df['RosterPosition'], df['Position'])

    position = {
        'Quarterback': 0,
        'QB': 0,
        'Running Back': 1,
        'RB': 1,
        'FB': 2, 
        'Wide Receiver': 3,
        'WR': 3,
        'Tight End': 4,
        'TE': 4,
        'Offensive Lineman': 5,
        'OL': 5,
        'C': 6,
        'G': 7,
        'LG': 8,
        'RG': 9, 
        'T': 10, 
        'LT': 11, 
        'RT': 12, 
        'Kicker': 13,
        'K': 13,
        'KR': 14, 
        'Defensive Lineman': 15,
        'DL': 15,
        'DE': 16,
        'DT': 17, 
        'NT': 18, 
        'Linebacker': 19,
        'LB': 19,
        'OLB': 20,
        'ILB': 21,
        'MLB': 22,
        'DB': 23,
        'Cornerback': 24,
        'CB': 24,
        'Safety': 25,
        'S': 25,
        'SS': 26,
        'FS': 27,
        'P': 28,
        'PR': 29
    }

    df.RosterPosition.replace(position, inplace=True)
    df.Position.replace(position, inplace=True)
    df.drop(columns='PositionGroup', inplace=True)

    return df


In [412]:
# This function adjusts the player day to remove the negative values
def PlayerDay_Adjuster(df):
    df.assign(DaysPlayed = lambda x: x['PlayerDay'] + 63)

    return df


In [413]:
# This function creates a categorical grouping for the different types of plays, grouping into passing, rushing, or kicking plays
def Play_Coder(df):
    play_type = {
        'Pass': 'Pass',
        'Rush': 'Rush',
        'Extra Point': 'Kick',
        'Kickoff': 'Kick',
        'Punt': 'Kick',
        'Field Goal': 'Kick',
        'Kickoff Not Returned': 'Kick',
        'Punt Not Returned': 'Kick',
        'Kickoff Returned': 'Kick',
        'Punt Returned': 'Kick',
        '0': 'Kick'
    }

    play_map = {
        'Pass': 0, 
        'Rush': 1, 
        'Kick': 2
    }

    df.PlayType.replace(play_type, inplace=True)
    df['PlayCode'] = df.PlayType.map(play_map)

    return df

In [414]:
# Create cleaning function to apply all of the other functions to the single df input
def Process_Playlist_Data(df):
    df = Surface_Coder(df)
    df = Stadium_Coder(df)
    df = Temperature_Adjuster(df)
    df = Weather_Coder(df)
    df = Position_Coder(df)
    df = PlayerDay_Adjuster(df)
    df = Play_Coder(df)
    
    

    return df

In [415]:
# Remove columns based on analysis being performed
playlist.drop(columns=['GameID', 'PlayerKey', 'StadiumType', 'FieldType', 'Weather', 'PlayType'], inplace=True)


In [416]:
playlist = pd.read_csv("NFL_Turf/PlayList.csv")


In [417]:
playlist = Process_Playlist_Data(playlist)
playlist.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-

Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,SyntheticField,Outdoor,Precipitation,PlayCode
0,26624,26624-1,26624-1-1,0,1,1,Outdoor,Synthetic,63,Clear,Pass,1,0,1,1,0.0,0.0
1,26624,26624-1,26624-1-2,0,1,1,Outdoor,Synthetic,63,Clear,Pass,2,0,1,1,0.0,0.0
2,26624,26624-1,26624-1-3,0,1,1,Outdoor,Synthetic,63,Clear,Rush,3,0,1,1,0.0,1.0
3,26624,26624-1,26624-1-4,0,1,1,Outdoor,Synthetic,63,Clear,Rush,4,0,1,1,0.0,1.0
4,26624,26624-1,26624-1-5,0,1,1,Outdoor,Synthetic,63,Clear,Pass,5,0,1,1,0.0,0.0


## Cleaning Injury List Data

- Depending on the Join, there will be groupings with NoInjury, which needs to be coded as well as the others
- The DM_M# columns represent the minimum number of days that the player was out - this can be turned into a continuous data type
- Can also create a binary output for Severe, over 28 days, and Less Severe for under
- There are some playKeys that are NAN and need to be removed, since there is no indication when or how the injury occurred
- Surface can be removed, since this will be merged with the other table that already contains this information

In [418]:
injuries = pd.read_csv("NFL_Turf/InjuryRecord.csv")
injuries.head()


Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


In [419]:
def Injury_Coder(df):
    knee_freq = df.BodyPart.value_counts()['Knee']
    ankle_freq = df.BodyPart.value_counts()['Ankle']
    foot_freq = df.BodyPart.value_counts()['Foot']
        
    injury_map = {
        'NoInjury': 0, 
        'Foot': foot_freq, 
        'Ankle': ankle_freq, 
        'Knee': knee_freq
    }

    df['InjuryType'] = df.BodyPart.map(injury_map)

    # Remove any injuries not associated with a play
    df.PlayKey.dropna(inplace=True)

    return df


In [420]:
# This creates a new list of numerical values as the shortest number of days of injury
def Injury_Duration_Classifier(row):
    injury_duration = 0
    if row["DM_M42"] == 1:
        injury_duration = 42
    else:
        if row["DM_M28"] == 1:
            injury_duration = 28
        else:
            if row["DM_M7"] == 1:
                injury_duration = 7
            else: 
                injury_duration = 1
    
    return injury_duration

In [421]:
# Apply the Injury Duration Classifier to the dataframe
def Injury_Duration_Coder(df): 
    df['InjuryDuration'] = df.apply(Injury_Duration_Classifier, axis=1)

    severity_map = {
        42: 1, 
        28: 1,
        7: 0, 
        1: 0 
    }
    df['SevereInjury'] = df.InjuryDuration.map(severity_map)
    
    return df

In [422]:
# This function applies all of the InjuryRecord table processing Functions

def Process_Injury_Data(df): 
    df = Injury_Coder(df)
    df = Injury_Duration_Coder(df)

    # Drop any PlayKey NaN values
    df.PlayKey.dropna(inplace=True)
    
    # Drop all columns that are not necessary or redundant
    

    return df

In [423]:
# Drop the columns based on the analysis that is being performed. 
injuries.drop(columns=['GameID',
        'PlayerKey',
                 'Surface',
                 'DM_M1',
                 'DM_M7',
                 'DM_M28',
                 'DM_M42'], inplace=True)


In [424]:
injuries = pd.read_csv("NFL_Turf/InjuryRecord.csv")
injuries.head()


Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


In [425]:
injuries = Process_Injury_Data(injuries)
injuries.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42,InjuryType,InjuryDuration,SevereInjury
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1,48.0,42,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0,48.0,7,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1,42.0,42,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0,42.0,1,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1,42.0,42,1


## Tracking Data Cleaner