# Evaluating Injury vs. Non-Injury Plays

The issue we ran into with these data is that there are already 76 million rows in the tracking data, and merging additional columns is problematic in local analysis due to memory constraints. The plan for this analysis is to use undersampling from the outer merge of the Playlist-Injury Datasets, to randomly reduce the non-injury plays. It's important to perform this step at this time, so that we don't have to perform additional aggregation steps to the large table with 76 million rows. When we merge the Playlist-Injury dataset to the Tracking data, only the rows that match a PlayKey number will be merged, significantly cutting down the size of the dataframe by rows, as we increase the number of columns. 

In [None]:
import numpy as np
import pandas as pd
from NFL_Injury_Cleaning_Functions import *
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN
from sklearn.model_selection import train_test_split
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
import matplotlib.pyplot as plt



pd.set_option('mode.chained_assignment', None)
seed = 42


## Read in the datasets and Import Functions

In [None]:
playlist = pd.read_csv("NFL_Turf/PlayList.csv")
injuries = pd.read_csv("NFL_Turf/InjuryRecord.csv")

In [None]:
ml = ML_Data_Cleaner(playlist, injuries)
ml.head()

In [None]:
# ml.set_index('PlayKey', inplace=True)
ml.head()

We are adding one additional column, 'IsInjured', where it is 1 wherever the injury type is not 0

In [None]:
# The numpy where function reads as follows... set ml.IsInjured equal to 0 
# where ml.InjuryType == 0, else set equal to 1. All injuryType 0 values are not injures,
# everything else is an injury
 
ml['IsInjured'] = np.where(ml['InjuryType'] == 0, 0, 1)

# Undersampling

We will undersample the data using the Cluster Centroids algorithm doing the following: 

1. View the count of the target class (injury types) using Counter from the collections library
2. Use the resampled data to merge with the training data, only keeping the values that match the PlayKey from the sampled set
3. Use the new dataset to perform machine learning analysis

In [None]:
X = ml.drop(columns=['InjuryType', 'InjuryDuration', 'SevereInjury'])
y = ml.IsInjured

In [None]:
y.value_counts()

In [None]:
rus = RandomUnderSampler(random_state=seed)

# Fit the resample
X_resampled, y_resampled = rus.fit_resample(X, y)


In [None]:
y_resampled.value_counts()

## Merging the Undersampled Data with the Tracking Data

At this point, X_resampled and y_resampled have as many non-injury datapoints as it has injury datapoints, which will be expaneded once we add the tracking data. Note, these data have not been split using the train_test_split, as we still need to merge with the tracking data. The merge will be on the PlayKey, which will have to be an Inner merge, which will include all of the position data per play, but it will only contain the plays from the sampled data. 

Note: The X_resampled df still contains the IsInjured Column (y) - this is being maintained because after the merge with the tracking data, we will separate the y-values from the full table, to ensure that there wasn't some kind of indexing issue, so y_resampled is unnecessary moving forward.  

First, load the tracking Data

In [None]:
tracking = pd.read_csv('NFL_Turf/PlayerTrackData.csv')
tracking.drop(columns=['event', 'dis', 'time'], inplace=True)
tracking.head()

In [None]:
tracking.shape

In [None]:
ml_merged = pd.merge(tracking, X_resampled, on='PlayKey', how='inner')

In [None]:
ml_merged.head()

In [None]:
ml_merged.shape

In [None]:
ml_merged.PlayKey.nunique()

This has reduced the number of rows from 76 million to 44 thousand, sampling from 153 different plays, 77 of which involve injuries. 

## Machine Learning Model

The data will be split using train_test_split, and then similar to the previous models, a RandomForest classifier will be used for the learning process.

In [None]:
# Split into training and testing 

X_merged = ml_merged.drop(columns=['PlayKey', 'IsInjured'])
y_merged = ml_merged.IsInjured

X_train, X_test, y_train, y_test = train_test_split(X_merged, y_merged)

In [None]:
# Create the Classifier
barf = BalancedRandomForestClassifier(n_estimators=10, random_state=seed)

# Fit the model
barf.fit(X_train, y_train)

# Calculate predicted accuracy score
y_pred = barf.predict(X_test)
balanced_accuracy_score(y_test, y_pred)

In [None]:
# Display confusion matrix
fig, ax = plt.subplots(figsize=(10, 8))
plot_confusion_matrix(barf, X_test, y_test, display_labels=[
                      "Not Injured", "Injured"], cmap='Blues', values_format='d', ax=ax)
plt.title('Random Forest Confusion Matrix')
plt.show()

In [None]:
X_train.head()


This peformed very will with the undersampling algorithm, but to better analyze overfitting issues, we also tested SMOTEENN, a combination of under and oversampling, to get a larger dataset to pull from.

Note: Simple Logistic Regression was performed, which yielded only a balanced accuracy of 68%, just to verify that a more complex model should be used. 


In [None]:
ml_smoteenn = ml.copy()

ml_smoteenn.set_index('PlayKey', drop=True, inplace=True)
ml_smoteenn.head()

In [None]:
X = ml_smoteenn.drop(columns=['InjuryType', 'InjuryDuration', 'SevereInjury'])
y = ml_smoteenn.IsInjured

In [None]:
y.head()

In [None]:
smoteenn = SMOTEENN(random_state=seed)
X_resampled, y_resampled = smoteenn.fit_resample(X, y)

In [None]:
y_resampled.value_counts()

In [None]:
X_resampled.head()

Already, this has exponentially more data than the Undersampling model, with 260,000 compared to 77 values per category

## Merging the SMOTEENN resampled data with the Tracking Data

Again, X_resampled and y_resampled have not been split into testing/training sets, as they will need to be merged.