In [12]:
import pandas as pd
import numpy as np

# First dataset: Injury Records

In [2]:
df = pd.read_csv('InjuryRecord.csv')

In [3]:
df.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


## Dataset info
DM_M1 = Hot encoding indicating 1 or more days missed due to injury

DM_M7 = Hot encoding indicating 7 or more days missed due to injury

DM_M28 = Hot encoding indicating 28 or more days missed due to injury

DM_M42= Hot encoding indicating 42 or more days missed due to injury

In [4]:
df.shape

(105, 9)

In [5]:
df.isnull().sum() # replace PlayKey with unknown

PlayerKey     0
GameID        0
PlayKey      28
BodyPart      0
Surface       0
DM_M1         0
DM_M7         0
DM_M28        0
DM_M42        0
dtype: int64

# Second dataset: Play list

In [6]:
df2 = pd.read_csv('PlayList.csv')

In [7]:
df2.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB
2,26624,26624-1,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB
3,26624,26624-1,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,QB
4,26624,26624-1,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,QB


## Dataset Info
PlayerDay = an integer sequence that relfects the time of a player's participation in games; use this field to sequence player participation

PlayerGame = uniquely identifies a player's games; matches the last integer of the GameID (not strictly in temporal order of the game occurence)

Temperature = on-field temperature at the start of the game (for closed dome/indoor statidums this field may not be relevant as the temperature and weather are controlled)

PlayerGamePlay = an ordered index denoting the running count of plays the player has participated in during the game

Position = categorical variable denoting the player's position for the play - may not be the same as the roster position

In [8]:
df2.shape

(267005, 14)

In [9]:
df2.isnull().sum()

PlayerKey             0
GameID                0
PlayKey               0
RosterPosition        0
PlayerDay             0
PlayerGame            0
StadiumType       16910
FieldType             0
Temperature           0
Weather           18691
PlayType            367
PlayerGamePlay        0
Position              0
PositionGroup         0
dtype: int64

# Third dataset: Player track data

In [10]:
# will not work without specifying dtypes
df3 = pd.read_csv('PlayerTrackData.csv',      
                        dtype={'time':'float64',
                                'x':'float16',
                                'y':'float16',
                                'dir': 'float16',
                                'dis': 'float16',
                                'o':'float16',
                                's':'float16'})

In [11]:
df3.head()

Unnamed: 0,PlayKey,time,event,x,y,dir,dis,o,s
0,26624-1-1,0.0,huddle_start_offense,87.4375,28.9375,288.25,0.010002,262.25,0.130005
1,26624-1-1,0.1,,87.4375,28.921875,284.0,0.010002,261.75,0.119995
2,26624-1-1,0.2,,87.4375,28.921875,280.5,0.010002,261.25,0.119995
3,26624-1-1,0.3,,87.4375,28.921875,278.75,0.010002,260.75,0.099976
4,26624-1-1,0.4,,87.4375,28.921875,275.5,0.010002,260.25,0.090027


## Dataset Info
The PlayKey feature can de decomposed as followed: PlayerKey-GameID-X, where the first 5 digits a unique identifier for a player, the first lone digit is the game identifier for the player, and the last lone digit is the player play identifier. It uniquely identifies a player's play within a game

x = player position along the long axis of the field (yards) over time index

y = player position along the short axis of the field (yards) over the time index

dir = direction - angle of player motion (deg)

dis = distance traveled from prior time point over the time index

o = orientation - angle that the placer is facing (deg)

s = estimated speed at that particular point in time over the time index

In [13]:
df3.shape # 76 million rows... there is position for each player for every 0.1 seconds

(76366748, 9)

In [14]:
df3.isnull().sum()

PlayKey           0
time              0
event      74526875
x                 0
y                 0
dir               2
dis               0
o                 2
s                 0
dtype: int64