# Case Study 4 : Data Science in Any Data You Like

**Required Readings:** 
* In this case study, you could use any data as you like.
* [TED Talks](https://www.ted.com/talks) for examples of 7 minutes talks.


** NOTE **
* Please don't forget to save the notebook frequently when working in Jupyter Notebook, otherwise the changes you made can be lost.

*----------------------

# Problem: pick a data science problem that you plan to solve using your Data
* The problem should be important and interesting, which has a potential impact in some area.
* The problem should be solvable using the data and data science solutions.

Please briefly describe in the following cell: what problem are you trying to solve? why this problem is important and interesting?

In [None]:
We are trying to predict how many yards a runner will run durring a play in a game of american football.
This is an interesting problem because if we can correlate succesful plays with different features of teams,
plays, players, and the environment, coaches and analysts can use that information to improve their game plans.

# Data Collection/Processing: 

In [None]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

np.set_printoptions(threshold=np.inf)

In [None]:
raw_data = pd.read_csv('data/train.csv')

In [None]:
print(np.unique(raw_data['PlayId'].values).size)
print(raw_data.columns)

In [None]:
raw_data['Temperature'] = raw_data['Temperature'].fillna(raw_data['Temperature'].mean())
raw_data['Humidity'] = raw_data['Humidity'].fillna(raw_data['Humidity'].mean())
raw_data = raw_data.dropna()

In [None]:
plays = np.unique(raw_data['PlayId'])
plays_trn, plays_val = train_test_split(plays, train_size=0.75)
data_trn = raw_data[raw_data['PlayId'].isin(plays_trn)]
data_val = raw_data[raw_data['PlayId'].isin(plays_val)]

In [None]:
def get_time(quarter, clock):
    split_time = clock.split(':')
    return (quarter-1)*15 + int(split_time[0]) + int(split_time[1])/60

def get_distance_to_touchdown(yard_line, possession_team, field_position):
    if possession_team != field_position:
        return yard_line
    else:
        return 100 - yard_line

def get_time_since_snap(time_handoff, time_snap):
    split_handoff = time_handoff.split(':')
    handoff_sec = int(split_handoff[1])*60 + int(split_handoff[2].split('.')[0])
    split_snap = time_handoff.split(':')
    snap_sec = int(split_snap[1])*60 + int(split_snap[2].split('.')[0])
    return float(handoff_sec) - float(snap_sec)

def get_height(player_height):
    split_height = player_height.split('-')
    return int(split_height[0])*12 + int(split_height[1])

def get_age(player_birth_date):
    return 2019 - int(player_birth_date.split('/')[2])

def encode_personnel(personnel):
    PERSONNELS = ['DB', 'DL', 'LB', 'OL', 'QB', 'RB', 'TE', 'WR']
    encoded_personnel = [0]*len(PERSONNELS)
    personnel = personnel.replace(' ','')
    for i in range(0,len(personnel),4):
        encoded_personnel[PERSONNELS.index(personnel[i+1:i+3])] += int(personnel[i])
    return encoded_personnel

def get_offense_features(formation, personnel):
    FORMATIONS = ['SHOTGUN','SINGLEBACK','JUMBO','PISTOL','I_FORM','ACE','WILDCAT','EMPTY']
    one_hot_formation = [int(f == formation) for f in FORMATIONS]
    return one_hot_formation + encode_personnel(personnel)

def get_defense_features(in_the_box, personnel):
    return [in_the_box] + encode_personnel(personnel)

def make_matrix(data_trn):
    BUFFER = 60
    offense_player_count = np.zeros((BUFFER*2+1,BUFFER*2+1))
    defense_player_count = np.zeros((BUFFER*2+1,BUFFER*2+1))
    offense_mean_yards = np.zeros((BUFFER*2+1,BUFFER*2+1))
    defense_mean_yards = np.zeros((BUFFER*2+1,BUFFER*2+1))
    for _,play in data_trn.groupby(['PlayId']):
        offense_team = play.loc[play['NflId'] == play['NflIdRusher'],'Team'].iloc[0]
        direction = play['PlayDirection'].iloc[0]
        yards = play['Yards'].iloc[0]
        ox = play.loc[play['NflId'] == play['NflIdRusher'],'X'].iloc[0]
        oy = play.loc[play['NflId'] == play['NflIdRusher'],'Y'].iloc[0]
        for _,player in play.iterrows():
            x = int(round(player['X']-ox+BUFFER)) if direction == 'right' else int(round(ox-player['X']+BUFFER))
            y = int(round(player['Y']-oy+BUFFER))
            if player['Team'] == offense_team:
                offense_player_count[y,x] += 1
                offense_mean_yards[y,x] = offense_mean_yards[y,x] + (yards - offense_mean_yards[y,x]) / offense_player_count[y,x]
            else:
                defense_player_count[y,x] += 1
                defense_mean_yards[y,x] = defense_mean_yards[y,x] + (yards - defense_mean_yards[y,x]) / defense_player_count[y,x]
    return offense_player_count,defense_player_count,offense_mean_yards,defense_mean_yards

def get_matrix_prediction(offense_player_count,defense_player_count,offense_mean_yards,defense_mean_yards,play):
    BUFFER = 60
    offense_team = play.loc[play['NflId'] == play['NflIdRusher'],'Team'].iloc[0]
    direction = play['PlayDirection'].iloc[0]
    yards = play['Yards'].iloc[0]
    ox = play.loc[play['NflId'] == play['NflIdRusher'],'X'].iloc[0]
    oy = play.loc[play['NflId'] == play['NflIdRusher'],'Y'].iloc[0]
    predictions = []
    weights = []
    for _,player in play.iterrows():
        x = int(round(player['X']-ox+BUFFER)) if direction == 'right' else int(round(ox-player['X']+BUFFER))
        y = int(round(player['Y']-oy+BUFFER))
        if player['Team'] == offense_team:
            predictions.append(offense_mean_yards[y,x])
            weights.append(offense_player_count[y,x])
        else:
            predictions.append(defense_mean_yards[y,x])
            weights.append(defense_player_count[y,x])
    return np.average(predictions, weights=weights)

In [None]:
for _,play in data_trn.groupby(['PlayId']):
    state_features = []
    state_features.append(get_distance_to_touchdown(play['YardLine'].iloc[0], play['PossessionTeam'].iloc[0], play['FieldPosition'].iloc[0]))
    state_features.append(get_time(play['Quarter'].iloc[0],play['GameClock'].iloc[0]))
    state_features.append(play['Down'].iloc[0])
    state_features.append(play['Distance'].iloc[0])
    state_features.append(get_time_since_snap(play['TimeHandoff'].iloc[0], play['TimeSnap'].iloc[0]))
    state_features.append(play['Temperature'].iloc[0])
    state_features.append(play['Humidity'].iloc[0])
    #state_features.append(get_matrix_prediction(opc,dpc,omy,dmy,play))
    offense_features = get_offense_features(play['OffenseFormation'].iloc[0], play['OffensePersonnel'].iloc[0])
    defense_features = get_defense_features(play['DefendersInTheBox'].iloc[0], play['DefensePersonnel'].iloc[0])
    for t,team in play.groupby(['Team']):
        team_features = []
        team_features.append(np.mean(team['X']))
        team_features.append(np.mean(team['Y']))
        team_features.append(np.mean(team['S']))
        team_features.append(np.mean(team['A']))
        team_features.append(np.mean(team['Dis']))
        team_features.append(np.mean(team['Orientation']))
        team_features.append(np.mean(team['Dir']))
        team_features.append(np.mean(team['PlayerHeight'].apply(lambda x: get_height(x))))
        team_features.append(np.mean(team['PlayerWeight']))
        team_features.append(np.mean(team['PlayerBirthDate'].apply(lambda x: get_age(x))))
        if t == 'home':
            team_features.append(team['HomeScoreBeforePlay'].iloc[0])
            if team['PossessionTeam'].iloc[0] == team['HomeTeamAbbr'].iloc[0]:
                offense_features = offense_features + team_features
            else:
                defense_features = defense_features + team_features
        elif t == 'away':
            team_features.append(team['VisitorScoreBeforePlay'].iloc[0])
            if team['PossessionTeam'].iloc[0] == team['VisitorTeamAbbr'].iloc[0]:
                offense_features = offense_features + team_features
            else:
                defense_features = defense_features + team_features
    if np.amax(np.isnan(state_features + offense_features + defense_features)) == 0:
        input_trn.append(state_features + offense_features + defense_features)
        target_trn.append(play['Yards'].iloc[0])

In [None]:
for _,play in data_val.groupby(['PlayId']):
    state_features = []
    state_features.append(get_distance_to_touchdown(play['YardLine'].iloc[0], play['PossessionTeam'].iloc[0], play['FieldPosition'].iloc[0]))
    state_features.append(get_time(play['Quarter'].iloc[0],play['GameClock'].iloc[0]))
    state_features.append(play['Down'].iloc[0])
    state_features.append(play['Distance'].iloc[0])
    state_features.append(get_time_since_snap(play['TimeHandoff'].iloc[0], play['TimeSnap'].iloc[0]))
    state_features.append(play['Temperature'].iloc[0])
    state_features.append(play['Humidity'].iloc[0])
    #state_features.append(get_matrix_prediction(opc,dpc,omy,dmy,play))
    offense_features = get_offense_features(play['OffenseFormation'].iloc[0], play['OffensePersonnel'].iloc[0])
    defense_features = get_defense_features(play['DefendersInTheBox'].iloc[0], play['DefensePersonnel'].iloc[0])
    for t,team in play.groupby(['Team']):
        team_features = []
        team_features.append(np.mean(team['X']))
        team_features.append(np.mean(team['Y']))
        team_features.append(np.mean(team['S']))
        team_features.append(np.mean(team['A']))
        team_features.append(np.mean(team['Dis']))
        team_features.append(np.mean(team['Orientation']))
        team_features.append(np.mean(team['Dir']))
        team_features.append(np.mean(team['PlayerHeight'].apply(lambda x: get_height(x))))
        team_features.append(np.mean(team['PlayerWeight']))
        team_features.append(np.mean(team['PlayerBirthDate'].apply(lambda x: get_age(x))))
        if t == 'home':
            team_features.append(team['HomeScoreBeforePlay'].iloc[0])
            if team['PossessionTeam'].iloc[0] == team['HomeTeamAbbr'].iloc[0]:
                offense_features = offense_features + team_features
            else:
                defense_features = defense_features + team_features
        elif t == 'away':
            team_features.append(team['VisitorScoreBeforePlay'].iloc[0])
            if team['PossessionTeam'].iloc[0] == team['VisitorTeamAbbr'].iloc[0]:
                offense_features = offense_features + team_features
            else:
                defense_features = defense_features + team_features
    if np.amax(np.isnan(state_features + offense_features + defense_features)) == 0:
        input_val.append(state_features + offense_features + defense_features)
        target_val.append(play['Yards'].iloc[0])

# Data Exploration: Exploring the your data

** plot some properties/statistics/distribution of your data** 

In [30]:
P_GAIN = 5
P_BUFFER = int(60 * P_GAIN)
V_GAIN = 10
V_BUFFER = int(20 * V_GAIN)
off_p_count = np.zeros((P_BUFFER*2+1,P_BUFFER*2+1))
off_v_count = np.zeros((V_BUFFER*2+1,V_BUFFER*2+1))
def_p_count = np.zeros((P_BUFFER*2+1,P_BUFFER*2+1))
def_v_count = np.zeros((V_BUFFER*2+1,V_BUFFER*2+1))
off_p_yards = np.zeros((P_BUFFER*2+1,P_BUFFER*2+1))
off_v_yards = np.zeros((V_BUFFER*2+1,V_BUFFER*2+1))
def_p_yards = np.zeros((P_BUFFER*2+1,P_BUFFER*2+1))
def_v_yards = np.zeros((V_BUFFER*2+1,V_BUFFER*2+1))
for _,play in data_trn.groupby(['PlayId']):
    offense_team = play.loc[play['NflId'] == play['NflIdRusher'],'Team'].iloc[0]
    direction = play['PlayDirection'].iloc[0]
    yards = play['Yards'].iloc[0]
    runner = play.loc[play['NflId'] == play['NflIdRusher']]
    opx = runner['X'] * P_GAIN
    opy = runner['Y'] * P_GAIN
    ovx = runner['S'] * np.sin(np.deg2rad(runner['Dir'])) * V_GAIN
    ovy = runner['S'] * np.cos(np.deg2rad(runner['Dir'])) * V_GAIN
    for _,player in play.iterrows():
        px = int(round(player['X'] * P_GAIN - opx + P_BUFFER))
        py = int(round(player['Y'] * P_GAIN - opy + P_BUFFER))
        vx = int(round(player['S'] * np.sin(np.deg2rad(player['Dir'])) * V_GAIN - ovy + V_BUFFER))
        vy = int(round(player['S'] * np.cos(np.deg2rad(player['Dir'])) * V_GAIN - ovy + V_BUFFER))
        if direction != 'right':
            px = int(round(opx - player['X'] * P_GAIN + P_BUFFER))
            vx = int(round(ovy - player['S'] * np.sin(np.deg2rad(player['Dir'])) * V_GAIN + V_BUFFER))
        if player['Team'] == offense_team:
            off_p_count[py,px] += 1
            off_v_count[vy,vx] += 1
            off_p_yards[py,px] = off_p_yards[py,px] + (yards - off_p_yards[py,px]) / off_p_count[py,px]
            off_v_yards[vy,vx] = off_v_yards[vy,vx] + (yards - off_v_yards[vy,vx]) / off_v_count[vy,vx]
        else:
            def_p_count[py,px] += 1
            def_v_count[vy,vx] += 1
            def_p_yards[py,px] = def_p_yards[py,px] + (yards - def_p_yards[py,px]) / def_p_count[py,px]
            def_v_yards[vy,vx] = def_v_yards[vy,vx] + (yards - def_v_yards[vy,vx]) / def_v_count[vy,vx]

In [None]:
off_p_yards[off_p_count==0] = np.nan
off_p_fig = plt.figure(figsize=(25,20))
ax = sns.heatmap(off_p_yards,-15,40,cmap='BuPu')

off_v_yards[off_v_count==0] = np.nan
off_v_fig = plt.figure(figsize=(25,20))
ax = sns.heatmap(off_v_yards,-15,40,cmap='BuPu')

def_p_yards[def_p_count==0] = np.nan
def_p_fig = plt.figure(figsize=(25,20))
ax = sns.heatmap(def_p_yards,-15,40,cmap='BuPu')

def_v_yards[def_v_count==0] = np.nan
def_v_fig = plt.figure(figsize=(25,20))
ax = sns.heatmap(def_v_yards,-15,40,cmap='BuPu')

# The Solution: implement a data science solution to the problem you are trying to solve.

Briefly describe the idea of your solution to the problem in the following cell:

In [None]:
Using linear regression we will use the features we extracted to model the number of yards run in a play.

Write codes to implement the solution in python:

In [2]:
input_trn = np.stack(input_trn)
input_val = np.stack(input_val)
target_trn = np.stack(target_trn)
target_val = np.stack(target_val)

In [None]:
dtr = DecisionTreeRegressor(min_samples_leaf=128)
lr = LinearRegression()
dtr.fit(input_trn,target_trn)
error = dtr.predict(input_val) - target_val
squared_error = error * error
mse.append(np.mean(squared_error))
print(np.mean(squared_error))
lr.fit(input_trn,target_trn)
error = lr.predict(input_val) - target_val
squared_error = error * error
print(np.mean(squared_error))
print(np.mean((np.mean(target_trn)-target_val)*(np.mean(target_trn)-target_val)))

In [None]:
lr.coef_

In [None]:
input_ss = StandardScaler()
input_trn = input_ss.fit_transform(input_trn)
input_val = input_ss.transform(input_val)
model = Sequential()
model.add(Dense(units=16, activation='relu', input_shape=(input_trn.shape[1:])))
model.add(Dense(units=1,activation='softmax'))
model.compile(optimizer='adam',loss='mse',metrics=['accuracy'])
callbacks = [EarlyStopping(monitor='val_loss',patience=5,restore_best_weights=False)]

In [None]:
model.fit(x=input_trn, y=target_trn, epochs=1000, verbose=1, callbacks=callbacks, validation_data=(input_val,target_val))

In [None]:
prediction = model.predict(input_val)
print(np.mean((prediction-target_val)*(prediction-target_val)))

# Results: summarize and visualize the results discovered from the analysis

Please use figures, tables, or videos to communicate the results with the audience.


In [None]:
plt.scatter(target_val,lr.predict(input_val))

*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "jupyter notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.

* **PPT Slides**: please prepare PPT slides (for 7 minutes' talk) to present about the case study . Each team present their case studies in class for 7 minutes.

Please compress all the files in a zipped file.


** How to submit: **

        Please submit through Canvas, in the Assignment "Case Study 4".
        
** Note: Each team only needs to submit one submission in Canvas **