#Data Engineering Notebook

The report for this final project can be found at this [link](https://cybertraining-dsc.github.io/report/fa20-523-301/project/project/).

## Part 1 Importing the functions

This file requires that we import Numpy, Matplotlib, Pylab, Keras, and Pandas

In [None]:
! pip install utils
import numpy as np
import matplotlib.pyplot as plt
import pylab
import os, sys
import pandas as pd
import io
import requests
import warnings
import sklearn
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import utils

Collecting utils
  Downloading https://files.pythonhosted.org/packages/55/e6/c2d2b2703e7debc8b501caae0e6f7ead148fd0faa3c8131292a599930029/utils-1.0.1-py2.py3-none-any.whl
Installing collected packages: utils
Successfully installed utils-1.0.1


Now that the funtions have been imported the team can focus on the download coding. The following cells will set up an install for Kaggle files and prompt for an upload of the kaggle.json file for credentials. 

The mkdir function creates a directory for the Kaggle data. This cell will allow the team to verify that the kaggle.json file appropriately uploaded to the directory.

In [None]:
##import the kaggle.json from local to drive
!pip install -q kaggle
from google.colab import files
##when it asks you to choose a file select the kaggle.json located within the 'project' folder from the github repo
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"chelseagorius","key":"0a34819ed937ff55d31f4288ab40cf19"}'}

In [None]:
##make a kaggle and a data folder
!mkdir ~/.kaggle
!mkdir data
##copy the kaggle.json to the .kaggle folder then grant permissions
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
#test to see if kaggle is working, should print list of datasets
!kaggle datasets list

ref                                                       title                                               size  lastUpdated          downloadCount  
--------------------------------------------------------  -------------------------------------------------  -----  -------------------  -------------  
terenceshin/covid19s-impact-on-airport-traffic            COVID-19's Impact on Airport Traffic               106KB  2020-10-19 12:40:17           1290  
sootersaalu/amazon-top-50-bestselling-books-2009-2019     Amazon Top 50 Bestselling Books 2009 - 2019         15KB  2020-10-13 09:39:21           1258  
thomaskonstantin/highly-rated-children-books-and-stories  Highly Rated Children Books And Stories            106KB  2020-10-24 12:09:59            288  
tunguz/euro-parliament-proceedings-1996-2011              Euro Parliament Proceedings 1996 - 2011              1GB  2020-10-26 17:48:29             18  
rishidamarla/judicial-expenditures-across-all-50-states   Judicial Expenditures ac

Now, the team must download all of the datasets for the class. The three datasets are focused on the NBA. 

The first dataset is for injuries. Each injury will be used to set up players, timeframes, and severity of injuries. 

The other two datasets are for the player performance. By cross referencing this data to the previous list, the team will be able to see which players are limited from the injury and how performance is hampered by time in rehab.

In [None]:
##downloading all the datasets
!kaggle datasets download -d ghopkins/nba-injuries-2010-2018
!kaggle datasets download -d nathanlauga/nba-games
!kaggle datasets download -d pablote/nba-enhanced-stats

Downloading nba-injuries-2010-2018.zip to /content
  0% 0.00/226k [00:00<?, ?B/s]
100% 226k/226k [00:00<00:00, 33.5MB/s]
Downloading nba-games.zip to /content
 50% 9.00M/18.1M [00:01<00:01, 5.38MB/s]
100% 18.1M/18.1M [00:01<00:00, 10.4MB/s]
Downloading nba-enhanced-stats.zip to /content
 54% 9.00M/16.7M [00:00<00:00, 16.8MB/s]
100% 16.7M/16.7M [00:00<00:00, 28.4MB/s]


In [None]:
##unzipping to the data folder
!unzip nba-injuries-2010-2018.zip -d data
!unzip nba-games.zip -d data
!unzip nba-enhanced-stats.zip -d data

Archive:  nba-injuries-2010-2018.zip
  inflating: data/injuries_2010-2020.csv  
Archive:  nba-games.zip
  inflating: data/games.csv          
  inflating: data/games_details.csv  
  inflating: data/players.csv        
  inflating: data/ranking.csv        
  inflating: data/teams.csv          
Archive:  nba-enhanced-stats.zip
  inflating: data/2012-18_officialBoxScore.csv  
  inflating: data/2012-18_playerBoxScore.csv  
  inflating: data/2012-18_standings.csv  
  inflating: data/2012-18_teamBoxScore.csv  
  inflating: data/2016-17_officialBoxScore.csv  
  inflating: data/2016-17_playerBoxScore.csv  
  inflating: data/2016-17_standings.csv  
  inflating: data/2016-17_teamBoxScore.csv  
  inflating: data/2017-18_officialBoxScore.csv  
  inflating: data/2017-18_playerBoxScore.csv  
  inflating: data/2017-18_standings.csv  
  inflating: data/2017-18_teamBoxScore.csv  
  inflating: data/metadata_officialBoxScore.pdf  
  inflating: data/metadata_playerBoxScore.pdf  
  inflating: data/metadata

The team must now use these downloads to create dataframes. Pandas dataframes will be easier to manage the data. The team will be able to use Pandas to process the data and allow the team to make correlations for feature engineering to create the models.

In [None]:
#create a list for each data set
ds_NBA_Injuries, ds_NBA_Games, ds_NBA_Enhanced = [], [], []

#import csv files as dataframes and save to respective list, injury set first
df_Injuries = pd.read_csv('data/injuries_2010-2020.csv')
df_Injury_Start = df_Injuries[df_Injuries.Acquired.isnull()]
df_Injury_End = df_Injuries[df_Injuries.Relinquished.isnull()]
ds_NBA_Injuries = [df_Injury_Start, df_Injury_End]
#nba games dataset
df_Games_games = pd.read_csv('data/games.csv')
df_Games_gamesDetails = pd.read_csv('data/games_details.csv')
df_Games_players = pd.read_csv('data/players.csv')
df_Games_ranking = pd.read_csv('data/ranking.csv')
df_Games_teams = pd.read_csv('data/teams.csv')
ds_NBA_Games = [df_Games_games, df_Games_gamesDetails, df_Games_players, df_Games_ranking, df_Games_teams]
#nba enhanced stats dataset
df_En_officialBS_1218 = pd.read_csv('data/2012-18_officialBoxScore.csv')
df_En_playerBS_1218 = pd.read_csv('data/2012-18_playerBoxScore.csv')
df_En_standings_1218 = pd.read_csv('data/2012-18_standings.csv')
df_En_teamBS_1218 = pd.read_csv('data/2012-18_teamBoxScore.csv')  
df_En_officialBS_1617 = pd.read_csv('data/2016-17_officialBoxScore.csv')  
df_En_playerBS_1617 = pd.read_csv('data/2016-17_playerBoxScore.csv')
df_En_standings_1617 = pd.read_csv('data/2016-17_standings.csv')
df_En_teamBS_1617 = pd.read_csv('data/2016-17_teamBoxScore.csv')  
df_En_officialBS_1718 = pd.read_csv('data/2017-18_officialBoxScore.csv')  
df_En_playerBS_1718 = pd.read_csv('data/2017-18_playerBoxScore.csv')
df_En_standings_1718 = pd.read_csv('data/2017-18_standings.csv')
df_En_teamBS_1718 = pd.read_csv('data/2017-18_teamBoxScore.csv')  
##data/metadata_officialBoxScore.pdf, data/metadata_playerBoxScore.pdf, data/metadata_standing.pdf, data/metadata_teamBoxScore.pdf  
df_En_teamBS = pd.read_csv('data/teamBoxScore.csv')
ds_NBA_Enhanced = [df_En_officialBS_1218, df_En_officialBS_1617, df_En_officialBS_1718, df_En_playerBS_1218, df_En_playerBS_1617, df_En_playerBS_1718, df_En_standings_1218, df_En_standings_1617, df_En_standings_1718, \
                       df_En_teamBS_1218, df_En_teamBS_1617, df_En_teamBS_1718, df_En_teamBS]


#probably need some more data exploration and some feature engineering

Preparing data tables to have the appropriate columns in order to calculate time and player specific metrics for each injury.

In [None]:
#distinct player and player ID list
df_distinct_playerID = df_Games_players[["PLAYER_NAME", "PLAYER_ID"]].drop_duplicates()
df_distinct_playerID.astype({'PLAYER_ID':'object'}).dtypes
#distinct gameID and game date list
df_Games_games['GAME_DATE_EST'] = pd.to_datetime(df_Games_games['GAME_DATE_EST'])
df_distinct_gameId_date = df_Games_games[["GAME_ID", "GAME_DATE_EST"]].drop_duplicates()
#join player ID, for j=injury start db
df_Injury_Start = df_Injury_Start.join(df_distinct_playerID.astype('object').set_index('PLAYER_NAME'), on='Relinquished')
df_Injury_Start = df_Injury_Start.merge(df_Games_teams[["TEAM_ID", "NICKNAME"]], left_on="Team", right_on="NICKNAME")
df_Injury_Start.drop(['NICKNAME'], axis=1)
df_Injury_Start['Date']= pd.to_datetime(df_Injury_Start['Date'])
#again for injury end db
df_Injury_End = df_Injury_End.join(df_distinct_playerID.astype('object').set_index('PLAYER_NAME'), on='Acquired')
df_Injury_End = df_Injury_End.merge(df_Games_teams[["TEAM_ID", "NICKNAME"]], left_on="Team", right_on="NICKNAME")
df_Injury_End.drop(['NICKNAME'], axis=1)
df_Injury_End['Date']= pd.to_datetime(df_Injury_Start['Date'])
# df_distinct_playerID=df_distinct_playerID.sort_values('PLAYER_NAME')
df_Games_gamesDetails = df_Games_gamesDetails.merge(df_distinct_gameId_date, on="GAME_ID")

Transforming the minutes column to a numeric value that can be used to create calculated metrics.

In [None]:
for index, row in df_Games_gamesDetails.iterrows():
  try:
    m, s = str(row.MIN).split(':')
  except (SyntaxError, ValueError) as e:
    m = (row.MIN)
    s = 0
  df_Games_gamesDetails.loc[index,'MIN'] = pd.to_numeric(m) + pd.to_numeric(s)/60

Creating the metrics for player performance metrics during the injury game and summarized for the 5 games prior to the injury.

In [None]:
for index, row in df_Injury_Start.iterrows():
        #games of just that player
        temp = df_Games_gamesDetails.loc[df_Games_gamesDetails['PLAYER_ID'] == row.PLAYER_ID]
        #games before and inlucding injury date
        inj_game = temp.loc[(temp['GAME_DATE_EST'] == row.Date)]
        #5 games prior and the game of injury, for some reason we need to have 4 different variabels, did not work with resetting the variable 'game_set' to itself
        temp2 = temp.loc[(temp['GAME_DATE_EST'] <= row.Date)]
        game_set = temp2.nlargest(6, 'GAME_DATE_EST')
        if len(game_set) > 0:
          #injury game
          inj_game = game_set.iloc[0]
          #5 games prior to injury
          prior5 = game_set.iloc[1:]
          #storing game data from injury game
          df_Injury_Start.at[index, 'inj_MIN'] = inj_game[['MIN']].MIN
          df_Injury_Start.at[index,'inj_FGA'] = inj_game[['FGA']].FGA
          df_Injury_Start.at[index,'inj_FG_PCT'] = inj_game[['FG_PCT']].FG_PCT
          df_Injury_Start.at[index,'inj_FG3A'] = inj_game[['FG3A']].FG3A
          df_Injury_Start.at[index,'inj_FG3_PCT'] = inj_game[['FG3_PCT']].FG3_PCT
          df_Injury_Start.loc[index,'inj_FTA'] = inj_game[['FTA']].FTA
          df_Injury_Start.loc[index,'inj_FT_PCT'] = inj_game[['FT_PCT']].FT_PCT
          df_Injury_Start.loc[index,'inj_REB'] = inj_game[['REB']].REB
          df_Injury_Start.loc[index,'inj_AST'] = inj_game[['AST']].AST
          df_Injury_Start.loc[index,'inj_STL'] = inj_game[['STL']].STL
          df_Injury_Start.loc[index,'inj_BLK'] = inj_game[['BLK']].BLK
          df_Injury_Start.loc[index,'inj_TO'] = inj_game[['TO']].TO
          df_Injury_Start.loc[index,'inj_PF'] = inj_game[['PF']].PF
          df_Injury_Start.loc[index,'inj_PTS'] = inj_game[['PTS']].PTS
          df_Injury_Start.loc[index,'inj_PLUS_MINUS'] = inj_game[['PLUS_MINUS']].PLUS_MINUS
#storing game data from prior 5 games
          df_Injury_Start.at[index,'p5_MIN'] = prior5[['MIN']].MIN.mean()
          df_Injury_Start.at[index,'p5_FGA'] = prior5[['FGA']].FGA.mean()
          df_Injury_Start.at[index,'p5_FG_PCT'] = prior5[['FG_PCT']].FG_PCT.mean()
          df_Injury_Start.at[index,'p5_FG3A'] = prior5[['FG3A']].FG3A.mean()
          df_Injury_Start.at[index,'p5_FG3_PCT'] = prior5[['FG3_PCT']].FG3_PCT.mean()
          df_Injury_Start.at[index,'p5_FTA'] = prior5[['FTA']].FTA.mean()
          df_Injury_Start.at[index,'p5_FT_PCT'] = prior5[['FT_PCT']].FT_PCT.mean()
          df_Injury_Start.at[index,'p5_REB'] = prior5[['REB']].REB.mean()
          df_Injury_Start.at[index,'p5_AST'] = prior5[['AST']].AST.mean()
          df_Injury_Start.at[index,'p5_STL'] = prior5[['STL']].STL.mean()
          df_Injury_Start.at[index,'p5_BLK'] = prior5[['BLK']].BLK.mean()
          df_Injury_Start.at[index,'p5_TO'] = prior5[['TO']].TO.mean()
          df_Injury_Start.at[index,'p5_PF'] = prior5[['PF']].PF.mean()
          df_Injury_Start.at[index,'p5_PTS'] = prior5[['PTS']].PTS.mean()
          df_Injury_Start.at[index,'p5_PLUS_MINUS'] = prior5[['PLUS_MINUS']].PLUS_MINUS.mean()
          
df_Injury_Start.to_csv('df_Injury_Start.csv')

Creating the metrics for player performance metrics in the first game back from injury and summarized for the 5 games after teh return.

In [None]:
#df_Injury_End
for index, row in df_Injury_End.iterrows():
        #games of just that player
        temp = df_Games_gamesDetails.loc[df_Games_gamesDetails['PLAYER_ID'] == row.PLAYER_ID]
        #games before and inlucding injury date
        temp2 = temp.loc[(temp['GAME_DATE_EST'] >= row.Date)]
        #5 games prior and the game of injury, for some reason we need to have 4 different variabels, did not work with resetting the variable 'game_set' to itself
        game_set = temp.nsmallest(6, 'GAME_DATE_EST')
        if len(game_set) > 0:
          #injury game
          inj_game = game_set.iloc[0]
          #5 games post injury
          post5 = game_set.iloc[1:]
          #storing game data from injury game
          df_Injury_End.at[index, 'inj_MIN'] = inj_game[['MIN']].MIN
          df_Injury_End.at[index,'inj_FGA'] = inj_game[['FGA']].FGA
          df_Injury_End.at[index,'inj_FG_PCT'] = inj_game[['FG_PCT']].FG_PCT
          df_Injury_End.at[index,'inj_FG3A'] = inj_game[['FG3A']].FG3A
          df_Injury_End.at[index,'inj_FG3_PCT'] = inj_game[['FG3_PCT']].FG3_PCT
          df_Injury_End.loc[index,'inj_FTA'] = inj_game[['FTA']].FTA
          df_Injury_End.loc[index,'inj_FT_PCT'] = inj_game[['FT_PCT']].FT_PCT
          df_Injury_End.loc[index,'inj_REB'] = inj_game[['REB']].REB
          df_Injury_End.loc[index,'inj_AST'] = inj_game[['AST']].AST
          df_Injury_End.loc[index,'inj_STL'] = inj_game[['STL']].STL
          df_Injury_End.loc[index,'inj_BLK'] = inj_game[['BLK']].BLK
          df_Injury_End.loc[index,'inj_TO'] = inj_game[['TO']].TO
          df_Injury_End.loc[index,'inj_PF'] = inj_game[['PF']].PF
          df_Injury_End.loc[index,'inj_PTS'] = inj_game[['PTS']].PTS
          df_Injury_End.loc[index,'inj_PLUS_MINUS'] = inj_game[['PLUS_MINUS']].PLUS_MINUS
          #storing game data from prior 5 games
          df_Injury_End.at[index,'p5_MIN'] = post5[['MIN']].MIN.mean()
          df_Injury_End.at[index,'p5_FGA'] = post5[['FGA']].FGA.mean()
          df_Injury_End.at[index,'p5_FG_PCT'] = post5[['FG_PCT']].FG_PCT.mean()
          df_Injury_End.at[index,'p5_FG3A'] = post5[['FG3A']].FG3A.mean()
          df_Injury_End.at[index,'p5_FG3_PCT'] = post5[['FG3_PCT']].FG3_PCT.mean()
          df_Injury_End.at[index,'p5_FTA'] = post5[['FTA']].FTA.mean()
          df_Injury_End.at[index,'p5_FT_PCT'] = post5[['FT_PCT']].FT_PCT.mean()
          df_Injury_End.at[index,'p5_REB'] = post5[['REB']].REB.mean()
          df_Injury_End.at[index,'p5_AST'] = post5[['AST']].AST.mean()
          df_Injury_End.at[index,'p5_STL'] = post5[['STL']].STL.mean()
          df_Injury_End.at[index,'p5_BLK'] = post5[['BLK']].BLK.mean()
          df_Injury_End.at[index,'p5_TO'] = post5[['TO']].TO.mean()
          df_Injury_End.at[index,'p5_PF'] = post5[['PF']].PF.mean()
          df_Injury_End.at[index,'p5_PTS'] = post5[['PTS']].PTS.mean()
          df_Injury_End.at[index,'p5_PLUS_MINUS'] = post5[['PLUS_MINUS']].PLUS_MINUS.mean()
        #print(inj_game)

        #print(inj_game)
        #print(prior5)
df_Injury_End.to_csv('df_Injury_End.csv')