# Final Project Notebook

The report for this final project can be found at this [link](https://cybertraining-dsc.github.io/report/fa20-523-301/project/project/).

## Part 1 Importing the functions

This file requires that we import Numpy, Matplotlib, Pylab, Keras, and Pandas

In [54]:
import numpy as np
import matplotlib.pyplot as plt
import pylab
import os, sys
import pandas as pd
import io
import requests
import warnings
import sklearn
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from datetime import datetime

In [None]:
! pip install cloudmesh-common -U

from cloudmesh.common.Benchmark import Benchmark

Requirement already up-to-date: cloudmesh-common in /usr/local/lib/python3.6/dist-packages (4.3.26)


In [3]:
! pip install utils
import utils

Collecting utils
  Downloading https://files.pythonhosted.org/packages/55/e6/c2d2b2703e7debc8b501caae0e6f7ead148fd0faa3c8131292a599930029/utils-1.0.1-py2.py3-none-any.whl
Installing collected packages: utils
Successfully installed utils-1.0.1


Now that the funtions have been imported the team can focus on the download coding. The following cells will set up an install for Kaggle files and prompt for an upload of the kaggle.json file for credentials. 

The mkdir function creates a directory for the Kaggle data. This cell will allow the team to verify that the kaggle.json file appropriately uploaded to the directory.

In [4]:
##import the kaggle.json from local to drive
!pip install -q kaggle
from google.colab import files
##when it asks you to choose a file select the kaggle.json located within the 'project' folder from the github repo
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"chelseagorius","key":"0a34819ed937ff55d31f4288ab40cf19"}'}

In [5]:
##make a kaggle and a data folder
!mkdir ~/.kaggle
!mkdir data
##copy the kaggle.json to the .kaggle folder then grant permissions
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
#test to see if kaggle is working, should print list of datasets
!kaggle datasets list

ref                                                       title                                               size  lastUpdated          downloadCount  
--------------------------------------------------------  -------------------------------------------------  -----  -------------------  -------------  
terenceshin/covid19s-impact-on-airport-traffic            COVID-19's Impact on Airport Traffic               106KB  2020-10-19 12:40:17           1292  
sootersaalu/amazon-top-50-bestselling-books-2009-2019     Amazon Top 50 Bestselling Books 2009 - 2019         15KB  2020-10-13 09:39:21           1261  
thomaskonstantin/highly-rated-children-books-and-stories  Highly Rated Children Books And Stories            106KB  2020-10-24 12:09:59            289  
tunguz/euro-parliament-proceedings-1996-2011              Euro Parliament Proceedings 1996 - 2011              1GB  2020-10-26 17:48:29             18  
rishidamarla/judicial-expenditures-across-all-50-states   Judicial Expenditures ac

Now, the team must download all of the datasets for the class. The three datasets are focused on the NBA. 

The first dataset is for injuries. Each injury will be used to set up players, timeframes, and severity of injuries. 

The other two datasets are for the player performance. By cross referencing this data to the previous list, the team will be able to see which players are limited from the injury and how performance is hampered by time in rehab.

In [6]:
##downloading all the datasets
!kaggle datasets download -d ghopkins/nba-injuries-2010-2018
!kaggle datasets download -d nathanlauga/nba-games
!kaggle datasets download -d pablote/nba-enhanced-stats
##unzipping to the data folder
!unzip nba-injuries-2010-2018.zip -d data
!unzip nba-games.zip -d data
!unzip nba-enhanced-stats.zip -d data

Downloading nba-injuries-2010-2018.zip to /content
  0% 0.00/226k [00:00<?, ?B/s]
100% 226k/226k [00:00<00:00, 72.9MB/s]
Downloading nba-games.zip to /content
 72% 13.0M/18.1M [00:00<00:00, 18.8MB/s]
100% 18.1M/18.1M [00:00<00:00, 21.9MB/s]
Downloading nba-enhanced-stats.zip to /content
 54% 9.00M/16.7M [00:00<00:00, 23.3MB/s]
100% 16.7M/16.7M [00:00<00:00, 37.5MB/s]


The team must now use these downloads to create dataframes. Pandas dataframes will be easier to manage the data. The team will be able to use Pandas to process the data and allow the team to make correlations for feature engineering to create the models.

In [90]:
#create a list for each data set
ds_NBA_Injuries, ds_NBA_Games, ds_NBA_Enhanced = [], [], []

#import csv files as dataframes and save to respective list, injury set first
df_Injuries = pd.read_csv('data/injuries_2010-2020.csv')
df_Injury_Start = df_Injuries[df_Injuries.Acquired.isnull()]
df_Injury_End = df_Injuries[df_Injuries.Relinquished.isnull()]
ds_NBA_Injuries = [df_Injury_Start, df_Injury_End]
#nba games dataset
df_Games_games = pd.read_csv('data/games.csv')
df_Games_gamesDetails = pd.read_csv('data/games_details.csv')
df_Games_players = pd.read_csv('data/players.csv')
df_Games_ranking = pd.read_csv('data/ranking.csv')
df_Games_teams = pd.read_csv('data/teams.csv')
ds_NBA_Games = [df_Games_games, df_Games_gamesDetails, df_Games_players, df_Games_ranking, df_Games_teams]
#nba enhanced stats dataset
#df_En_officialBS_1218 = pd.read_csv('data/2012-18_officialBoxScore.csv')
#df_En_playerBS_1218 = pd.read_csv('data/2012-18_playerBoxScore.csv')
#df_En_standings_1218 = pd.read_csv('data/2012-18_standings.csv')
#df_En_teamBS_1218 = pd.read_csv('data/2012-18_teamBoxScore.csv')  
#df_En_officialBS_1617 = pd.read_csv('data/2016-17_officialBoxScore.csv')  
#df_En_playerBS_1617 = pd.read_csv('data/2016-17_playerBoxScore.csv')
#df_En_standings_1617 = pd.read_csv('data/2016-17_standings.csv')
#df_En_teamBS_1617 = pd.read_csv('data/2016-17_teamBoxScore.csv')  
#df_En_officialBS_1718 = pd.read_csv('data/2017-18_officialBoxScore.csv')  
#df_En_playerBS_1718 = pd.read_csv('data/2017-18_playerBoxScore.csv')
#df_En_standings_1718 = pd.read_csv('data/2017-18_standings.csv')
#df_En_teamBS_1718 = pd.read_csv('data/2017-18_teamBoxScore.csv')  
##data/metadata_officialBoxScore.pdf, data/metadata_playerBoxScore.pdf, data/metadata_standing.pdf, data/metadata_teamBoxScore.pdf  
#df_En_teamBS = pd.read_csv('data/teamBoxScore.csv')
#ds_NBA_Enhanced = [df_En_officialBS_1218, df_En_officialBS_1617, df_En_officialBS_1718, df_En_playerBS_1218, df_En_playerBS_1617, df_En_playerBS_1718, df_En_standings_1218, df_En_standings_1617, df_En_standings_1718, \
#                       df_En_teamBS_1218, df_En_teamBS_1617, df_En_teamBS_1718, df_En_teamBS]


#probably need some more data exploration and some feature engineering

###Feature Engineering for Injury sets
#####Goal is to have stats for injury game, average stats of last/first 5 games and maybe join season avg?

In [91]:
#distinct player and player ID list
df_distinct_playerID = df_Games_players[["PLAYER_NAME", "PLAYER_ID"]].drop_duplicates()
df_distinct_playerID.astype({'PLAYER_ID':'object'}).dtypes
#distinct gameID and game date list
df_Games_games['GAME_DATE_EST'] = pd.to_datetime(df_Games_games['GAME_DATE_EST'])
df_distinct_gameId_date = df_Games_games[["GAME_ID", "GAME_DATE_EST"]].drop_duplicates()
#join player ID, for j=injury start db
df_Injury_Start = df_Injury_Start.join(df_distinct_playerID.astype('object').set_index('PLAYER_NAME'), on='Relinquished')
df_Injury_Start = df_Injury_Start.merge(df_Games_teams[["TEAM_ID", "NICKNAME"]], left_on="Team", right_on="NICKNAME")
df_Injury_Start = df_Injury_Start.drop(['NICKNAME'], axis=1)
df_Injury_Start['Date']= pd.to_datetime(df_Injury_Start['Date'])#.apply(lambda x: x.date())
#again for injury end db
df_Injury_End = df_Injury_End.join(df_distinct_playerID.astype('object').set_index('PLAYER_NAME'), on='Acquired')
df_Injury_End = df_Injury_End.merge(df_Games_teams[["TEAM_ID", "NICKNAME"]], left_on="Team", right_on="NICKNAME")
df_Injury_End = df_Injury_End.drop(['NICKNAME'], axis=1)
df_Injury_End['Date']= pd.to_datetime(df_Injury_End['Date'])#.apply(lambda x: x.date())
# df_distinct_playerID=df_distinct_playerID.sort_values('PLAYER_NAME')
df_Games_gamesDetails = df_Games_gamesDetails.merge(df_distinct_gameId_date, on="GAME_ID")

# df_distinct_playerID = df_distinct_playerID.sort_values(by=['PLAYER_NAME']).reset_index(drop=True, inplace=True)
#df_Injury_End

## Exploratory Data Analysis

At this point, it is time to build into new useful sets of data. The team will explore different sets to combine in to the models to be trained.

### **To be deleted later**
This code is put in to make the dataset much smaller. The datasets will normally be done on the larger set, but for buildup we want to use a smaller subset so the training does not take hours.

In [None]:
### GH_Add ## Slicing Rows to make it easier to build up program
df_Games_gamesDetails_orig = df_Games_gamesDetails.copy()
df_Games_gamesDetails = df_Games_gamesDetails[0:201].copy()

df_Injury_Start_orig = df_Injury_Start.copy()
df_Injury_Start = df_Injury_Start[0:201].copy()

df_Injury_End_orig = df_Injury_End.copy()
df_Injury_End = df_Injury_End[0:201].copy()

In [None]:
for index, row in df_Games_gamesDetails.iterrows():
  try:
    m, s = str(row.MIN).split(':')
  except (SyntaxError, ValueError) as e:
    m = (row.MIN)
    s = 0
  df_Games_gamesDetails.loc[index,'MIN'] = pd.to_numeric(m) + pd.to_numeric(s)/60




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


In [None]:
for index, row in df_Injury_Start.iterrows():
        #games of just that player
        temp = df_Games_gamesDetails.loc[df_Games_gamesDetails['PLAYER_ID'] == row.PLAYER_ID]
        #games before and inlucding injury date
        inj_game = temp.loc[(temp['GAME_DATE_EST'] == row.Date)]
        #5 games prior and the game of injury, for some reason we need to have 4 different variabels, did not work with resetting the variable 'game_set' to itself
        temp2 = temp.loc[(temp['GAME_DATE_EST'] <= row.Date)]
        game_set = temp2.nlargest(6, 'GAME_DATE_EST')
        if len(game_set) > 0:
          #injury game
          inj_game = game_set.iloc[0]
          #5 games prior to injury
          prior5 = game_set.iloc[1:]
          #storing game data from injury game
          df_Injury_Start.at[index, 'inj_MIN'] = inj_game[['MIN']].MIN
          df_Injury_Start.at[index,'inj_FGA'] = inj_game[['FGA']].FGA
          df_Injury_Start.at[index,'inj_FG_PCT'] = inj_game[['FG_PCT']].FG_PCT
          df_Injury_Start.at[index,'inj_FG3A'] = inj_game[['FG3A']].FG3A
          df_Injury_Start.at[index,'inj_FG3_PCT'] = inj_game[['FG3_PCT']].FG3_PCT
          df_Injury_Start.loc[index,'inj_FTA'] = inj_game[['FTA']].FTA
          df_Injury_Start.loc[index,'inj_FT_PCT'] = inj_game[['FT_PCT']].FT_PCT
          df_Injury_Start.loc[index,'inj_REB'] = inj_game[['REB']].REB
          df_Injury_Start.loc[index,'inj_AST'] = inj_game[['AST']].AST
          df_Injury_Start.loc[index,'inj_STL'] = inj_game[['STL']].STL
          df_Injury_Start.loc[index,'inj_BLK'] = inj_game[['BLK']].BLK
          df_Injury_Start.loc[index,'inj_TO'] = inj_game[['TO']].TO
          df_Injury_Start.loc[index,'inj_PF'] = inj_game[['PF']].PF
          df_Injury_Start.loc[index,'inj_PTS'] = inj_game[['PTS']].PTS
          df_Injury_Start.loc[index,'inj_PLUS_MINUS'] = inj_game[['PLUS_MINUS']].PLUS_MINUS
#storing game data from prior 5 games
          df_Injury_Start.at[index,'p5_MIN'] = prior5[['MIN']].MIN.mean()
          df_Injury_Start.at[index,'p5_FGA'] = prior5[['FGA']].FGA.mean()
          df_Injury_Start.at[index,'p5_FG_PCT'] = prior5[['FG_PCT']].FG_PCT.mean()
          df_Injury_Start.at[index,'p5_FG3A'] = prior5[['FG3A']].FG3A.mean()
          df_Injury_Start.at[index,'p5_FG3_PCT'] = prior5[['FG3_PCT']].FG3_PCT.mean()
          df_Injury_Start.at[index,'p5_FTA'] = prior5[['FTA']].FTA.mean()
          df_Injury_Start.at[index,'p5_FT_PCT'] = prior5[['FT_PCT']].FT_PCT.mean()
          df_Injury_Start.at[index,'p5_REB'] = prior5[['REB']].REB.mean()
          df_Injury_Start.at[index,'p5_AST'] = prior5[['AST']].AST.mean()
          df_Injury_Start.at[index,'p5_STL'] = prior5[['STL']].STL.mean()
          df_Injury_Start.at[index,'p5_BLK'] = prior5[['BLK']].BLK.mean()
          df_Injury_Start.at[index,'p5_TO'] = prior5[['TO']].TO.mean()
          df_Injury_Start.at[index,'p5_PF'] = prior5[['PF']].PF.mean()
          df_Injury_Start.at[index,'p5_PTS'] = prior5[['PTS']].PTS.mean()
          df_Injury_Start.at[index,'p5_PLUS_MINUS'] = prior5[['PLUS_MINUS']].PLUS_MINUS.mean()
          




In [None]:
#df_Injury_End
for index, row in df_Injury_End.iterrows():
        #games of just that player
        temp = df_Games_gamesDetails.loc[df_Games_gamesDetails['PLAYER_ID'] == row.PLAYER_ID]
        #games before and inlucding injury date
        temp2 = temp.loc[(temp['GAME_DATE_EST'] >= row.Date)]
        #5 games prior and the game of injury, for some reason we need to have 4 different variabels, did not work with resetting the variable 'game_set' to itself
        game_set = temp.nsmallest(6, 'GAME_DATE_EST')
        if len(game_set) > 0:
          #injury game
          inj_game = game_set.iloc[0]
          #5 games post injury
          post5 = game_set.iloc[1:]
          #storing game data from injury game
          df_Injury_End.at[index, 'inj_MIN'] = inj_game[['MIN']].MIN
          df_Injury_End.at[index,'inj_FGA'] = inj_game[['FGA']].FGA
          df_Injury_End.at[index,'inj_FG_PCT'] = inj_game[['FG_PCT']].FG_PCT
          df_Injury_End.at[index,'inj_FG3A'] = inj_game[['FG3A']].FG3A
          df_Injury_End.at[index,'inj_FG3_PCT'] = inj_game[['FG3_PCT']].FG3_PCT
          df_Injury_End.loc[index,'inj_FTA'] = inj_game[['FTA']].FTA
          df_Injury_End.loc[index,'inj_FT_PCT'] = inj_game[['FT_PCT']].FT_PCT
          df_Injury_End.loc[index,'inj_REB'] = inj_game[['REB']].REB
          df_Injury_End.loc[index,'inj_AST'] = inj_game[['AST']].AST
          df_Injury_End.loc[index,'inj_STL'] = inj_game[['STL']].STL
          df_Injury_End.loc[index,'inj_BLK'] = inj_game[['BLK']].BLK
          df_Injury_End.loc[index,'inj_TO'] = inj_game[['TO']].TO
          df_Injury_End.loc[index,'inj_PF'] = inj_game[['PF']].PF
          df_Injury_End.loc[index,'inj_PTS'] = inj_game[['PTS']].PTS
          df_Injury_End.loc[index,'inj_PLUS_MINUS'] = inj_game[['PLUS_MINUS']].PLUS_MINUS
          #storing game data from prior 5 games
          df_Injury_End.at[index,'p5_MIN'] = post5[['MIN']].MIN.mean()
          df_Injury_End.at[index,'p5_FGA'] = post5[['FGA']].FGA.mean()
          df_Injury_End.at[index,'p5_FG_PCT'] = post5[['FG_PCT']].FG_PCT.mean()
          df_Injury_End.at[index,'p5_FG3A'] = post5[['FG3A']].FG3A.mean()
          df_Injury_End.at[index,'p5_FG3_PCT'] = post5[['FG3_PCT']].FG3_PCT.mean()
          df_Injury_End.at[index,'p5_FTA'] = post5[['FTA']].FTA.mean()
          df_Injury_End.at[index,'p5_FT_PCT'] = post5[['FT_PCT']].FT_PCT.mean()
          df_Injury_End.at[index,'p5_REB'] = post5[['REB']].REB.mean()
          df_Injury_End.at[index,'p5_AST'] = post5[['AST']].AST.mean()
          df_Injury_End.at[index,'p5_STL'] = post5[['STL']].STL.mean()
          df_Injury_End.at[index,'p5_BLK'] = post5[['BLK']].BLK.mean()
          df_Injury_End.at[index,'p5_TO'] = post5[['TO']].TO.mean()
          df_Injury_End.at[index,'p5_PF'] = post5[['PF']].PF.mean()
          df_Injury_End.at[index,'p5_PTS'] = post5[['PTS']].PTS.mean()
          df_Injury_End.at[index,'p5_PLUS_MINUS'] = post5[['PLUS_MINUS']].PLUS_MINUS.mean()
        #print(inj_game)

        #print(inj_game)
        #print(prior5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead



## Adding
Starting to build up injury performance for model.

In [None]:
prior5

NameError: ignored

# Part 2 Building the Keras Model

A link for Keras for us to use can be found [here](https://keras.io/guides/sequential_model/). We are first going to set up our Benchmark Test to be used when we are Benchmarking our model.

In [None]:
def b():
  Benchmark.Start()
  print ("b")
  import time
  time.sleep(3)
  Benchmark.Stop()

def c():
  Benchmark.Start()
  print ("c")
  import time
  time.sleep(1)
  Benchmark.Stop()

In [None]:
 b()
 c()

 Benchmark.print()

b
c

+---------------------+------------------------------------------------------------------+
| Attribute           | Value                                                            |
|---------------------+------------------------------------------------------------------|
| BUG_REPORT_URL      | "https://bugs.launchpad.net/ubuntu/"                             |
| DISTRIB_CODENAME    | bionic                                                           |
| DISTRIB_DESCRIPTION | "Ubuntu 18.04.5 LTS"                                             |
| DISTRIB_ID          | Ubuntu                                                           |
| DISTRIB_RELEASE     | 18.04                                                            |
| HOME_URL            | "https://www.ubuntu.com/"                                        |
| ID                  | ubuntu                                                           |
| ID_LIKE             | debian                                                       

Now that we know which GPU we are using, we can get into the actual work. The following is building our Keras model.

In [None]:
np.random.seed(23)
warnings.filterwarnings("ignore")

In [None]:
df_baseline = df_Injury_End
sort_by = 'Acquired'

#df_baseline.sort_values(by=['Date','Name']).reset_index(drop=True, inplace=True)
df_baseline.sort_values(by=[sort_by]).reset_index(drop=True, inplace=True)
# df_baseline['FPTS_pred'] = utils.calculate_FPTS(df_baseline)

# # Season average
# print(' MAE | ', utils.calculate_MAE(df_baseline['FPTS_pred'], df_baseline['FPTS']))
# print('RMSE | ', utils.calculate_RMSE(df_baseline['FPTS_pred'], df_baseline['FPTS']))

In [None]:
df_baseline

Unnamed: 0,Date,Team,Acquired,Relinquished,Notes,PLAYER_ID,TEAM_ID,NICKNAME,inj_MIN,inj_FGA,inj_FG_PCT,inj_FG3A,inj_FG3_PCT,inj_FTA,inj_FT_PCT,inj_REB,inj_AST,inj_STL,inj_BLK,inj_TO,inj_PF,inj_PTS,inj_PLUS_MINUS,p5_MIN,p5_FGA,p5_FG_PCT,p5_FG3A,p5_FG3_PCT,p5_FTA,p5_FT_PCT,p5_REB,p5_AST,p5_STL,p5_BLK,p5_TO,p5_PF,p5_PTS,p5_PLUS_MINUS
0,2010-10-03,Heat,Jerry Stackhouse,,activated from IL,711,1610612748,Heat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2010-10-27,Heat,Jamaal Magloire,,activated from IL,2048,1610612748,Heat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2010-11-24,Heat,Mario Chalmers,,returned to lineup,201596,1610612748,Heat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2010-11-26,Heat,Juwan Howard,,activated from IL,436,1610612748,Heat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2010-11-27,Heat,Jamaal Magloire,,activated from IL,2048,1610612748,Heat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
196,2013-12-16,Heat,Tyler Johnson,,returned to lineup,204020,1610612748,Heat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
197,2013-12-18,Heat,Dwyane Wade,,activated from IL,2548,1610612748,Heat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
198,2013-12-19,Heat,Chris Andersen,,returned to lineup,2365,1610612748,Heat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
199,2013-12-19,Heat,Hassan Whiteside,,activated from IL,202355,1610612748,Heat,31.1,14.0,0.643,0.0,0.0,6.0,0.5,13.0,1.0,1.0,3.0,1.0,3.0,21.0,-3.0,,,,,,,,,,,,,,,


In [None]:
#df_baseline.sort_values(by=['Date','Name']).reset_index(drop=True, inplace=True)
df_baseline.sort_values(by=[sort_by]).reset_index(drop=True, inplace=True)


# df_baseline['	PLAYER_ID'] = utils.calculate_FPTS(df_baseline)

# # Season average
# print(' MAE | ', utils.calculate_MAE(df_baseline['FPTS_pred'], df_baseline['FPTS']))
# print('RMSE | ', utils.calculate_RMSE(df_baseline['FPTS_pred'], df_baseline['FPTS']))

# Part No X. Building the Model

The team is now moving on to building the model for the baseline. Linear Regression can be used to model the values. Additionally, a Random Forest modeling function was used to verify model performance. 

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

The pipeline will be built off of using df_baseline.PLAYER_ID. To change the data going into the model, the team had to modify the dataframe input to get to the results. Comments were put around to make it easy to find the code in the Notebook.

In [None]:
#df_baseline = df_baseline
# basic =  ['PTS','3P','AST','TRB','STL','BLK','TOV', 'DD', 'TD']

###################################
##                               ##
##     INSERT CODE               ##
##                               ##
##     Change DF before here     ##
##                               ##
##                               ##
##                               ##
###################################

X = df_baseline.PLAYER_ID

In [None]:
X = X.reshape(-1, 1)
# X_reshape=X.reshape(-1, 1)
# X = df_baseline.loc[:, basic]
X = MinMaxScaler().fit_transform(X)
print(X.shape)
# Y = df_baseline['FPTS'].values.reshape(-1,1).flatten()
Y = df_baseline.values.reshape(-1,1).flatten() # Y is 38 times larger. Not sure what I did here. 
Y = Y.reshape(-1, 1) 

size_x = X.shape[0]
size_y = Y.shape[0]
size_y = (size_y/size_x)
print(size_y)

Y = Y.reshape((size_x, size_y)) # Y is 29 times larger. Not sure what I did here. 



print(X.shape)
print(Y.shape)


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=101)







lin_reg = LinearRegression()
# rf=RandomForestClassifier(max_depth=8,n_estimators=5)

# scores = cross_validate(lasso, X, Y, cv=3, scoring=('r2', 'neg_mean_squared_error'), return_train_score=True)


reg_cv_score=cross_val_score(estimator=lin_reg,X=X_train,y=Y_train,cv=5)
print(reg_cv_score)

# errors = utils.cross_val(reg, X, y, n_folds=5, verbose=0)
# utils.summarize_errors(errors)

NameError: ignored

The model has been built and trained on 2/3 of the data with a test on 1/3 of the data. 

In [None]:
# When the dataframes are combined, use this code to select features.

features = ['SG', 'F', 'C', 'PTS', '3P', 'AST', 'TRB', 'STL', 'BLK', 'TOV', 'DD', 'TD', 'MP', 'FT',
            'FTA', 'FGA', '3PA', 'DRB', 'ORB', 'USG_perc', 'DRtg', 'ORtg', 'AST_perc', 'DRB_perc',
            'ORB_perc', 'BLK_perc', 'TOV_perc', 'STL_perc', 'eFG_perc', 'FG_perc', '3P_perc', 'FT_perc']

In [None]:
_all = ['Salary', 'Rest', 'Rota_All', 'Rota_Pos', 'Home', 'SG', 'F', 'C', 'Value', 'FPTS_std',
        'PTS', '3P', 'AST', 'TRB', 'STL', 'BLK', 'TOV', 'DD', 'TD', 'MP', 'FT', 'FTA', 'FGA', '3PA', 'DRB',
        'ORB', 'USG_perc', 'DRtg', 'ORtg', 'AST_perc', 'DRB_perc', 'ORB_perc', 'BLK_perc', 'TOV_perc', 
        'STL_perc', 'eFG_perc', 'FG_perc', '3P_perc', 'FT_perc']

In [None]:
### This should work when we start passing numbers. It will help pick best features.


# X was called above

X = MinMaxScaler().fit_transform(X)
# y = df_features['FPTS'].values.reshape(-1,1).flatten()

Y = df_baseline.values.reshape(-1,1).flatten() # Y is 38 times larger. Not sure what I did here. 
Y = Y.reshape(-1, 1) 
Y = Y.reshape((16894, 38)) # Y is 38 times larger. Not sure what I did here. 

# Takes 2 minutes
# clf.set_params(n_estimators=2000)
# clf.fit(X, y, sample_weight=train_weight)

model = GradientBoostingRegressor()
model.fit(X, Y)

top_features = pd.Series(model.feature_importances_, index = _all).sort_values()
top_features.plot(kind = "barh", figsize=(15,10) ,title='Top Features')
plt.show()

ValueError: ignored

In [None]:
omit_lowest = 20
_selected = list(top_features[omit_lowest:].index)

# Building the Keras Model

In [None]:
tf.keras.backend.set_floatx('float64')

# Define Sequential model with 3 layers
model = keras.Sequential(
    [
        layers.Dense(2, activation="relu", name="layer1"),
        layers.Dense(3, activation="relu", name="layer2"),
        layers.Dense(4, name="layer3"),
    ]
)


# x = df_baseline
# x = tf.ones((3, 3))
Y = model(X)

print(Y)

tf.Tensor(
[[ 2.91940524e-07  3.08297296e-07 -3.65564912e-07 -7.21618691e-07]
 [ 2.91940524e-07  3.08297296e-07 -3.65564912e-07 -7.21618691e-07]
 [ 2.70738259e-05  2.85907116e-05 -3.39015655e-05 -6.69210926e-05]
 ...
 [ 2.18450302e-04  2.30689582e-04 -2.73541215e-04 -5.39965535e-04]
 [ 2.70726179e-05  2.85894359e-05 -3.39000528e-05 -6.69181066e-05]
 [ 2.18703317e-04  2.30956773e-04 -2.73858038e-04 -5.40590938e-04]], shape=(16894, 4), dtype=float64)


Insert Layers

In [None]:
# Create 3 layers
layer1 = layers.Dense(2, activation="relu", name="layer1")
layer2 = layers.Dense(3, activation="relu", name="layer2")
layer3 = layers.Dense(4, name="layer3")

# Call layers on a test input
# x = df_baseline
Y = layer3(layer2(layer1(X)))
print(Y)

tf.Tensor(
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 ...
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]], shape=(16894, 4), dtype=float64)


# Part 3 Conclusions

This is where the conclusions section will be typed