# DAT 205 Project - Prediction
## By Dennis Hung
## Version 1
## Code DRAFT 2021-04-03

## Code Strucuture
### Section 0: Function definitions
### Section 1: Import libraries
### Section 2: Configuration of variables

### Section 3: Load the dataset from file and initial analysis
- #### Section 3.1: Load the dataset from file
- #### Section 3.2: Initial Analysis

### Section 4: Transforming/cleansing the data 
- #### Section 4.1: Enhance the data
- #### Section 4.4 Filter data by Team (if specified)
- #### Section 4.5: Remove (Stage 1) from dataframe the unwanted numerical/categorical features
- #### Section 4.6: Transform categorical feature (WL) using value replace
- #### Section 4.7: Transform categorical features using LabelEncoder

- #### Section 4.8: Define TARGET variable and separate into dataframes by season type

## Section 5: Analysis - Heat Maps / Correlation Matrices

## Section 6: Modeling and Analysis
- ### Section 6.1: Prepare train and test data
- ### Section 6.4: Apply Random Forest Classifier on the split train/test dataset

## Section 8: Summary Report

## End of Code




# Updates

### 2021-03-30

- Added models xgboost and SVM for testing
- Tuned all models (if possible)
- Added in code to save models https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
- Inserted flags to allow processing of certain models to focus model analysis processing

### 2021-03-27

Code fully working to handle 
- Loading of different files as raw from  "HistoricalGameLogs_*.csv' or after data enhancement "DAT205_Output_Enhanced_df_TF *.csv"
- Enable/disable data enhancment process
- Filtering by specific team or all the data before performing corelation matrix or model analysis





# Reference

#### How to Get NBA Data Using the nba_api Python Module (Beginner). Retrieved from Playing Numbers: 

https://www.playingnumbers.com/2019/12/how-to-get-nba-data-using-the-nba_api-python-module-beginner/

#### Patel, S. (2020, August 19). swar / nba_api. Retrieved from GitHub: 

https://github.com/swar/nba_api/blob/master/docs/table_of_contents.md

#### Issues

https://github.com/swar/nba_api/issues/124



# Note: 
#### For this analysis, this code relies on the CSV output from "DAT 205-Group01-NBA-HistPlayGameLogs.ipynb" or the enhanced data from this code as the dataset 

# Section 0: Function definitions

hms_string(sec_elapsed)


In [1]:
# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60))/60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h,m,s)

# Null field analysis
def nullFieldAnalysis(df):
    df_missingDataInfo = pd.DataFrame({'Count': df.isnull().sum(), 'Percent': 100*df.isnull().sum()/len(df)})
    #Printing the columns with over XX% of missing values (ie 60 = 60%) This is set to 0 for 0%
    null_threshold = 0 
    print("")
    print("")
    print("==== Null value analysis ====")
    return df_missingDataInfo[df_missingDataInfo['Percent'] > null_threshold].sort_values(by=['Percent'])

# CalcThreshold_List
def CalcThreshold_List(totalRecords):
    CTL_10 = int(round(totalRecords*0.1,0))
    CTL_20 = int(round(totalRecords*0.2,0))
    CTL_30 = int(round(totalRecords*0.3,0))
    CTL_40 = int(round(totalRecords*0.4,0))
    CTL_50 = int(round(totalRecords*0.5,0))
    CTL_60 = int(round(totalRecords*0.6,0))
    CTL_70 = int(round(totalRecords*0.7,0))
    CTL_80 = int(round(totalRecords*0.8,0))
    CTL_90 = int(round(totalRecords*0.9,0))
    CTL_100 = int(round(totalRecords*1,0))

    CTL_05 = int(round(totalRecords*0.05,0))
    CTL_15 = int(round(totalRecords*0.15,0))
    CTL_25 = int(round(totalRecords*0.25,0))
    CTL_35 = int(round(totalRecords*0.35,0))
    CTL_45 = int(round(totalRecords*0.45,0))
    CTL_55 = int(round(totalRecords*0.55,0))
    CTL_65 = int(round(totalRecords*0.65,0))
    CTL_75 = int(round(totalRecords*0.75,0))
    CTL_85 = int(round(totalRecords*0.85,0))
    CTL_95 = int(round(totalRecords*0.95,0))

    Threshold_List = [1, CTL_05, CTL_10, CTL_15, CTL_20, CTL_25, CTL_30, CTL_35, CTL_40, CTL_45, 
                    CTL_50, CTL_55, CTL_60, CTL_65, CTL_70, CTL_75, CTL_80, CTL_85, 
                    CTL_90, CTL_95, CTL_100]
    return Threshold_List


# Section 1: Import libraries

In [2]:
# Install any missing libraries
# pip install xgboost
# pip install tpot

In [3]:
# Initialized required packages
# Standard packages
import numpy as np
import pandas as pd
import scipy as sp
import csv
import time
import joblib

# Graphing packages
# import seaborn as sns

# import matplotlib
# import matplotlib.pyplot as plt
# import matplotlib.patches as mpatches
# import matplotlib.lines as mlines

# Data preparation
from sklearn.preprocessing import LabelEncoder

# Modeling packages
# import tensorflow as tf
# import sklearn as skl
from sklearn.model_selection import train_test_split

# Regression modeling
# from sklearn.linear_model import LinearRegression
# from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# from sklearn.gaussian_process import GaussianProcessRegressor
# from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel
# from sklearn.naive_bayes import GaussianNB
# from sklearn.svm import SVC
# from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
# import xgboost as xgb

# to fix xgboost warnings error
# https://github.com/EpistasisLab/tpot/issues/1139
# "Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior."
# from tpot import TPOTClassifier
# from tpot.config import classifier_config_dict

# from sklearn.ensemble import RandomForestRegressor


# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
from sklearn.model_selection import RandomizedSearchCV

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV
# from sklearn.model_selection import GridSearchCV

# from sklearn.model_selection import cross_val_score

# Confusion matrix, Accuracy, sensitivity and specificity
# from sklearn.model_selection import cross_val_score
# from sklearn.metrics import mean_squared_error, r2_score
# from sklearn.metrics import precision_score, \
    # recall_score, confusion_matrix, classification_report, \
    # accuracy_score, f1_score

# from sklearn.feature_selection import VarianceThreshold 
# from sklearn.feature_selection import RFE 
# from sklearn.feature_selection import RFECV

# Clustering
# from sklearn.datasets import make_blobs
# from sklearn.cluster import KMeans
# from sklearn.metrics import silhouette_samples, silhouette_score

# Following code is being deprecated
# from sklearn.datasets.samples_generator import make_blobs

# Initialize variables if there is any debugging required
# Insert following line and activate the debugging.
# # VALIDATION CODE 
# if debug_active == 'yes':
# 
# Use "display(df)"" if the result command is "df" to retain the same format



start_time = time.time()

# Section 2: Configuration of variables

- Must manually set the following variables

- gameTypeListed as one of the following: 'Pre Season', 'Regular Season', 'Playoffs'



In [4]:

# General configuration
debug_active = 'yes'
loop_max = 100
# showNumRecs = 15
numFormat = '{:.4f}'
numFormat_Pct = "{:.0%}"

# Data Transformation 'yes' or other
# dataEnhancement_active = 'yes'
dataEnhancement_active = 'no'
aggregatedTORGames = 'yes'

# Section 3.1: Load the dataset from file
# pick who is running the code and comment out the others
# coder = 'bhavika'
# coder = 'cindy'
coder = 'dennis'


# Folder locations
if aggregatedTORGames == 'yes':
    folder_Input = 'D:/_Data-Den/GitHub/Capstone/Data/AggrgatedGameLogs/'
else:
    folder_Input = './'
folder_Output = 'D:/_Data-Den/GitHub/Capstone/Code/_Dennis_Sandbox/_Output/'

# Setup file name for csv or Excel (.xlsx)
if coder == 'bhavika':
    filename = 'D:/McMaster/DAT205/Capstone/Data/HistoricalGameLogs_2004-05_to_2019-20_ALL.csv'
elif coder == 'dennis':
    # filename = './HistoricalGameLogs_2004-05_to_2019-20_ALL.csv'
    # filename = './DAT205_Output_Enhanced_df_TF 2004-2020.csv'
    filename = folder_Input + 'AggrgatedGameLogs_2019-2020_Regular_NewPlayers_V2.csv'
    # Test Data files
    # filename = './HistoricalGameLogs_2007-08_to_2008-09_ALL.csv'
    # filename = './DAT205_Output_Enhanced_df_TF 2007-09.csv'
  

# filename = filename + seasonStart + '_to_' + seasonEnd + '_' + gameType + '.csv'
# filename = filename + seasonStart + '_to_' + seasonEnd + '_ALL' + '.csv'


# Section 4.4 Filter data by Team (if specified)
# Filter the dataset by team or None
allTeamsList = ['CLE', 'LAC', 'NOH', 'WAS', 'ORL', 'NJN', 'PHX', 'DET', 'IND', \
       'CHA', 'DAL', 'ATL', 'NYK', 'CHI', 'BOS', 'MIN', 'PHI', 'HOU', \
       'POR', 'TOR', 'SAC', 'UTA', 'GSW', 'MIA', 'SEA', 'MEM', 'LAL', \
       'SAS', 'DEN', 'MIL', 'NOK', 'ZAK', 'CHN', 'PAN', 'RMA', 'MMT', \
       'MTA', 'MAL', 'LRO', 'EPT', 'OKC', 'LRY', 'BAR', 'MOS', 'OLP', \
       'PAR', 'LAB', 'MAC', 'MLN', 'BKN', 'FCB', 'RMD', 'MPS', 'EAM', \
       'ALB', 'FBU', 'NOP', 'UBB', 'FLA', 'BAU', 'FEN', 'SLA', 'SDS', \
       'BNE', 'MEL', 'SYD', 'GUA', 'PER', 'ADL', 'NZB', 'BJD', 'FRA']
# teamSelected = 'None'
teamSelected = 'TOR'

# Section 6: Modeling and Analysis
random_state_val = 42
model_list = ['LogRegM', 'DTM', 'RFM', 'RFM_RSCV', 'XGBM']

# Which models are active. At least 1 must be yes
useModel_LogRegM = 'no'
useModel_LogRegM_RSCV = 'no'
useModel_DTM = 'no'
useModel_DTM_RSCV = 'no'
useModel_RFM = 'yes'
useModel_RFM_RSCV = 'no'
useModel_XGBM = 'no'
useModel_XGBM_RSCV = 'no'
useModel_SVCM = 'no'


# Section 6.1: Prepare train and test data
# Select a season 
gameTypeListed = ['Pre Season', 'Regular Season', 'Playoffs']
gameTypeListed_code = [0, 1, 2]
gameTypeToProcess = 1
test_size_val = 1.0

# Section 5: Analysis - Heat Maps / Correlation Matrices
plotSize = (20,15)



# Section 3: Load the dataset from file and initial analysis

## Section 3.1: Load the dataset from file

In [5]:
# load the CSV or Excel file 
# Note the other option in Jupyter Notebook is to use the upload the csv files before running the code

# lst of column names which needs to be string
lst_str_cols = ['GAME_ID']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str'  for x in lst_str_cols}
# use dict on dtypes
df = pd.read_csv(filename, dtype=dict_dtypes)
# Excel file import
# df = pd.read_excel(filename)

# Remove duplicate index from import
unwanted_list = ['Unnamed: 0']


X_headers_list = df.columns.tolist()
for x in unwanted_list:
    X_headers_list.remove(x)

# Display current dataframe
df_Initial = df[X_headers_list]

# VALIDATION CODE 
if debug_active == 'yes':
    display(df_Initial)
    # Examine shape of dataframe
    display(df_Initial.shape)
    # Examine the type of attributes in the dataframe
    print("Shape of the dataset")
    df_Initial.info()
    # Describe the numerical data
    df_Initial.describe()
    


Unnamed: 0,SEASON_YEAR,Game_Type,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,MIN,FGM,...,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PER
0,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900001,2019-10-22T00:00:00,TOR vs. NOP,265.000000,42.0,...,7.0,3.000000,9.000000,24.0,34.000000,130.0,8.0,1,0,129.756301
1,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900017,2019-10-25T00:00:00,TOR @ BOS,248.394444,39.5,...,2.5,6.166667,8.833333,27.0,21.833333,111.5,-4.5,0,0,122.495341
2,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900031,2019-10-26T00:00:00,TOR @ CHI,248.661111,40.5,...,8.5,12.166667,2.833333,24.0,23.833333,114.5,26.9,0,0,233.266292
3,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900044,2019-10-28T00:00:00,TOR vs. ORL,247.670000,35.0,...,9.0,2.000000,7.000000,22.0,22.000000,109.0,10.6,1,0,127.036927
4,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900060,2019-10-30T00:00:00,TOR vs. DET,241.616667,52.0,...,13.0,2.000000,2.000000,23.0,18.000000,128.0,15.4,1,0,190.942918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901279,2020-08-07T00:00:00,TOR vs. BOS,248.094444,38.5,...,8.5,7.166667,2.833333,19.0,22.833333,101.5,-23.1,0,0,192.229912
68,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901286,2020-08-09T00:00:00,TOR vs. MEM,240.000000,34.0,...,17.0,7.000000,3.000000,23.0,27.000000,108.0,9.0,1,0,158.325098
69,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901294,2020-08-10T00:00:00,TOR @ MIL,243.264444,45.5,...,10.5,6.166667,6.833333,27.0,21.833333,122.5,9.7,1,0,193.822673
70,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901305,2020-08-12T00:00:00,TOR @ PHI,244.627778,41.5,...,8.5,6.166667,10.833333,24.0,28.833333,128.5,3.5,0,0,291.068161


(72, 33)

Shape of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SEASON_YEAR        72 non-null     object 
 1   Game_Type          72 non-null     object 
 2   TEAM_ID            72 non-null     int64  
 3   TEAM_ABBREVIATION  72 non-null     object 
 4   TEAM_NAME          72 non-null     object 
 5   GAME_ID            72 non-null     object 
 6   GAME_DATE          72 non-null     object 
 7   MATCHUP            72 non-null     object 
 8   MIN                72 non-null     float64
 9   FGM                72 non-null     float64
 10  FGA                72 non-null     float64
 11  FG_PCT             72 non-null     float64
 12  FG3M               72 non-null     float64
 13  FG3A               72 non-null     float64
 14  FG3_PCT            72 non-null     float64
 15  FTM                72 non-null     float64
 16  FTA    

## Section 3.2: Initial Analysis

In [6]:
# Display the headers of columns that use descriptive or non-numerical values
categorical_Features = df_Initial.dtypes[df_Initial.dtypes == "object"].index.tolist()

# VALIDATION CODE 
if debug_active == 'yes':
    print("VALIDATION CODE")
    print(categorical_Features)

# Describe the categorical data
print("")
print("")
print("==== Description of the categorical features ====")
display(df_Initial[categorical_Features].describe())

# # Null field analysis
nullFieldAnalysis(df_Initial)
# # Null field analysis
# df_missingDataInfo = pd.DataFrame({'Count': df_Initial.isnull().sum(), 'Percent': 100*df_Initial.isnull().sum()/len(df)})

# #Printing the columns with over XX% of missing values (ie 60 = 60%) This is set to 0 for 0%
# null_threshold = 0 
# print("")
# print("")
# print("==== Null value analysis ====")
# df_missingDataInfo[df_missingDataInfo['Percent'] > null_threshold].sort_values(by=['Percent'])

VALIDATION CODE
['SEASON_YEAR', 'Game_Type', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP']


==== Description of the categorical features ====


Unnamed: 0,SEASON_YEAR,Game_Type,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP
count,72,72,72,72,72,72,72
unique,1,1,1,1,72,72,54
top,2019-20,Regular Season,TOR,Toronto Raptors,21900809,2019-12-03T00:00:00,TOR vs. IND
freq,72,72,72,72,1,1,2




==== Null value analysis ====


Unnamed: 0,Count,Percent


# 

# Section 4: Transforming/cleansing the data 

## Data cleansing of nulls (Not working)

## Correction to missing PreSeason games WL values only 

49 PreSeason records 

2007-08 
GAME_ID 0010700072 / 2007-10-19
BOS vs NJN   W 36 to L 33

2008-09 
GAME_ID 0010800035 / 2008-10-11
DEN vs PHX   W 77 to L 72
Note some player game data seems missing

## Corrected missing player name data

740 records (727 preseason and 13 regular season)

This is not important as the player names are excluded from the analysis



## Section 4.1: Enhance the data

In [7]:
# Setup variables for data transformation
df_TF = df_Initial
totalNumRec = df_TF.shape[0]

# Check df_TeamGameStats
# VALIDATION CODE 
if debug_active == 'yes':
    print(totalNumRec)
    display(df_TF)
    print(df_TF.columns)

72


Unnamed: 0,SEASON_YEAR,Game_Type,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,MIN,FGM,...,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PER
0,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900001,2019-10-22T00:00:00,TOR vs. NOP,265.000000,42.0,...,7.0,3.000000,9.000000,24.0,34.000000,130.0,8.0,1,0,129.756301
1,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900017,2019-10-25T00:00:00,TOR @ BOS,248.394444,39.5,...,2.5,6.166667,8.833333,27.0,21.833333,111.5,-4.5,0,0,122.495341
2,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900031,2019-10-26T00:00:00,TOR @ CHI,248.661111,40.5,...,8.5,12.166667,2.833333,24.0,23.833333,114.5,26.9,0,0,233.266292
3,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900044,2019-10-28T00:00:00,TOR vs. ORL,247.670000,35.0,...,9.0,2.000000,7.000000,22.0,22.000000,109.0,10.6,1,0,127.036927
4,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900060,2019-10-30T00:00:00,TOR vs. DET,241.616667,52.0,...,13.0,2.000000,2.000000,23.0,18.000000,128.0,15.4,1,0,190.942918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901279,2020-08-07T00:00:00,TOR vs. BOS,248.094444,38.5,...,8.5,7.166667,2.833333,19.0,22.833333,101.5,-23.1,0,0,192.229912
68,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901286,2020-08-09T00:00:00,TOR vs. MEM,240.000000,34.0,...,17.0,7.000000,3.000000,23.0,27.000000,108.0,9.0,1,0,158.325098
69,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901294,2020-08-10T00:00:00,TOR @ MIL,243.264444,45.5,...,10.5,6.166667,6.833333,27.0,21.833333,122.5,9.7,1,0,193.822673
70,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901305,2020-08-12T00:00:00,TOR @ PHI,244.627778,41.5,...,8.5,6.166667,10.833333,24.0,28.833333,128.5,3.5,0,0,291.068161


Index(['SEASON_YEAR', 'Game_Type', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME',
       'GAME_ID', 'GAME_DATE', 'MATCHUP', 'MIN', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'TOV', 'STL', 'BLK', 'BLKA', 'PF', 'PFD', 'PTS',
       'PLUS_MINUS', 'DD2', 'TD3', 'PER'],
      dtype='object')


## Section 4.4 Filter data by Team (if specified)

In [8]:
# If a specific team is selected by 'TEAM_ABBREVIATION' then recreate the dataframe with this filter else use the entire dataset as is.
if teamSelected in allTeamsList:
    df_TF = df_TF[df_TF['TEAM_ABBREVIATION']==teamSelected]
else:
    df_TF    
display(df_TF)

Unnamed: 0,SEASON_YEAR,Game_Type,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,MIN,FGM,...,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PER
0,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900001,2019-10-22T00:00:00,TOR vs. NOP,265.000000,42.0,...,7.0,3.000000,9.000000,24.0,34.000000,130.0,8.0,1,0,129.756301
1,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900017,2019-10-25T00:00:00,TOR @ BOS,248.394444,39.5,...,2.5,6.166667,8.833333,27.0,21.833333,111.5,-4.5,0,0,122.495341
2,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900031,2019-10-26T00:00:00,TOR @ CHI,248.661111,40.5,...,8.5,12.166667,2.833333,24.0,23.833333,114.5,26.9,0,0,233.266292
3,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900044,2019-10-28T00:00:00,TOR vs. ORL,247.670000,35.0,...,9.0,2.000000,7.000000,22.0,22.000000,109.0,10.6,1,0,127.036927
4,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21900060,2019-10-30T00:00:00,TOR vs. DET,241.616667,52.0,...,13.0,2.000000,2.000000,23.0,18.000000,128.0,15.4,1,0,190.942918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901279,2020-08-07T00:00:00,TOR vs. BOS,248.094444,38.5,...,8.5,7.166667,2.833333,19.0,22.833333,101.5,-23.1,0,0,192.229912
68,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901286,2020-08-09T00:00:00,TOR vs. MEM,240.000000,34.0,...,17.0,7.000000,3.000000,23.0,27.000000,108.0,9.0,1,0,158.325098
69,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901294,2020-08-10T00:00:00,TOR @ MIL,243.264444,45.5,...,10.5,6.166667,6.833333,27.0,21.833333,122.5,9.7,1,0,193.822673
70,2019-20,Regular Season,1610612761,TOR,Toronto Raptors,21901305,2020-08-12T00:00:00,TOR @ PHI,244.627778,41.5,...,8.5,6.166667,10.833333,24.0,28.833333,128.5,3.5,0,0,291.068161


## Section 4.5: Remove (Stage 1) from dataframe the unwanted numerical/categorical features

#### Note: if data enhancement done then adjust 
unwanted_categorical_Features_TF

In [9]:
# Gather current list of features
numerical_Features = df_TF.columns.tolist()

# All possible features
# ['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'TOV', 'STL', 'BLK', 'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS', 'DD2', 'TD3', 'Game_Type']

for i in categorical_Features: 
    numerical_Features.remove(i)

# Lists unwanted features
# unwanted_numerical_Features = ['PLAYER_ID', 'TEAM_ID']
# unwanted_categorical_Features = ['PLAYER_NAME', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP']
unwanted_numerical_Features = ['TEAM_ID']
unwanted_categorical_Features = ['TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP']

# # if Enchancement done then use this to get rid of what extras you don't want.
if dataEnhancement_active == 'yes':
    unwanted_categorical_Features_TF = ['UID_STG']

# unwanted_list_01 = unwanted_numerical_Features + unwanted_categorical_Features + unwanted_categorical_Features_TF
if dataEnhancement_active == 'yes':
    unwanted_list_01 = unwanted_numerical_Features + unwanted_categorical_Features + unwanted_categorical_Features_TF
else:
    unwanted_list_01 = unwanted_numerical_Features + unwanted_categorical_Features
X_headers_list = df_TF.columns.tolist()

for i in unwanted_list_01:
    X_headers_list.remove(i)

# Reset new dataframe with desired features
df_Reduced = df_TF[X_headers_list]

# Remaining attributes
# VALIDATION CODE 
if debug_active == 'yes':
    display(X_headers_list)

['SEASON_YEAR',
 'Game_Type',
 'MIN',
 'FGM',
 'FGA',
 'FG_PCT',
 'FG3M',
 'FG3A',
 'FG3_PCT',
 'FTM',
 'FTA',
 'FT_PCT',
 'OREB',
 'DREB',
 'REB',
 'AST',
 'TOV',
 'STL',
 'BLK',
 'BLKA',
 'PF',
 'PFD',
 'PTS',
 'PLUS_MINUS',
 'DD2',
 'TD3',
 'PER']

## Section 4.6: Transform categorical feature (WL) using value replace

In [10]:
# cleaned_categorical_Features = ['WL', 'Game_Type']
cleaned_categorical_Features = ['Game_Type']
cleanupValue = {'WL': {'W': 1, 'L': 0}, 'Game_Type': {'Pre Season': 0, 'Regular Season': 1, 'Playoffs': 2}}
df_Reduced = df_Reduced.replace(cleanupValue)

# VALIDATION CODE 
if debug_active == 'yes':
    display(df_Reduced)

Unnamed: 0,SEASON_YEAR,Game_Type,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PER
0,2019-20,1,265.000000,42.0,103.0,0.407767,14.000000,40.000000,0.350000,32.000000,...,7.0,3.000000,9.000000,24.0,34.000000,130.0,8.0,1,0,129.756301
1,2019-20,1,248.394444,39.5,82.0,0.481707,17.166667,34.666667,0.495192,15.333333,...,2.5,6.166667,8.833333,27.0,21.833333,111.5,-4.5,0,0,122.495341
2,2019-20,1,248.661111,40.5,86.0,0.470930,15.166667,41.666667,0.364000,18.333333,...,8.5,12.166667,2.833333,24.0,23.833333,114.5,26.9,0,0,233.266292
3,2019-20,1,247.670000,35.0,89.0,0.393258,10.000000,39.000000,0.256410,29.000000,...,9.0,2.000000,7.000000,22.0,22.000000,109.0,10.6,1,0,127.036927
4,2019-20,1,241.616667,52.0,89.0,0.584270,12.000000,26.000000,0.461538,12.000000,...,13.0,2.000000,2.000000,23.0,18.000000,128.0,15.4,1,0,190.942918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,2019-20,1,248.094444,38.5,91.0,0.423077,9.166667,36.666667,0.250000,15.333333,...,8.5,7.166667,2.833333,19.0,22.833333,101.5,-23.1,0,0,192.229912
68,2019-20,1,240.000000,34.0,71.0,0.478873,15.000000,39.000000,0.384615,25.000000,...,17.0,7.000000,3.000000,23.0,27.000000,108.0,9.0,1,0,158.325098
69,2019-20,1,243.264444,45.5,99.0,0.459596,12.166667,40.666667,0.299180,19.333333,...,10.5,6.166667,6.833333,27.0,21.833333,122.5,9.7,1,0,193.822673
70,2019-20,1,244.627778,41.5,92.0,0.451087,18.166667,42.666667,0.425781,27.333333,...,8.5,6.166667,10.833333,24.0,28.833333,128.5,3.5,0,0,291.068161


## Section 4.7: Transform categorical features using LabelEncoder

This will work with the reminding categorical values as there is a hierarchy for 

'SEASON_YEAR' - the more recent the season the more relevant it is where as older data is less valuable

'Game_Type' - need to think about this but assume regular season is more important

In [11]:
# # Select features to encode
e_categorical = categorical_Features

print(e_categorical)

for i in unwanted_categorical_Features:
    e_categorical.remove(i)

print(unwanted_categorical_Features)

for j in cleaned_categorical_Features:
    e_categorical.remove(j)

print(cleaned_categorical_Features)

print(e_categorical)

# Reset variable
categorical_Features = df_Reduced.dtypes[df_Reduced.dtypes == "object"].index.tolist()

lb_make = LabelEncoder()
# cat_list = ['Gender','Education_Level','Marital_Status','Income_Category','Card_Category']
# cat_list_code = ['Gender_code','Education_Level_code','Marital_Status_code','Income_Category_code','Card_Category_code']

df_Encoded = df_Reduced
# df_Encoded = df_Reduced[e_categorical]




['SEASON_YEAR', 'Game_Type', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP']
['TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP']
['Game_Type']
['SEASON_YEAR']


In [12]:
# Apply LabelEncoding on e_categorical features

for k in e_categorical:
    val_A = k
    val_B = k + '_code'
    df_Encoded[(val_B)] = lb_make.fit_transform(df_Encoded[val_A])

# VALIDATION CODE 
if debug_active == 'yes':
    display(df_Encoded) #Results in appending a new column to df

Unnamed: 0,SEASON_YEAR,Game_Type,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PER,SEASON_YEAR_code
0,2019-20,1,265.000000,42.0,103.0,0.407767,14.000000,40.000000,0.350000,32.000000,...,3.000000,9.000000,24.0,34.000000,130.0,8.0,1,0,129.756301,0
1,2019-20,1,248.394444,39.5,82.0,0.481707,17.166667,34.666667,0.495192,15.333333,...,6.166667,8.833333,27.0,21.833333,111.5,-4.5,0,0,122.495341,0
2,2019-20,1,248.661111,40.5,86.0,0.470930,15.166667,41.666667,0.364000,18.333333,...,12.166667,2.833333,24.0,23.833333,114.5,26.9,0,0,233.266292,0
3,2019-20,1,247.670000,35.0,89.0,0.393258,10.000000,39.000000,0.256410,29.000000,...,2.000000,7.000000,22.0,22.000000,109.0,10.6,1,0,127.036927,0
4,2019-20,1,241.616667,52.0,89.0,0.584270,12.000000,26.000000,0.461538,12.000000,...,2.000000,2.000000,23.0,18.000000,128.0,15.4,1,0,190.942918,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,2019-20,1,248.094444,38.5,91.0,0.423077,9.166667,36.666667,0.250000,15.333333,...,7.166667,2.833333,19.0,22.833333,101.5,-23.1,0,0,192.229912,0
68,2019-20,1,240.000000,34.0,71.0,0.478873,15.000000,39.000000,0.384615,25.000000,...,7.000000,3.000000,23.0,27.000000,108.0,9.0,1,0,158.325098,0
69,2019-20,1,243.264444,45.5,99.0,0.459596,12.166667,40.666667,0.299180,19.333333,...,6.166667,6.833333,27.0,21.833333,122.5,9.7,1,0,193.822673,0
70,2019-20,1,244.627778,41.5,92.0,0.451087,18.166667,42.666667,0.425781,27.333333,...,6.166667,10.833333,24.0,28.833333,128.5,3.5,0,0,291.068161,0


## Section 4.8: Define TARGET variable and separate into dataframes by season type

Remove (Stage 2) from dataframe the featuree (categorical, Target, and other unwanted)

Separating the dataframe by gameTypeListed ('Pre Season', 'Regular Season', 'Playoffs')

In [13]:
# VALIDATION CODE 
if debug_active == 'yes':
    display(df_Encoded)
    print(e_categorical)

Unnamed: 0,SEASON_YEAR,Game_Type,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PER,SEASON_YEAR_code
0,2019-20,1,265.000000,42.0,103.0,0.407767,14.000000,40.000000,0.350000,32.000000,...,3.000000,9.000000,24.0,34.000000,130.0,8.0,1,0,129.756301,0
1,2019-20,1,248.394444,39.5,82.0,0.481707,17.166667,34.666667,0.495192,15.333333,...,6.166667,8.833333,27.0,21.833333,111.5,-4.5,0,0,122.495341,0
2,2019-20,1,248.661111,40.5,86.0,0.470930,15.166667,41.666667,0.364000,18.333333,...,12.166667,2.833333,24.0,23.833333,114.5,26.9,0,0,233.266292,0
3,2019-20,1,247.670000,35.0,89.0,0.393258,10.000000,39.000000,0.256410,29.000000,...,2.000000,7.000000,22.0,22.000000,109.0,10.6,1,0,127.036927,0
4,2019-20,1,241.616667,52.0,89.0,0.584270,12.000000,26.000000,0.461538,12.000000,...,2.000000,2.000000,23.0,18.000000,128.0,15.4,1,0,190.942918,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,2019-20,1,248.094444,38.5,91.0,0.423077,9.166667,36.666667,0.250000,15.333333,...,7.166667,2.833333,19.0,22.833333,101.5,-23.1,0,0,192.229912,0
68,2019-20,1,240.000000,34.0,71.0,0.478873,15.000000,39.000000,0.384615,25.000000,...,7.000000,3.000000,23.0,27.000000,108.0,9.0,1,0,158.325098,0
69,2019-20,1,243.264444,45.5,99.0,0.459596,12.166667,40.666667,0.299180,19.333333,...,6.166667,6.833333,27.0,21.833333,122.5,9.7,1,0,193.822673,0
70,2019-20,1,244.627778,41.5,92.0,0.451087,18.166667,42.666667,0.425781,27.333333,...,6.166667,10.833333,24.0,28.833333,128.5,3.5,0,0,291.068161,0


['SEASON_YEAR']


In [14]:
# Configure variables
# gameTypeListed = ['Pre Season', 'Regular Season', 'Playoffs']
# gameTypeListed_code = [0, 1, 2]
# Y_headers_list1 = ['WL']
# Y_headers_list2 = ['WL', 'Game_Type']
Y_headers_list1 = []
Y_headers_list2 = ['Game_Type']
e_categorical = e_categorical + Y_headers_list2

# Define the current list of features
X_headers_list = df_Encoded.columns.tolist()

# Remove LabelEncoded categorical features
for k in e_categorical:
    X_headers_list.remove(k)

# VALIDATION CODE 
if debug_active == 'yes':
    print(e_categorical)

['SEASON_YEAR', 'Game_Type']


In [21]:
df_X_Reduced2 = df_Encoded[X_headers_list]
df_Y_Reduced2 = df_Encoded[Y_headers_list1]
# cleanDFColumns = ['Game_Type', 'SEASON_YEAR_code']
cleanDFColumns = ['SEASON_YEAR_code']

if aggregatedTORGames == 'yes':
    gameType = 1
    df_X_RegularSeason = df_X_Reduced2
    df_Y_RegularSeason = df_Y_Reduced2


# for gameType in gameTypeListed_code:
# is_gameType_X = df_X_Reduced2['Game_Type']==gameType
# is_gameType_Y = df_Y_Reduced2['Game_Type']==gameType
if gameType == 0:
    df_X_PreSeason = df_X_Reduced2[is_gameType_X]
    df_X_PreSeason = df_X_PreSeason.drop(cleanDFColumns, axis=1)
    df_Y_PreSeason = df_Y_Reduced2[is_gameType_Y]
    df_Y_PreSeason = df_Y_PreSeason[Y_headers_list2]
elif gameType == 1:
    # df_X_RegularSeason = df_X_Reduced2[is_gameType_X]
    df_X_RegularSeason = df_X_Reduced2
    df_X_RegularSeason = df_X_RegularSeason.drop(cleanDFColumns, axis=1)
    # df_Y_RegularSeason = df_Y_Reduced2[is_gameType_Y]
    df_Y_RegularSeason = df_Y_Reduced2
    # df_Y_RegularSeason = df_Y_RegularSeason[Y_headers_list2]
    df_Y_RegularSeason = df_Y_RegularSeason
elif gameType == 2:
    df_X_Playoffs = df_X_Reduced2[is_gameType_X]
    df_X_Playoffs = df_X_Playoffs.drop(cleanDFColumns, axis=1)
    df_Y_Playoffs = df_Y_Reduced2[is_gameType_Y]
    df_Y_Playoffs = df_Y_Playoffs[Y_headers_list2]


# VALIDATION CODE 
if debug_active == 'yes':
    # print("")
    # print("Pre Season")
    # display(df_X_PreSeason)
    # display(df_Y_PreSeason)
    print("")
    print("Regular Season")
    display(df_X_RegularSeason)
    display(df_Y_RegularSeason)
    # print("")
    # print("Playoffs")
    # display(df_X_Playoffs)
    # display(df_Y_Playoffs)


Regular Season


Unnamed: 0,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,...,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PER
0,265.000000,42.0,103.0,0.407767,14.000000,40.000000,0.350000,32.000000,38.0,0.842105,...,7.0,3.000000,9.000000,24.0,34.000000,130.0,8.0,1,0,129.756301
1,248.394444,39.5,82.0,0.481707,17.166667,34.666667,0.495192,15.333333,18.5,0.828829,...,2.5,6.166667,8.833333,27.0,21.833333,111.5,-4.5,0,0,122.495341
2,248.661111,40.5,86.0,0.470930,15.166667,41.666667,0.364000,18.333333,19.5,0.940171,...,8.5,12.166667,2.833333,24.0,23.833333,114.5,26.9,0,0,233.266292
3,247.670000,35.0,89.0,0.393258,10.000000,39.000000,0.256410,29.000000,31.0,0.935484,...,9.0,2.000000,7.000000,22.0,22.000000,109.0,10.6,1,0,127.036927
4,241.616667,52.0,89.0,0.584270,12.000000,26.000000,0.461538,12.000000,15.0,0.800000,...,13.0,2.000000,2.000000,23.0,18.000000,128.0,15.4,1,0,190.942918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,248.094444,38.5,91.0,0.423077,9.166667,36.666667,0.250000,15.333333,17.5,0.876190,...,8.5,7.166667,2.833333,19.0,22.833333,101.5,-23.1,0,0,192.229912
68,240.000000,34.0,71.0,0.478873,15.000000,39.000000,0.384615,25.000000,33.0,0.757576,...,17.0,7.000000,3.000000,23.0,27.000000,108.0,9.0,1,0,158.325098
69,243.264444,45.5,99.0,0.459596,12.166667,40.666667,0.299180,19.333333,25.5,0.758170,...,10.5,6.166667,6.833333,27.0,21.833333,122.5,9.7,1,0,193.822673
70,244.627778,41.5,92.0,0.451087,18.166667,42.666667,0.425781,27.333333,39.5,0.691983,...,8.5,6.166667,10.833333,24.0,28.833333,128.5,3.5,0,0,291.068161


0
1
2
3
4
...
67
68
69
70
71


# Section 5: Analysis - Heat Maps / Correlation Matrices

# Section 6: Modeling and Analysis

## Section 6.1: Prepare train and test data

In [24]:
# Select a season 
# gameTypeListed = ['Pre Season', 'Regular Season', 'Playoffs']
# gameTypeListed_code = [0, 1, 2]

if gameTypeToProcess == 0:
    X = df_X_PreSeason
    Y = df_Y_PreSeason
elif gameTypeToProcess == 1:
    X = df_X_RegularSeason
    Y = df_Y_RegularSeason
elif gameTypeToProcess == 2:
    X = df_X_Playoffs
    Y = df_Y_Playoffs

# # Split the code into training and test dataset 0.7/0.3
# X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = test_size_val, random_state = random_state_val)

X_test = X
Y_test = Y

# selectedSeasonRecordCount = X_train.shape[0] + X_test.shape[0]

# VALIDATION CODE 
if debug_active == 'yes':
    # Validate the split at a high level
    # print(X_train.shape,Y_train.shape)
    print(X_test.shape,Y_test.shape)
    print('Season Type: ', gameTypeToProcess)
    df_Encoded.to_csv('DAT205_Output_All.csv') 
    # X_train.to_csv('DAT205_Output_Split_X_train.csv') 
    X_test.to_csv('DAT205_Output_Split_X_test.csv') 
    # Y_train.to_csv('DAT205_Output_Split_Y_train.csv') 
    Y_test.to_csv('DAT205_Output_Split_Y_test.csv') 
    # display(X_train)
    # display(Y_train)


(72, 25) (72, 0)
Season Type:  1


In [25]:
# VALIDATION CODE 
if debug_active == 'yes':
    display(X)
    display(Y)

Unnamed: 0,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,...,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PER
0,265.000000,42.0,103.0,0.407767,14.000000,40.000000,0.350000,32.000000,38.0,0.842105,...,7.0,3.000000,9.000000,24.0,34.000000,130.0,8.0,1,0,129.756301
1,248.394444,39.5,82.0,0.481707,17.166667,34.666667,0.495192,15.333333,18.5,0.828829,...,2.5,6.166667,8.833333,27.0,21.833333,111.5,-4.5,0,0,122.495341
2,248.661111,40.5,86.0,0.470930,15.166667,41.666667,0.364000,18.333333,19.5,0.940171,...,8.5,12.166667,2.833333,24.0,23.833333,114.5,26.9,0,0,233.266292
3,247.670000,35.0,89.0,0.393258,10.000000,39.000000,0.256410,29.000000,31.0,0.935484,...,9.0,2.000000,7.000000,22.0,22.000000,109.0,10.6,1,0,127.036927
4,241.616667,52.0,89.0,0.584270,12.000000,26.000000,0.461538,12.000000,15.0,0.800000,...,13.0,2.000000,2.000000,23.0,18.000000,128.0,15.4,1,0,190.942918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,248.094444,38.5,91.0,0.423077,9.166667,36.666667,0.250000,15.333333,17.5,0.876190,...,8.5,7.166667,2.833333,19.0,22.833333,101.5,-23.1,0,0,192.229912
68,240.000000,34.0,71.0,0.478873,15.000000,39.000000,0.384615,25.000000,33.0,0.757576,...,17.0,7.000000,3.000000,23.0,27.000000,108.0,9.0,1,0,158.325098
69,243.264444,45.5,99.0,0.459596,12.166667,40.666667,0.299180,19.333333,25.5,0.758170,...,10.5,6.166667,6.833333,27.0,21.833333,122.5,9.7,1,0,193.822673
70,244.627778,41.5,92.0,0.451087,18.166667,42.666667,0.425781,27.333333,39.5,0.691983,...,8.5,6.166667,10.833333,24.0,28.833333,128.5,3.5,0,0,291.068161


0
1
2
3
4
...
67
68
69
70
71


## Section 6.4: Apply Random Forest Classifier on the split train/test dataset

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]¶

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html


Hint from processing from TPOT under XGBoost

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.4, min_samples_leaf=13, min_samples_split=13, n_estimators=100)

### Random Forest Classifier - Tuned Model

In [26]:
if useModel_RFM == 'yes':

    # # Create the model
    # RFM = RandomForestClassifier(max_depth=2, random_state=random_state_val)

    # # Train the model
    # RFM.fit(X_train, Y_train.values.ravel())

    # load the model from disk
    # modelSelect = 'RFM'
    # filename = modelSelect + '_finalized_model.sav'
    # loaded_model = pickle.load(open(filename, 'rb'))
    
    # load, no need to initialize the loaded_rf
    loaded_model = joblib.load("./RFM_TeamModel.joblib")

    # Predict using test data
    Y_pred = loaded_model.predict(X_test)
    df_Y_pred = pd.DataFrame(Y_pred, columns = ['Y_pred'])
    df_Y_pred.to_csv(folder_Input + 'DAT205_Output_Y_pred_RFM_Prediction.csv') 

In [27]:
df_Pred = pd.concat([df_Y_pred, X_test], axis=1)
df_Pred.to_csv(folder_Input + 'DAT205_Output_Pred_RFM_Prediction.csv') 

# Section 8: Summary Report

In [28]:
time_took = time.time() - start_time
print("")
print("")
print("PROCESSING COMPLETE")
print(f"Total Runtime: {hms_string(time_took)}")
if dataEnhancement_active == 'yes':
    print(f"Add Enhancement Columns Runtime: {hms_string(time_took01)}")
    print(f"Create temp TeamGameStats dataframe Runtime: {hms_string(time_took02)}")
    print(f"Calculate PIE / PER Runtime: {hms_string(time_took03)}")
    # print(f"Calculate PER Runtime: {hms_string(time_took04)}")



PROCESSING COMPLETE
Total Runtime: 0:05:21.43


# End of Code