# DAT 205 Project - Prediction
## By Dennis Hung
## Version 1
## Code DRAFT 2021-03-27

## Code Strucuture
### Section 0: Function definitions
### Section 1: Import libraries
### Section 2: Configuration of variables

### Section 3: Load the dataset from file and initial analysis
- #### Section 3.1: Load the dataset from file
- #### Section 3.2: Initial Analysis

### Section 4: Transforming/cleansing the data 
- #### Section 4.1: Enhance the data
- #### Section 4.4 Filter data by Team (if specified)
- #### Section 4.5: Remove (Stage 1) from dataframe the unwanted numerical/categorical features
- #### Section 4.6: Transform categorical feature (WL) using value replace
- #### Section 4.7: Transform categorical features using LabelEncoder

- #### Section 4.8: Define TARGET variable and separate into dataframes by season type

## Section 5: Analysis - Heat Maps / Correlation Matrices

## Section 6: Modeling and Analysis
- ### Section 6.1: Prepare train and test data
- ### Section 6.4: Apply Random Forest Classifier on the split train/test dataset

## Section 8: Summary Report

## End of Code




# Updates

### 2021-03-30

- Added models xgboost and SVM for testing
- Tuned all models (if possible)
- Added in code to save models https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
- Inserted flags to allow processing of certain models to focus model analysis processing

### 2021-03-27

Code fully working to handle 
- Loading of different files as raw from  "HistoricalGameLogs_*.csv' or after data enhancement "DAT205_Output_Enhanced_df_TF *.csv"
- Enable/disable data enhancment process
- Filtering by specific team or all the data before performing corelation matrix or model analysis





# Reference

#### How to Get NBA Data Using the nba_api Python Module (Beginner). Retrieved from Playing Numbers: 

https://www.playingnumbers.com/2019/12/how-to-get-nba-data-using-the-nba_api-python-module-beginner/

#### Patel, S. (2020, August 19). swar / nba_api. Retrieved from GitHub: 

https://github.com/swar/nba_api/blob/master/docs/table_of_contents.md

#### Issues

https://github.com/swar/nba_api/issues/124



# Note: 
#### For this analysis, this code relies on the CSV output from "DAT 205-Group01-NBA-HistPlayGameLogs.ipynb" or the enhanced data from this code as the dataset 

# Section 0: Function definitions

hms_string(sec_elapsed)


In [14]:
# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60))/60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h,m,s)

# Null field analysis
def nullFieldAnalysis(df):
    df_missingDataInfo = pd.DataFrame({'Count': df.isnull().sum(), 'Percent': 100*df.isnull().sum()/len(df)})
    #Printing the columns with over XX% of missing values (ie 60 = 60%) This is set to 0 for 0%
    null_threshold = 0 
    print("")
    print("")
    print("==== Null value analysis ====")
    return df_missingDataInfo[df_missingDataInfo['Percent'] > null_threshold].sort_values(by=['Percent'])

# CalcThreshold_List
def CalcThreshold_List(totalRecords):
    CTL_10 = int(round(totalRecords*0.1,0))
    CTL_20 = int(round(totalRecords*0.2,0))
    CTL_30 = int(round(totalRecords*0.3,0))
    CTL_40 = int(round(totalRecords*0.4,0))
    CTL_50 = int(round(totalRecords*0.5,0))
    CTL_60 = int(round(totalRecords*0.6,0))
    CTL_70 = int(round(totalRecords*0.7,0))
    CTL_80 = int(round(totalRecords*0.8,0))
    CTL_90 = int(round(totalRecords*0.9,0))
    CTL_100 = int(round(totalRecords*1,0))

    CTL_05 = int(round(totalRecords*0.05,0))
    CTL_15 = int(round(totalRecords*0.15,0))
    CTL_25 = int(round(totalRecords*0.25,0))
    CTL_35 = int(round(totalRecords*0.35,0))
    CTL_45 = int(round(totalRecords*0.45,0))
    CTL_55 = int(round(totalRecords*0.55,0))
    CTL_65 = int(round(totalRecords*0.65,0))
    CTL_75 = int(round(totalRecords*0.75,0))
    CTL_85 = int(round(totalRecords*0.85,0))
    CTL_95 = int(round(totalRecords*0.95,0))

    Threshold_List = [1, CTL_05, CTL_10, CTL_15, CTL_20, CTL_25, CTL_30, CTL_35, CTL_40, CTL_45, 
                    CTL_50, CTL_55, CTL_60, CTL_65, CTL_70, CTL_75, CTL_80, CTL_85, 
                    CTL_90, CTL_95, CTL_100]
    return Threshold_List


# Section 1: Import libraries

In [15]:
# Install any missing libraries
# pip install xgboost
# pip install tpot

In [16]:
# Initialized required packages
# Standard packages
import numpy as np
import pandas as pd
import scipy as sp
import csv
import time
import pickle
import joblib

# Graphing packages
# import seaborn as sns

# import matplotlib
# import matplotlib.pyplot as plt
# import matplotlib.patches as mpatches
# import matplotlib.lines as mlines

# Data preparation
from sklearn.preprocessing import LabelEncoder

# Modeling packages
# import tensorflow as tf
# import sklearn as skl
from sklearn.model_selection import train_test_split

# Regression modeling
# from sklearn.linear_model import LinearRegression
# from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# from sklearn.gaussian_process import GaussianProcessRegressor
# from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel
# from sklearn.naive_bayes import GaussianNB
# from sklearn.svm import SVC
# from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
# import xgboost as xgb

# to fix xgboost warnings error
# https://github.com/EpistasisLab/tpot/issues/1139
# "Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior."
# from tpot import TPOTClassifier
# from tpot.config import classifier_config_dict

# from sklearn.ensemble import RandomForestRegressor


# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
from sklearn.model_selection import RandomizedSearchCV

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV
# from sklearn.model_selection import GridSearchCV

# from sklearn.model_selection import cross_val_score

# Confusion matrix, Accuracy, sensitivity and specificity
# from sklearn.model_selection import cross_val_score
# from sklearn.metrics import mean_squared_error, r2_score
# from sklearn.metrics import precision_score, \
    # recall_score, confusion_matrix, classification_report, \
    # accuracy_score, f1_score

# from sklearn.feature_selection import VarianceThreshold 
# from sklearn.feature_selection import RFE 
# from sklearn.feature_selection import RFECV

# Clustering
# from sklearn.datasets import make_blobs
# from sklearn.cluster import KMeans
# from sklearn.metrics import silhouette_samples, silhouette_score

# Following code is being deprecated
# from sklearn.datasets.samples_generator import make_blobs

# Initialize variables if there is any debugging required
# Insert following line and activate the debugging.
# # VALIDATION CODE 
# if debug_active == 'yes':
# 
# Use "display(df)"" if the result command is "df" to retain the same format



start_time = time.time()

# Section 2: Configuration of variables

- Must manually set the following variables

- gameTypeListed as one of the following: 'Pre Season', 'Regular Season', 'Playoffs'



In [9]:

# General configuration
debug_active = 'yes'
loop_max = 100
# showNumRecs = 15
numFormat = '{:.4f}'
numFormat_Pct = "{:.0%}"

# Data Transformation 'yes' or other
# dataEnhancement_active = 'yes'
dataEnhancement_active = 'no'
    
# Section 3.1: Load the dataset from file
# pick who is running the code and comment out the others
# coder = 'bhavika'
# coder = 'cindy'
coder = 'dennis'

# Setup file name for csv or Excel (.xlsx)
if coder == 'bhavika':
    filename = 'D:/McMaster/DAT205/Capstone/Data/HistoricalGameLogs_2004-05_to_2019-20_ALL.csv'
elif coder == 'dennis':
    # filename = './HistoricalGameLogs_2004-05_to_2019-20_ALL.csv'
    # filename = './DAT205_Output_Enhanced_df_TF 2004-2020.csv'
    # filename = './HistoricalGameLogs_2019-20_TESTFile.csv'
    # Test Data files
    # filename = './HistoricalGameLogs_2007-08_to_2008-09_ALL.csv'
    filename = './DAT205_Output_Enhanced_df_TF 2007-09.csv'
    

# filename = filename + seasonStart + '_to_' + seasonEnd + '_' + gameType + '.csv'
# filename = filename + seasonStart + '_to_' + seasonEnd + '_ALL' + '.csv'


# Section 4.4 Filter data by Team (if specified)
# Filter the dataset by team or None
allTeamsList = ['CLE', 'LAC', 'NOH', 'WAS', 'ORL', 'NJN', 'PHX', 'DET', 'IND', \
       'CHA', 'DAL', 'ATL', 'NYK', 'CHI', 'BOS', 'MIN', 'PHI', 'HOU', \
       'POR', 'TOR', 'SAC', 'UTA', 'GSW', 'MIA', 'SEA', 'MEM', 'LAL', \
       'SAS', 'DEN', 'MIL', 'NOK', 'ZAK', 'CHN', 'PAN', 'RMA', 'MMT', \
       'MTA', 'MAL', 'LRO', 'EPT', 'OKC', 'LRY', 'BAR', 'MOS', 'OLP', \
       'PAR', 'LAB', 'MAC', 'MLN', 'BKN', 'FCB', 'RMD', 'MPS', 'EAM', \
       'ALB', 'FBU', 'NOP', 'UBB', 'FLA', 'BAU', 'FEN', 'SLA', 'SDS', \
       'BNE', 'MEL', 'SYD', 'GUA', 'PER', 'ADL', 'NZB', 'BJD', 'FRA']
# teamSelected = 'None'
teamSelected = 'TOR'

# Section 6: Modeling and Analysis
random_state_val = 42
model_list = ['LogRegM', 'DTM', 'RFM', 'RFM_RSCV', 'XGBM']

# Which models are active. At least 1 must be yes
useModel_LogRegM = 'yes'
useModel_LogRegM_RSCV = 'yes'
useModel_DTM = 'yes'
useModel_DTM_RSCV = 'yes'
useModel_RFM = 'yes'
useModel_RFM_RSCV = 'no'
useModel_XGBM = 'yes'
useModel_XGBM_RSCV = 'yes'
useModel_SVCM = 'no'


# Section 6.1: Prepare train and test data
# Select a season 
gameTypeListed = ['Pre Season', 'Regular Season', 'Playoffs']
gameTypeListed_code = [0, 1, 2]
gameTypeToProcess = 1
test_size_val = 1.0

# Section 5: Analysis - Heat Maps / Correlation Matrices
plotSize = (20,15)



# Section 3: Load the dataset from file and initial analysis

## Section 3.1: Load the dataset from file

In [10]:
# load the CSV or Excel file 
# Note the other option in Jupyter Notebook is to use the upload the csv files before running the code

# lst of column names which needs to be string
lst_str_cols = ['GAME_ID']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str'  for x in lst_str_cols}
# use dict on dtypes
df = pd.read_csv(filename, dtype=dict_dtypes)
# Excel file import
# df = pd.read_excel(filename)

# Remove duplicate index from import
if dataEnhancement_active == 'yes':
    unwanted_list = ['Unnamed: 0']
else: 
    unwanted_list = ['Unnamed: 0', 'UID_STG']

X_headers_list = df.columns.tolist()
for x in unwanted_list:
    X_headers_list.remove(x)

# Display current dataframe
df_Initial = df[X_headers_list]

# VALIDATION CODE 
if debug_active == 'yes':
    display(df_Initial)
    # Examine shape of dataframe
    display(df_Initial.shape)
    # Examine the type of attributes in the dataframe
    print("Shape of the dataset")
    df_Initial.info()
    # Describe the numerical data
    df_Initial.describe()
    


FileNotFoundError: [Errno 2] No such file or directory: './DAT205_Output_Enhanced_df_TF 2007-09.csv'

## Section 3.2: Initial Analysis

In [6]:
# Display the headers of columns that use descriptive or non-numerical values
categorical_Features = df_Initial.dtypes[df_Initial.dtypes == "object"].index.tolist()

# VALIDATION CODE 
if debug_active == 'yes':
    print("VALIDATION CODE")
    print(categorical_Features)

# Describe the categorical data
print("")
print("")
print("==== Description of the categorical features ====")
display(df_Initial[categorical_Features].describe())

# # Null field analysis
nullFieldAnalysis(df_Initial)
# # Null field analysis
# df_missingDataInfo = pd.DataFrame({'Count': df_Initial.isnull().sum(), 'Percent': 100*df_Initial.isnull().sum()/len(df)})

# #Printing the columns with over XX% of missing values (ie 60 = 60%) This is set to 0 for 0%
# null_threshold = 0 
# print("")
# print("")
# print("==== Null value analysis ====")
# df_missingDataInfo[df_missingDataInfo['Percent'] > null_threshold].sort_values(by=['Percent'])

VALIDATION CODE
['SEASON_YEAR', 'PLAYER_NAME', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP', 'WL', 'Game_Type']


==== Description of the categorical features ====


Unnamed: 0,SEASON_YEAR,PLAYER_NAME,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,Game_Type
count,58288,58093,58288,58288,58288,58288,58288,58239,58288
unique,2,638,43,43,2847,457,1890,2,3
top,2007-08,Kobe Bryant,LAL,Los Angeles Lakers,10700007,2009-01-02T00:00:00,UTA @ LAL,W,Regular Season
freq,29259,221,2330,2330,33,304,137,29190,49515




==== Null value analysis ====


Unnamed: 0,Count,Percent
WL,49,0.084065
PLAYER_NAME,195,0.334546


# 

# Section 4: Transforming/cleansing the data 

## Data cleansing of nulls (Not working)

## Correction to missing PreSeason games WL values only 

49 PreSeason records 

2007-08 
GAME_ID 0010700072 / 2007-10-19
BOS vs NJN   W 36 to L 33

2008-09 
GAME_ID 0010800035 / 2008-10-11
DEN vs PHX   W 77 to L 72
Note some player game data seems missing

## Corrected missing player name data

740 records (727 preseason and 13 regular season)

This is not important as the player names are excluded from the analysis



## Section 4.1: Enhance the data

In [7]:
# Setup variables for data transformation
df_TF = df_Initial
totalNumRec = df_TF.shape[0]

# Check df_TeamGameStats
# VALIDATION CODE 
if debug_active == 'yes':
    print(totalNumRec)
    display(df_TF)
    print(df_TF.columns)

58288


Unnamed: 0,SEASON_YEAR,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,...,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,Game_Type,PIE,PER
0,2007-08,200759,Cedric Simmons,1610612739,CLE,Cleveland Cavaliers,0010700104,2007-10-25T00:00:00,CLE @ TOR,L,...,0,2,0,0,-10,0,0,Pre Season,-4.705882,-7.768099
1,2007-08,1088,Chucky Atkins,1610612743,DEN,Denver Nuggets,0010700106,2007-10-25T00:00:00,DEN @ PHX,L,...,0,0,0,2,0,0,0,Pre Season,-4.347826,-10.820924
2,2007-08,201191,JamesOn Curry,1610612741,CHI,Chicago Bulls,0010700109,2007-10-25T00:00:00,CHI vs. MIL,W,...,0,1,0,0,4,0,0,Pre Season,-3.012048,-8.109928
3,2007-08,1956,Ira Newble,1610612739,CLE,Cleveland Cavaliers,0010700104,2007-10-25T00:00:00,CLE @ TOR,L,...,0,0,0,0,-2,0,0,Pre Season,0.000000,-7.855508
4,2007-08,2743,Kris Humphries,1610612761,TOR,Toronto Raptors,0010700104,2007-10-25T00:00:00,TOR vs. CLE,W,...,0,4,2,11,23,0,0,Pre Season,9.216590,20.691968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58283,2008-09,200796,Leon Powe,1610612738,BOS,Boston Celtics,0040800111,2009-04-18T00:00:00,BOS vs. CHI,L,...,0,2,6,8,-10,0,0,Playoffs,10.344828,15.409897
58284,2008-09,1888,Richard Hamilton,1610612765,DET,Detroit Pistons,0040800101,2009-04-18T00:00:00,DET @ CLE,L,...,0,1,2,15,-19,0,0,Playoffs,11.666667,8.741725
58285,2008-09,703,Kurt Thomas,1610612759,SAS,San Antonio Spurs,0040800161,2009-04-18T00:00:00,SAS vs. DAL,L,...,0,1,0,0,-9,0,0,Playoffs,-0.694444,-0.385635
58286,2008-09,101112,Channing Frye,1610612757,POR,Portland Trail Blazers,0040800171,2009-04-18T00:00:00,POR vs. HOU,L,...,0,4,1,4,-15,0,0,Playoffs,-5.714286,-0.253384


Index(['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID',
       'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP',
       'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM',
       'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'TOV', 'STL', 'BLK',
       'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS', 'DD2', 'TD3', 'Game_Type',
       'PIE', 'PER'],
      dtype='object')


## Section 4.4 Filter data by Team (if specified)

In [8]:
# If a specific team is selected by 'TEAM_ABBREVIATION' then recreate the dataframe with this filter else use the entire dataset as is.
if teamSelected in allTeamsList:
    df_TF = df_TF[df_TF['TEAM_ABBREVIATION']==teamSelected]
else:
    df_TF    
display(df_TF)

Unnamed: 0,SEASON_YEAR,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,...,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,Game_Type,PIE,PER
4,2007-08,2743,Kris Humphries,1610612761,TOR,Toronto Raptors,0010700104,2007-10-25T00:00:00,TOR vs. CLE,W,...,0,4,2,11,23,0,0,Pre Season,9.216590,20.691968
10,2007-08,239,Darrick Martin,1610612761,TOR,Toronto Raptors,0010700104,2007-10-25T00:00:00,TOR vs. CLE,W,...,0,1,0,5,10,0,0,Pre Season,3.686636,39.836496
11,2007-08,101121,Joey Graham,1610612761,TOR,Toronto Raptors,0010700104,2007-10-25T00:00:00,TOR vs. CLE,W,...,0,0,2,5,11,0,0,Pre Season,5.529954,44.248073
13,2007-08,2413,Juan Dixon,1610612761,TOR,Toronto Raptors,0010700104,2007-10-25T00:00:00,TOR vs. CLE,W,...,0,0,1,5,19,0,0,Pre Season,8.294931,15.164332
14,2007-08,1725,Rasho Nesterovic,1610612761,TOR,Toronto Raptors,0010700104,2007-10-25T00:00:00,TOR vs. CLE,W,...,1,1,0,2,2,0,0,Pre Season,1.382488,1.738336
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56376,2008-09,979,Jermaine O'Neal,1610612761,TOR,Toronto Raptors,0020800005,2008-10-29T00:00:00,TOR @ PHI,W,...,0,5,4,17,0,0,0,Regular Season,17.647059,16.978241
56380,2008-09,101181,Jose Calderon,1610612761,TOR,Toronto Raptors,0020800005,2008-10-29T00:00:00,TOR @ PHI,W,...,2,5,1,13,4,0,0,Regular Season,16.993464,18.790151
56423,2008-09,101121,Joey Graham,1610612761,TOR,Toronto Raptors,0020800005,2008-10-29T00:00:00,TOR @ PHI,W,...,2,2,0,4,2,0,0,Regular Season,-2.614379,2.274514
56433,2008-09,1515,Anthony Parker,1610612761,TOR,Toronto Raptors,0020800005,2008-10-29T00:00:00,TOR @ PHI,W,...,1,2,0,9,11,0,0,Regular Season,6.535948,7.739894


## Section 4.5: Remove (Stage 1) from dataframe the unwanted numerical/categorical features

#### Note: if data enhancement done then adjust 
unwanted_categorical_Features_TF

In [9]:
# Gather current list of features
numerical_Features = df_TF.columns.tolist()

# All possible features
# ['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'TOV', 'STL', 'BLK', 'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS', 'DD2', 'TD3', 'Game_Type']

for i in categorical_Features: 
    numerical_Features.remove(i)

# Lists unwanted features
unwanted_numerical_Features = ['PLAYER_ID', 'TEAM_ID']
unwanted_categorical_Features = ['PLAYER_NAME', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP']

# # if Enchancement done then use this to get rid of what extras you don't want.
if dataEnhancement_active == 'yes':
    unwanted_categorical_Features_TF = ['UID_STG']

# unwanted_list_01 = unwanted_numerical_Features + unwanted_categorical_Features + unwanted_categorical_Features_TF
if dataEnhancement_active == 'yes':
    unwanted_list_01 = unwanted_numerical_Features + unwanted_categorical_Features + unwanted_categorical_Features_TF
else:
    unwanted_list_01 = unwanted_numerical_Features + unwanted_categorical_Features
X_headers_list = df_TF.columns.tolist()

for i in unwanted_list_01:
    X_headers_list.remove(i)

# Reset new dataframe with desired features
df_Reduced = df_TF[X_headers_list]

# Remaining attributes
# VALIDATION CODE 
if debug_active == 'yes':
    display(X_headers_list)

['SEASON_YEAR',
 'WL',
 'MIN',
 'FGM',
 'FGA',
 'FG_PCT',
 'FG3M',
 'FG3A',
 'FG3_PCT',
 'FTM',
 'FTA',
 'FT_PCT',
 'OREB',
 'DREB',
 'REB',
 'AST',
 'TOV',
 'STL',
 'BLK',
 'BLKA',
 'PF',
 'PFD',
 'PTS',
 'PLUS_MINUS',
 'DD2',
 'TD3',
 'Game_Type',
 'PIE',
 'PER']

## Section 4.6: Transform categorical feature (WL) using value replace

In [10]:
cleaned_categorical_Features = ['WL', 'Game_Type']
cleanupValue = {'WL': {'W': 1, 'L': 0}, 'Game_Type': {'Pre Season': 0, 'Regular Season': 1, 'Playoffs': 2}}
df_Reduced = df_Reduced.replace(cleanupValue)

# VALIDATION CODE 
if debug_active == 'yes':
    display(df_Reduced)

Unnamed: 0,SEASON_YEAR,WL,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,Game_Type,PIE,PER
4,2007-08,1,20.666667,5,9,0.556,0,0,0.00,1,...,0,4,2,11,23,0,0,0,9.216590,20.691968
10,2007-08,1,4.566667,2,3,0.667,1,2,0.50,0,...,0,1,0,5,10,0,0,0,3.686636,39.836496
11,2007-08,1,5.450000,1,1,1.000,0,0,0.00,3,...,0,0,2,5,11,0,0,0,5.529954,44.248073
13,2007-08,1,18.700000,2,6,0.333,1,2,0.50,0,...,0,0,1,5,19,0,0,0,8.294931,15.164332
14,2007-08,1,19.833333,1,4,0.250,0,0,0.00,0,...,1,1,0,2,2,0,0,0,1.382488,1.738336
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56376,2008-09,1,34.116667,7,15,0.467,0,1,0.00,3,...,0,5,4,17,0,0,0,1,17.647059,16.978241
56380,2008-09,1,32.695000,5,9,0.556,3,5,0.60,0,...,2,5,1,13,4,0,0,1,16.993464,18.790151
56423,2008-09,1,8.750000,2,7,0.286,0,0,0.00,0,...,2,2,0,4,2,0,0,1,-2.614379,2.274514
56433,2008-09,1,37.683333,3,11,0.273,3,5,0.60,0,...,1,2,0,9,11,0,0,1,6.535948,7.739894


## Section 4.7: Transform categorical features using LabelEncoder

This will work with the reminding categorical values as there is a hierarchy for 

'SEASON_YEAR' - the more recent the season the more relevant it is where as older data is less valuable

'Game_Type' - need to think about this but assume regular season is more important

In [11]:
# # Select features to encode
e_categorical = categorical_Features

print(e_categorical)

for i in unwanted_categorical_Features:
    e_categorical.remove(i)

print(unwanted_categorical_Features)

for j in cleaned_categorical_Features:
    e_categorical.remove(j)

print(cleaned_categorical_Features)

print(e_categorical)

# Reset variable
categorical_Features = df_Reduced.dtypes[df_Reduced.dtypes == "object"].index.tolist()

lb_make = LabelEncoder()
# cat_list = ['Gender','Education_Level','Marital_Status','Income_Category','Card_Category']
# cat_list_code = ['Gender_code','Education_Level_code','Marital_Status_code','Income_Category_code','Card_Category_code']

df_Encoded = df_Reduced
# df_Encoded = df_Reduced[e_categorical]




['SEASON_YEAR', 'PLAYER_NAME', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP', 'WL', 'Game_Type']
['PLAYER_NAME', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP']
['WL', 'Game_Type']
['SEASON_YEAR']


In [12]:
# Apply LabelEncoding on e_categorical features

for k in e_categorical:
    val_A = k
    val_B = k + '_code'
    df_Encoded[(val_B)] = lb_make.fit_transform(df_Encoded[val_A])

# VALIDATION CODE 
if debug_active == 'yes':
    display(df_Encoded) #Results in appending a new column to df

Unnamed: 0,SEASON_YEAR,WL,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,Game_Type,PIE,PER,SEASON_YEAR_code
4,2007-08,1,20.666667,5,9,0.556,0,0,0.00,1,...,4,2,11,23,0,0,0,9.216590,20.691968,0
10,2007-08,1,4.566667,2,3,0.667,1,2,0.50,0,...,1,0,5,10,0,0,0,3.686636,39.836496,0
11,2007-08,1,5.450000,1,1,1.000,0,0,0.00,3,...,0,2,5,11,0,0,0,5.529954,44.248073,0
13,2007-08,1,18.700000,2,6,0.333,1,2,0.50,0,...,0,1,5,19,0,0,0,8.294931,15.164332,0
14,2007-08,1,19.833333,1,4,0.250,0,0,0.00,0,...,1,0,2,2,0,0,0,1.382488,1.738336,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56376,2008-09,1,34.116667,7,15,0.467,0,1,0.00,3,...,5,4,17,0,0,0,1,17.647059,16.978241,1
56380,2008-09,1,32.695000,5,9,0.556,3,5,0.60,0,...,5,1,13,4,0,0,1,16.993464,18.790151,1
56423,2008-09,1,8.750000,2,7,0.286,0,0,0.00,0,...,2,0,4,2,0,0,1,-2.614379,2.274514,1
56433,2008-09,1,37.683333,3,11,0.273,3,5,0.60,0,...,2,0,9,11,0,0,1,6.535948,7.739894,1


## Section 4.8: Define TARGET variable and separate into dataframes by season type

Remove (Stage 2) from dataframe the featuree (categorical, Target, and other unwanted)

Separating the dataframe by gameTypeListed ('Pre Season', 'Regular Season', 'Playoffs')

In [13]:
# VALIDATION CODE 
if debug_active == 'yes':
    display(df_Encoded)
    print(e_categorical)

Unnamed: 0,SEASON_YEAR,WL,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,Game_Type,PIE,PER,SEASON_YEAR_code
4,2007-08,1,20.666667,5,9,0.556,0,0,0.00,1,...,4,2,11,23,0,0,0,9.216590,20.691968,0
10,2007-08,1,4.566667,2,3,0.667,1,2,0.50,0,...,1,0,5,10,0,0,0,3.686636,39.836496,0
11,2007-08,1,5.450000,1,1,1.000,0,0,0.00,3,...,0,2,5,11,0,0,0,5.529954,44.248073,0
13,2007-08,1,18.700000,2,6,0.333,1,2,0.50,0,...,0,1,5,19,0,0,0,8.294931,15.164332,0
14,2007-08,1,19.833333,1,4,0.250,0,0,0.00,0,...,1,0,2,2,0,0,0,1.382488,1.738336,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56376,2008-09,1,34.116667,7,15,0.467,0,1,0.00,3,...,5,4,17,0,0,0,1,17.647059,16.978241,1
56380,2008-09,1,32.695000,5,9,0.556,3,5,0.60,0,...,5,1,13,4,0,0,1,16.993464,18.790151,1
56423,2008-09,1,8.750000,2,7,0.286,0,0,0.00,0,...,2,0,4,2,0,0,1,-2.614379,2.274514,1
56433,2008-09,1,37.683333,3,11,0.273,3,5,0.60,0,...,2,0,9,11,0,0,1,6.535948,7.739894,1


['SEASON_YEAR']


In [14]:
# Configure variables
# gameTypeListed = ['Pre Season', 'Regular Season', 'Playoffs']
# gameTypeListed_code = [0, 1, 2]
Y_headers_list1 = ['WL', 'Game_Type']
Y_headers_list2 = ['WL']
e_categorical = e_categorical + Y_headers_list2

# Define the current list of features
X_headers_list = df_Encoded.columns.tolist()

# Remove LabelEncoded categorical features
for k in e_categorical:
    X_headers_list.remove(k)

# VALIDATION CODE 
if debug_active == 'yes':
    print(e_categorical)

['SEASON_YEAR', 'WL']


In [15]:
df_X_Reduced2 = df_Encoded[X_headers_list]
df_Y_Reduced2 = df_Encoded[Y_headers_list1]
cleanDFColumns = ['Game_Type', 'SEASON_YEAR_code']
# cleanDFColumns = ['Game_Type']

for gameType in gameTypeListed_code:
    is_gameType_X = df_X_Reduced2['Game_Type']==gameType
    is_gameType_Y = df_Y_Reduced2['Game_Type']==gameType
    if gameType == 0:
        df_X_PreSeason = df_X_Reduced2[is_gameType_X]
        df_X_PreSeason = df_X_PreSeason.drop(cleanDFColumns, axis=1)
        df_Y_PreSeason = df_Y_Reduced2[is_gameType_Y]
        df_Y_PreSeason = df_Y_PreSeason[Y_headers_list2]
    elif gameType == 1:
        df_X_RegularSeason = df_X_Reduced2[is_gameType_X]
        df_X_RegularSeason = df_X_RegularSeason.drop(cleanDFColumns, axis=1)
        df_Y_RegularSeason = df_Y_Reduced2[is_gameType_Y]
        df_Y_RegularSeason = df_Y_RegularSeason[Y_headers_list2]
    elif gameType == 2:
        df_X_Playoffs = df_X_Reduced2[is_gameType_X]
        df_X_Playoffs = df_X_Playoffs.drop(cleanDFColumns, axis=1)
        df_Y_Playoffs = df_Y_Reduced2[is_gameType_Y]
        df_Y_Playoffs = df_Y_Playoffs[Y_headers_list2]


# VALIDATION CODE 
if debug_active == 'yes':
    print("")
    print("Pre Season")
    display(df_X_PreSeason)
    display(df_Y_PreSeason)
    print("")
    print("Regular Season")
    display(df_X_RegularSeason)
    display(df_Y_RegularSeason)
    print("")
    print("Playoffs")
    display(df_X_Playoffs)
    display(df_Y_Playoffs)


Pre Season


Unnamed: 0,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,...,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PIE,PER
4,20.666667,5,9,0.556,0,0,0.0,1,2,0.5,...,0,0,4,2,11,23,0,0,9.216590,20.691968
10,4.566667,2,3,0.667,1,2,0.5,0,0,0.0,...,0,0,1,0,5,10,0,0,3.686636,39.836496
11,5.450000,1,1,1.000,0,0,0.0,3,3,1.0,...,0,0,0,2,5,11,0,0,5.529954,44.248073
13,18.700000,2,6,0.333,1,2,0.5,0,0,0.0,...,0,0,0,1,5,19,0,0,8.294931,15.164332
14,19.833333,1,4,0.250,0,0,0.0,0,0,0.0,...,0,1,1,0,2,2,0,0,1.382488,1.738336
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31623,20.983333,4,6,0.667,3,3,1.0,1,2,0.5,...,0,0,4,1,12,-2,0,0,11.000000,25.138570
31642,26.450000,4,5,0.800,2,2,1.0,0,0,0.0,...,0,0,3,1,10,6,0,0,8.000000,12.379811
31652,22.966667,5,10,0.500,0,0,0.0,3,6,0.5,...,1,0,2,3,13,-3,0,0,9.500000,21.360914
31672,27.533333,4,8,0.500,2,4,0.5,0,0,0.0,...,1,1,3,2,10,8,0,0,16.000000,20.833511


Unnamed: 0,WL
4,1
10,1
11,1
13,1
14,1
...,...
31623,1
31642,1
31652,1
31672,1



Regular Season


Unnamed: 0,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,...,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PIE,PER
2613,18.316667,7,9,0.778,1,1,1.00,3,3,1.0,...,0,0,0,3,18,3,0,0,30.656934,50.615487
2668,15.233333,3,8,0.375,0,1,0.00,2,2,1.0,...,1,0,2,2,8,6,0,0,9.489051,17.007440
2688,20.683333,5,11,0.455,1,4,0.25,2,2,1.0,...,1,1,4,1,13,-10,0,0,6.569343,15.715939
2747,15.133333,1,2,0.500,0,0,0.00,0,0,0.0,...,0,0,3,1,2,5,0,0,1.459854,5.237379
2765,13.603333,3,6,0.500,0,0,0.00,0,0,0.0,...,0,0,0,0,6,-4,0,0,5.839416,11.384489
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56376,34.116667,7,15,0.467,0,1,0.00,3,3,1.0,...,1,0,5,4,17,0,0,0,17.647059,16.978241
56380,32.695000,5,9,0.556,3,5,0.60,0,0,0.0,...,0,2,5,1,13,4,0,0,16.993464,18.790151
56423,8.750000,2,7,0.286,0,0,0.00,0,0,0.0,...,0,2,2,0,4,2,0,0,-2.614379,2.274514
56433,37.683333,3,11,0.273,3,5,0.60,0,0,0.0,...,0,1,2,0,9,11,0,0,6.535948,7.739894


Unnamed: 0,WL
2613,0
2668,0
2688,0
2747,0
2765,0
...,...
56376,1
56380,1
56423,1
56433,1



Playoffs


Unnamed: 0,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,...,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,DD2,TD3,PIE,PER
28590,39.111667,7,19,0.368,0,0,0.0,2,4,0.5,...,0,1,3,3,16,-14,0,0,14.084507,13.328478
28595,15.916667,2,4,0.5,0,2,0.0,0,0,0.0,...,0,0,3,0,4,-5,0,0,4.225352,11.30111
28602,24.283333,6,14,0.429,0,2,0.0,2,2,1.0,...,0,0,2,2,14,-7,0,0,16.197183,17.497227
28603,8.888333,0,0,0.0,0,0,0.0,0,0,0.0,...,0,0,3,0,0,4,0,0,-2.816901,-8.551097
28622,43.983333,3,8,0.375,1,3,0.333,4,4,1.0,...,0,0,2,3,11,-15,0,0,15.492958,10.200409
28623,24.433333,3,6,0.5,2,2,1.0,0,0,0.0,...,1,0,1,0,8,0,0,0,13.380282,19.697599
28625,29.216667,3,11,0.273,2,5,0.4,6,6,1.0,...,0,2,3,3,14,-4,0,0,18.309859,20.753052
28629,30.45,6,13,0.462,1,4,0.25,0,1,0.0,...,0,0,2,1,13,-6,0,0,9.859155,11.074417
28639,23.716667,5,10,0.5,2,4,0.5,0,0,0.0,...,0,0,5,2,12,-3,0,0,11.267606,16.029909
28739,0.55,0,0,0.0,0,0,0.0,0,0,0.0,...,0,0,0,0,0,0,0,0,1.351351,97.994545


Unnamed: 0,WL
28590,0
28595,0
28602,0
28603,0
28622,0
28623,0
28625,0
28629,0
28639,0
28739,0


# Section 5: Analysis - Heat Maps / Correlation Matrices

In [16]:
# Remove unwanted/useless features. anything over 80% was removed
# unwanted_list_02 = ['PTS', 'FGA', 'FG3M', 'FTM', 'PFD', 'REB']
unwanted_list_02 = ['PTS', 'FGA', 'FG3M', 'FTM', 'PFD', 'REB']

for gameType in gameTypeListed_code:
    if gameType == 0:
        df_X_PreSeason = df_X_PreSeason.drop(unwanted_list_02, axis=1)
    elif gameType == 1:
        df_X_RegularSeason = df_X_RegularSeason.drop(unwanted_list_02, axis=1)
    elif gameType == 2:
        df_X_Playoffs = df_X_Playoffs.drop(unwanted_list_02, axis=1)


# Remaining features and after removal of unwanted features in the dataframes
# VALIDATION CODE 
if debug_active == 'yes':
    display(df_X_PreSeason)
    display(df_X_RegularSeason)
    display(df_X_Playoffs)    

Unnamed: 0,MIN,FGM,FG_PCT,FG3A,FG3_PCT,FTA,FT_PCT,OREB,DREB,AST,TOV,STL,BLK,BLKA,PF,PLUS_MINUS,DD2,TD3,PIE,PER
4,20.666667,5,0.556,0,0.0,2,0.5,0,5,2,0,1,0,0,4,23,0,0,9.216590,20.691968
10,4.566667,2,0.667,2,0.5,0,0.0,0,1,0,0,0,0,0,1,10,0,0,3.686636,39.836496
11,5.450000,1,1.000,0,0.0,3,1.0,0,1,0,0,0,0,0,0,11,0,0,5.529954,44.248073
13,18.700000,2,0.333,2,0.5,0,0.0,0,4,3,0,1,0,0,0,19,0,0,8.294931,15.164332
14,19.833333,1,0.250,0,0.0,0,0.0,1,3,0,1,1,0,1,1,2,0,0,1.382488,1.738336
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31623,20.983333,4,0.667,3,1.0,2,0.5,0,2,5,3,2,0,0,4,-2,0,0,11.000000,25.138570
31642,26.450000,4,0.800,2,1.0,0,0.0,0,3,1,2,0,0,0,3,6,0,0,8.000000,12.379811
31652,22.966667,5,0.500,0,0.0,6,0.5,2,4,1,1,1,1,0,2,-3,0,0,9.500000,21.360914
31672,27.533333,4,0.500,4,0.5,0,0.0,1,8,4,1,1,1,1,3,8,0,0,16.000000,20.833511


Unnamed: 0,MIN,FGM,FG_PCT,FG3A,FG3_PCT,FTA,FT_PCT,OREB,DREB,AST,TOV,STL,BLK,BLKA,PF,PLUS_MINUS,DD2,TD3,PIE,PER
2613,18.316667,7,0.778,1,1.00,3,1.0,0,0,3,2,4,0,0,0,3,0,0,30.656934,50.615487
2668,15.233333,3,0.375,1,0.00,2,1.0,2,4,2,2,0,1,0,2,6,0,0,9.489051,17.007440
2688,20.683333,5,0.455,4,0.25,2,1.0,0,1,0,1,1,1,1,4,-10,0,0,6.569343,15.715939
2747,15.133333,1,0.500,0,0.00,0,0.0,0,1,2,0,0,0,0,3,5,0,0,1.459854,5.237379
2765,13.603333,3,0.500,0,0.00,0,0.0,0,1,0,0,0,0,0,0,-4,0,0,5.839416,11.384489
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56376,34.116667,7,0.467,1,0.00,3,1.0,2,6,4,2,0,1,0,5,0,0,0,17.647059,16.978241
56380,32.695000,5,0.556,5,0.60,0,0.0,0,2,7,1,1,0,2,5,4,0,0,16.993464,18.790151
56423,8.750000,2,0.286,0,0.00,0,0.0,2,0,0,0,0,0,2,2,2,0,0,-2.614379,2.274514
56433,37.683333,3,0.273,5,0.60,0,0.0,0,1,3,0,2,0,1,2,11,0,0,6.535948,7.739894


Unnamed: 0,MIN,FGM,FG_PCT,FG3A,FG3_PCT,FTA,FT_PCT,OREB,DREB,AST,TOV,STL,BLK,BLKA,PF,PLUS_MINUS,DD2,TD3,PIE,PER
28590,39.111667,7,0.368,0,0.0,4,0.5,4,5,3,0,1,0,1,3,-14,0,0,14.084507,13.328478
28595,15.916667,2,0.5,2,0.0,0,0.0,0,1,2,0,1,0,0,3,-5,0,0,4.225352,11.30111
28602,24.283333,6,0.429,2,0.0,2,1.0,1,4,5,4,2,0,0,2,-7,0,0,16.197183,17.497227
28603,8.888333,0,0.0,0,0.0,0,0.0,0,2,0,1,0,0,0,3,4,0,0,-2.816901,-8.551097
28622,43.983333,3,0.375,3,0.333,4,1.0,0,4,2,0,1,0,0,2,-15,0,0,15.492958,10.200409
28623,24.433333,3,0.5,2,1.0,0,0.0,2,2,0,0,2,1,0,1,0,0,0,13.380282,19.697599
28625,29.216667,3,0.273,5,0.4,6,1.0,2,5,2,0,2,0,2,3,-4,0,0,18.309859,20.753052
28629,30.45,6,0.462,4,0.25,1,0.0,0,2,2,0,0,0,0,2,-6,0,0,9.859155,11.074417
28639,23.716667,5,0.5,4,0.5,0,0.0,0,3,4,1,0,0,0,5,-3,0,0,11.267606,16.029909
28739,0.55,0,0.0,0,0.0,0,0.0,0,0,0,0,1,0,0,0,0,0,0,1.351351,97.994545


# Section 6: Modeling and Analysis

## Section 6.1: Prepare train and test data

In [17]:
# Select a season 
# gameTypeListed = ['Pre Season', 'Regular Season', 'Playoffs']
# gameTypeListed_code = [0, 1, 2]

if gameTypeToProcess == 0:
    X = df_X_PreSeason
    Y = df_Y_PreSeason
elif gameTypeToProcess == 1:
    X = df_X_RegularSeason
    Y = df_Y_RegularSeason
elif gameTypeToProcess == 2:
    X = df_X_Playoffs
    Y = df_Y_Playoffs

# # Split the code into training and test dataset 0.7/0.3
# X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = test_size_val, random_state = random_state_val)

X_test = X
Y_test = Y

# selectedSeasonRecordCount = X_train.shape[0] + X_test.shape[0]

# VALIDATION CODE 
if debug_active == 'yes':
    # Validate the split at a high level
    # print(X_train.shape,Y_train.shape)
    print(X_test.shape,Y_test.shape)
    print('Season Type: ', gameTypeToProcess)
    df_Encoded.to_csv('DAT205_Output_All.csv') 
    # X_train.to_csv('DAT205_Output_Split_X_train.csv') 
    X_test.to_csv('DAT205_Output_Split_X_test.csv') 
    # Y_train.to_csv('DAT205_Output_Split_Y_train.csv') 
    Y_test.to_csv('DAT205_Output_Split_Y_test.csv') 
    # display(X_train)
    # display(Y_train)


(1694, 20) (1694, 1)
Season Type:  1


In [18]:
# VALIDATION CODE 
if debug_active == 'yes':
    display(X)
    display(Y)

Unnamed: 0,MIN,FGM,FG_PCT,FG3A,FG3_PCT,FTA,FT_PCT,OREB,DREB,AST,TOV,STL,BLK,BLKA,PF,PLUS_MINUS,DD2,TD3,PIE,PER
2613,18.316667,7,0.778,1,1.00,3,1.0,0,0,3,2,4,0,0,0,3,0,0,30.656934,50.615487
2668,15.233333,3,0.375,1,0.00,2,1.0,2,4,2,2,0,1,0,2,6,0,0,9.489051,17.007440
2688,20.683333,5,0.455,4,0.25,2,1.0,0,1,0,1,1,1,1,4,-10,0,0,6.569343,15.715939
2747,15.133333,1,0.500,0,0.00,0,0.0,0,1,2,0,0,0,0,3,5,0,0,1.459854,5.237379
2765,13.603333,3,0.500,0,0.00,0,0.0,0,1,0,0,0,0,0,0,-4,0,0,5.839416,11.384489
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56376,34.116667,7,0.467,1,0.00,3,1.0,2,6,4,2,0,1,0,5,0,0,0,17.647059,16.978241
56380,32.695000,5,0.556,5,0.60,0,0.0,0,2,7,1,1,0,2,5,4,0,0,16.993464,18.790151
56423,8.750000,2,0.286,0,0.00,0,0.0,2,0,0,0,0,0,2,2,2,0,0,-2.614379,2.274514
56433,37.683333,3,0.273,5,0.60,0,0.0,0,1,3,0,2,0,1,2,11,0,0,6.535948,7.739894


Unnamed: 0,WL
2613,0
2668,0
2688,0
2747,0
2765,0
...,...
56376,1
56380,1
56423,1
56433,1


## Section 6.4: Apply Random Forest Classifier on the split train/test dataset

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]¶

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html


Hint from processing from TPOT under XGBoost

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.4, min_samples_leaf=13, min_samples_split=13, n_estimators=100)

### Random Forest Classifier - Tuned Model

In [19]:
if useModel_RFM == 'yes':

    # # Create the model
    # RFM = RandomForestClassifier(max_depth=2, random_state=random_state_val)

    # # Train the model
    # RFM.fit(X_train, Y_train.values.ravel())

    # load the model from disk
    # modelSelect = 'RFM'
    # filename = modelSelect + '_finalized_model.sav'
    # loaded_model = pickle.load(open(filename, 'rb'))
    
    # load, no need to initialize the loaded_rf
    loaded_model = joblib.load("./RFM_RSCV_Model.joblib")

    # Predict using test data
    Y_pred = loaded_model.predict(X_test)
    df_Y_pred = pd.DataFrame(Y_pred, columns = ['Y_pred'])
    df_Y_pred.to_csv('DAT205_Output_Y_pred_RFM_RSCV_Prediction.csv') 

In [20]:
print(loaded_model)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)


# Section 8: Summary Report

In [21]:
# # Create summary table of metric analysis
# df_Metrics = []

# df_Metrics_headers = [0,1,2,3,4,5,6]
# df_Metrics = pd.DataFrame (df_Metrics, columns = df_Metrics_headers)

# if useModel_LogRegM == 'yes':
#     df_AddModel = pd.Series(['Logistic Regression', accuracy_score_LogRegM, f1_score_LogRegM, recall_score_LogRegM, precision_score_LogRegM, sensitivity_LogRegM, specificity_LogRegM])
#     df_Metrics = df_Metrics.append(df_AddModel, ignore_index=True)

# if useModel_LogRegM_RSCV == 'yes':
#     df_AddModel = pd.Series(['Logistic Regression (Tuned)',accuracy_score_LogRegM_RSCV, f1_score_LogRegM_RSCV, recall_score_LogRegM_RSCV, precision_score_LogRegM_RSCV, sensitivity_LogRegM_RSCV, specificity_LogRegM_RSCV])
#     df_Metrics = df_Metrics.append(df_AddModel, ignore_index=True)

# if useModel_DTM == 'yes':
#     df_AddModel = pd.Series(['Decision Tree',accuracy_score_DTM, f1_score_DTM, recall_score_DTM, precision_score_DTM, sensitivity_DTM, specificity_DTM])
#     df_Metrics = df_Metrics.append(df_AddModel, ignore_index=True)

# if useModel_DTM_RSCV == 'yes':
#     df_AddModel = pd.Series(['Decision Tree (Tuned)',accuracy_score_DTM_RSCV, f1_score_DTM_RSCV, recall_score_DTM_RSCV, precision_score_DTM_RSCV, sensitivity_DTM_RSCV, specificity_DTM_RSCV])
#     df_Metrics = df_Metrics.append(df_AddModel, ignore_index=True)

# if useModel_RFM == 'yes':
#     df_AddModel = pd.Series(['Random Forest', accuracy_score_RFM, f1_score_RFM, recall_score_RFM, precision_score_RFM, sensitivity_RFM, specificity_RFM])
#     df_Metrics = df_Metrics.append(df_AddModel, ignore_index=True)

# if useModel_RFM_RSCV == 'yes':
#     df_AddModel = pd.Series(['Random Forest (Tuned)',accuracy_score_RFM_RSCV, f1_score_RFM_RSCV, recall_score_RFM_RSCV, precision_score_RFM_RSCV, sensitivity_RFM_RSCV, specificity_RFM_RSCV])
#     df_Metrics = df_Metrics.append(df_AddModel, ignore_index=True)

# if useModel_XGBM == 'yes':
#     df_AddModel = pd.Series(['XGBoost',accuracy_score_XGBM, f1_score_XGBM, recall_score_XGBM, precision_score_XGBM, sensitivity_XGBM, specificity_XGBM])
#     df_Metrics = df_Metrics.append(df_AddModel, ignore_index=True)

# if useModel_XGBM_RSCV == 'yes':
#     df_AddModel = pd.Series(['XGBoost (Tuned)',accuracy_score_XGBM_RSCV, f1_score_XGBM_RSCV, recall_score_XGBM_RSCV, precision_score_XGBM_RSCV, sensitivity_XGBM_RSCV, specificity_XGBM_RSCV])
#     df_Metrics = df_Metrics.append(df_AddModel, ignore_index=True)

# if useModel_SVCM == 'yes':
#     df_AddModel = pd.Series(['SVM',accuracy_score_SVCM, f1_score_SVCM, recall_score_SVCM, precision_score_SVCM, sensitivity_SVCM, specificity_SVCM])
#     df_Metrics = df_Metrics.append(df_AddModel, ignore_index=True)

# df_Metrics.columns = ['Model','Accuracy','F1 score','Recall','Precision','Sensitivity','Specificity']

# # VALIDATION CODE 
# if debug_active == 'yes':
#     display(df_Metrics)

# # Join dataframes for Metrics and cross_val_scores
# # df_Summary = pd.concat([df_Metrics,df_cross_val_score], axis=1)
# df_Summary = df_Metrics
# # VALIDATION CODE 
# if debug_active == 'yes':
#     display(df_Summary)

In [22]:
# print("================= Results Summary ==================\n")

# print("\n==================== Configuration ======================")
# print('Filter by Team Selected = ', teamSelected)

# print('Total Number of Records (Initial Dataset) = ', totalNumRec)
# print('Total Number of Records (Transformed and Filtered Dataset) = ', df_TF.shape[0])
# print('Game Type Processed (0 = PreSeason / 1 = RegularSeason / 2 = Playoffs) = ', gameTypeToProcess)
# print('Selected Season Records = ', selectedSeasonRecordCount)
# print('Train / Test Split = ', test_size_val)
# print('Model random_state_val = ', random_state_val)

# print("\n==================== Features ======================")
# print('----------------- Removed Features -----------------')
# display(unwanted_list_01)
# print('\n------ Removed attributes - Heat Map / Correlation Matrix ---- ')
# display(unwanted_list_02)

# print('\n------------------- Applied Features --------------------')
# display(X_test.columns.tolist())

# print("\n ================= Model Analysis Summary ==================\n")
# display(df_Summary)

# if useModel_LogRegM == 'yes':
#     print('\n\n----------------- Logistic Regression --------------------')
#     print('Base Model')
#     print('Model Parameteres =', LogRegM.get_params())
#     print('')
#     print('Accuracy:', numFormat.format(accuracy_score_LogRegM))
#     print('F1 score:', numFormat.format(f1_score_LogRegM))
#     print('Recall:', numFormat.format(recall_score_LogRegM))
#     print('Precision:', numFormat.format(precision_score_LogRegM))
#     print('Sensitivity : ', numFormat.format(sensitivity_LogRegM))
#     print('Specificity : ', numFormat.format(specificity_LogRegM))
#     print('\n clasification report:\n', classification_report_LogRegM)
#     print('\n confussion matrix:\n',confusion_matrix_LogRegM)

# if useModel_LogRegM_RSCV == 'yes':
#     print('Tuned Model')
#     print('Best Model Parameters =', LogRegM_RSCV.best_params_)
#     print('')
#     print('Accuracy:', numFormat.format(accuracy_score_LogRegM_RSCV))
#     print('F1 score:', numFormat.format(f1_score_LogRegM_RSCV))
#     print('Recall:', numFormat.format(recall_score_LogRegM_RSCV))
#     print('Precision:', numFormat.format(precision_score_LogRegM_RSCV))
#     print('Sensitivity : ', numFormat.format(sensitivity_LogRegM_RSCV))
#     print('Specificity : ', numFormat.format(specificity_LogRegM_RSCV))
#     print('\n clasification report:\n', classification_report_LogRegM_RSCV)
#     print('\n confussion matrix:\n',confusion_matrix_LogRegM_RSCV)

#     print('\nImprovement (Accuracy) of {:0.2f}%.'.format(comparison_accuracy_score_LogRegM_RSCV))
#     print('Improvement (F1_score) of {:0.2f}%.'.format(comparison_f1_score_LogRegM_RSCV))
#     print('Improvement (Sensitivity) of {:0.2f}%.'.format(comparison_sensitivity_LogRegM_RSCV))
#     print('Improvement (Specificity) of {:0.2f}%.'.format(comparison_specificity_LogRegM_RSCV))

#     print("\nFeature Importance")
#     display(df_feature_importance_LogRegM)

# if useModel_DTM == 'yes':
#     print('\n\n-------------------- Decision Tree -----------------------\n')
#     print('Base Model')
#     print('Model Parameteres =', DTM.get_params())
#     print('')
#     print('Accuracy:', numFormat.format(accuracy_score_DTM))
#     print('F1 score:', numFormat.format(f1_score_DTM))
#     print('Recall:', numFormat.format(recall_score_DTM))
#     print('Precision:', numFormat.format(precision_score_DTM))
#     print('Sensitivity : ', numFormat.format(sensitivity_DTM))
#     print('Specificity : ', numFormat.format(specificity_DTM))
#     print('\n clasification report:\n', classification_report_DTM)
#     print('\n confussion matrix:\n',confusion_matrix_DTM)

# if useModel_DTM_RSCV == 'yes':
#     print('Tuned Model')
#     print('Best Model Parameters =', DTM_RSCV.best_params_)
#     print('')
#     print('Accuracy:', numFormat.format(accuracy_score_DTM_RSCV))
#     print('F1 score:', numFormat.format(f1_score_DTM_RSCV))
#     print('Recall:', numFormat.format(recall_score_DTM_RSCV))
#     print('Precision:', numFormat.format(precision_score_DTM_RSCV))
#     print('Sensitivity : ', numFormat.format(sensitivity_DTM_RSCV))
#     print('Specificity : ', numFormat.format(specificity_DTM_RSCV))
#     print('\n clasification report:\n', classification_report_DTM_RSCV)
#     print('\n confussion matrix:\n',confusion_matrix_DTM_RSCV)

#     print('\nImprovement (Accuracy) of {:0.2f}%.'.format(comparison_accuracy_score_DTM_RSCV))
#     print('Improvement (F1_score) of {:0.2f}%.'.format(comparison_f1_score_DTM_RSCV))
#     print('Improvement (Sensitivity) of {:0.2f}%.'.format(comparison_sensitivity_DTM_RSCV))
#     print('Improvement (Specificity) of {:0.2f}%.'.format(comparison_specificity_DTM_RSCV))

#     print("\nFeature Importance")
#     display(df_feature_importance_DTM)

# if useModel_RFM == 'yes':
#     print('\n\n-------------------- Random Forest -----------------------\n')
#     print('Base Model')
#     print('Model Parameteres =', RFM.get_params())
#     print('')
#     print('Accuracy:', numFormat.format(accuracy_score_RFM))
#     print('F1 score:', numFormat.format(f1_score_RFM))
#     print('Recall:', numFormat.format(recall_score_RFM))
#     print('Precision:', numFormat.format(precision_score_RFM))
#     print('Sensitivity : ', numFormat.format(sensitivity_RFM))
#     print('Specificity : ', numFormat.format(specificity_RFM))
#     print('\n clasification report:\n', classification_report_RFM)
#     print('\n confussion matrix:\n',confusion_matrix_RFM)

# if useModel_RFM_RSCV == 'yes':
#     print('Tuned Model')
#     print('Best Model Parameters =', RFM_RSCV.best_params_)
#     print('')
#     print('Accuracy:', numFormat.format(accuracy_score_RFM_RSCV))
#     print('F1 score:', numFormat.format(f1_score_RFM_RSCV))
#     print('Recall:', numFormat.format(recall_score_RFM_RSCV))
#     print('Precision:', numFormat.format(precision_score_RFM_RSCV))
#     print('Sensitivity : ', numFormat.format(sensitivity_RFM_RSCV))
#     print('Specificity : ', numFormat.format(specificity_RFM_RSCV))
#     print('\n clasification report:\n', classification_report_RFM_RSCV)
#     print('\n confussion matrix:\n',confusion_matrix_RFM_RSCV)

#     print('\nImprovement (Accuracy) of {:0.2f}%.'.format(comparison_accuracy_score_RFM_RSCV))
#     print('Improvement (F1_score) of {:0.2f}%.'.format(comparison_f1_score_RFM_RSCV))
#     print('Improvement (Sensitivity) of {:0.2f}%.'.format(comparison_sensitivity_RFM_RSCV))
#     print('Improvement (Specificity) of {:0.2f}%.'.format(comparison_specificity_RFM_RSCV))

#     print("\nFeature Importance")
#     display(df_feature_importance_RFM)

# if useModel_XGBM == 'yes':
#     print('\n\n----------------- XGBoost --------------------')
#     print('Model Parameteres =', XGBM.get_params())
#     print('')
#     print('Accuracy:', numFormat.format(accuracy_score_XGBM))
#     print('F1 score:', numFormat.format(f1_score_XGBM))
#     print('Recall:', numFormat.format(recall_score_XGBM))
#     print('Precision:', numFormat.format(precision_score_XGBM))
#     print('Sensitivity : ', numFormat.format(sensitivity_XGBM))
#     print('Specificity : ', numFormat.format(specificity_XGBM))
#     print('\n clasification report:\n', classification_report_XGBM)
#     print('\n confussion matrix:\n',confusion_matrix_XGBM)

# if useModel_XGBM_RSCV == 'yes':
#     print('Tuned Model')
#     print('Best Model Parameters =', XGBM_RSCV.best_params_)
#     print('')
#     print('Accuracy:', numFormat.format(accuracy_score_XGBM_RSCV))
#     print('F1 score:', numFormat.format(f1_score_XGBM_RSCV))
#     print('Recall:', numFormat.format(recall_score_XGBM_RSCV))
#     print('Precision:', numFormat.format(precision_score_XGBM_RSCV))
#     print('Sensitivity : ', numFormat.format(sensitivity_XGBM_RSCV))
#     print('Specificity : ', numFormat.format(specificity_XGBM_RSCV))
#     print('\n clasification report:\n', classification_report_XGBM_RSCV)
#     print('\n confussion matrix:\n',confusion_matrix_XGBM_RSCV)

#     print('\nImprovement (Accuracy) of {:0.2f}%.'.format(comparison_accuracy_score_XGBM_RSCV))
#     print('Improvement (F1_score) of {:0.2f}%.'.format(comparison_f1_score_XGBM_RSCV))
#     print('Improvement (Sensitivity) of {:0.2f}%.'.format(comparison_sensitivity_XGBM_RSCV))
#     print('Improvement (Specificity) of {:0.2f}%.'.format(comparison_specificity_XGBM_RSCV))

#     print("\nFeature Importance")
#     display(df_feature_importance_XGBM)

# if useModel_SVCM == 'yes':
#     print('\n\n----------------- SVM --------------------')
#     print('Accuracy:', numFormat.format(accuracy_score_SVCM))
#     print('F1 score:', numFormat.format(f1_score_SVCM))
#     print('Recall:', numFormat.format(recall_score_SVCM))
#     print('Precision:', numFormat.format(precision_score_SVCM))
#     print('Sensitivity : ', numFormat.format(sensitivity_SVCM))
#     print('Specificity : ', numFormat.format(specificity_SVCM))

#     print('\n clasification report:\n', classification_report_SVCM)
#     print('\n confussion matrix:\n',confusion_matrix_SVCM)
#     print("\nFeature Importance")
#     # display(df_feature_importance_SVCM)
#     print("Work in progress")

In [23]:
time_took = time.time() - start_time
print("")
print("")
print("PROCESSING COMPLETE")
print(f"Total Runtime: {hms_string(time_took)}")
if dataEnhancement_active == 'yes':
    print(f"Add Enhancement Columns Runtime: {hms_string(time_took01)}")
    print(f"Create temp TeamGameStats dataframe Runtime: {hms_string(time_took02)}")
    print(f"Calculate PIE / PER Runtime: {hms_string(time_took03)}")
    # print(f"Calculate PER Runtime: {hms_string(time_took04)}")



PROCESSING COMPLETE
Total Runtime: 0:00:01.52


# End of Code