# KOBE Bryant Shot Selection


Kobe Bryant marked his retirement from the NBA by scoring 60 points in his final game as a Los Angeles Laker on Wednesday, April 12, 2016. Drafted into the NBA at the age of 17, Kobe earned the sport’s highest accolades throughout his long career.

Using 20 years of data on Kobe's swishes and misses, can you predict which shots will find the bottom of the net? This competition is well suited for practicing classification basics, feature engineering, and time series analysis. Practice got Kobe an eight-figure contract and 5 championship rings. What will it get you?

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

**From initial exploration, we see that there is an index on this dataset ```shot_id```**

In [2]:
df_raw = pd.read_csv('./data/kobe_data.csv', index_col = 'shot_id')
print(df_raw.shape)
df_raw.head()

(30697, 24)


Unnamed: 0_level_0,action_type,combined_shot_type,game_event_id,game_id,lat,loc_x,loc_y,lon,minutes_remaining,period,...,shot_made_flag,shot_type,shot_zone_area,shot_zone_basic,shot_zone_range,team_id,team_name,game_date,matchup,opponent
shot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Jump Shot,Jump Shot,10,20000012,33.9723,167,72,-118.1028,10,1,...,,2PT Field Goal,Right Side(R),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR
2,Jump Shot,Jump Shot,12,20000012,34.0443,-157,0,-118.4268,10,1,...,0.0,2PT Field Goal,Left Side(L),Mid-Range,8-16 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR
3,Jump Shot,Jump Shot,35,20000012,33.9093,-101,135,-118.3708,7,1,...,1.0,2PT Field Goal,Left Side Center(LC),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR
4,Jump Shot,Jump Shot,43,20000012,33.8693,138,175,-118.1318,6,1,...,0.0,2PT Field Goal,Right Side Center(RC),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR
5,Driving Dunk Shot,Dunk,155,20000012,34.0443,0,0,-118.2698,6,2,...,1.0,2PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR


**Importing of the data looks good. Let's check out our columns.**

In [3]:
df_raw.columns

Index(['action_type', 'combined_shot_type', 'game_event_id', 'game_id', 'lat',
       'loc_x', 'loc_y', 'lon', 'minutes_remaining', 'period', 'playoffs',
       'season', 'seconds_remaining', 'shot_distance', 'shot_made_flag',
       'shot_type', 'shot_zone_area', 'shot_zone_basic', 'shot_zone_range',
       'team_id', 'team_name', 'game_date', 'matchup', 'opponent'],
      dtype='object')

**I have some domain knowledge here, being a Laker's and Kobe fan, that I can apply in removing columns.**
- He has only ever played for the Lakers, so ```team_name``` and ```team_id``` can be dropped.
- ```game_date``` also would be difficult to incorporate into a model, since it would be categorical. It may account for injuries or something like that, but I don't think it would strengthen our model much.

In [4]:
df_raw['team_name'].unique()

array(['Los Angeles Lakers'], dtype=object)

In [5]:
df_raw['team_id'].unique()

array([1610612747], dtype=int64)

In [6]:
df_raw = df_raw.drop(columns=['team_id', 'team_name', 'game_date'])

**From reading the Kaggle description, they have removed 5000 values from our target column. Since we are not submitting this to Kaggle, we will drop these columns and use the rest for our Train/Test split and analysis.**

In [7]:
df_raw['shot_made_flag'].value_counts()

0.0    14232
1.0    11465
Name: shot_made_flag, dtype: int64

In [8]:
df_raw['shot_made_flag'].isnull().sum()

5000

In [9]:
df_raw.dropna(inplace=True)
df_raw.shape

(25697, 21)

**```matchup``` is a column I am unsure of. ```opponent``` covers most of the data represented, but ```matchup``` tells us if the game is away or at home. This seems like a good opportunity to feature engineer.**

Matchup is in the form "LAL @ OPP" or "LAL vs. OPP" to denote home and away games.

In [10]:
# Creating a new column that has a 1 if it is a home game and a 0 if it is away.
new_col = []
for stuff in df_raw['matchup']:
    if "@" in stuff:
        new_col.append(0) #Away game
    elif "vs." in stuff:
        new_col.append(1) #Home game
        
df_raw['home_away'] = new_col


df_raw['home_away'].value_counts()

0    13212
1    12485
Name: home_away, dtype: int64

**Now we can drop ```matchup```, since its information is contained in ```home_away``` and ```opponent```**

In [11]:
df_raw.drop(columns=['matchup'],inplace=True)

In [12]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25697 entries, 2 to 30697
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   action_type         25697 non-null  object 
 1   combined_shot_type  25697 non-null  object 
 2   game_event_id       25697 non-null  int64  
 3   game_id             25697 non-null  int64  
 4   lat                 25697 non-null  float64
 5   loc_x               25697 non-null  int64  
 6   loc_y               25697 non-null  int64  
 7   lon                 25697 non-null  float64
 8   minutes_remaining   25697 non-null  int64  
 9   period              25697 non-null  int64  
 10  playoffs            25697 non-null  int64  
 11  season              25697 non-null  object 
 12  seconds_remaining   25697 non-null  int64  
 13  shot_distance       25697 non-null  int64  
 14  shot_made_flag      25697 non-null  float64
 15  shot_type           25697 non-null  object 
 16  shot

**These location variables look like theyre going to overlap one another**

In [13]:
df_raw[['lat', 'loc_x', 'loc_y', 'lon']].describe()

Unnamed: 0,lat,loc_x,loc_y,lon
count,25697.0,25697.0,25697.0,25697.0
mean,33.953043,7.148422,91.257345,-118.262652
std,0.088152,110.073147,88.152106,0.110073
min,33.2533,-250.0,-44.0,-118.5198
25%,33.8843,-67.0,4.0,-118.3368
50%,33.9703,0.0,74.0,-118.2698
75%,34.0403,94.0,160.0,-118.1758
max,34.0883,248.0,791.0,-118.0218


> ```lat``` and ```lon``` look like they aren't going to help out model much. I honestly am not sure exactly what they represent, but I am confident other columns will cover their information.

In [14]:
df_raw.drop(columns=['lat', 'lon'],inplace=True)

**Does ```minutes_remaining``` mean for the entire game, or for that quarter (period)? What about ```seconds_remaining```?**
> Also, where the hell is shot clock?

In [15]:
df_raw[['minutes_remaining', 'seconds_remaining']].describe()

Unnamed: 0,minutes_remaining,seconds_remaining
count,25697.0,25697.0
mean,4.886796,28.311554
std,3.452475,17.523392
min,0.0,0.0
25%,2.0,13.0
50%,5.0,28.0
75%,8.0,43.0
max,11.0,59.0


Well I learned something about this data I didn't expect, they work in tandem to represent time left in the quarter (period).
> Let's combine these into a single column for ```time```.

In [16]:
df_raw[['minutes_remaining', 'seconds_remaining']].head()

Unnamed: 0_level_0,minutes_remaining,seconds_remaining
shot_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,10,22
3,7,45
4,6,52
5,6,19
6,9,32


In [17]:
time_sec_remaining = []
for row in df_raw[['minutes_remaining', 'seconds_remaining']].iterrows():
    time_sec_remaining.append(row[1][0]*60 + row[1][1])

df_raw['time_sec_remaining'] = time_sec_remaining

In [18]:
df_raw[['minutes_remaining', 'seconds_remaining', 'time_sec_remaining']].head()

Unnamed: 0_level_0,minutes_remaining,seconds_remaining,time_sec_remaining
shot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,10,22,622
3,7,45,465
4,6,52,412
5,6,19,379
6,9,32,572


**Looks good. We don't need ```minutes_remaining``` or ```seconds_remaining``` anymore.**

In [19]:
df_raw.drop(columns=['seconds_remaining', 'minutes_remaining'], inplace=True)

Looks good, what else?

In [20]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25697 entries, 2 to 30697
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   action_type         25697 non-null  object 
 1   combined_shot_type  25697 non-null  object 
 2   game_event_id       25697 non-null  int64  
 3   game_id             25697 non-null  int64  
 4   loc_x               25697 non-null  int64  
 5   loc_y               25697 non-null  int64  
 6   period              25697 non-null  int64  
 7   playoffs            25697 non-null  int64  
 8   season              25697 non-null  object 
 9   shot_distance       25697 non-null  int64  
 10  shot_made_flag      25697 non-null  float64
 11  shot_type           25697 non-null  object 
 12  shot_zone_area      25697 non-null  object 
 13  shot_zone_basic     25697 non-null  object 
 14  shot_zone_range     25697 non-null  object 
 15  opponent            25697 non-null  object 
 16  home

> ```combined_shot_type```, ```action_type```, and ```shot_type``` all look very closely related. Lets take a closer look.

In [21]:
df_raw[['combined_shot_type', 'action_type', 'shot_type']].head()

Unnamed: 0_level_0,combined_shot_type,action_type,shot_type
shot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Jump Shot,Jump Shot,2PT Field Goal
3,Jump Shot,Jump Shot,2PT Field Goal
4,Jump Shot,Jump Shot,2PT Field Goal
5,Dunk,Driving Dunk Shot,2PT Field Goal
6,Jump Shot,Jump Shot,2PT Field Goal


In [22]:
df_raw['combined_shot_type'].value_counts()

Jump Shot    19710
Layup         4532
Dunk          1056
Tip Shot       152
Hook Shot      127
Bank Shot      120
Name: combined_shot_type, dtype: int64

In [23]:
df_raw['shot_type'].value_counts()

2PT Field Goal    20285
3PT Field Goal     5412
Name: shot_type, dtype: int64

In [24]:
df_raw['action_type'].value_counts()

Jump Shot                          15836
Layup Shot                          2154
Driving Layup Shot                  1628
Turnaround Jump Shot                 891
Fadeaway Jump Shot                   872
Running Jump Shot                    779
Pullup Jump shot                     402
Turnaround Fadeaway shot             366
Slam Dunk Shot                       334
Reverse Layup Shot                   333
Jump Bank Shot                       289
Driving Dunk Shot                    257
Dunk Shot                            217
Tip Shot                             151
Step Back Jump shot                  106
Alley Oop Dunk Shot                   95
Floating Jump shot                    93
Driving Reverse Layup Shot            83
Hook Shot                             73
Driving Finger Roll Shot              68
Alley Oop Layup shot                  67
Reverse Dunk Shot                     61
Driving Finger Roll Layup Shot        59
Turnaround Bank shot                  58
Running Layup Sh

**```action_type``` is a little too messy of a column. It does provide some extra information about specifics of the ```combined_shot_type```, but given the other available variables and my time constraint I'm going to drop it.**

In [25]:
df_raw.drop(columns='action_type', inplace=True)

**What else?**

In [26]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25697 entries, 2 to 30697
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   combined_shot_type  25697 non-null  object 
 1   game_event_id       25697 non-null  int64  
 2   game_id             25697 non-null  int64  
 3   loc_x               25697 non-null  int64  
 4   loc_y               25697 non-null  int64  
 5   period              25697 non-null  int64  
 6   playoffs            25697 non-null  int64  
 7   season              25697 non-null  object 
 8   shot_distance       25697 non-null  int64  
 9   shot_made_flag      25697 non-null  float64
 10  shot_type           25697 non-null  object 
 11  shot_zone_area      25697 non-null  object 
 12  shot_zone_basic     25697 non-null  object 
 13  shot_zone_range     25697 non-null  object 
 14  opponent            25697 non-null  object 
 15  home_away           25697 non-null  int64  
 16  time

> ```game_event_id``` and ```game_id``` look kind of useless.

In [27]:
df_raw[['game_event_id', 'game_id']].head()

Unnamed: 0_level_0,game_event_id,game_id
shot_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,12,20000012
3,35,20000012
4,43,20000012
5,155,20000012
6,244,20000012


In [28]:
df_raw[['game_event_id', 'game_id']].describe()

Unnamed: 0,game_event_id,game_id
count,25697.0,25697.0
mean,249.348679,24741090.0
std,149.77852,7738108.0
min,2.0,20000010.0
25%,111.0,20500060.0
50%,253.0,20900340.0
75%,367.0,29600270.0
max,653.0,49900090.0


Just extra identifiers to the shot. Only useful to identify at what point in the game they take place, Kobe was clutch so later events may have higher success rates, but that information is covered in ```time_sec_remaining```.

In [29]:
df_raw.drop(columns=['game_event_id', 'game_id'], inplace=True)

In [30]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25697 entries, 2 to 30697
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   combined_shot_type  25697 non-null  object 
 1   loc_x               25697 non-null  int64  
 2   loc_y               25697 non-null  int64  
 3   period              25697 non-null  int64  
 4   playoffs            25697 non-null  int64  
 5   season              25697 non-null  object 
 6   shot_distance       25697 non-null  int64  
 7   shot_made_flag      25697 non-null  float64
 8   shot_type           25697 non-null  object 
 9   shot_zone_area      25697 non-null  object 
 10  shot_zone_basic     25697 non-null  object 
 11  shot_zone_range     25697 non-null  object 
 12  opponent            25697 non-null  object 
 13  home_away           25697 non-null  int64  
 14  time_sec_remaining  25697 non-null  int64  
dtypes: float64(1), int64(7), object(7)
memory usage: 3.1+

In [31]:
df_raw['season'].value_counts()

2005-06    1924
2002-03    1852
2008-09    1851
2007-08    1819
2009-10    1772
2001-02    1708
2006-07    1579
2000-01    1575
2010-11    1521
2011-12    1416
2003-04    1371
2012-13    1328
1999-00    1312
2004-05    1127
2015-16     932
1997-98     810
1998-99     765
2014-15     593
1996-97     383
2013-14      59
Name: season, dtype: int64

In [32]:
df_raw['playoffs'].value_counts()

0    21939
1     3758
Name: playoffs, dtype: int64

In [33]:
df_raw[['shot_zone_area', 'shot_zone_basic', 'shot_zone_range', 'shot_distance']].head()

Unnamed: 0_level_0,shot_zone_area,shot_zone_basic,shot_zone_range,shot_distance
shot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,Left Side(L),Mid-Range,8-16 ft.,15
3,Left Side Center(LC),Mid-Range,16-24 ft.,16
4,Right Side Center(RC),Mid-Range,16-24 ft.,22
5,Center(C),Restricted Area,Less Than 8 ft.,0
6,Left Side(L),Mid-Range,8-16 ft.,14


**I was planning on using KClusters to find my zones or clusters based on ```loc_x``` and ```loc_y```, but I think it is better covered by the ```shot_zone``` categories.**
>**I am going to drop ```loc_x``` and ```loc_y```.**

In [34]:
df_raw.drop(columns=['loc_x', 'loc_y'], inplace=True)

In [35]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25697 entries, 2 to 30697
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   combined_shot_type  25697 non-null  object 
 1   period              25697 non-null  int64  
 2   playoffs            25697 non-null  int64  
 3   season              25697 non-null  object 
 4   shot_distance       25697 non-null  int64  
 5   shot_made_flag      25697 non-null  float64
 6   shot_type           25697 non-null  object 
 7   shot_zone_area      25697 non-null  object 
 8   shot_zone_basic     25697 non-null  object 
 9   shot_zone_range     25697 non-null  object 
 10  opponent            25697 non-null  object 
 11  home_away           25697 non-null  int64  
 12  time_sec_remaining  25697 non-null  int64  
dtypes: float64(1), int64(5), object(7)
memory usage: 2.7+ MB


In [36]:
df = df_raw.copy()

## Finished Cleaning

### Saving cleaned dataset to a new CSV

In [37]:
df.to_csv('./data/kobe_data_cleaned.csv')

---
# Alright lets do some modeling

Make sure our dataframe saved correctly

In [38]:
df = pd.read_csv('./data/kobe_data_cleaned.csv', index_col = 'shot_id')
print(df.shape)
df.head()

(25697, 13)


Unnamed: 0_level_0,combined_shot_type,period,playoffs,season,shot_distance,shot_made_flag,shot_type,shot_zone_area,shot_zone_basic,shot_zone_range,opponent,home_away,time_sec_remaining
shot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2,Jump Shot,1,0,2000-01,15,0.0,2PT Field Goal,Left Side(L),Mid-Range,8-16 ft.,POR,0,622
3,Jump Shot,1,0,2000-01,16,1.0,2PT Field Goal,Left Side Center(LC),Mid-Range,16-24 ft.,POR,0,465
4,Jump Shot,1,0,2000-01,22,0.0,2PT Field Goal,Right Side Center(RC),Mid-Range,16-24 ft.,POR,0,412
5,Dunk,2,0,2000-01,0,1.0,2PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,POR,0,379
6,Jump Shot,3,0,2000-01,14,0.0,2PT Field Goal,Left Side(L),Mid-Range,8-16 ft.,POR,0,572


## Set our X and Y for our model

In [39]:
X = df.drop(columns=['shot_made_flag'])
y = df['shot_made_flag'].astype(int)

## Dummified Categoricals

In [40]:
dum_X = pd.get_dummies(X, columns=['combined_shot_type', 'period', 'playoffs', 'season', 'shot_type', 
                                   'shot_zone_area', 'shot_zone_basic', 'shot_zone_range', 'opponent', 
                                   'home_away'], 
                       drop_first=True)

In [41]:
dum_X.shape

(25697, 82)

In [42]:
dum_X.head()

Unnamed: 0_level_0,shot_distance,time_sec_remaining,combined_shot_type_Dunk,combined_shot_type_Hook Shot,combined_shot_type_Jump Shot,combined_shot_type_Layup,combined_shot_type_Tip Shot,period_2,period_3,period_4,...,opponent_PHX,opponent_POR,opponent_SAC,opponent_SAS,opponent_SEA,opponent_TOR,opponent_UTA,opponent_VAN,opponent_WAS,home_away_1
shot_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,15,622,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,16,465,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,22,412,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,0,379,1,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
6,14,572,0,0,1,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0


**Looks good; appears to be dummified correctly.**

*How does StandardScaler work after dummifying?*

In [43]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(dum_X, y, stratify=y, random_state=42)

# Scale data
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

n_input = X_train_sc.shape[1]

## Model 0: Null Model

In [44]:
y.value_counts(normalize=True)

0    0.553839
1    0.446161
Name: shot_made_flag, dtype: float64

>**Our null model has an accuracy of 55%**

## Model 1: NN v1

In [45]:
model = Sequential()

# Layer 1: Input
model.add(BatchNormalization())
model.add(Dense(n_input, # how many neurons/nodes in the layer
                input_shape=(n_input,), # how many features you have
                activation='relu' # relu for hidden layer
               ))
model.add(Dropout(0.05))

# Layer 2: Hidden 1
model.add(Dense(64, # how many neurons/nodes in the layer
                activation='relu' # relu for hidden layer
               ))
model.add(Dropout(0.1))

# Layer 3: Hidden 2
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.15))
           
# Layer 4: Hidden 3
model.add(Dense(64, # how many neurons/nodes in the layer
                activation='relu' # relu for hidden layer
               ))
model.add(Dropout(0.2))

# Layer 5: Output
model.add(Dense(1, activation='sigmoid'))



model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])



#early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)


history = model.fit(
    X_train_sc,
    y_train,
    validation_data=(X_test_sc, y_test),
    epochs=200#,
    #callbacks=[early_stop]
)

# Turned off EarlyStopping because it was cutting the algorithm short 
#despite a slow, but steady increase in accuracy.

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200


Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 79/200
Epoch 80/200
Epoch 81/200
Epoch 82/200
Epoch 83/200
Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200


Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200


Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


In [46]:
preds = np.round(model.predict(X_test_sc),0)

# Recall/Sensitivity
print('Recall = ', metrics.recall_score(y_test, preds))

# Precision
print('Precision = ', metrics.precision_score(y_test, preds))

cm = metrics.confusion_matrix(y_test, preds)
tn, fp, fn, tp = cm.ravel()

print("Specificity = ", tn / (tn+fp))

Recall =  0.5081967213114754
Precision =  0.48859825620389
Specificity =  0.5713884204609331


#### Analysis: Accuracy sitting around 76% after a lot of epochs (200), but slowly had been increasing; 
> val_loss has skyrocketed since about 68% Accuracy.

### With EarlyStopping

In [47]:
model = Sequential()

# Layer 1: Input
model.add(BatchNormalization())
model.add(Dense(n_input, # how many neurons/nodes in the layer
                input_shape=(n_input,), # how many features you have
                activation='relu' # relu for hidden layer
               ))
model.add(Dropout(0.05))

# Layer 2: Hidden 1
model.add(Dense(64, # how many neurons/nodes in the layer
                activation='relu' # relu for hidden layer
               ))
model.add(Dropout(0.1))

# Layer 3: Hidden 2
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.15))
           
# Layer 4: Hidden 3
model.add(Dense(64, # how many neurons/nodes in the layer
                activation='relu' # relu for hidden layer
               ))
model.add(Dropout(0.2))

# Layer 5: Output
model.add(Dense(1, activation='sigmoid'))



model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])



early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)


history = model.fit(
    X_train_sc,
    y_train,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 00014: early stopping


In [48]:
preds = np.round(model.predict(X_test_sc),0)

# Recall/Sensitivity
print('Recall = ', metrics.recall_score(y_test, preds))

# Precision
print('Precision = ', metrics.precision_score(y_test, preds))

cm = metrics.confusion_matrix(y_test, preds)
tn, fp, fn, tp = cm.ravel()

print("Specificity = ", tn / (tn+fp))

Recall =  0.36867806069061737
Precision =  0.5747688961392061
Specificity =  0.7802136031478358


#### Analysis: Accuracy at 64% after EarlyStopping due to val_loss increasing.
---

## Model 2: NN v2

In [49]:
model = Sequential()

# Layer 1: Input
model.add(BatchNormalization())
model.add(Dense(n_input, # how many neurons/nodes in the layer
                input_shape=(n_input,), # how many features you have
                activation='relu', # relu for hidden layer
                kernel_regularizer=l2(.001)
               ))
model.add(Dropout(0.05))

# Layer 2: Hidden 1
model.add(Dense(64, # how many neurons/nodes in the layer
                activation='relu', # relu for hidden layer
                kernel_regularizer=l2(.001) # relu for hidden layer
               ))
model.add(Dropout(0.1))

# Layer 3: Hidden 2
model.add(Dense(128, activation='relu', # relu for hidden layer
                kernel_regularizer=l2(.001)))
model.add(Dropout(0.15))
           
# Layer 4: Hidden 3
model.add(Dense(64, # how many neurons/nodes in the layer
                activation='relu', # relu for hidden layer
                kernel_regularizer=l2(.001) # relu for hidden layer
               ))
model.add(Dropout(0.2))

# Layer 5: Output
model.add(Dense(1, activation='sigmoid'))



model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])



#early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)


history = model.fit(
    X_train_sc,
    y_train,
    validation_data=(X_test_sc, y_test),
    epochs=200#,
    #callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200


Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 79/200
Epoch 80/200
Epoch 81/200
Epoch 82/200
Epoch 83/200
Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200


Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200


Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


In [50]:
preds = np.round(model.predict(X_test_sc),0)

# Recall/Sensitivity
print('Recall = ', metrics.recall_score(y_test, preds))

# Precision
print('Precision = ', metrics.precision_score(y_test, preds))

cm = metrics.confusion_matrix(y_test, preds)
tn, fp, fn, tp = cm.ravel()

print("Specificity = ", tn / (tn+fp))

Recall =  0.3578653644925009
Precision =  0.5747899159663865
Specificity =  0.7866779089376054


#### Analysis: Accuracy is hovering around 64%; val_loss is not skyrocketing though.

### With EarlyStopping

In [51]:
model = Sequential()

# Layer 1: Input
model.add(BatchNormalization())
model.add(Dense(n_input, # how many neurons/nodes in the layer
                input_shape=(n_input,), # how many features you have
                activation='relu', # relu for hidden layer
                kernel_regularizer=l2(.001)
               ))
model.add(Dropout(0.05))

# Layer 2: Hidden 1
model.add(Dense(64, # how many neurons/nodes in the layer
                activation='relu', # relu for hidden layer
                kernel_regularizer=l2(.001) # relu for hidden layer
               ))
model.add(Dropout(0.1))

# Layer 3: Hidden 2
model.add(Dense(128, activation='relu', # relu for hidden layer
                kernel_regularizer=l2(.001)))
model.add(Dropout(0.15))
           
# Layer 4: Hidden 3
model.add(Dense(64, # how many neurons/nodes in the layer
                activation='relu', # relu for hidden layer
                kernel_regularizer=l2(.001) # relu for hidden layer
               ))
model.add(Dropout(0.2))

# Layer 5: Output
model.add(Dense(1, activation='sigmoid'))



model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])



early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)


history = model.fit(
    X_train_sc,
    y_train,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 00020: early stopping


In [52]:
preds = np.round(model.predict(X_test_sc),0)

# Recall/Sensitivity
print('Recall = ', metrics.recall_score(y_test, preds))

# Precision
print('Precision = ', metrics.precision_score(y_test, preds))

cm = metrics.confusion_matrix(y_test, preds)
tn, fp, fn, tp = cm.ravel()

print("Specificity = ", tn / (tn+fp))

Recall =  0.3543773979769794
Precision =  0.5896691816598956
Specificity =  0.801292861157954


#### Analysis: Accuracy at 63% after EarlyStopping due to val_loss increasing.
---

## Model 3: KNeighbors

In [53]:
knn_params = {
    'n_neighbors': range(1,1001,100),
    'metric': ['euclidean', 'manhattan'],
}

knn_gridsearch = GridSearchCV(
    KNeighborsClassifier(), #estimator; what model to fit
    knn_params, #parameter grid
    cv=5, #number of folds; default is 5
    n_jobs=-1,
    verbose=1
)

In [54]:
knn_gridsearch.fit(X_train_sc, y_train)
pd.DataFrame(knn_gridsearch.cv_results_).sort_values('rank_test_score').head()

Fitting 5 folds for each of 20 candidates, totalling 100 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_metric,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
9,0.044647,0.013693,3.959787,0.383711,euclidean,901,"{'metric': 'euclidean', 'n_neighbors': 901}",0.610895,0.616602,0.610275,0.609237,0.609237,0.611249,0.00275,1
8,0.034181,0.003233,4.279629,0.064618,euclidean,801,"{'metric': 'euclidean', 'n_neighbors': 801}",0.610895,0.61738,0.608459,0.609497,0.609237,0.611093,0.00324,2
14,0.032981,0.004055,14.479678,0.416001,manhattan,401,"{'metric': 'manhattan', 'n_neighbors': 401}",0.607004,0.614267,0.610535,0.610275,0.611313,0.610679,0.002322,3
16,0.033356,0.004268,14.341536,0.4343,manhattan,601,"{'metric': 'manhattan', 'n_neighbors': 601}",0.606744,0.614267,0.610275,0.610016,0.610794,0.610419,0.002393,4
13,0.035272,0.004583,14.400944,0.161567,manhattan,301,"{'metric': 'manhattan', 'n_neighbors': 301}",0.607004,0.615045,0.609497,0.609756,0.610794,0.610419,0.002627,5


In [55]:
knn_gridsearch.score(X_test_sc, y_test)

0.6062256809338521

In [56]:
preds = np.round(knn_gridsearch.predict(X_test_sc),0)

# Recall/Sensitivity
print('Recall = ', metrics.recall_score(y_test, preds))

# Precision
print('Precision = ', metrics.precision_score(y_test, preds))

cm = metrics.confusion_matrix(y_test, preds)
tn, fp, fn, tp = cm.ravel()

print("Specificity = ", tn / (tn+fp))

Recall =  0.2940355772584583
Precision =  0.6249073387694588
Specificity =  0.8577852726250703


#### Analysis: Accuracy of 60.6%. 
> Not Great; going to try and tune these hyperparameters a bit more.
---

### Tuned Hyperparameters based on previous GridSearch

In [57]:
knn_params = {
    'n_neighbors': range(900,920,1),
    'metric': ['euclidean']
}

knn_gridsearch = GridSearchCV(
    KNeighborsClassifier(), #estimator; what model to fit
    knn_params, #parameter grid
    cv=5, #number of folds; default is 5
    n_jobs=-1,
    verbose=1
)

In [58]:
knn_gridsearch.fit(X_train_sc, y_train)
pd.DataFrame(knn_gridsearch.cv_results_).sort_values('rank_test_score').head()

Fitting 5 folds for each of 20 candidates, totalling 100 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_metric,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
9,0.038368,0.006595,4.389306,0.365302,euclidean,909,"{'metric': 'euclidean', 'n_neighbors': 909}",0.610636,0.61738,0.610275,0.608978,0.609497,0.611353,0.003069,1
11,0.037649,0.004887,4.159302,0.186988,euclidean,911,"{'metric': 'euclidean', 'n_neighbors': 911}",0.610636,0.616602,0.610794,0.609497,0.608978,0.611301,0.002737,2
1,0.03032,0.003763,5.184089,0.384087,euclidean,901,"{'metric': 'euclidean', 'n_neighbors': 901}",0.610895,0.616602,0.610275,0.609237,0.609237,0.611249,0.00275,3
13,0.032996,0.001898,4.17075,0.106578,euclidean,913,"{'metric': 'euclidean', 'n_neighbors': 913}",0.610376,0.61738,0.610275,0.609237,0.608978,0.611249,0.003115,4
5,0.034703,0.002914,4.423439,0.101737,euclidean,905,"{'metric': 'euclidean', 'n_neighbors': 905}",0.610376,0.617121,0.610016,0.609237,0.609237,0.611197,0.002995,5


In [59]:
knn_gridsearch.score(X_test_sc, y_test)

0.6063813229571985

In [60]:
preds = np.round(knn_gridsearch.predict(X_test_sc),0)

# Recall/Sensitivity
print('Recall = ', metrics.recall_score(y_test, preds))

# Precision
print('Precision = ', metrics.precision_score(y_test, preds))

cm = metrics.confusion_matrix(y_test, preds)
tn, fp, fn, tp = cm.ravel()

print("Specificity = ", tn / (tn+fp))

Recall =  0.2943843739100105
Precision =  0.6251851851851852
Specificity =  0.8577852726250703


#### Analysis: Accuracy of 60.6%. 
> I didnt make a new gridsearch every time I reduced my range, but basically 901-911 was consistently doing best for n_neighbors and when narrowed, Euclidian was the best param_metric. 
>> It seems this is the best accuracy we are going to get with KNN
---

## Model 4: LogisticRegression

In [61]:
logreg_params = {
    'C': np.logspace(1,5,100)
}

logreg_gridsearch = GridSearchCV(
    LogisticRegression(), #estimator; what model to fit
    logreg_params, #parameter grid
    cv=5, #number of folds; default is 5
    n_jobs=-1,
    verbose=1
)

In [62]:
logreg_gridsearch.fit(X_train_sc, y_train)
pd.DataFrame(logreg_gridsearch.cv_results_).sort_values('rank_test_score').head()

Fitting 5 folds for each of 100 candidates, totalling 500 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
97,1.046151,0.146706,0.003591,0.001198,83021.756813,{'C': 83021.75681319753},0.611154,0.622827,0.614946,0.60768,0.606642,0.61265,0.005863,1
75,0.823028,0.091299,0.002993,2.1e-05,10722.67222,{'C': 10722.672220103253},0.611154,0.622827,0.614946,0.607421,0.606902,0.61265,0.005856,1
94,0.908614,0.102284,0.003495,0.001003,62802.914418,{'C': 62802.914418342596},0.611154,0.622827,0.614946,0.607421,0.606902,0.61265,0.005856,1
90,0.968288,0.082844,0.002924,0.000486,43287.612811,{'C': 43287.612810830615},0.611154,0.622827,0.614946,0.607421,0.606902,0.61265,0.005856,1
88,0.944393,0.091638,0.00359,0.000487,35938.136638,{'C': 35938.13663804626},0.611154,0.623087,0.614946,0.607421,0.606642,0.61265,0.005998,5


In [63]:
logreg_gridsearch.score(X_test_sc, y_test)

0.6090272373540856

In [64]:
preds = np.round(logreg_gridsearch.predict(X_test_sc),0)

# Recall/Sensitivity
print('Recall = ', metrics.recall_score(y_test, preds))

# Precision
print('Precision = ', metrics.precision_score(y_test, preds))

cm = metrics.confusion_matrix(y_test, preds)
tn, fp, fn, tp = cm.ravel()

print("Specificity = ", tn / (tn+fp))

Recall =  0.3425183118242065
Precision =  0.6103169670602859
Specificity =  0.8237774030354131


#### Analysis: Accuracy of 60.9%. 
> I didnt make a new gridsearch every time I reduced my range, but C varied widely so I attempted to increase amount as well as range.
>> The accuracy is very close to KNN. This ~60-61% accuracy may be a decent resistance point to higher accuracy in general for the dataset.
---

## Model 5: RandomForest

In [67]:
RandomForest_params = {
    'n_estimators': [100, 125, 150, 175],
    'max_depth': range(7,27,10),
    'min_samples_split': [5,15,20],
    'min_samples_leaf': [2,5,7,10]
}

RandomForest_gridsearch = GridSearchCV(
    RandomForestClassifier(), #estimator; what model to fit
    RandomForest_params, #parameter grid
    cv=3, #number of folds; default is 5
    n_jobs=-1,
    verbose=1
)

In [68]:
RandomForest_gridsearch.fit(X_train_sc, y_train)
pd.DataFrame(RandomForest_gridsearch.cv_results_).sort_values('rank_test_score').head()

Fitting 3 folds for each of 96 candidates, totalling 288 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_leaf,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
84,3.67654,0.087991,0.223307,0.054462,17,10,5,100,"{'max_depth': 17, 'min_samples_leaf': 10, 'min...",0.623443,0.61675,0.608655,0.616283,0.006046,1
60,3.146441,0.040927,0.171647,0.017998,17,5,5,100,"{'max_depth': 17, 'min_samples_leaf': 5, 'min_...",0.621731,0.615037,0.610212,0.61566,0.004723,2
94,4.834269,0.035658,0.209749,0.00974,17,10,20,150,"{'max_depth': 17, 'min_samples_leaf': 10, 'min...",0.621264,0.614726,0.61099,0.61566,0.004246,2
90,4.919068,0.061104,0.274799,0.019138,17,10,15,150,"{'max_depth': 17, 'min_samples_leaf': 10, 'min...",0.622198,0.61457,0.610212,0.61566,0.004954,2
79,6.77469,0.097186,0.374354,0.048382,17,7,15,175,"{'max_depth': 17, 'min_samples_leaf': 7, 'min_...",0.621731,0.61566,0.609278,0.615556,0.005085,5


In [69]:
RandomForest_gridsearch.score(X_test_sc, y_test)

0.6130739299610894

In [70]:
preds = np.round(RandomForest_gridsearch.predict(X_test_sc),0)

# Recall/Sensitivity
print('Recall = ', metrics.recall_score(y_test, preds))

# Precision
print('Precision = ', metrics.precision_score(y_test, preds))

cm = metrics.confusion_matrix(y_test, preds)
tn, fp, fn, tp = cm.ravel()

print("Specificity = ", tn / (tn+fp))

Recall =  0.33205441227764215
Precision =  0.6250820748522653
Specificity =  0.8395165823496347


#### Analysis: Accuracy of 61%. 
> The accuracy is very close to KNN. This ~60-61% accuracy may be a decent resistance point to higher accuracy in general for the dataset.
---

## Model 6: GradientBoosting

Using parameter range from the RandomForest GridSearch above. I expect very similar results.

In [71]:
RandomForest_gridsearch.best_params_

{'max_depth': 17,
 'min_samples_leaf': 10,
 'min_samples_split': 5,
 'n_estimators': 100}

In [72]:
GradientBoosting_params = {
    'max_depth': range(5,95,30),
    'min_samples_leaf': range(2,11,3),
    'min_samples_split': range(1,31,10),
    'n_estimators': range(50, 200, 50)
}

GradientBoosting_gridsearch = GridSearchCV(
    GradientBoostingClassifier(), #estimator; what model to fit
    GradientBoosting_params, #parameter grid
    cv=3, #number of folds; default is 5
    n_jobs=-1,
    verbose=1
)

In [73]:
GradientBoosting_gridsearch.fit(X_train_sc, y_train)
pd.DataFrame(GradientBoosting_gridsearch.cv_results_).sort_values('rank_test_score').head()

Fitting 3 folds for each of 81 candidates, totalling 243 fits


KeyboardInterrupt: 

In [None]:
GradientBoosting_gridsearch.score(X_test_sc, y_test)

In [None]:
preds = np.round(GradientBoosting_gridsearch.predict(X_test_sc),0)

# Recall/Sensitivity
print('Recall = ', metrics.recall_score(y_test, preds))

# Precision
print('Precision = ', metrics.precision_score(y_test, preds))

cm = metrics.confusion_matrix(y_test, preds)
tn, fp, fn, tp = cm.ravel()

print("Specificity = ", tn / (tn+fp))

#### Analysis: Accuracy of _%. 

---

# Analysis and Conclusion

First of all, it should be clear to everyone that Kobe Bryant is the Best Basketball Player of All Time 
>Not to be confused with Michael Jordan, who is the GOAT. 
>>**There will be no questions or discussion on this point at this time.**

Secondly, Basketball is about 1 thing and 1 thing only, buckets. When considering our null model is about 55%, we know that we have a balanced dataset and accuracy is the metric that is important to us. False Positives and False Negatives, as long as their rate isnt too skewed, do not matter to our analysis.

## Moving on to model analysis

### **Model 1: NN v1** was the best performing model for our most important statistic, **Accuracy**.
> **It is also our most 'expensive' model.**

>There was a very hard resistance to increase accuracy around 61% for all models. Only by letting Model 1 continue to hammer through epochs were we able to break this resistance at great cost. It is possible we could continue to see this increase continue, though it would prove expensive and there is no guarantee of continued success.

## Thoughts that came too late

> I should have used a time series analysis that could incorporate the 'trend' of the game and better capture the relationship between time and shot success rate.

> Needed more visualizations and correlation checks to find the optimal features to predict our target.