<a href="https://colab.research.google.com/github/adamlutzz/DS-Unit-2-Kaggle-Challenge/blob/master/DS7_Sprint_Challenge_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science, Unit 2_
 
# Sprint Challenge: Predict Steph Curry's shots 🏀

For your Sprint Challenge, you'll use a dataset with all Steph Curry's NBA field goal attempts. (Regular season and playoff games, from October 28, 2009, through June 5, 2019.) 

You'll predict whether each shot was made, using information about the shot and the game. This is hard to predict! Try to get above 60% accuracy. The dataset was collected with the [nba_api](https://github.com/swar/nba_api) Python library.

### Setup

In [231]:
import sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install packages in Colab
    !pip install category_encoders==2.0.0
    !pip install pandas-profiling==2.3.0
    !pip install plotly==4.1.1



### Assignment

In [0]:
import pandas as pd

# Read data
url = 'https://drive.google.com/uc?export=download&id=1fL7KPyxgGYfQDsuJoBWHIWwCAf-HTFpX'
df = pd.read_csv(url)

# Check data shape
assert df.shape == (13958, 20)

To demonstrate mastery on your Sprint Challenge, do all the required, numbered instructions in this notebook.

To earn a score of "3", also do all the stretch goals.

You are permitted and encouraged to do as much data exploration as you want.

**1. Begin with baselines for classification.** Your target to predict is `shot_made_flag`. What is your baseline accuracy, if you guessed the majority class for every prediction?

**2. Hold out your test set.** Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

**3. Engineer new feature.** Engineer at least **1** new feature, from this list, or your own idea.
- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
- **Opponent**: Who is the other team playing the Golden State Warriors?
- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
- **Made previous shot**: Was Steph Curry's previous shot successful?

**4. Decide how to validate** your model. Choose one of the following options. Any of these options are good. You are not graded on which you choose.
- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
- **Train/validate/test split: random 80/20%** train/validate split.
- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

**5.** Use a scikit-learn **pipeline** to **encode categoricals** and fit a **Decision Tree** or **Random Forest** model.

**6.** Get your model's **validation accuracy.** (Multiple times if you try multiple iterations.) 

**7.** Get your model's **test accuracy.** (One time, at the end.)


**8.** Given a **confusion matrix** for a hypothetical binary classification model, **calculate accuracy, precision, and recall.**

### Stretch Goals
- Engineer 4+ new features total, either from the list above, or your own ideas.
- Make 2+ visualizations to explore relationships between features and target.
- Optimize 3+ hyperparameters by trying 10+ "candidates" (possible combinations of hyperparameters). You can use `RandomizedSearchCV` or do it manually.
- Get and plot your model's feature importances.



## 1. Begin with baselines for classification. 

>Your target to predict is `shot_made_flag`. What would your baseline accuracy be, if you guessed the majority class for every prediction?

In [233]:
# get majority class from target
df['shot_made_flag'].value_counts(normalize=True)

0    0.527081
1    0.472919
Name: shot_made_flag, dtype: float64

In [0]:
import numpy as np

# I am coming to this much later in the project but I wonder if changing my target from int to str will fix my overfitting (nope wasn't it)
df['category'] = np.where(df['shot_made_flag']==1, 'Make', 'Miss')

I am guessing 0 is False and 1 is True meaning the majority class is miss. </br>
If miss is guessed every time your accuracy would be 52.7%

## 2. Hold out your test set.

>Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

In [235]:
# visualize
df.head()

Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot,category
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0,Miss
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0,Make
2,20900015,53,Stephen Curry,1,6,2,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0,Miss
3,20900015,141,Stephen Curry,2,9,49,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0,Miss
4,20900015,249,Stephen Curry,2,2,19,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0,Miss


In [236]:
# set game_date to datetime
df['game_date'] = pd.to_datetime(df['game_date'], infer_datetime_format=True)
df['game_date'].describe()

count                   13958
unique                    801
top       2013-05-06 00:00:00
freq                       35
first     2009-10-28 00:00:00
last      2019-06-05 00:00:00
Name: game_date, dtype: object

In [237]:
# Set cutoff date for Oct 1, 2018
cutoff = pd.to_datetime('2018-10-01')

# Set Train and Test splits
train = df[df.game_date < cutoff]
test  = df[df.game_date >= cutoff]
train.shape, test.shape

((12249, 21), (1709, 21))

In [238]:
# Check for NANs
df.isna().sum()

game_id                    0
game_event_id              0
player_name                0
period                     0
minutes_remaining          0
seconds_remaining          0
action_type                0
shot_type                  0
shot_zone_basic            0
shot_zone_area             0
shot_zone_range            0
shot_distance              0
loc_x                      0
loc_y                      0
shot_made_flag             0
game_date                  0
htm                        0
vtm                        0
season_type                0
scoremargin_before_shot    0
category                   0
dtype: int64

## 3. Engineer new feature.

>Engineer at least **1** new feature, from this list, or your own idea.
>
>- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
>- **Opponent**: Who is the other team playing the Golden State Warriors?
>- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
>- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
>- **Made previous shot**: Was Steph Curry's previous shot successful?

    

In [239]:
# visualize again
df.head()

Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot,category
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0,Miss
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0,Make
2,20900015,53,Stephen Curry,1,6,2,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0,Miss
3,20900015,141,Stephen Curry,2,9,49,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0,Miss
4,20900015,249,Stephen Curry,2,2,19,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0,Miss


In [0]:
import numpy as np

# Define function to add features to Train and Test
def feat_engineer(X):
    """Engineer features for train, validate, and test sets in the same way"""

    # Prevent SettingWithCopyWarning
    X = X.copy()

    # Homecourt Advantage Feature
    X['homecourt_adv'] = (X['htm']=='GSW')

    # Opponent Feature
    X['opponent'] = np.where(X['htm'] == 'GSW', X['vtm'], X['htm'])

    # Made Previous Shot
    X['made_prev_shot'] = (X['shot_made_flag'][1:] == 1) # First Shot will be NaN

    # Lead
    X['Lead'] = (X['scoremargin_before_shot'] > 0)

    

    return X

## **4. Decide how to validate** your model. 

>Choose one of the following options. Any of these options are good. You are not graded on which you choose.
>
>- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
>- **Train/validate/test split: random 80/20%** train/validate split.
>- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

In [241]:
# set cutoff
cutoff = pd.to_datetime('2017-10-01')

# do validate first so you don't remove it
val  = train[train.game_date >= cutoff]
train = train[train.game_date < cutoff]

# double check shape
train.shape, val.shape

((11081, 21), (1168, 21))

In [242]:
train.head(10)

Unnamed: 0,game_id,game_event_id,player_name,period,minutes_remaining,seconds_remaining,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,shot_distance,loc_x,loc_y,shot_made_flag,game_date,htm,vtm,season_type,scoremargin_before_shot,category
0,20900015,4,Stephen Curry,1,11,25,Jump Shot,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,26,99,249,0,2009-10-28,GSW,HOU,Regular Season,2.0,Miss
1,20900015,17,Stephen Curry,1,9,31,Step Back Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-122,145,1,2009-10-28,GSW,HOU,Regular Season,0.0,Make
2,20900015,53,Stephen Curry,1,6,2,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,14,-60,129,0,2009-10-28,GSW,HOU,Regular Season,-4.0,Miss
3,20900015,141,Stephen Curry,2,9,49,Jump Shot,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,19,-172,82,0,2009-10-28,GSW,HOU,Regular Season,-4.0,Miss
4,20900015,249,Stephen Curry,2,2,19,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-68,148,0,2009-10-28,GSW,HOU,Regular Season,0.0,Miss
5,20900015,277,Stephen Curry,2,0,34,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),Less Than 8 ft.,4,39,15,0,2009-10-28,GSW,HOU,Regular Season,4.0,Miss
6,20900015,413,Stephen Curry,4,10,26,Pullup Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,16,-64,149,1,2009-10-28,GSW,HOU,Regular Season,-9.0,Make
7,20900015,453,Stephen Curry,4,6,31,Pullup Jump shot,2PT Field Goal,Mid-Range,Right Side Center(RC),16-24 ft.,17,118,123,1,2009-10-28,GSW,HOU,Regular Season,-6.0,Make
8,20900015,487,Stephen Curry,4,2,25,Pullup Jump shot,2PT Field Goal,Mid-Range,Right Side Center(RC),16-24 ft.,20,121,162,1,2009-10-28,GSW,HOU,Regular Season,-9.0,Make
9,20900015,490,Stephen Curry,4,1,47,Pullup Jump shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,18,-125,134,1,2009-10-28,GSW,HOU,Regular Season,-7.0,Make


In [243]:
# apply feat_engineer
train = feat_engineer(train)
val = feat_engineer(val)
test = feat_engineer(test)

# verify shape (added 3 features so should be 23)
train.shape, val.shape, test.shape

((11081, 25), (1168, 25), (1709, 25))

### Profile Report

In [244]:
# run pandas profile report to check variance
import pandas_profiling
df.profile_report()



Columns to drop because of high correlation (> 0.4):

*   shot_distance (has high correlation with loc_y)
*   period (has high correlation with game_event_id)
*   player_name (is a constant)



### Wrangling

In [0]:
def data_jeans(X):
    '''Wrangle data so everything looks nice'''

    # Prevent SettingWithCopyWarning
    X = X.copy()

    # Drop features with unsable variance or high missing values
    unusable_variance = ['shot_distance', 'period', 'player_name', 'action_type', 'shot_made_flag']
    X = X.drop(columns=unusable_variance)

    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['game_date'].dt.year
    X['month_recorded'] = X['game_date'].dt.month
    X['day_recorded'] = X['game_date'].dt.day
    X = X.drop(columns='game_date')

    # return df
    return X

In [246]:
# run data_jeans
train = data_jeans(train)
val = data_jeans(val)
test = data_jeans(test)

# verify shape (removed 4 features added 3 so should be 22)
train.shape, val.shape, test.shape

((11081, 22), (1168, 22), (1709, 22))

In [0]:
# create target
target = 'category'

# create X_features matrix and y_target vector for train
X_train = train.drop(columns=target)
y_train = train[target]

# create X_features matrix and y_target vector for val
X_val = val.drop(columns=target)
y_val = val[target]

# create X_features matrix and y_target vector for test
X_test = test.drop(columns=target)
y_test = test[target]

In [0]:
# imports for pipeline
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

# Make pipeline!
RF = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(random_state=11, n_jobs=-1)
)

## 6.Get your model's validation accuracy

> (Multiple times if you try multiple iterations.)

In [249]:
# Fit on train, score on val
RF.fit(X_train, y_train)
y_pred = RF.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

Validation Accuracy 1.0


In [250]:
# make y_pred a series for better visualization
y_series = pd.Series(y_pred)
y_series.shape, y_val.shape, X_val.shape

((1168,), (1168,), (1168, 21))

In [251]:
# test with my own eyes
print(y_series.head(10))
print(y_val.head(10))

0    Make
1    Miss
2    Miss
3    Miss
4    Make
5    Make
6    Miss
7    Make
8    Miss
9    Miss
dtype: object
11081    Make
11082    Miss
11083    Miss
11084    Miss
11085    Make
11086    Make
11087    Miss
11088    Make
11089    Miss
11090    Miss
Name: category, dtype: object


In [252]:
# something is obviously wrong
X_train.tail()

# I was half expecting to see the target still in the subset :(

Unnamed: 0,game_id,game_event_id,minutes_remaining,seconds_remaining,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,loc_x,loc_y,htm,vtm,season_type,scoremargin_before_shot,homecourt_adv,opponent,made_prev_shot,Lead,year_recorded,month_recorded,day_recorded
11076,41600405,500,4,32,2PT Field Goal,Restricted Area,Center(C),Less Than 8 ft.,0,8,GSW,CLE,Playoffs,10.0,True,CLE,True,True,2017,6,12
11077,41600405,503,4,13,2PT Field Goal,Restricted Area,Center(C),Less Than 8 ft.,-7,11,GSW,CLE,Playoffs,12.0,True,CLE,True,True,2017,6,12
11078,41600405,527,1,37,3PT Field Goal,Above the Break 3,Center(C),24+ ft.,1,283,GSW,CLE,Playoffs,11.0,True,CLE,False,True,2017,6,12
11079,41600405,534,0,42,3PT Field Goal,Above the Break 3,Left Side Center(LC),24+ ft.,-166,205,GSW,CLE,Playoffs,11.0,True,CLE,True,True,2017,6,12
11080,41600405,536,0,20,3PT Field Goal,Right Corner 3,Right Side(R),24+ ft.,235,7,GSW,CLE,Playoffs,12.0,True,CLE,False,True,2017,6,12


In [253]:
# check to make sure I split them correctly
X_val.head()

Unnamed: 0,game_id,game_event_id,minutes_remaining,seconds_remaining,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,loc_x,loc_y,htm,vtm,season_type,scoremargin_before_shot,homecourt_adv,opponent,made_prev_shot,Lead,year_recorded,month_recorded,day_recorded
11081,21700002,56,8,9,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,95,242,GSW,HOU,Regular Season,5.0,True,HOU,,True,2017,10,17
11082,21700002,167,0,32,2PT Field Goal,Mid-Range,Right Side(R),8-16 ft.,129,43,GSW,HOU,Regular Season,4.0,True,HOU,False,True,2017,10,17
11083,21700002,207,9,14,2PT Field Goal,Restricted Area,Center(C),Less Than 8 ft.,20,10,GSW,HOU,Regular Season,8.0,True,HOU,False,True,2017,10,17
11084,21700002,219,8,15,3PT Field Goal,Above the Break 3,Left Side Center(LC),24+ ft.,-127,239,GSW,HOU,Regular Season,9.0,True,HOU,False,True,2017,10,17
11085,21700002,370,11,13,2PT Field Goal,Restricted Area,Center(C),Less Than 8 ft.,-13,14,GSW,HOU,Regular Season,10.0,True,HOU,True,True,2017,10,17


In [254]:
# check end of val split
X_val.tail()

Unnamed: 0,game_id,game_event_id,minutes_remaining,seconds_remaining,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,loc_x,loc_y,htm,vtm,season_type,scoremargin_before_shot,homecourt_adv,opponent,made_prev_shot,Lead,year_recorded,month_recorded,day_recorded
12244,41700404,582,6,19,3PT Field Goal,Left Corner 3,Left Side(L),24+ ft.,-240,0,CLE,GSW,Playoffs,25.0,False,CLE,True,True,2018,6,8
12245,41700404,588,5,48,2PT Field Goal,Mid-Range,Left Side(L),8-16 ft.,-115,89,CLE,GSW,Playoffs,28.0,False,CLE,False,True,2018,6,8
12246,41700404,591,5,13,3PT Field Goal,Above the Break 3,Left Side Center(LC),24+ ft.,-116,330,CLE,GSW,Playoffs,26.0,False,CLE,False,True,2018,6,8
12247,41700404,603,4,27,2PT Field Goal,Restricted Area,Center(C),Less Than 8 ft.,14,11,CLE,GSW,Playoffs,25.0,False,CLE,False,True,2018,6,8
12248,41700404,614,3,49,2PT Field Goal,In The Paint (Non-RA),Center(C),Less Than 8 ft.,13,59,CLE,GSW,Playoffs,25.0,False,CLE,False,True,2018,6,8


In [255]:
# Make New pipeline try to get as low of accuracy as possible
RF = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(random_state=11, max_features=1, max_depth=1, n_jobs=-1)
)

# Fit on train, score on val
RF.fit(X_train, y_train)
y_pred = RF.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

Validation Accuracy 0.997431506849315


In [256]:
# okay, that is less than the baseline, hmm
# Make New pipeline try to get better of accuracy without overfit
RF = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(random_state=11, max_features=4, max_depth=4, n_jobs=-1)
)

# Fit on train, score on val
RF.fit(X_train, y_train)
y_pred = RF.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

Validation Accuracy 1.0


In [257]:
# maybe try hyperparameter tuning?

# RandomizedSearchCV
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV

# make pipeline
pipeline = make_pipeline(
    ce.OrdinalEncoder(), # This can be what ever kind of encoder you want
    SimpleImputer(), 
    RandomForestClassifier(random_state=11)
)

# set parameter ranges
param_distributions = { 
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestclassifier__n_estimators': randint(50, 500), 
    'randomforestclassifier__max_depth': [5, 10, 15, 20, None], # if this makes your accuracy worse, try smaller gaps 
    'randomforestclassifier__max_features': uniform(0, 1), 
}

search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=10, 
    cv=3, 
    scoring='accuracy', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    8.0s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   10.8s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   17.5s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:   27.3s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   36.8s finished


In [258]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation Accuracy', -search.best_score_)

Best hyperparameters {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__max_features': 0.6102352985886536, 'randomforestclassifier__n_estimators': 454, 'simpleimputer__strategy': 'mean'}
Cross-validation Accuracy -1.0


In [0]:
# I may have done the made last shot feature wrong and that is messing up my data I am going to drop it
# I found found this out after searching on google what could cause an RF model to have 100% accuracy. Answer? Label Leakage I think

def drop_stupid_feat(X):
  '''sometimes we all make mistakes, also maybe I was tricked'''

  # prevent warning
  X.copy()

  # drop bad feature
  unusable_feature = ['made_prev_shot']
  X = X.drop(columns=unusable_feature)

  return X

train = drop_stupid_feat(train)
val = drop_stupid_feat(val)
test = drop_stupid_feat(test)

In [260]:
train.head()

Unnamed: 0,game_id,game_event_id,minutes_remaining,seconds_remaining,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,loc_x,loc_y,htm,vtm,season_type,scoremargin_before_shot,category,homecourt_adv,opponent,Lead,year_recorded,month_recorded,day_recorded
0,20900015,4,11,25,3PT Field Goal,Above the Break 3,Right Side Center(RC),24+ ft.,99,249,GSW,HOU,Regular Season,2.0,Miss,True,HOU,True,2009,10,28
1,20900015,17,9,31,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,-122,145,GSW,HOU,Regular Season,0.0,Make,True,HOU,False,2009,10,28
2,20900015,53,6,2,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,-60,129,GSW,HOU,Regular Season,-4.0,Miss,True,HOU,False,2009,10,28
3,20900015,141,9,49,2PT Field Goal,Mid-Range,Left Side(L),16-24 ft.,-172,82,GSW,HOU,Regular Season,-4.0,Miss,True,HOU,False,2009,10,28
4,20900015,249,2,19,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,-68,148,GSW,HOU,Regular Season,0.0,Miss,True,HOU,False,2009,10,28


In [0]:
# create target
target = 'category'

# create X_features matrix and y_target vector for train
X_train = train.drop(columns=target)
y_train = train[target]

# create X_features matrix and y_target vector for val
X_val = val.drop(columns=target)
y_val = val[target]

# create X_features matrix and y_target vector for test
X_test = test.drop(columns=target)
y_test = test[target]

In [262]:
# okay, that is less than the baseline, hmm
# Make New pipeline try to get better of accuracy without overfit
RF = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(random_state=11, n_jobs=-1)
)

# Fit on train, score on val
RF.fit(X_train, y_train)
y_pred = RF.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

Validation Accuracy 0.5333904109589042


In [263]:
# RandomizedSearchCV
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV

# build pipeline
pipeline = make_pipeline(
    ce.OrdinalEncoder(), # This can be what ever kind of encoder you want
    SimpleImputer(), 
    RandomForestClassifier(random_state=11)
)

# set parameter ranges
param_distributions = { 
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestclassifier__n_estimators': randint(50, 500), 
    'randomforestclassifier__max_depth': [5, 10, 15, 20, None], # if this makes your accuracy worse, try smaller gaps 
    'randomforestclassifier__max_features': uniform(0, 1), 
}

search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=10, 
    cv=3, 
    scoring='accuracy', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   16.5s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.3min finished


In [264]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation Accuracy', search.best_score_)

Best hyperparameters {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__max_features': 0.789346681894233, 'randomforestclassifier__n_estimators': 353, 'simpleimputer__strategy': 'mean'}
Cross-validation Accuracy 0.5638480281563036


**well .56 is better than the baseline and not 100 so I am moving forward**

## 7. Get your model's test accuracy

> (One time, at the end.)

In [270]:
# build pipeline
RF_1 = make_pipeline(
    ce.OrdinalEncoder(), # This can be what ever kind of encoder you want
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(max_depth=5, max_features= 0.789346681894233, n_estimators=353,random_state=11)
)

RF_1.fit(X_train, y_train)
y_pred = RF_1.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

Validation Accuracy 0.5659246575342466


In [271]:
# run on test data
y_pred = RF_1.predict(X_test)
print('Validation Accuracy', accuracy_score(y_test, y_pred))

Validation Accuracy 0.5734347571679345


## 8. Given a confusion matrix, calculate accuracy, precision, and recall.

Imagine this is the confusion matrix for a binary classification model. Use the confusion matrix to calculate the model's accuracy, precision, and recall.

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

### Calculate accuracy 

In [267]:
# correct predictions = the diagonal, all true predictions
Correct_Predictions = 85 + 36

# total predictions = all predictions made
Total_Predictions = 85 + 58 + 8 + 36

# accuracy = Correct_Predictions / Total_Predictions
accuracy = Correct_Predictions / Total_Predictions
print('Accuracy:',accuracy)

Accuracy: 0.6470588235294118


### Calculate precision

In [268]:
# correct predictions
positive_Correct = 36

# total predictions (Vertical)
positive_Total_Predictions = 58 + 36

# Precision = Correct value predictions of a class / Total predictions for class
precision = positive_Correct / positive_Total_Predictions
print('Precision:',precision)

Precision: 0.3829787234042553


### Calculate recall

In [269]:
# correct predictions
positive_Correct = 36

# total predictions (Horizontal)
positive_Total_Predictions = 8 + 36

# Precision = Correct value predictions of a class / Total predictions for class
recall = positive_Correct / positive_Total_Predictions
print('Recall:', recall)

Recall: 0.8181818181818182


If I would not have run into that problem with my engineered feature I would have been able to finish the visualizations!!! Haha oh well.