_Lambda School Data Science, Unit 2_
 
# Sprint Challenge: Predict Steph Curry's shots 🏀

For your Sprint Challenge, you'll use a dataset with all Steph Curry's NBA field goal attempts. (Regular season and playoff games, from October 28, 2009, through June 5, 2019.) 

You'll predict whether each shot was made, using information about the shot and the game. This is hard to predict! Try to get above 60% accuracy. The dataset was collected with the [nba_api](https://github.com/swar/nba_api) Python library.

In [244]:
%%capture
!pip install category_encoders

In [245]:
import datetime
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
import category_encoders as ce
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

In [246]:
%%capture
import sys

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

In [247]:
# Read data
import pandas as pd
url = 'https://drive.google.com/uc?export=download&id=1fL7KPyxgGYfQDsuJoBWHIWwCAf-HTFpX'
df = pd.read_csv(url)

# Check data shape
assert df.shape == (13958, 20)

In [248]:
df.shape

(13958, 20)

In [249]:
df.describe()

Unnamed: 0,game_id,game_event_id,period,minutes_remaining,seconds_remaining,shot_distance,loc_x,loc_y,shot_made_flag,scoremargin_before_shot
count,13958.0,13958.0,13958.0,13958.0,13958.0,13958.0,13958.0,13958.0,13958.0,13958.0
mean,24428370.0,270.438458,2.41238,4.72754,28.506376,17.600373,-0.554162,131.257988,0.472919,1.615561
std,7226620.0,169.92717,1.125828,3.331646,17.597701,10.295807,124.721869,102.666562,0.499284,10.127139
min,20900020.0,2.0,1.0,0.0,0.0,0.0,-250.0,-41.0,0.0,-39.0
25%,21200910.0,109.0,1.0,2.0,13.0,8.0,-96.0,23.0,0.0,-4.0
50%,21500260.0,275.5,2.0,4.0,29.0,22.0,1.0,141.0,0.0,1.0
75%,21700960.0,398.0,3.0,7.0,44.0,25.0,95.0,219.0,1.0,8.0
max,41800400.0,752.0,6.0,11.0,59.0,83.0,247.0,811.0,1.0,43.0


To demonstrate mastery on your Sprint Challenge, do all the required, numbered instructions in this notebook.

To earn a score of "3", also do all the stretch goals.

You are permitted and encouraged to do as much data exploration as you want.

**1. Begin with baselines for classification.** Your target to predict is `shot_made_flag`. What is your baseline accuracy, if you guessed the majority class for every prediction?

**2. Hold out your test set.** Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

**3. Engineer new feature.** Engineer at least **1** new feature, from this list, or your own idea.
- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
- **Opponent**: Who is the other team playing the Golden State Warriors?
- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
- **Made previous shot**: Was Steph Curry's previous shot successful?

**4. Decide how to validate** your model. Choose one of the following options. Any of these options are good. You are not graded on which you choose.
- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
- **Train/validate/test split: random 80/20%** train/validate split.
- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

**5.** Use a scikit-learn **pipeline** to **encode categoricals** and fit a **Decision Tree** or **Random Forest** model.

**6.** Get your model's **validation accuracy.** (Multiple times if you try multiple iterations.) 

**7.** Get your model's **test accuracy.** (One time, at the end.)


**8.** Given a **confusion matrix** for a hypothetical binary classification model, **calculate accuracy, precision, and recall.**

### Stretch Goals
- Engineer 4+ new features total, either from the list above, or your own ideas.
- Make 2+ visualizations to explore relationships between features and target.
- Optimize 3+ hyperparameters by trying 10+ "candidates" (possible combinations of hyperparameters). You can use `RandomizedSearchCV` or do it manually.
- Get and plot your model's feature importances.



## 1. Begin with baselines for classification. 

>Your target to predict is `shot_made_flag`. What would your baseline accuracy be, if you guessed the majority class for every prediction?

In [250]:
df['shot_made_flag'].mean()

0.4729187562688064

Shot baseline is 47%. This can be found in the df describe I did above and is reflected as well in this specific call here: 0.4729187562688064

## 2. Hold out your test set.

>Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.

In [251]:
df.columns

Index(['game_id', 'game_event_id', 'player_name', 'period',
       'minutes_remaining', 'seconds_remaining', 'action_type', 'shot_type',
       'shot_zone_basic', 'shot_zone_area', 'shot_zone_range', 'shot_distance',
       'loc_x', 'loc_y', 'shot_made_flag', 'game_date', 'htm', 'vtm',
       'season_type', 'scoremargin_before_shot'],
      dtype='object')

Found the season column in 'game_date'

In [252]:
#df['game_date']= pd.to_datetime(df['game_date']) 
#df['season_year'] = pd.DatetimeIndex(df['game_date']).year
#df['season_month'] = pd.DatetimeIndex(df['game_date']).month

Converted 'game_date' to datetime pandas and then pulled out the year and added the feature of season with 'season_year'

In [253]:
#mask = (df['game_date'] > pd.to_datetime('2018-10-01')) & (df['game_date'] <= pd.to_datetime('2019-06-30'))
#test = df.loc[mask]

In [254]:
df = df.set_index('game_date')
train = df[:'2018-09:30']
test = df['2018-10-01':]

In [255]:
test.shape

(1709, 19)

df = df.set_index('game_date')
train = df[:'2018-09:30']
test = df['2018-10-01':]

Then, later

val = train['2017-10-01':'2018-09-30']
train = train[:'2017-09-30']

Created a mask against the original 'game_date' feature which extended from Oct 2018 to June 2019 and then ran its shape to confirm accuracy.

## 3. Engineer new feature.

>Engineer at least **1** new feature, from this list, or your own idea.
>
>- **Homecourt Advantage**: Is the home team (`htm`) the Golden State Warriors (`GSW`) ?
>- **Opponent**: Who is the other team playing the Golden State Warriors?
>- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
>- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
>- **Made previous shot**: Was Steph Curry's previous shot successful?

    

In [256]:
#df['season_year'].head(),df['season_month'].head()

I already created 2 new features 'season_year' & 'season_month'

In [257]:
teams = np.array(df['vtm'])
teams

array(['HOU', 'HOU', 'HOU', ..., 'TOR', 'TOR', 'TOR'], dtype=object)

I could create a dictionary to engineer a feature using english names for the team abbreviations. But I have already engineered two features and I will move on for now.


## **4. Decide how to validate** your model. 

>Choose one of the following options. Any of these options are good. You are not graded on which you choose.
>
>- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
>- **Train/validate/test split: random 80/20%** train/validate split.
>- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.

In [258]:
#mask_train = (df['game_date'] > pd.to_datetime('2009-10-01')) & (df['game_date'] <= pd.to_datetime('2017-06-30'))
#train = df.loc[mask_train]
#mask_val = (df['game_date'] > pd.to_datetime('2017-10-01')) & (df['game_date'] <= pd.to_datetime('2018-06-30'))
#val = df.loc[mask_val]

In [259]:
val = train['2017-10-01':'2018-09-30']
train = train[:'2017-09-30']

In [260]:
train.shape, val.shape, test.shape

((11081, 19), (1168, 19), (1709, 19))

In [261]:
df.isnull().sum()

game_id                    0
game_event_id              0
player_name                0
period                     0
minutes_remaining          0
seconds_remaining          0
action_type                0
shot_type                  0
shot_zone_basic            0
shot_zone_area             0
shot_zone_range            0
shot_distance              0
loc_x                      0
loc_y                      0
shot_made_flag             0
htm                        0
vtm                        0
season_type                0
scoremargin_before_shot    0
dtype: int64

In [262]:
def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""

    # Drop columns
    # none = ['game_id', 'game_event_id', 'loc_x', 'loc_y', 'player_name' ]
    X = X.drop(columns='player_name')
    # return the wrangled dataframe
    return X

In [263]:
#train['game_date'] = train['game_date'].astype(int)

In [264]:
#val['game_date'] = val['game_date'].astype(int)

In [265]:
#test['game_date'] = test['game_date'].astype(int)

In [266]:
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [267]:
target = 'shot_made_flag'
features = train.columns.drop(target)
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[features]
X_test = test[features]

features = ['period','minutes_remaining', 'seconds_remaining', 
        'action_type', 'shot_type', 'shot_zone_basic', 
        'shot_zone_area', 'shot_zone_range', 'shot_distance', 
        'shot_made_flag', 'htm', 'vtm', 'season_type', 
        'scoremargin_before_shot']

In [268]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape

((11081, 17), (11081,), (1168, 17), (1168, 17), (1709, 17))

## 5. Use a scikit-learn pipeline to encode categoricals and fit a Decision Tree or Random Forest model.

So after changing the dates out of my previous attempt to make them datetime objects I began getting '< str versus float' comparison errors. After looking at the dataframes it is clear they are still not being encoded. So after attempts to change the feature selection and manual attempt to change the 'scoremargin_before_shot' from float to int I realized this is a fallow path. Here I have manually encoded each set of data. As you can see X_train & y_train encode correctly. But using the exact same code in the exact same circumstance X_val & y_val will not encode. None of this makes any sense. The pipeline should work no problem but it won't touch the encoding. Doing this manually should work no problem but it only works on a subset of the data. I am confident I have integrated what I was taught but I lack the tools to figure out what is wrong here.

In [269]:
#encoder = ce.OneHotEncoder(use_cat_names=True)
#X_train = encoder.fit_transform(X_train)


In [270]:
#X_train.head()

In [271]:
#y_train.head()

In [272]:
#X_val = encoder.fit_transform(X_val)
#y_val =  encoder.fit_transform(y_val)

In [273]:
#X_val.head()

In [274]:
#X_test = encoder.transform(X_test)

In [275]:
#X_test.head()

In [276]:
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

In [277]:
pipeline.fit(X_train, y_train)
print('Random Forest')
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

Random Forest
Train Accuracy 1.0


TypeError: '<' not supported between instances of 'float' and 'str'

In [None]:
#X_val.columns

In [None]:
#X_train.columns

In [None]:
#pipeline.fit(X_train, y_train)
#print('Train Accuracy', pipeline.score(X_train, y_train))
#print('Validation Accuracy', pipeline.score(X_val, y_val))

## 6.Get your model's validation accuracy

> (Multiple times if you try multiple iterations.)

In [None]:


pipeline = make_pipeline(
    ce.TargetEncoder(min_samples_leaf=1, smoothing=1), 
    SimpleImputer(strategy='median'), 
    RandomForestRegressor(n_estimators=100, n_jobs=-1, random_state=42)
)

k = 3
scores = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='neg_mean_absolute_error')
print(f'MAE for {k} folds:', -scores)

## 7. Get your model's test accuracy

> (One time, at the end.)

## 8. Given a confusion matrix, calculate accuracy, precision, and recall.

Imagine this is the confusion matrix for a binary classification model. Use the confusion matrix to calculate the model's accuracy, precision, and recall.

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

### Calculate accuracy 

In [None]:
(85+36)/(85+36+58+8)

### Calculate precision

In [None]:
36/(58+36)

### Calculate recall

In [None]:
36/(8+36)