# KOBE ML

#### Importing required libraries.

In [1]:
import random
import pandas as pd
import pandasql as ps
import numpy as np
from tqdm import tqdm

In [2]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, f1_score, log_loss

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import TimeSeriesSplit, GridSearchCV, cross_validate

from sklearn.feature_selection import RFECV

import warnings
warnings.filterwarnings('ignore')

Adjusting some pandas settings

In [3]:
pd.options.display.max_columns = 150
pd.options.display.max_colwidth = None

In [4]:
%%time
%run kb_load_process.py
print(kb.shape)
kb.head(6)

(30697, 15)
Wall time: 1.8 s


Unnamed: 0,shot_id,season,month,playoffs,home,period,shot_distance,combined_shot_type,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,angle_bin,shot_made_flag
0,22902,0,0,0,1,1,4,Jump Shot,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,Left-Center,0.0
1,22903,0,0,0,0,2,4,Jump Shot,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,Left-Center,0.0
2,22904,0,0,0,0,2,5,Jump Shot,Jump Shot,3PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,Left-Center,1.0
3,22905,0,0,0,0,2,0,Jump Shot,Jump Shot,3PT Field Goal,Restricted Area,Center(C),Less Than 8 ft.,Center,0.0
4,22906,0,0,0,0,2,3,Jump Shot,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,Center,1.0
5,22907,0,0,0,0,2,4,Jump Shot,Jump Shot,2PT Field Goal,Mid-Range,Right Side Center(RC),16-24 ft.,Right-Center,


Since we will make some tests before making a submission, we'll create a copy of a dataframe excluding rows were the target is null.

In [5]:
kb_nn = kb.dropna().copy().reset_index(drop=True)
print(kb_nn.shape)
kb_nn.head(6)

(25697, 15)


Unnamed: 0,shot_id,season,month,playoffs,home,period,shot_distance,combined_shot_type,action_type,shot_type,shot_zone_basic,shot_zone_area,shot_zone_range,angle_bin,shot_made_flag
0,22902,0,0,0,1,1,4,Jump Shot,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,Left-Center,0.0
1,22903,0,0,0,0,2,4,Jump Shot,Jump Shot,2PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,Left-Center,0.0
2,22904,0,0,0,0,2,5,Jump Shot,Jump Shot,3PT Field Goal,Mid-Range,Left Side Center(LC),16-24 ft.,Left-Center,1.0
3,22905,0,0,0,0,2,0,Jump Shot,Jump Shot,3PT Field Goal,Restricted Area,Center(C),Less Than 8 ft.,Center,0.0
4,22906,0,0,0,0,2,3,Jump Shot,Jump Shot,2PT Field Goal,In The Paint (Non-RA),Center(C),8-16 ft.,Center,1.0
5,22908,0,0,0,0,2,5,Jump Shot,Jump Shot,3PT Field Goal,Mid-Range,Center(C),16-24 ft.,Center,1.0


We'll define a function to which we'll pass our Dataframe to a new version of it with numeric features scaled between 0 and 1, and categorical features as dummies.

In [6]:
def column_transform(df):
    
    '''Processes categorical features and normalizes numeric features.
    
    Args:
        df (pd.Dataframe): Dataframe to be transformed into a new dataframe that can be passed to estimators.
        
    Returns: pd.DataFrame Dummified and normalized version of the passed DataFramed
    '''
    
    num_cols = ['season','month', 'period', 'shot_distance']
    cat_columns = ['combined_shot_type', 'action_type', 'shot_type', 'shot_zone_basic', 'shot_zone_area', 'shot_zone_range', 'angle_bin']
    drop_features = ['st_2PT_Field_Goal', 'sza_Back_Court(BC)', 'szr_Back_Court_Shot']
    
    for col in cat_columns:
        try:
            prefix = (''.join(i[0] for i in col.split('_')))
            df = pd.concat([df, pd.get_dummies(df[col], prefix)], axis=1)
            df = df.drop(columns=col)
            df.columns = df.columns.str.replace(' ', '_')
        except:
            pass
    
    scaler = MinMaxScaler()
    try:
        df[num_cols] = scaler.fit_transform(df[num_cols])
    except:
        pass

    for i in drop_features:
        try:
            df=df.drop(columns=i)
        except:
            pass
    return df

In [7]:
%%time
kb_t = column_transform(df = kb_nn)

Wall time: 136 ms


This is how our final validation Dataframe looks like. 25697 rows and 54 columns.

In [8]:
print(kb_t.shape)
kb_t.head()

(25697, 52)


Unnamed: 0,shot_id,season,month,playoffs,home,period,shot_distance,shot_made_flag,cst_Bank_Shot,cst_Dunk,cst_Hook_Shot,cst_Jump_Shot,cst_Layup,cst_Tip_Shot,at_Alley_Oop_Dunk_Shot,at_Driving_Dunk_Shot,at_Driving_Layup_Shot,at_Dunk_Shot,at_Fadeaway_Jump_Shot,at_Jump_Bank_Shot,at_Jump_Shot,at_Layup_Shot,at_Other,at_Pullup_Jump_shot,at_Reverse_Layup_Shot,at_Running_Jump_Shot,at_Slam_Dunk_Shot,at_Tip_Shot,at_Turnaround_Fadeaway_shot,at_Turnaround_Jump_Shot,st_3PT_Field_Goal,szb_Above_the_Break_3,szb_Backcourt,szb_In_The_Paint_(Non-RA),szb_Left_Corner_3,szb_Mid-Range,szb_Restricted_Area,szb_Right_Corner_3,sza_Center(C),sza_Left_Side_Center(LC),sza_Left_Side(L),sza_Right_Side_Center(RC),sza_Right_Side(R),szr_16-24_ft.,szr_24+_ft.,szr_8-16_ft.,szr_Less_Than_8_ft.,ab_Left,ab_Left-Center,ab_Center,ab_Right-Center,ab_Right
0,22902,0.0,0.0,0,1,0.0,0.210526,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0
1,22903,0.0,0.0,0,0,0.25,0.210526,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0
2,22904,0.0,0.0,0,0,0.25,0.263158,1.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0
3,22905,0.0,0.0,0,0,0.25,0.0,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0
4,22906,0.0,0.0,0,0,0.25,0.157895,1.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0


### Baseline score
We'll define our baseline score. This is our point of reference to evaluate model's performance. It's calculated by simulating a naive classifier. A naive would predict the probability simply by the proportion of shots that went in.

In [9]:
print(kb_nn.shape)
kb_nn['shot_made_flag'].value_counts(normalize=True)

(25697, 15)


0.0    0.553839
1.0    0.446161
Name: shot_made_flag, dtype: float64

We can see that 44.6161% of the shots we have went in. For each shot we know the outcome of, let's calculate the LogLoss by passing the unique probability of 0.446161.

In [10]:
baseline_array = np.full((kb_nn.shape[0]), fill_value=0.446161)
print(baseline_array.shape)

(25697,)


In [11]:
baseline = pd.DataFrame()
baseline['shot_made_flag'] = kb_nn['shot_made_flag']
baseline['shot_proba'] = baseline_array
baseline.head()

Unnamed: 0,shot_made_flag,shot_proba
0,0.0,0.446161
1,0.0,0.446161
2,1.0,0.446161
3,0.0,0.446161
4,1.0,0.446161


In [12]:
print('Log Loss Baseline: ', log_loss(baseline['shot_made_flag'], baseline['shot_proba']))

Log Loss Baseline:  0.687338656221433


Our goal is to create a classifier which is at least better than our baseline, anything equal or higher than that is just a bad result.

Now let's get to the validation process.

First we'll create a decorator function that will wrap any function we decorate with it, this decorator will cache a set of parameters and it's results so that if we execute the same code twice it doesn't have to run twice by fetching the previously computed results.

In [13]:
from functools import wraps
def memoize(func):
    """Store the results of the decorated function for fast lookup in case it is executed twice with the same arguments
    """
    cache = func.cache = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        key = str(args) + str(kwargs)
        if key not in cache:
            cache[key] = func(*args, **kwargs)
        return cache[key]
    return wrapper

We'll use Scikit-Learn's Time Series Split, since it returns an iterable that we can pass to Scikit-Learn Recursive Feature Elimination and GridSearch.
This function will just print the cross validation scores on both train and test set including Accuracy, Recall, Precision, F1 and Log Loss which we want to optimize. Because it is a loss function the lower the LogLoss the better our model is performing. 

In [14]:
tscv = TimeSeriesSplit(n_splits=100 , test_size=250)

In [15]:
@memoize
def ts_cross_val(X, y, model=None, cv_ = tscv, full_score = True):        
    '''Time series validation for data sorted by cronological order to prevent leakage.
    
    Args:
        X (pd.Dataframe): Chronologically sorted dataframe containing our features.
        
        y (pd.Series) Chronologically sorted series containing features.
        
        model (estimator): Estimator to be used in each train/test validation step.
        
        cv_ (int, cross-validation generator or an iterable, default = TimeSeriesSplit) Determines the cross-validation splitting strategy.
        
    Returns: pd.Series containing the scores the estimator obtained on both the train and test sets
    '''
    score = cross_validate(estimator = model,\
                       X = X,\
                       y = y,\
                       scoring = ('accuracy', 'f1', 'neg_log_loss'),\
                       cv = cv_,\
                       n_jobs = -1,\
                       return_train_score= True)

    score_df = pd.DataFrame(score)
    score_df = score_df.mean()['test_accuracy':]
    score_df['test_neg_log_loss'] = np.absolute(score_df['test_neg_log_loss'])
    score_df['train_neg_log_loss'] = np.absolute(score_df['train_neg_log_loss'])
    return(score_df)

In [16]:
features = kb_t.drop(columns=['shot_id', 'shot_made_flag'])
target = kb_t['shot_made_flag']

We have decided to test 3 different ML Algorithms to see how they perform and evolve during the optimization process.
* Multinomial Naive Bayes
* Logistic Regression
* Random Forest

In [17]:
mnb = MultinomialNB()
lrc = LogisticRegression(max_iter=200, solver='sag', warm_start=True, n_jobs=-1)
rfc = RandomForestClassifier(n_jobs=-1)

## Time-Series Cross-Validation

#### Multinomial Naive Bayes

In [18]:
%%time
mnb_score = ts_cross_val(X = features, y = target, model=mnb)
print(mnb_score)

test_accuracy         0.642000
train_accuracy        0.641558
test_f1               0.536745
train_f1              0.550328
test_neg_log_loss     0.803915
train_neg_log_loss    0.826800
dtype: float64
Wall time: 4.77 s


#### Logistic Regression

In [19]:
%%time
lrc_score = ts_cross_val(X = features, y = target, model=lrc)
print(lrc_score)

test_accuracy         0.679760
train_accuracy        0.682336
test_f1               0.564458
train_f1              0.557136
test_neg_log_loss     0.613274
train_neg_log_loss    0.604437
dtype: float64
Wall time: 8.09 s


#### Random Forest

In [20]:
%%time
rfc_score = ts_cross_val(X = features, y = target, model=rfc)
print(rfc_score)

test_accuracy         0.625800
train_accuracy        0.900824
test_f1               0.562566
train_f1              0.887927
test_neg_log_loss     0.748393
train_neg_log_loss    0.263449
dtype: float64
Wall time: 41.6 s


#### Summary

* All classifiers resulted in a fairly good F1 score which tells us that the Models are not predicting just one class

* Random Forest seemed to be pretty bad and overfit because the train scores are too good compared to the test scores, also it's test Log Loss is above our baseline which makes it a pretty bad estimator so far.

* Multinomial Naive Bayes obtained a more balanced train/test score on all metric but it's Log Loss is at least 0.10 more than our baseline which is an awful result.

* Linear classifiers (Logistic Regression and Linear XGBoost) performed better than the rest by achieving a lower Log Loss which can still be improved

Let's see if we can improve these scores.

## Feature Selection
### Recursive Feature Elimination Cross-Validation

So far we trained our estimators using each feature in our data, this isn't optimal it takes longer to execute, and some of the features aren't really that useful.

We'll use a technique called Recursive Featue Elimination which eliminates features until we reach an optimal combination of features based on the coefficient given by the classifier. We'll get the best features for each of the estimators separately.

 We'll create a function where we'll pass a DataFrame and a model and will use RFECV  to return the best combination of features to minimize Log Loss.
 Notice that the function is decorated so that if run it a second time with the same parameters it will just look up previously cached results.

In [21]:
@memoize
def rfe_tscv(X, y, estimator, cv_ = tscv, step_=1):
    
    '''Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.
    
    Args:
        
        X (pd.Dataframe): Chronologically sorted dataframe containing our features.
        
        y (pd.Series) Chronologically sorted series containing features.
        
        model (estimator): Estimator to be used in each train/test validation step.
        
        cv_ (int, cross-validation generator or an iterable, default = TimeSeriesSplit) Determines the cross-validation splitting strategy.
        
        step_(int or float) If int, determines the amount of features removed at each iteration. 
        If float determines the proportion of features to be removed each iteration
    
    Returns: list of strings containing the name of the best combination of features selected by RFECV.
    '''
    
    fs = RFECV(estimator, step=step_, cv = cv_, n_jobs=-1, scoring = 'neg_log_loss')
    fs.fit(X, y)
    
    best_features = list(X.columns[fs.support_])
    return best_features

In [22]:
%%time
mnb_cols = rfe_tscv(features, target, mnb)
print(kb_t[mnb_cols].shape)
kb_t[mnb_cols].head(1)

(25697, 28)
Wall time: 10.9 s


Unnamed: 0,cst_Bank_Shot,cst_Dunk,cst_Hook_Shot,cst_Tip_Shot,at_Alley_Oop_Dunk_Shot,at_Driving_Dunk_Shot,at_Driving_Layup_Shot,at_Dunk_Shot,at_Fadeaway_Jump_Shot,at_Jump_Bank_Shot,at_Layup_Shot,at_Other,at_Pullup_Jump_shot,at_Reverse_Layup_Shot,at_Running_Jump_Shot,at_Slam_Dunk_Shot,at_Tip_Shot,at_Turnaround_Fadeaway_shot,at_Turnaround_Jump_Shot,szb_Above_the_Break_3,szb_Backcourt,szb_Left_Corner_3,szb_Right_Corner_3,sza_Left_Side_Center(LC),sza_Left_Side(L),sza_Right_Side_Center(RC),sza_Right_Side(R),ab_Left
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [23]:
%%time
lrc_cols = rfe_tscv(features, target, lrc, step_=1)
print(kb_t[lrc_cols].shape)
kb_t[lrc_cols].head(1)

(25697, 36)
Wall time: 3min 24s


Unnamed: 0,season,period,shot_distance,cst_Bank_Shot,cst_Dunk,cst_Hook_Shot,cst_Jump_Shot,cst_Tip_Shot,at_Alley_Oop_Dunk_Shot,at_Driving_Dunk_Shot,at_Driving_Layup_Shot,at_Dunk_Shot,at_Fadeaway_Jump_Shot,at_Jump_Bank_Shot,at_Jump_Shot,at_Layup_Shot,at_Other,at_Pullup_Jump_shot,at_Reverse_Layup_Shot,at_Running_Jump_Shot,at_Slam_Dunk_Shot,at_Tip_Shot,szb_Above_the_Break_3,szb_Backcourt,szb_In_The_Paint_(Non-RA),szb_Left_Corner_3,szb_Mid-Range,szb_Restricted_Area,sza_Center(C),sza_Left_Side_Center(LC),sza_Left_Side(L),sza_Right_Side_Center(RC),sza_Right_Side(R),szr_16-24_ft.,szr_24+_ft.,szr_8-16_ft.
0,0.0,0.0,0.210526,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0


In the cells above you can see that each group of features is smaller than the original features we had after scaling and getting dummies.
Now let's test the same Cross-Validation process but using only the selected features for each estimator.

### ALL FEATURES VS SELECTED FEATURES

#### Multinomial Naive Bayes

In [24]:
print('- With all features: ')
print(mnb_score)

print(f'\n- With {len(mnb_cols)} best featues')
mnb_fs = ts_cross_val(features[mnb_cols], target, model=mnb)
print(mnb_fs)

- With all features: 
test_accuracy         0.642000
train_accuracy        0.641558
test_f1               0.536745
train_f1              0.550328
test_neg_log_loss     0.803915
train_neg_log_loss    0.826800
dtype: float64

- With 28 best featues
test_accuracy         0.678960
train_accuracy        0.680120
test_f1               0.547707
train_f1              0.545345
test_neg_log_loss     0.628181
train_neg_log_loss    0.625342
dtype: float64


#### Logistic Regression

In [25]:
print('- With all features: ')
print(lrc_score)

print(f'\n- With {len(lrc_cols)} best featues')
lrc_fs = ts_cross_val(features[lrc_cols], target, model=lrc)
print(lrc_fs)

- With all features: 
test_accuracy         0.679760
train_accuracy        0.682336
test_f1               0.564458
train_f1              0.557136
test_neg_log_loss     0.613274
train_neg_log_loss    0.604437
dtype: float64

- With 36 best featues
test_accuracy         0.680040
train_accuracy        0.682286
test_f1               0.564858
train_f1              0.557069
test_neg_log_loss     0.612483
train_neg_log_loss    0.605008
dtype: float64


As we can see Logistic Regression had some improvements after removing less relevant features but Naive bayes reduced it's loss from 0.80 to less than 0.65 which is a strong improvement.

#### Random Forest 

The feature selection process will be skipped for Random Fores because it takes too long while not improving performance significantly, instead this estimator will skip directly to the Hyper-Parameter optimization stage.

## Hyperparameter Tuning

We've reached the last process of optimization, so far we have used either the default parameters or a set of parameters that allow us to do the step faster without sacrificing too much performance.

We will use a new function which returns a dictionary with the best parameters for the passed estimator. This function will be decorated too. 

In [26]:
@memoize
def grid_cv (X, y, model, params, cv_ = tscv):
    
    '''Iterates over all possible combinations of a given dictionary containing lists of parameters for a given estimator using time_series_cross_validation to prevent leakage.
    
    Args:
        X (pd.Dataframe): Chronologically sorted dataframe containing our features.
        
        y (pd.Series) Chronologically sorted series containing features.
        
        model (estimator): Estimator to be used in each train/test validation step.
        
        cv_ (int, cross-validation generator or an iterable, default = TimeSeriesSplit) Determines the cross-validation splitting strategy.
        
        param (dicitonary): Dictionary with key value pairs, where each key is the name of a parameter for the given estimator 
        and it's value must be a list of values to be passed into such parameter in each iteration.
        
        
    Returns: Dataframe containing the scores the estimator obtained on both the train and test sets
    '''
    
    relevant_cols=['mean_test_neg_log_loss', 'mean_test_accuracy', 'mean_test_recall', 'mean_test_f1', 'params']
    
    model = model
    
    grid = GridSearchCV(estimator=model,\
                        param_grid=params,\
                        scoring=('accuracy', 'recall', 'precision', 'f1', 'neg_log_loss'),\
                        n_jobs = -1,\
                        cv = cv_,\
                        refit='neg_log_loss',\
                       verbose=1)
    grid.fit(X,y)

    results = pd.DataFrame(grid.cv_results_)
    best_result = results[relevant_cols].sort_values(by='mean_test_neg_log_loss', ascending=False).head().iloc[0]

    scores = {'Accuracy': best_result['mean_test_accuracy'],\
              'Recall': best_result['mean_test_recall'],\
              'Precision'
              'F1': best_result['mean_test_f1'],\
              'LogLoss': -best_result['mean_test_neg_log_loss']}

    best_params = best_result['params']

    return best_params

Below you can see the parameter grid for each estimator, some of these were previously exlored with more values but were reduced to the remaining set of parameters to reduce execution time.

In [27]:
mnb_grid = {'alpha': [1.5, 3.5, 5.5, 7.5],\
            'fit_prior': [True, False]}

lrc_grid = {'fit_intercept': [True, False],\
            'penalty': ['l2'],\
            'solver': ['sag', 'saga', 'newton-cg'],\
            'C': [0.4, 0.5, 0.6, 0.7, 0.8],\
            'max_iter': [300],\
            'warm_start': [True],\
            'n_jobs': [-1]}

rfc_grid = {'n_estimators': [180],\
            'criterion': ['gini'],\
            'max_depth': [5, 7],\
            'min_samples_split': [5, 7],\
            'min_samples_leaf': [10, 12],\
            'bootstrap': [True],\
            'max_features': [0.75],\
            'max_samples': [0.5],\
            'n_jobs': [-1]}

#### Multinomial Naive Bayes

In [28]:
%%time
print('- With all features: ')
print(mnb_score)

print(f'\n- With {len(mnb_cols)} best featues')
print(mnb_fs)

print('\n- With optimized hyper-parameters')
mnb_params = grid_cv(features[mnb_cols], target, MultinomialNB(), mnb_grid)
mnb_opt = MultinomialNB(**mnb_params)

mnb_final = ts_cross_val(features[mnb_cols], target, model = mnb_opt)
print(mnb_final)

print('\n- Best parameters', mnb_params)

- With all features: 
test_accuracy         0.642000
train_accuracy        0.641558
test_f1               0.536745
train_f1              0.550328
test_neg_log_loss     0.803915
train_neg_log_loss    0.826800
dtype: float64

- With 28 best featues
test_accuracy         0.678960
train_accuracy        0.680120
test_f1               0.547707
train_f1              0.545345
test_neg_log_loss     0.628181
train_neg_log_loss    0.625342
dtype: float64

- With optimized hyper-parameters
Fitting 100 folds for each of 8 candidates, totalling 800 fits
test_accuracy         0.676960
train_accuracy        0.679787
test_f1               0.541681
train_f1              0.544218
test_neg_log_loss     0.627411
train_neg_log_loss    0.624230
dtype: float64

- Best parameters {'alpha': 7.5, 'fit_prior': True}
Wall time: 3.93 s


The difference after optimizing parameters is not much but compared to the first test with default parameters and all features there is a great difference.

#### Logistic Regression

In [29]:
%%time
print('- With all features: ')
print(lrc_score)

print(f'\n- With {len(lrc_cols)} best featues')
print(lrc_fs)

print('\n- With optimized hyper-parameters')
lrc_params = grid_cv(features[lrc_cols], target, LogisticRegression(), lrc_grid)
lrc_opt = LogisticRegression(**lrc_params)

lrc_final = ts_cross_val(features[lrc_cols], target, model = lrc_opt)
print(lrc_final)

print('\n- Best parameters', lrc_params)

- With all features: 
test_accuracy         0.679760
train_accuracy        0.682336
test_f1               0.564458
train_f1              0.557136
test_neg_log_loss     0.613274
train_neg_log_loss    0.604437
dtype: float64

- With 36 best featues
test_accuracy         0.680040
train_accuracy        0.682286
test_f1               0.564858
train_f1              0.557069
test_neg_log_loss     0.612483
train_neg_log_loss    0.605008
dtype: float64

- With optimized hyper-parameters
Fitting 100 folds for each of 30 candidates, totalling 3000 fits
test_accuracy         0.680160
train_accuracy        0.682326
test_f1               0.565123
train_f1              0.557215
test_neg_log_loss     0.612406
train_neg_log_loss    0.605236
dtype: float64

- Best parameters {'C': 0.6, 'fit_intercept': False, 'max_iter': 300, 'n_jobs': -1, 'penalty': 'l2', 'solver': 'sag', 'warm_start': True}
Wall time: 2min 53s


Just like the step before the Logistic Regression estimator seems to improve slightly but yet is still a better result than the one obtained in the first validation step including all features and first combination of parameters.

#### Random Forest 

In [30]:
%%time
print('- With all features: ')
print(rfc_score)

print('\n- With optimized hyper-parameters')
rfc_params = grid_cv(features, target, RandomForestClassifier(), rfc_grid)
rfc_opt = RandomForestClassifier(**rfc_params)

rfc_final = ts_cross_val(features, target, model = rfc_opt)
print(rfc_final)

print('\n- Best parameters', rfc_params)

- With all features: 
test_accuracy         0.625800
train_accuracy        0.900824
test_f1               0.562566
train_f1              0.887927
test_neg_log_loss     0.748393
train_neg_log_loss    0.263449
dtype: float64

- With optimized hyper-parameters
Fitting 100 folds for each of 8 candidates, totalling 800 fits
test_accuracy         0.679400
train_accuracy        0.685326
test_f1               0.569833
train_f1              0.566678
test_neg_log_loss     0.608685
train_neg_log_loss    0.586366
dtype: float64

- Best parameters {'bootstrap': True, 'criterion': 'gini', 'max_depth': 7, 'max_features': 0.75, 'max_samples': 0.5, 'min_samples_leaf': 12, 'min_samples_split': 5, 'n_estimators': 180, 'n_jobs': -1}
Wall time: 6min 13s


This estimator went from being over-fitting to be the best performing estimator just by tuning it's hyperparameters.

## Submissions
Now for the last validation step we're gonna use a slightly different strategy. For each shot to be predicted we're gonna train our model using all the previous shots. We'll do so using a custom function which allows us to pass an iterable of indices and create a dataframe ready to be written into a csv and then submitted.

Let's say  we're predicting shot on index 35 we'll slice the Dataframe from index 0 to 35

In [31]:
shot = 35 #shot_index
rfc = RandomForestClassifier(n_jobs=-1) #creating an estimator

df = kb_t.iloc[:shot+1, :].copy() #slice dataframe until the shot being predicted
print(df.shape)
df.tail()

(36, 52)


Unnamed: 0,shot_id,season,month,playoffs,home,period,shot_distance,shot_made_flag,cst_Bank_Shot,cst_Dunk,cst_Hook_Shot,cst_Jump_Shot,cst_Layup,cst_Tip_Shot,at_Alley_Oop_Dunk_Shot,at_Driving_Dunk_Shot,at_Driving_Layup_Shot,at_Dunk_Shot,at_Fadeaway_Jump_Shot,at_Jump_Bank_Shot,at_Jump_Shot,at_Layup_Shot,at_Other,at_Pullup_Jump_shot,at_Reverse_Layup_Shot,at_Running_Jump_Shot,at_Slam_Dunk_Shot,at_Tip_Shot,at_Turnaround_Fadeaway_shot,at_Turnaround_Jump_Shot,st_3PT_Field_Goal,szb_Above_the_Break_3,szb_Backcourt,szb_In_The_Paint_(Non-RA),szb_Left_Corner_3,szb_Mid-Range,szb_Restricted_Area,szb_Right_Corner_3,sza_Center(C),sza_Left_Side_Center(LC),sza_Left_Side(L),sza_Right_Side_Center(RC),sza_Right_Side(R),szr_16-24_ft.,szr_24+_ft.,szr_8-16_ft.,szr_Less_Than_8_ft.,ab_Left,ab_Left-Center,ab_Center,ab_Right-Center,ab_Right
31,22941,0.0,0.0,0,0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0
32,22942,0.0,0.0,0,0,0.0,0.368421,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0
33,22943,0.0,0.0,0,0,0.25,0.315789,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0
34,22944,0.0,0.0,0,0,0.5,0.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0
35,22945,0.0,0.0,0,0,0.5,0.263158,1.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0


We'll split the current sliced Dataframe into X (our features), y (our target), and we'll keep last shot's id so that we can link our predictions to the correct shot.

In [32]:
shot_id = df.iloc[-1]['shot_id'] # fetch the shot_id

# split features and target
X = df.drop(columns=['shot_id', 'shot_made_flag']) 
y = df['shot_made_flag']

print('shot_id: ', shot_id)
X.shape, y.shape

shot_id:  22945.0


((36, 50), (36,))

We'll split the current sliced features and target variables, train our model with all of the samples but the last shot, and predicting the latter, the shot on index 35.

In [33]:
#split features in train and test sets
X_train = np.array(X.iloc[:-1])
X_test = np.array(X.iloc[-1]).reshape(1, -1) #reshaping is necessary because we're predicting on a single sample

#split our target in train and test sets
y_train = np.array(y.iloc[:-1])
y_test = y.iloc[-1]

print('X_train: ', X_train.shape, '\n',
      'X_test: ', X_test.shape, '\n',
      'y_train: ', y_train.shape, '\n', 
      'y_test: ', y_test)

X_train:  (35, 50) 
 X_test:  (1, 50) 
 y_train:  (35,) 
 y_test:  1.0


Fitting our estimator using X_train and y_train

In [34]:
rfc.fit(X_train, y_train)

RandomForestClassifier(n_jobs=-1)

Predictiong X_test

In [35]:
y_p = rfc.predict(X_test)

Predictiong the probability of X_test for each class we'll only use the probability of the shot going in (1)

In [36]:
y_prob = rfc.predict_proba(X_test)

In [37]:
print('y_true: ', y_test, ' / y_pred: ', y_p[0], ' / y_probabilities (0, 1): ', y_prob, ' / y_probability of being 1: ', y_prob[0][1])

y_true:  1.0  / y_pred:  1.0  / y_probabilities (0, 1):  [[0.0325 0.9675]]  / y_probability of being 1:  0.9675


##### The function below wraps the steps seen above to split, train and validate multiple times, it takes an array of indices to predict each every single one of those shots.

In [38]:
@memoize
def predict(df_r, shots, model):
    
    '''Time Series predictor, slices a chronologically ordered dataframe up to the sample being predicted and makes prediction on it using only data that happened before said sample.
    
    Args:
        
        df_r (pd.Dataframe): Chronologically sorted dataframe containing our features.
        
        shots (iterable): Iterable collection of ints corresponding to the index of the shot being predicted.
        
        model (estimator): Estimator to be used in each train/test validation step.
        
    Returns: pd.DataFrame containing the id of the sample being predicted and the probability of corresponding to the class 1, determined by the estimator.
    '''
    
    shot_ids = []
    y_proba =  [] 

#    creating an instance of the passed model, 
    model = model        
    
    for shot in tqdm(shots):

#       creating a copy of the dataframe up to the row to be predicted
        df = df_r.iloc[:shot+1, :].copy()
        shot_id = df.iloc[-1]['shot_id']

        df = df.loc[(df['shot_made_flag'].notnull()) | (df['shot_id'] == shot_id)] #dropping null values except the one being predicted

        X = df.drop(columns=['shot_id', 'shot_made_flag'])
        y = df['shot_made_flag']

        X_train = np.array(X.iloc[:-1])
        X_test = np.array(X.iloc[-1]).reshape(1, -1) #reshaping is necessary because we're predicting on a single sample

        y_train = np.array(y.iloc[:-1])

#        training with training data and predicting on the test row
        model.fit(X_train, y_train)
        y_prob = model.predict_proba(X_test)

        shot_ids.append(shot_id)
        y_proba.append(y_prob[0][1])

    result_df = pd.DataFrame()
    result_df['shot_id'] = shot_ids
    result_df['shot_made_flag'] = y_proba
    
    result_df['shot_id'] = result_df['shot_id'].astype('int32')
    
    return result_df

Since we already validated and improved the performance of several estimators we have to process the whole dataset and predict on shots where the shot_made_flag value is null.

In [39]:
kb_p = column_transform(df = kb)

In [40]:
def sub_csv(df, filename):
    '''Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.
    
    Args:
        
        df (pd.DataFrame) Dataframe to be written into a csv file
        
        filename (string) Name of the csv file to be generated
    
    Returns: None
    '''
    df.to_csv('submissions/' + filename, index = False)

Let's create an array containing the index of those shots we don't know the outcome of.
Based on the table above the first two values should be 5 and 7.

In [41]:
kb_p.loc[kb_p['shot_made_flag'].isna(), 'shot_id'].head()

5     22907
7     22909
24    22926
25    22927
28    22930
Name: shot_id, dtype: int64

In [42]:
missing = kb_p.loc[kb_p['shot_made_flag'].isna()].index
print(missing)

Int64Index([    5,     7,    24,    25,    28,    32,    35,    38,    46,
               49,
            ...
            30636, 30642, 30643, 30657, 30670, 30672, 30683, 30687, 30693,
            30695],
           dtype='int64', length=5000)


For our function to catch the shot_id and attach each prediction to the correct shot we'll add both our target and shot_id to the best features.

In [43]:
mnb_cols.insert(0, 'shot_id')
mnb_cols.append('shot_made_flag')

lrc_cols.insert(0, 'shot_id')
lrc_cols.append('shot_made_flag')

In [44]:
mnb_submission = predict(kb_p[mnb_cols], missing, mnb_opt)
mnb_submission.head()

100%|██████████| 5000/5000 [00:41<00:00, 119.77it/s]


Unnamed: 0,shot_id,shot_made_flag
0,22907,0.401135
1,22909,0.501182
2,22926,0.353543
3,22927,0.422495
4,22930,0.375


In [45]:
sub_csv(mnb_submission, 'mnb_submission.csv')

In [46]:
lrc_submission = predict(kb_p[lrc_cols], missing, lrc_opt)
lrc_submission.head()

100%|██████████| 5000/5000 [13:03<00:00,  6.38it/s]


Unnamed: 0,shot_id,shot_made_flag
0,22907,0.411644
1,22909,0.486095
2,22926,0.338657
3,22927,0.416655
4,22930,0.563882


In [47]:
sub_csv(lrc_submission, 'lrc_submission.csv')

In [48]:
rfc_submission = predict(kb_p, missing, rfc_opt)
print(rfc_submission.shape)
rfc_submission.head()

100%|██████████| 5000/5000 [46:38<00:00,  1.79it/s]

(5000, 2)





Unnamed: 0,shot_id,shot_made_flag
0,22907,0.452778
1,22909,0.525926
2,22926,0.357071
3,22927,0.357576
4,22930,0.375463


In [49]:
sub_csv(rfc_submission, 'rfc_submission.csv')

Let's submit our predictions.

<img src="media/kobe_jumpshot.gif" style="width:552px;height:331px"/>

## Submission Results

The leaderboard consist of 1117 submissions

### Multinomial Naive Bayes
<img src="media/sub_mnb.JPG"/>
Our Multinomial Naive Bayes estimator achieved a 0.62508 not bad considering  the fact that our validation resulted in a higher loss.

This would have been placed on the 697th position.

### Logistic Regression
<img src="media/sub_lrc.JPG"/>

Our Logistic Regression estimator achieved a 0.61192 slightly better than our validation.

This would have been placed on 567th position.

### Random Forest
<img src="media/sub_rfc.JPG"/>
This is our best submission which obtained a 0.60751 and would have been placed on position 423.

There's still room for improvement and soon we'll enhace this project by extracting new features and using different estimators.

**Kobe Bryant** was one of the greatest to ever play this game:

 * 5 NBA Championships (2000, 2001, 2002, 2009, 2010)
 * 2 NBA Finals MVP (2009, 2010)
 * 2 Olympic Gold Medals (2008, 2012) with the US National Team
 * 17 Consecutive NBA All-Star selections
 * 1997 Slam Dunk Champion
 * 81 points vs Raptors (2nd most points scored in an NBA game)
 * 20 seasons with Los Angeles Lakers (Most seasons with one NBA team)
 * 12 All-Defensive Team Selections
 * 12 Three-pointers vs Seattle (Tied in 3rd place with Stephen Curry and Donyell Marshall)

  <img src="media/kobe_champ.gif"/>

A whole generation grew up shouting 'KOBE' when throwing something, whether it was an acutal ball into a hoop or a piece of paper into a trash bin.

That's legacy.

Let's keep the tradition alive.