# KOBE Bryant Shot Selection

Kobe Bryant marked his retirement from the NBA by scoring 60 points in his final game as a Los Angeles Laker on Wednesday, April 12, 2016. Drafted into the NBA at the age of 17, Kobe earned the sport’s highest accolades throughout his long career.

Using 20 years of data on Kobe's swishes and misses, can you predict which shots will find the bottom of the net? This competition is well suited for practicing classification basics, feature engineering, and time series analysis. Practice got Kobe an eight-figure contract and 5 championship rings. What will it get you?

---
# Analysis and Conclusion
---

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# 0. Read In Data

In [2]:
m1_df = pd.read_csv('../data/m1_summary.csv', index_col='Unnamed: 0')
m1_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
NN - Dataset 1,60.0,59.92,26.61,61.83,86.76
NN - Dataset 2,59.5,59.44,25.64,60.79,86.68
NN - Dataset 3,61.0,61.28,32.4,62.81,84.54
NN - Dataset 4,60.5,60.65,33.45,60.73,82.57
NN - Dataset 5,60.5,60.65,33.45,60.73,82.57


In [3]:
m2_df = pd.read_csv('../data/m2_summary.csv', index_col='Unnamed: 0')
m2_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
KNN - Dataset 1,61.0,60.74,30.59,62.24,85.05
KNN - Dataset 2,60.0,60.0,18.7,69.16,93.28
KNN - Dataset 3,60.5,60.5,32.16,60.82,83.31
KNN - Dataset 4,60.6,60.6,33.55,60.62,82.43
KNN - Dataset 5,60.74,60.74,30.8,62.14,84.88


In [4]:
m3_df = pd.read_csv('../data/m3_summary.csv', index_col='Unnamed: 0')
m3_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
LR - Dataset 1,59.66,59.66,37.67,57.29,77.37
LR - Dataset 2,60.0,60.0,35.82,58.35,79.4
LR - Dataset 3,61.31,61.31,32.4,62.9,84.6
LR - Dataset 4,60.45,60.45,40.32,58.21,76.67
LR - Dataset 5,60.5,60.5,41.96,57.92,75.44


In [5]:
m4_df = pd.read_csv('../data/m4_summary.csv', index_col='Unnamed: 0')
m4_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
RF - Dataset 1,61.0,61.0,32.72,61.91,83.78
RF - Dataset 2,61.0,60.98,32.89,61.8,83.61
RF - Dataset 3,61.15,61.15,34.32,61.62,82.77
RF - Dataset 4,60.62,60.62,33.66,60.58,82.35
RF - Dataset 5,60.74,60.74,30.8,62.14,84.88


In [6]:
m5_df = pd.read_csv('../data/m5_summary.csv', index_col='Unnamed: 0')
m5_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
XG - Dataset 1,61.76,61.76,31.85,62.49,84.6
XG - Dataset 2,61.14,61.14,32.44,62.42,84.26
XG - Dataset 3,61.27,61.27,32.4,62.81,84.54
XG - Dataset 4,60.62,60.62,30.8,62.14,84.88
XG - Dataset 5,60.74,60.74,30.8,62.14,84.88


## Combine into a Single DataFrame

**We want to ensure our analysis was correct and combine all of our results into a single dataframe.** 

In [7]:
df = pd.concat([m1_df, m2_df, m3_df, m4_df, m5_df])
df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
NN - Dataset 1,60.0,59.92,26.61,61.83,86.76
NN - Dataset 2,59.5,59.44,25.64,60.79,86.68
NN - Dataset 3,61.0,61.28,32.4,62.81,84.54
NN - Dataset 4,60.5,60.65,33.45,60.73,82.57
NN - Dataset 5,60.5,60.65,33.45,60.73,82.57
KNN - Dataset 1,61.0,60.74,30.59,62.24,85.05
KNN - Dataset 2,60.0,60.0,18.7,69.16,93.28
KNN - Dataset 3,60.5,60.5,32.16,60.82,83.31
KNN - Dataset 4,60.6,60.6,33.55,60.62,82.43
KNN - Dataset 5,60.74,60.74,30.8,62.14,84.88


**We then sort that dataframe by accuracy.**

In [8]:
df.sort_values(by='Final Acc', ascending=False, inplace=True)
df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
XG - Dataset 1,61.76,61.76,31.85,62.49,84.6
LR - Dataset 3,61.31,61.31,32.4,62.9,84.6
NN - Dataset 3,61.0,61.28,32.4,62.81,84.54
XG - Dataset 3,61.27,61.27,32.4,62.81,84.54
RF - Dataset 3,61.15,61.15,34.32,61.62,82.77
XG - Dataset 2,61.14,61.14,32.44,62.42,84.26
RF - Dataset 1,61.0,61.0,32.72,61.91,83.78
RF - Dataset 2,61.0,60.98,32.89,61.8,83.61
RF - Dataset 5,60.74,60.74,30.8,62.14,84.88
XG - Dataset 5,60.74,60.74,30.8,62.14,84.88


# Which dataset performed the best?

## Dataset 1

In [15]:
d1 = []
for j in df.index:
    if 'Dataset 1' in j:
        d1.append(j)    
d1_df = df.loc[d1]
d1_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
XG - Dataset 1,61.76,61.76,31.85,62.49,84.6
RF - Dataset 1,61.0,61.0,32.72,61.91,83.78
KNN - Dataset 1,61.0,60.74,30.59,62.24,85.05
NN - Dataset 1,60.0,59.92,26.61,61.83,86.76
LR - Dataset 1,59.66,59.66,37.67,57.29,77.37


In [21]:
d1_df.mean()

Avg Acc        60.684
Final Acc      60.616
Recall         31.888
Precision      61.152
Specificity    83.512
dtype: float64

## Dataset 2

In [16]:
d2 = []
for j in df.index:
    if 'Dataset 2' in j:
        d2.append(j)    
d2_df = df.loc[d2]
d2_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
XG - Dataset 2,61.14,61.14,32.44,62.42,84.26
RF - Dataset 2,61.0,60.98,32.89,61.8,83.61
LR - Dataset 2,60.0,60.0,35.82,58.35,79.4
KNN - Dataset 2,60.0,60.0,18.7,69.16,93.28
NN - Dataset 2,59.5,59.44,25.64,60.79,86.68


In [22]:
d2_df.mean()

Avg Acc        60.328
Final Acc      60.312
Recall         29.098
Precision      62.504
Specificity    85.446
dtype: float64

## Dataset 3

In [17]:
d3 = []
for j in df.index:
    if 'Dataset 3' in j:
        d3.append(j)    
d3_df = df.loc[d3]
d3_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
LR - Dataset 3,61.31,61.31,32.4,62.9,84.6
NN - Dataset 3,61.0,61.28,32.4,62.81,84.54
XG - Dataset 3,61.27,61.27,32.4,62.81,84.54
RF - Dataset 3,61.15,61.15,34.32,61.62,82.77
KNN - Dataset 3,60.5,60.5,32.16,60.82,83.31


In [23]:
d3_df.mean()

Avg Acc        61.046
Final Acc      61.102
Recall         32.736
Precision      62.192
Specificity    83.952
dtype: float64

## Dataset 4

In [18]:
d4 = []
for j in df.index:
    if 'Dataset 4' in j:
        d4.append(j)    
d4_df = df.loc[d4]
d4_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
NN - Dataset 4,60.5,60.65,33.45,60.73,82.57
RF - Dataset 4,60.62,60.62,33.66,60.58,82.35
XG - Dataset 4,60.62,60.62,30.8,62.14,84.88
KNN - Dataset 4,60.6,60.6,33.55,60.62,82.43
LR - Dataset 4,60.45,60.45,40.32,58.21,76.67


In [24]:
d4_df.mean()

Avg Acc        60.558
Final Acc      60.588
Recall         34.356
Precision      60.456
Specificity    81.780
dtype: float64

## Dataset 5

In [19]:
d5 = []
for j in df.index:
    if 'Dataset 5' in j:
        d5.append(j)    
d5_df = df.loc[d5]
d5_df

Unnamed: 0,Avg Acc,Final Acc,Recall,Precision,Specificity
RF - Dataset 5,60.74,60.74,30.8,62.14,84.88
XG - Dataset 5,60.74,60.74,30.8,62.14,84.88
KNN - Dataset 5,60.74,60.74,30.8,62.14,84.88
NN - Dataset 5,60.5,60.65,33.45,60.73,82.57
LR - Dataset 5,60.5,60.5,41.96,57.92,75.44


In [25]:
d5_df.mean()

Avg Acc        60.644
Final Acc      60.674
Recall         33.562
Precision      61.014
Specificity    82.530
dtype: float64

# Overall Analysis:
1. **Given that the chance to make a shot under any condition is probabilistic, it logically follows that our accuracy in predicting a made or miss shot is not close to 100%. Kobe is conisdered one of the greatest shooters and scorers of all time, and even he had no shot where he was 100% accurate from.**


2. **We appear to be stuck in a range from 59-62%, with our best performing model breaking away from the others slightly with 61.76% accuracy and comparable misclassiciation scores with other top performing models.** 


3. This is a significant increase from our Null Model of 55%, though honestly I was hoping we could get closer to 80%.


4. On average, Dataset 3 was our best performing dataset. Dataset 3 used ```combined_shot_type```, ```shot_distance```, and ```shot_zone_area```. This tells us that our most important predictors were where the shot was from and what kind of shot it was. Other factors, such as time left, opponent, period, and other features were less important, as we hypothesized. 


5. Despite Dataset 3 performing best on average, **XGBoost on Dataset 1** was our best performing model. This intuitively makes sense when consider that, though dataset 3 contains our MOST IMPORTANT variables, the other variables still add predictive value.