## How to Make a Shot Quality Model

### What is a Shot Quality Model

In general, a shot quality model's goal is to predict how many points a player will score based on a shot (PPS). There are three kinds of shots a player can take, FT, FG2, and FG3. You could make three separate models or one model that works for all three kinds of shots. One nice and/or challenging aspect of these models is that you're more or less entirely bound between 0-3. If your PPS model is predicting 7, there might be an issue. 

One well documented public example of a good shot quality model is KOBE, developed by [Krisha Narsu](http://twitter.com/knarsu3) and [published at Nylon Calculus](https://fansided.com/2015/09/28/introducing-kobe-a-measure-of-shot-quality/). The model we develop here will not be as good, but will also not be named after a Lakers player, so that's a plus. In fact, we can go ahead and name ours **EMBIID**, or **E**xtremely **M**ediocre to **B**ad **I**ntroductory scor**I**ng mo**D**el.

We're going to be using exclusively shot data derived from play-by-play data - no tracking or demographic info here. The shots are 100,000 random fieldgoal attempts from the 2018 and 2019 regular seasons. 

### Get the Data Set Up

We're going to read in the data first and just sanity check that it looks like shot data. Secondly, as this is spatial data, we're going to visualize some shots. I cannot emphasize enough how important it is to be 10000% confident in your coordinate system when you use spatial data. Spending 8hrs modeling and then realizing you need to change the CRS or do a bunch of rotations is not a good feeling.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

all_shots = pd.read_csv("data/shots.csv")
print(all_shots.head())

f, ax = plt.subplots(figsize=(8, 6))
#plt.axis('equal')
plt.xlim(-25, 25)
plt.ylim(-4, 90)
plt.title("Shots!")
sns.scatterplot(
    data=all_shots, 
    x="x", 
    y="y", 
    hue="shot_type",
    alpha = 0.25)

Well, that certainly looks correct or close enough. However, there are quite a few heaves that we probably want to take out of the model since we probably(?) don't care about those. Let's limit it to shots 32 feet and closer (the top line of the little court). We only lose about 600 shots by doing this, so shouldn't be an issue.

In [None]:
print(all_shots.shape)

shots = all_shots.loc[all_shots['shot_distance'] <= 32]

print(shots.shape)

## Pre-Modeling Interlude

Before we get started, a brief soapbox moment. It can be tempting to fire up TensorFlow and get crazy here, but starting simple and improving your model iteratively is a good idea. Go small to big, as you might find out you don't need a 40 layer NN to get the job done. (Although sometimes you do).

## Modeling

The good news about a PPS model is that there is an extremely obvious first pass model to either test or at least consider. Very slight math warning approaching.

$xPPS = {\beta_0} + {\beta_1} * FG2 + {\beta_2} * FG3$

This is our starting point. This model takes binary flags for FG2 and FG3 and then returns an expected point per shot value (xPPS). Although, let's consider if we actually need an intercept here - what would the xPPS be if FG2 = 0 and FG3 = 0? It would be zero, so we can actually get rid of the intercept entirely. You could of course also remove one of FG2 or FG3 as well and keep the intercept, but I like it this way for presentation purposes. EMBIID v1.0 is presented below.

$xPPS = {\beta_1} * FG2 + {\beta_2} * FG3$


Now, let's use this model setup to train on 75% of the data and test on 25% of the data and see how we do. Sidenote, I'm doing this with statsmodels because sklearn doesn't have linear regression summary tables which makes me insane.

In [23]:
from sklearn.model_selection import train_test_split
#from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import numpy as np
pd.options.mode.chained_assignment = None

shots['fg2'] = np.where(shots['shot_type'] == '2PT Field Goal', 1, 0)
shots['fg3'] = np.where(shots['shot_type'] == '3PT Field Goal', 1, 0)

train_shots, test_shots = train_test_split(shots, train_size=0.75)

## NOTE THAT Y COMES FIRST UNLIKE SKLEARN

y_train = train_shots['points']
X_train = train_shots[['fg2', 'fg3']].to_numpy()
X_train = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                 points   R-squared:                      -0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    -1.097
Date:                Tue, 05 Jan 2021   Prob (F-statistic):               1.00
Time:                        13:10:24   Log-Likelihood:            -1.1833e+05
No. Observations:               74543   AIC:                         2.367e+05
Df Residuals:                   74540   BIC:                         2.367e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -3.111e+11   3.02e+11     -1.031      0.3

In [None]:
#from sklearn.model_selection import train_test_split
#from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import numpy as np
pd.options.mode.chained_assignment = None

shots['fg2'] = np.where(shots['shot_type'] == '2PT Field Goal', 1, 0)
shots['fg3'] = np.where(shots['shot_type'] == '3PT Field Goal', 1, 0)

train_shots, test_shots = train_test_split(shots, train_size=0.75)

## NOTE THAT Y COMES FIRST UNLIKE SKLEARN
model = sm.OLS(y=train_shots['points'].to_numpy(), x=train_shots[['fg2', 'fg3'].to_numpy()])
print(model.summary())


In [None]:
model.summary()