<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup-and-Load-Data" data-toc-modified-id="Setup-and-Load-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup and Load Data</a></span></li><li><span><a href="#Defender-Distance" data-toc-modified-id="Defender-Distance-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Defender Distance</a></span></li><li><span><a href="#Shot-Distance" data-toc-modified-id="Shot-Distance-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Shot Distance</a></span></li><li><span><a href="#Shot-Distance-and-Defender-Distance" data-toc-modified-id="Shot-Distance-and-Defender-Distance-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Shot Distance and Defender Distance</a></span><ul class="toc-item"><li><span><a href="#Hexbin-Plot" data-toc-modified-id="Hexbin-Plot-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Hexbin Plot</a></span></li></ul></li><li><span><a href="#Expected-Shot-Value" data-toc-modified-id="Expected-Shot-Value-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Expected Shot Value</a></span></li><li><span><a href="#An-Improved-Shooting-Metric?" data-toc-modified-id="An-Improved-Shooting-Metric?-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>An Improved Shooting Metric?</a></span><ul class="toc-item"><li><span><a href="#Player-ESV" data-toc-modified-id="Player-ESV-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Player ESV</a></span></li><li><span><a href="#Team-ESV-per-85-FGA" data-toc-modified-id="Team-ESV-per-85-FGA-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Team ESV per 85 FGA</a></span></li><li><span><a href="#Player-Points-Above-Average-per-FGA" data-toc-modified-id="Player-Points-Above-Average-per-FGA-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Player Points Above Average per FGA</a></span></li><li><span><a href="#Team-Points-Above-Average-per-85-FGA" data-toc-modified-id="Team-Points-Above-Average-per-85-FGA-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Team Points Above Average per 85 FGA</a></span></li></ul></li><li><span><a href="#Does-Defender-Distance-Relate-to-Shooter-Quality?" data-toc-modified-id="Does-Defender-Distance-Relate-to-Shooter-Quality?-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Does Defender Distance Relate to Shooter Quality?</a></span></li></ul></div>

# Expected Value Modeling for NBA Shot Locations

We should expect that two important drivers of the likelihood of making a shot are the shot distance and how near the closest defender is.  Using data obtained from NBA.com for the 2014-2015 season, this demo explores how you can build an expected value model for shooting.

## Setup and Load Data

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
pd.set_option('precision', 2)

from datascience_utils import hexbin_plot, sorted_boxplot
from datascience_topic import build_expected_shot_values_from_hexbin

In [None]:
df = pd.read_csv('shot_logs_2014_15.csv.gz')

In [None]:
df.head()

In [None]:
df.info()

## Defender Distance
There are lots of things you could do with this data but let's just jump right in with defender distance.

We use the 1ft buckets in `CLOSE_DEF_DIST_ROUNDED` to compute FG% as a function of defender distance.

After grouping, we plot the relation and... wait, why does defender distance not seem to matter!?

_Question_
+ What might explain this seeming lack of relationship?  Does shot type matter?  Anything else?

In [None]:
result = df.groupby(['CLOSE_DEF_DIST_ROUNDED']).\
    apply(lambda g: np.mean(g.SHOT_RESULT_BIN))

fig, ax = plt.subplots()
result.loc[:14].plot(label="All Shots")
ax.set_ylabel('Shooting Pct')
ax.set_xlabel("Closest Defender Distance (ft)")
ax.set_ylim(0, .7)
ax.legend();

Okay, now we separate by shot type.  Clearly defender distance is a big deal for 3pt shots.  But it still doesn't seem that important for 2pt shots.

_Question_
+ Should 2pt shooting % be immune defender distance?  Why or why not?  Is there something else missing?

In [None]:
result = df.groupby(['PTS_TYPE', 'CLOSE_DEF_DIST_ROUNDED']).\
    apply(lambda g: np.mean(g.SHOT_RESULT_BIN))
    
fig, ax = plt.subplots()
result.loc[2].loc[:14].plot(label='2pt', ax=ax)
result.loc[3].loc[:14].plot(label='3pt', ax=ax)
ax.set_ylabel('Shooting Pct')
ax.set_xlabel("Closest Defender Distance (ft)")
ax.set_ylim(0, .7)
ax.legend();

## Shot Distance

We can quickly visualize the variability in shot distance with a pair of histogram plots.  The distance of the NBA 3pt line is not uniform but generally you see the overall pattern of shots.

In [None]:
nbins = 150
alpha = .5

dist_mask = df['SHOT_DIST'] <= 30
twopt = df['PTS_TYPE'] == 2
ax = df.loc[twopt & dist_mask].hist('SHOT_DIST', bins=nbins, alpha=alpha)

threept = df['PTS_TYPE'] == 3
df.loc[threept & dist_mask].hist('SHOT_DIST', bins=nbins, color='C1', alpha=alpha, ax=ax);

We can group by shot type and shot distance (rounded to nearest half foot) to get a feel for how shooting percentage varies with shot distance.  Obviously close in shots are layups/dunks that are almost always converted.

In [None]:
result = df.groupby(['PTS_TYPE', 'SHOT_DIST_ROUNDED']).\
    apply(lambda g: np.mean(g.SHOT_RESULT_BIN)).\
    sort_index(level=0)

fig, ax = plt.subplots()
result.loc[2].loc[:24].plot(label='2pt', ax=ax)
result.loc[3].loc[21:29].plot(label='3pt', ax=ax)
ax.set_ylabel('Shooting Pct')
ax.set_xlabel("Shot Distance (ft)")
ax.set_ylim(0, .7)
ax.legend();

## Shot Distance and Defender Distance

Okay, we saw defender distance mattered for 3s and shot distance matters.  Let's combine the two.  We use the shot distance buckets in `SHOT_DIST_BUCKET` as well as the rounded defender distance to show how shooting percentage varies both in shot distance and defender distance.

In [None]:
result = df.groupby(['SHOT_DIST_BUCKET', 'CLOSE_DEF_DIST_ROUNDED']).\
    apply(lambda g: np.mean(g.SHOT_RESULT_BIN))

shot_buckets = ['0-3', '3-10', '10-16', '16-3pt', '3pt']
fig, ax = plt.subplots()
for bucket in shot_buckets:
    result.loc[bucket].loc[:14].plot(label=bucket, ax=ax)

ax.set_ylabel('Shooting Pct')
ax.set_xlabel("Closest Defender Distance (ft)")
ax.set_ylim(0, 1.03)
ax.legend(bbox_to_anchor=(1.02, 1.02));

### Hexbin Plot

We can view the relationship between the in more continuous space with a hexbin plot.  A hexbin plot segments the space and all shots that lie within a hexagon will be grouped together.  Then within that hexagon, we will compute the shooting percentage.  We can see how regardless of shot distance, the defender distance near 0 drives down shooting percentage at that distance from the basket.

_Question_
+ Why are there some spots that are extra dark or white that don't really fit with the general pattern?
+ What is going on with that spike on the left side?  What kind of shot has a high percentage, is near the basket, and has no defender nearby?

In [None]:
hexbin_pct_plot = hexbin_plot(
    df.loc[(df.SHOT_DIST <= 30)],
    'SHOT_DIST',
    'CLOSE_DEF_DIST',
    C='SHOT_RESULT_BIN',
    collect=np.mean,
    gridsize=20,
    figsize=(8, 6),
    cmap=plt.cm.viridis_r,
    mincnt=5,
    vmin=0.2,
    vmax=0.8
)
hexbin_pct_plot.tick_params(reset=True)
hexbin_pct_plot.set_ylabel("Defender Dist (ft)")
hexbin_pct_plot.set_xlabel("Shot Dist (ft)")
cax = plt.gcf().get_axes()[1]
cax.set_yticklabels(['< 0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '> 0.8']);

## Expected Shot Value

We can do the same hexbin plot but compute expected points.  This makes more sense due to the 3pt line and represents a shooting efficiency due to the variable value of shots.

_Question_
+ What does this hexbin plot say about mid/long-distance 2s as well?  How bad is a closely guarded mid/long 2 point shot?

In [None]:
hexbin_esv_plot = hexbin_plot(
    df.loc[(df.SHOT_DIST <= 30)],
    'SHOT_DIST',
    'CLOSE_DEF_DIST',
    C='PTS_MADE',
    collect=np.mean,
    gridsize=20,
    figsize=(8, 6),
    cmap=plt.cm.viridis_r,
    mincnt=5,
    vmin=0.5,
    vmax=1.6
)
hexbin_esv_plot.tick_params(reset=True)
hexbin_esv_plot.set_ylabel("Defender Dist (ft)")
hexbin_esv_plot.set_xlabel("Shot Dist (ft)")
cax = plt.gcf().get_axes()[1]
cax.set_yticklabels(['< 0.6', '0.8', '1.0', '1.2', '1.4', '> 1.6']);

## An Improved Shooting Metric?

Recall the Effective Field Goal Pct was given by
$$
    \text{eFG\%} = \frac{\mathit{FG} + .5 \cdot \mathit{3FG}}{\mathit{FGA}} = \frac{\text{Total Points (excluding FT)}}{2 \cdot \mathit{FGA}}
$$

Ignoring the division by 2, our ESV computation is akin to the EFG computation: expected points scored per attempt.

Here's a metric we can build along what we've seen already in baseball: for each shot use the hexbin plot to compute the ESV.  

$$
    \text{eSV} = \frac{\text{Total Expected Points}}{\mathit{FGA}}
$$

We can also compute the points above average for each shot as
$$
    \text{Points Above Average} = \text{Points Made} - \text{eSV}
$$

This is actually akin to what was proposed in this [paper][1] at the Sloan conference.

[1]: http://www.sloansportsconference.com/wp-content/uploads/2014/02/2014-SSAC-Quantifying-Shot-Quality-in-the-NBA.pdf

In [None]:
# Use a helper function to get the bin (and therefore ESV) for each shot
esv = build_expected_shot_values_from_hexbin(
    df['SHOT_DIST'], df['CLOSE_DEF_DIST'], hexbin_esv_plot)
df['EXPECTED_SHOT_VALUE'] = esv
df['EXPECTED_SHOT_VALUE'].hist(bins=50);

In [None]:
df['PTS_ABOVE_AVG'] = df['PTS_MADE'] - df['EXPECTED_SHOT_VALUE']
ax = df.hist(column='PTS_ABOVE_AVG', bins=50);

### Player ESV
Which players are taking the highest value shots?  How about the lowest?

In [None]:
player_esv = df.groupby('playerName')['EXPECTED_SHOT_VALUE'].\
    agg({'ESVperFGA': np.mean, 'FGA': len})

In [None]:
player_esv.loc[player_esv.FGA > 100].\
    sort_values('ESVperFGA', ascending=False).\
    head(20)

In [None]:
player_esv.loc[player_esv.FGA > 100].\
    sort_values('ESVperFGA', ascending=True).\
    head(20)

### Team ESV per 85 FGA 

Which teams take the best shots?  85 FGA is about 1 game.

In [None]:
team_esv = df.groupby('PLAYER_TEAM')['EXPECTED_SHOT_VALUE'].\
    agg({'ESVperFGA': np.mean, 'FGA': len})
team_esv['ESVper85FGA'] = 85 * team_esv['ESVperFGA']

team_esv.sort_values('ESVper85FGA', ascending=False)

### Player Points Above Average per FGA
We can compute PAA per FGA for each player.

In [None]:
paa = df.groupby('playerName')['PTS_ABOVE_AVG'].\
    agg({'PAAperFGA': np.mean, 'FGA': len})

In [None]:
paa.loc[paa.FGA > 100].\
    sort_values('PAAperFGA', ascending=False).\
    head(20)

In [None]:
paa.loc[paa.FGA > 100].\
    sort_values('PAAperFGA', ascending=False).\
    tail(20)

### Team Points Above Average per 85 FGA 
We can compute PAA per 85 FGA for each team.  85 FGA is about 1 game.

In [None]:
paa_team = df.groupby('PLAYER_TEAM')['PTS_ABOVE_AVG'].agg({'PAAperFGA': np.mean})
paa_team['PAAper85FGA'] = 85 * paa_team['PAAperFGA']

paa_team.sort_values('PAAper85FGA', ascending=False)

## Does Defender Distance Relate to Shooter Quality?

Presumably better shooters will be guarded more closely, right?  To study this, let's restrict to 3pt shots.

In [None]:
steph = df.playerName == 'Stephen Curry'
fig, ax = plt.subplots()
df.loc[steph & threept].hist('CLOSE_DEF_DIST', bins=30, ax=ax)
ax.set_xlim(0, 25)
ax.set_title("Closest Defender Dist (ft) - Steph Curry 3FGA");

We take all players with at least 50 shots, sort them by their 3FG% (from left to right), and construct boxplots showing defender distance.  If 3FG% indicates better shooters, and it should, then those players should be guarded tighter than poor shooters. We should see some kind of pattern in the plot that will reflect this hypothesis.

_Question_
+ What pattern emerges from this plot?  Can you think of a plausible explanation? 

In [None]:
num_3pt_shots = 50
by = 'playerName'
column = "CLOSE_DEF_DIST"

three_pt_shots = df.loc[threept]

by_3pt_pct = three_pt_shots.\
    groupby(by)['SHOT_RESULT_BIN'].\
    agg({'3FG%': 'mean', '3FGA': 'count'}).\
    sort_values(by='3FG%', ascending=True)
by_3pt_pct = by_3pt_pct.loc[by_3pt_pct['3FGA'] > num_3pt_shots]

player_order = by_3pt_pct.index

sorted_boxplot(three_pt_shots, by, column, player_order, figsize=(25, 10), fontsize=8)