# Evolution of Basketball and the Correlation of Metrics

This notebook explores some key points about the evolution of the NBA and why metrics like FG%, eFG%, and TS% change in correlation with respect to offensive rating.

## The Question

Are the shooting metrics FG%, eFG%, or TS% correlated with Offensive Rating (team points per 100 possessions)?  Meaning, if I measure X%, I have a good idea of what the Offensive Rating would be?  And if I can work to reliably increase X%, then I will likely increase Offensive Rating.

_Yes_.

Unequivocally, these shooting metrics are correlated with Offensive Rating.  TS%, by factoring in 3-pointers and free throws does the best job of the three.  See the other demo on shooting metrics to explore this further.


## What else matters?

### Offensive Rebounding

The standard practice in basketball analytics is to treat an offensive rebound as _continuing a possession_ instead of creating a new possession.  Because of this, improved offensive rebounding leads to improved offensive efficiency because you get an extra shot in the same possession.

Consider this team performance:
+ 100 possessions, 0 turnovers or FTs, shoot 50% on 2s (no 3s)
    + 0 Off. Rebs
    + 100 Off. Rating on 50% shooting
Now consider this team performance
+ 100 possessions, 0 turnovers or FTs
    + On each possession's first shot, the team shoots 50% on 2s (no 3s)
    + On each of the 50 misses, the team gets the offensive rebound.
    + On each of the 50 off. rebounds, the team shoots one more time at 50% on 2s (no 3s)
    + No offensive rebounds are had on those last 25 misses
    + 150 Off. Rating on 50% shooting

If a team gets more Off. Rebs, then it’ll have a higher Off. Rating for fixed FG%.  Put another way, the FG% matters less and thus is less correlated with Off. Rating when teams are offensive rebounding more.


### Turnovers

If there were no turnovers or FTs and just 2 (or 3) point shots, then FG% would be equivalent to eFG% would be equivalent to TS% would be equivalent to  Off. Rating.  They would all be perfectly correlated.

If there are FTs but no turnovers, then only TS% would be equivalent to Off. Rating, ie. perfect correlation.

With turnovers, TS% (or FG% or eFG%) is a "noisy" measure of Off. Rating, ie. there is no longer a perfect correlation.
+ For fixed TS%, more turnovers means lower Off. Rating
+ For fixed turnovers though, higher TS% means higher Off. Rating

### 2 vs 3: The effect of riskier shots

Suppose there are two outcomes: a turnover or a guaranteed shot.  Then...
+ FG% is meaningless: it'll always be 100% and it'll have 0 correlation with Off. Rating
+ Off. Rating is entirely driven by how many turnovers there are

Again suppose there are two outcomes: a turnover or a shot. Further assume that turnovers happen on 99 out of 100 possessions and when we do get a shot, it's an even coin flip and it's worth 100 points if we make it.  Then...
+ FG% is everything: 0 Off. Rating if we miss, 100 if we make.  So FG% = Off. Rating, ie. perfect correlation
    
    
All things being equal, if you take lower percentage shots, then your FG% is more indicative of Off. Rating, ie. more correlated.   The value of the shot doesn’t matter.  We could change the 3pt shot to a 4pt shot and nothing would change.


## Setup

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from datascience_stats import correlation
from datascience_utils import coin_flip

## Load Data

In [None]:
df = pd.read_csv('nba_team_season_data.csv')

In [None]:
# very rough approximation of the total possessions
tot = df['fg2a'] + df['fg3a'].fillna(0) + .4 * df['fta'] + df['tov']

# Compute the rates at which events occur
df['fg2_rate'] = df['fg2a'] / tot
df['fg3_rate'] = df['fg3a'].fillna(0) / tot
df['ft_rate'] = .4 * df['fta'] / tot
df['tov_rate'] = df['tov'] / tot

df['oreb_rate'] = df['orb'] / (df['orb'] + df['opp_drb'])

# Compute shooting metrics
df['fg_pct'] = df['fg'] / df['fga']
df['efg_pct'] = (df['fg'] + .5 * df['fg3'].fillna(0) ) / df['fga']
df['ts_pct'] = df['pts'] / (2 * (df['fga'] + .44 * df['fta']))

## Evolution of the NBA

### Increase in 3-pt Field Goal Rate

In [None]:
def plot_dubs(df, metric):
    ax = plt.gca()
    x = ax.get_xticks()
    y = df.loc[df.team == 'GSW', metric].values
    ax.plot(x, y, '.', ms=10, color='C1', label='Dubs')
    ax.legend()

In [None]:
metric = 'fg3_rate'

df.boxplot(column=metric, by='year', rot=90, figsize=(12, 6))
plot_dubs(df, metric)

### Decrease in Turnover Rate

In [None]:
metric = 'tov_rate'

df.boxplot(column=metric, by='year', rot=90, figsize=(12, 6))
plot_dubs(df, metric)

### Varying Free Throw Rate

In [None]:
metric = 'ft_rate'

df.boxplot(column=metric, by='year', rot=90, figsize=(12, 6))
plot_dubs(df, metric)

### Steady 3 Point Rate

In [None]:
metric = 'fg3_pct'

df.boxplot(column=metric, by='year', rot=90, figsize=(12, 6))
plot_dubs(df, metric)

### Decreasing Offensive Rebound Rate

In [None]:
metric = 'oreb_rate'

df.boxplot(column=metric, by='year', rot=90, figsize=(12, 6))
plot_dubs(df, metric)

### Increasing Shooting Efficiency of Late

In [None]:
metric = 'ts_pct'

df.boxplot(column=metric, by='year', rot=90, figsize=(12, 6))
plot_dubs(df, metric)

### Off. Rating with a Similar Pattern to TS%

This should be clear since they are highly correlated.

In [None]:
metric = 'off_rtg'

df.boxplot(column=metric, by='year', rot=90, figsize=(12, 6))
plot_dubs(df, metric)

## Summary on Metrics and the NBA Evolution

Among teams within an era, a metric with a higher correlation is better.  That is we should conclude FG% <= eFG% < TS%.

However, across eras the gameplay changes so the metrics can change in correlation based on our thought experiments at the beginning.

The changes across eras
+ Teams are taking riskier/lower prob. shots (regardless of whether a 3 pointer has higher expected value)
+ Teams are turning the ball over less
+ Teams are grabbing fewer offensive rebounds
+ Three effects that drive up correlation of the shooting percentages with Off. Rating
+ The metrics aren’t better or worse due to changing era
+ Instead, the change in correlation is more an indication that play has evolved: with more 3s being shot, increasing the make % is more important; with fewer turnovers and off. rebounds all around, teams need to score more efficiently with their shots/FTs

At the player level, this doesn't encompass everything on offense (picks, passing, etc), but to a certain degree it does reinforce that a very poor shooter/scorer can become a big liability if relied on too much.  We see this problem in the extreme in the playoffs when teams deploy aggressive defensive tactics to force good players into tougher shots or bad players to take more shots.

## Game Simulation (if you're curious to dive deeper)

We can observe the change in correlation in action through a simple simulation.  There are three simple steps of the simulation represented in the cell below
1. Simulate a possession (the possession ignores Off. Reb)
2. Simulate a game of 100 possessions
3. Simulate a bunch of games

In [None]:
def simulate_possession(team_perf):
    """Simulate a simplified basketball possession"""
    # Event probabilities
    p = [
        team_perf['fg2_rate'], 
        team_perf['fg3_rate'], 
        team_perf['ft_rate'], 
        team_perf['tov_rate']
    ]
    # Determine action: fg2, fg3, ft, tov
    action = np.random.choice(['fg2', 'fg3', 'ft', 'tov'], p=p)
    if action == 'fg2':
        # shoot a 2 pt shot, make it with probability fg2_pct
        return action, 2 * coin_flip(team_perf['fg2_pct'])
    elif action == 'fg3':
        # shoot a 3 pt shot, make it with probability fg3_pct
        return action, 3 * coin_flip(team_perf['fg3_pct'])
    elif action == 'ft':
        # get fouled, make it with probability ft_pct
        return action, coin_flip(team_perf['ft_pct']) + coin_flip(team_perf['ft_pct'])
    else:
        # turn the ball over, score nothing
        return action, 0

def simulate_game(team_perf, n_games=1):
    """Simulate a game of 100 possessions"""
    game = []
    # simulate 100 possessions
    for _ in range(n_games * 100):
        action, pts = simulate_possession(team_perf)
        game.append({'action': action, 'pts': pts})
    game = pd.DataFrame(game)
    
    # split up results
    fgs = game.loc[game.action.str.contains('fg')]
    fg3s = game.loc[game.action.str.contains('fg3')]
    fts = game.loc[game.action.str.contains('ft')]
    
    # count quantities
    fgm = np.count_nonzero(fgs['pts'])
    fg3m = np.count_nonzero(fg3s['pts'])
    fga = fgs.shape[0]
    fta = 2 * fts.shape[0]
    total_pts = game['pts'].sum()
    
    # compute statistics
    fg_pct = fgm / fga
    efg_pct = (fgm + .5 * fg3m) / fga
    ts_pct = total_pts / (2 * (fga + .5 * fta))
    off_rtg = total_pts / n_games

    return {'fg_pct': fg_pct, 'efg_pct': efg_pct, 'ts_pct': ts_pct, 'off_rtg': off_rtg}

def simulation(team_perf, n):
    """Simulation of n games"""
    results = []
    # Simulate n games
    for _ in range(n):
        results.append(simulate_game(team_perf))
    results = pd.DataFrame(results)
    return results[['fg_pct', 'efg_pct', 'ts_pct', 'off_rtg']]

In [None]:
def print_sim_results(sim_results):
    print("Correlations in simulation")
    print("==========================")
    c = correlation(sim_results['fg_pct'], sim_results['off_rtg'])
    print(f"Corr. FG% vs Off. Rating:  {c:.3f}")
    c = correlation(sim_results['efg_pct'], sim_results['off_rtg'])
    print(f"Corr. eFG% vs Off. Rating: {c:.3f}")
    c = correlation(sim_results['ts_pct'], sim_results['off_rtg'])
    print(f"Corr. TS% vs Off. Rating:  {c:.3f}")
    print()

## Correlation of Metrics for an Average Team

We get our expected ordering of FG%, eFG%, and TS%.  The correlations are higher than we've seen but that's not a huge issue.

In [None]:
cols = ['year', 'fg2_rate', 'fg3_rate', 'ft_rate', 'tov_rate', 'fg2_pct', 'fg3_pct', 'ft_pct']

perfs = df[cols].groupby('year').mean()
avg_perf_2017 = perfs.loc[2017]
avg_perf_2017

In [None]:
sim_results = simulation(avg_perf_2017, 1000)
print_sim_results(sim_results)

In [None]:
pd.plotting.scatter_matrix(sim_results, figsize=(12, 12));

## Guaranteed 2-pt Shot

Let's see what drives correlation.  First, let's consider a game where if a shot is taken, it is guaranteed to go in.  We end up finding the correlations are 0.

In [None]:
perf = pd.Series({
    'fg2_rate': 0.85, 
    'fg3_rate': 0.,
    'ft_rate': 0.,
    'tov_rate': 0.15,
    'fg2_pct': 1.,  # guaranteed shot
    'fg3_pct': 0.,
    'ft_pct': 0.76
})

sim_results = simulation(perf, 1000)
print_sim_results(sim_results)

## 2-pt Shot is Low Probability

Now let's consider a game where if a shot is taken, it has a low probability of going in.  The correlations are incredibly high.

In [None]:
perf = pd.Series({
    'fg2_rate': 0.85, 
    'fg3_rate': 0.,
    'ft_rate': 0.,
    'tov_rate': 0.15,
    'fg2_pct': .1,  # low probability
    'fg3_pct': 0.,
    'ft_pct': 0.76
})

sim_results = simulation(perf, 1000)
print_sim_results(sim_results)

## 2-pt vs 3-pt Shot

If 2s and 3s go in at the exact same rate, would it matter for correlation if more 2-pt or 3-pt shots are taken? 

Nope.  It turns out it wouldn't matter.  If you run the next two cells for 1000 games, you'll see the correlations are close but a bit different.  If you want, change it to 10000 and let it run for a bit (it'll take 10x as long, naturally).  The correlations will get even closer, empirically validating that the value of the shot doesn't matter.


In [None]:
# Note! this cell will take a while if you increase the number of simulations

perf = pd.Series({
    'fg2_rate': 0.85,  # all 2-pt shots
    'fg3_rate': 0.,
    'ft_rate': 0.,
    'tov_rate': 0.15,
    'fg2_pct': .5,
    'fg3_pct': .5,
    'ft_pct': 0.76
})

sim_results = simulation(perf, 1000)
print_sim_results(sim_results)

In [None]:
# Note! this cell will take a while if you increase the number of simulations

perf = pd.Series({
    'fg2_rate': 0., 
    'fg3_rate': 0.85, # all 3-pt shots
    'ft_rate': 0.,
    'tov_rate': 0.15,
    'fg2_pct': .5,
    'fg3_pct': .5,
    'ft_pct': 0.76
})

sim_results = simulation(perf, 1000)
print_sim_results(sim_results)