## Lab: Pythagorean Expectation for NBA

### On You Own: Deriving the Exponent for NBA Pythagorean Expectation

In this lab, you will perform the same analysis for the NBA reusing (almost exactly) the same code from the demo on Pythagorean Expectation for MLB but tweaked whereever necessary.  If you are unsure how to do something, just look to the corresponding part of the MLB section and emulate the code.  The data is loaded in the first cell.

The columns (excluding some self-explanatory ones):
+ `lg_id`: League ID
+ `mp`: minutes played
+ `pts`: points scored
+ `opp_pts`: opponent points scored

**IMPORTANT TIP**: Reuse the code from the demo on Pythagorean Expectation for MLB as much as possible.  You should be able to reuse basically all of it and rename a few things here and there.  It should all work to produce the results for this lab!  

## Setup (Do Not Change)

In [None]:
%run ../../utils/notebook_setup.py

import numpy as np
import pandas as pd

from datascience_stats import linear_fit

nba = pd.read_csv(
    "nba_team_season_data.csv", 
    usecols=[1, 2, 3, 12, 16, 17, 38, 59, 60, 61]
)
nba.head()

### 1. The first thing we need to do is compute the winning percentage 
$$
    \text{Win Pct} = W / G
$$

In [None]:
nba['Wpct'] = nba['wins'] / nba['g']

### 2. Then we need to compute Points per Game values
\begin{align*}
    \text{Points For per Game} & = \text{Points For}\ /\ \text{Game} \\
    \text{Points Against per Game} & = \text{Points Against}\ /\ \text{Game} \\
    \text{Net Points per Game} & = \text{Points For per Game} - \text{Points Against per Game}
\end{align*}

Call the columns `ppg`, `opp_ppg`, and `net_ppg`.

*Note: Feel free to perform the analysis using Ratings, which provide points per 100 possessions, provided in the NBA dataset (you will need to change the data loading to include those columns).  The results will be identical.

_Question_

We're computing a per game value.  Should we use 82 or should we use something else?  What happened in the NBA recently (~2011) that might necessitate not using 82?

_Answer_

League seasons don't always have 82 games in the NBA.  2011 had a lockout that shortened the season.  When _normalizing_ data by something like games or plate appearances or shots, it's obviously best to use the actual count and not an assumption about what it should be.

In [None]:
nba['ppg'] = nba['pts'] / nba['g']
nba['opp_ppg'] = nba['opp_pts'] / nba['g']
nba['net_ppg'] = nba['ppg'] - nba['opp_ppg']

Show the top 10 team seasons by Net Points per Game.  Only show the following columns: `team_id, wins, losses, ppg, opp_ppg, net_ppg`

In [None]:
nba[['team_id', 'wins', 'losses', 'ppg', 'opp_ppg', 'net_ppg']].\
    sort_values('net_ppg', ascending=False).\
    head(10)

### 3. Compute the Linear Model
$$
    \text{Linear Win Pct} = \alpha  + \beta \cdot \text{Net Points per Game}
$$
where $\alpha$ gives $\text{Average Win Pct}$ and $\beta$ gives $\text{Win Pct per Net Points per Game}$.

Plot the linear model results as we did with MLB.

**Remember: Reuse the code from the MLB demo!**


_Question_

For what values of Net Points per Game does $\text{Linear Win Pct} < 0$ and $\text{Linear Win Pct} > 1$?  How much of an issue is that here compared to when we looked at MLB data?

In [None]:
# Compute a linear fit
params, predictions, errors = linear_fit(nba['net_ppg'], nba['Wpct'])
# The predictions are results of the model: alpha + beta * net_ppg
nba['LinWpct'] = predictions

alpha, beta = params
print("Computed Linear Model:")
print("====================")
s = "LinWpct = {alpha:.3f} + {beta:.3f} * net_ppg".format(alpha=alpha, beta=beta)
print(s)

_Answer_

To figure out what values of Net Points per Game lead to $\text{Linear Win Pct} < 0$ and $\text{Linear Win Pct} > 1$, we need to solve these equations:
$$
    0 = 0.500 + 0.032 \times \text{Net Points per Game}
$$
$$
    1 = 0.500 + 0.032 \times \text{Net Points per Game}
$$
We can do that below:

In [None]:
print('LinWpct less than 0:')
print('net_ppg = ', -.500 / .032)
print('LinWpct greater than 1:')
print('net_ppg = ', (1 - .500) / .032)

The estimated value of $\beta$ should be about $0.03$.

Compute the "Net PPG per Win" from the linear model.

You should get a Net PPG per win of about 0.38.

_Answer_

The units for $\beta$ can be easily extracted from the linear model:
$$
    \frac{\text{Win Pct}}{\text{Net PPG}}
$$

Multiplying by games gives 
$$
    G \cdot \beta = \frac{\text{Wins}}{\text{Net PPG}}
$$

Inverting this gives us what we want:
$$
    \frac{1}{G \cdot \beta} = \frac{\text{Net PPG}}{\text{Wins}}
$$

In [None]:
1 / (82 * beta)

### 4. Compute the following values:
\begin{align*}
    \text{Points Ratio} & = \text{PPG}\ /\ \text{Opp PPG} \\
    \text{Log Points Ratio} & = \log \text{Points Ratio} \\
    \text{Log Odds} & = \log \text{Wins}\ /\ \text{Losses}
\end{align*}

In [None]:
nba['PR'] = nba['ppg'] / nba['opp_ppg']
nba['log_PR'] = np.log(nba['PR'])
nba['log_odds'] = np.log(nba['wins'] / nba['losses'])

### 5. Compute a Pythagorean exponent for the NBA

Plot the results of the model for the Pythagorean exponent.  

**Again, reuse the MLB code with appropriate changes!**

You should get a large value (around 14).  We could perform this analysis on all sorts of sports. 

_Question_

What does this large value for the exponent mean?  To answer this question, start by answering this series of questions:
+ Suppose some random sport had an exponent of $K=1\text{mil}$.  If a team is able to score just a bit more than its opponents so $\text{Points Ratio} > 1$ by a small amount.  What is $\text{Points Ratio}^K$ in this case?  What is the team's expected winning percentage?
+ Suppose as sport had an $K=0.00001$.  What is $\text{Points Ratio}^K$ in this case?  What is a team's expected winning percentage if it is able to score just a bit more than its opponents?  What about if it's outscored by a little bit?
+ Do larger or smaller values of K lead to a sport which features a lot of luck/chance in its outcomes?

In [None]:
# fit the linear model for log-odds and log-points ratio
params, predictions, errors = linear_fit(
    nba['log_PR'], nba['log_odds'], constant=False)
# take the model predictions as expected log-odds
nba['xlog_odds'] = predictions

K = params.item()

print("Computed Linear Fit:")
print("====================")
s = "xlog_odds = {K:.2f} * log_PR".format(K=K)
print(s)

In [None]:
fig, ax = plt.subplots()
nba.plot.scatter(ax=ax, x="log_PR", y='log_odds')
nba.plot.scatter(ax=ax, x="log_PR", y='xlog_odds', color='C1')
plt.legend();

_Answer_

Note: this is a extra detailed answer to show you just how far one can go with these things. This level of detail is not expected.

1. If $K=1\text{mil}$, then if $\text{Points Ratio}$ is slightly bigger than 1, $\text{Points Ratio}^K$ will become very large, and the expected winning percentage will be 1.000.  And when $\text{Points Ratio}$ is slightly smaller than 1 $\text{Points Ratio}^K$ will become close to 0, and the expected winning percentage will be 0.000.
2. If $K=0.00001$, then if $\text{Points Ratio}$ is a small number, $\text{Points Ratio}^K$ will be close to 1.  And if $\text{Points Ratio}$ is a large number, $\text{Points Ratio}^K$ will _still_ be close to 1.  In both cases, the expected winning percentage is about .500.
3. When $K$ is large, our team's performance is rewarded (punished) with more assurances that we will win (lose) games.  If we are slightly better than our team, we will win and there will be little luck in the sport that gives us losses.  Conversely, when $K$ is small, no matter how good we are or how bad we are, we'll always win games around .500.  In this case, the games are just coin flips and purely luck.

The exponents for MLB and NBA are about 1.83 and 14, respectively.  Fans/writers/whoever frequently express an instinctive feel for how NBA games are more pre-determined and that tournaments like the MLB playoffs are more random and luck-based.  One can see aspects of this in how a 73 win team like the Warriors would correspond to about 140 wins in MLB, something no team has ever remotely come close to.  Winning percentages in baseball translated over to the NBA would have NBA teams concentrate in a relatively tight range of about 30-56 wins.  

This notion of baseball being a more luck-based game is further compounded when you look at the level of "quality" through the Run and Point Ratios.  MLB teams exhibit larger ranges of quality.  For example, from the demo on Pythagorean Expectation for MLB, you can see in one of the plots that multiple teams had a Run Ratio above 1.4.  In the next section, you'll see no NBA team had a Point Ratio above 1.15.  The same happens at the bottom end for the bad teams.  In MLB, teams exhibit wider ranges of quality but tighter ranges of winning percentages when compared to the NBA.

See the cell below for examples for 1. and 2.

In [None]:
# 1 plus a small amount raised to a large power
print(1.00001**(1000000))
# 1 minus a small amount raised to a large power
print(.99999**(1000000))
# a large numbers raised to a small power
print(200**(.00001))
# a small numbers raised to a small power
print(.000001**(.00001))

### 6. Using the computed exponent*, compute the Pythagorean Expectation
*To skip the previous cell if it isn't working immediately: use 14. 

_Question_

For team with really poor net scoring performance, how does the Pythagorean formula compare to the linear formula?  Which seems to perform better in this case?

In [None]:
nba['pythag_Wpct'] = nba['PR']**K / (1 + nba['PR']**K)

fig, ax = plt.subplots()
nba.plot.scatter(ax=ax, x="PR", y='Wpct')
nba.plot.scatter(ax=ax, x="PR", y='LinWpct', color='C1')
nba.plot.scatter(ax=ax, x="PR", y='pythag_Wpct', color='C2')
plt.legend();

_Answer_

The issue with the linear model can be seen at the extreme ends of scoring performance, but mainly for the bad teams with low Point Ratios.  Two teams had noticeably bad Point Ratios (see the next cell).  The linear model using Net PPG predicted these teams to have a winning percentage far too low.  In one case, it was nearly negative.

This is really the only issue with the linear model.  For the most part, it's quite adequate!  For MLB, it was just fine but this is an example where its flaws can be revealed

In [None]:
nba[['team_id', 'wins', 'losses', 'ppg', 'opp_ppg', 'net_ppg', 'PR']].\
    sort_values('PR', ascending=True).\
    head(2)

### 7. Compute Pythagorean Luck

Again, use the columns `team_id, wins, losses, ppg, opp_ppg, net_ppg`

+ Display a table of the top 10 "luckiest" teams.
+ Display a table of the top 10 "unluckiest" teams.

In [None]:
nba['pythag_luck'] = (nba['g'] * (nba['Wpct'] - nba['pythag_Wpct'])).round()

In [None]:
nba[['team_id', 'wins', 'losses', 'ppg', 'opp_ppg', 'net_ppg', 'pythag_luck']].\
    sort_values('pythag_luck', ascending=False).\
    head(10)

In [None]:
nba[['team_id', 'wins', 'losses', 'ppg', 'opp_ppg', 'net_ppg', 'pythag_luck']].\
    sort_values('pythag_luck', ascending=True).\
    head(10)

### 8. Compute a table of Points-to-Wins values

+ A function with the Points per Win formula has been provided
+ A range of point-per-game values for PPG and Opponent PPG has been provided
+ Compute the Points per Win for various PPG values

You should see values around 30 points, or .3 PPG, per Win.  Interpret this as follows: if you increase your scoring by 1 PPG, you should expect about a 3 win improvement.  Teams like the 96 Bulls or recent Warriors with a Net PPG of 10 see close to 30 game increases above .500, ie high 60s wins compared to 41 wins.

In [None]:
def pts_per_win(ppg, opp_ppg, K):
    PR = ppg / opp_ppg
    pyth = PR**K / (PR**K + 1)
    return opp_ppg * PR**(K + 1) / (K * pyth**2)

ppg_rng = np.arange(85, 130, 5)

In [None]:
pts_vals = []
for opp_pts in ppg_rng:
    row = []
    pts_vals.append(row)
    for ppg in ppg_rng:
        row.append(pts_per_win(ppg, opp_pts, K))
    
pts_vals = np.around(pts_vals, decimals=2)

net_ppg_to_wins = pd.DataFrame(
    pts_vals,
    index=pd.Series(ppg_rng, name='Opp PPG'),
    columns=pd.Series(ppg_rng, name='PPG')
)
net_ppg_to_wins

_Question_


Say a team has a star player who averages 20-30 points per game.  The team loses this player for 10 games in the middle of the season.  Use the Net PPG-to-Wins conversions above and give a "back-of-the-envelope" estimate (with a bit of explanation your thinking) of how many extra games we should expect a team without the star to lose.  Consider this when answering:  Do you lose all 20-30 points the player provides or is it replaced in some way?  Is it replaced to the full extent?

_Answer_ 

Suppose when the star player is lost, the team is unable to make up the 30ppg production. Say they could only replace the player with someone absolutely worthless on offense but still fine on defense.   Then they'd lose 30 PPG (and thus lose 30 Net PPG on their usual value) and should expect to lose every game in the 10 game stretch.  Why? 30 net points per game is a ludicrous amount and we intuitively know no team can drop 30 net points and still compete in the NBA.  Pythagorean Expectation confirms that: 30 net points on the season corresponds to 1 win.  Over a 10 game stretch the team loses 300 points, which corresponds to about 10 lost wins, ie. 0-10 on the 10 game stretch.

On average we should expect losing the star player to lead to fewer wins since we should not expect the team to be able to adequately make up all the lost production.  The 17-18 Warriors had 4 all stars and 3 of them couldn't come very close to making up the production of Steph Curry. See here for more details: https://www.12up.com/posts/6012906-warriors-stars-offensive-stats-without-steph-curry-on-the-floor-are-devastating

The possessions the star player typically uses to take shots and score 30 points aren't disappearing.  Someone else will get to use them.  So someone is going to step into the lineup and help make up for parts of the overall lost production.  For example, when Steph Curry has gone out for rest or injury, Shaun Livingston has often gotten the starting point guard spot.

While the star player is producing 20-30 points per game from his possessions, we may only get 15-20 points per game from the rest of the team that tries to take up the slack.  Thus we might only see a drop of 5-10 points per game overall.  Over the course of 10 games, we may only expect to see 1-3 extra losses than normal.