<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#MLB:-The-Relationship-between-Runs-and-Wins" data-toc-modified-id="MLB:-The-Relationship-between-Runs-and-Wins-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>MLB: The Relationship between Runs and Wins</a></span><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#First-Look" data-toc-modified-id="First-Look-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>First Look</a></span></li><li><span><a href="#Linear-Fit" data-toc-modified-id="Linear-Fit-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Linear Fit</a></span></li><li><span><a href="#Pythagorean-Expectation" data-toc-modified-id="Pythagorean-Expectation-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Pythagorean Expectation</a></span></li><li><span><a href="#Deriving-the-Pythagorean-Expectation-Formula" data-toc-modified-id="Deriving-the-Pythagorean-Expectation-Formula-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Deriving the Pythagorean Expectation Formula</a></span></li><li><span><a href="#Pythagorean-Luck" data-toc-modified-id="Pythagorean-Luck-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Pythagorean Luck</a></span></li><li><span><a href="#10-Runs-to-a-Win?" data-toc-modified-id="10-Runs-to-a-Win?-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>10 Runs to a Win?</a></span></li></ul></li><li><span><a href="#On-You-Own:-Pythagorean-Expectation-for-NBA" data-toc-modified-id="On-You-Own:-Pythagorean-Expectation-for-NBA-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>On You Own: Pythagorean Expectation for NBA</a></span></li><li><span><a href="#Wrap-Up" data-toc-modified-id="Wrap-Up-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Wrap Up</a></span></li></ul></div>

# Demo - Pythagorean Expectation

## MLB: The Relationship between Runs and Wins

Bill James' formula known as _Pythagorean Expectation_ for MLB is summarized as
$$
    \text{Pythagorean Win Pct}
        = \frac{\text{Runs Scored}^2}{\text{Runs Scored}^2 + \text{Runs Allowed}^2}
        = \frac{(\text{Runs Scored}\ /\ \text{Runs Allowed})^2}{1 + (\text{Runs Scored}\ /\ \text{Runs Allowed})^2}
$$
The formula produces an expected winning percentage given run scoring data.  The name comes from the similar appearance to the classic Pythagorean Theorem.

The Pythagorean Expectation is an empirically motivated relationship between the runs scored and allowed by a team and the team's winning percentage.  That is, the original insight was motivated by observing an empirical phenomenon.  It is beyond the scope of this lab but there is a theoretical justification which can be read about here: https://arxiv.org/abs/math/0509698.


An obvious result of the Pythagorean Expectation formula is that if a team scores more runs, holding runs allowed fixed, its expected winning percentage will go up.  It cannot be stressed enough that this formula is not exact, hence the usage of the term _expected_.

Let us begin.  This notebook explores the empirical relationship between runs and wins, derives the Pythagorean Expectation formula, explores some of its consequences, and then in the second part does the same for NBA data.

### Setup

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
import pandas as pd
import numpy as np

from datascience_utils import scatterplot_by_x
from datascience_stats import linear_fit

In [None]:
# Load lahman_teams.csv obtained from the Lahman databank
lahman = pd.read_csv("lahman_teams.csv", usecols=[0, 1, 2, 3, 6, 8, 9, 14, 26, 40])

# Define some extra values: win pct, loss pct, and run differential
lahman['Wpct'] = lahman['W'] / lahman["G"]
lahman['Lpct'] = 1 - lahman['Wpct']
lahman["RD"] = lahman["R"] - lahman["RA"]
lahman["RDperG"] = lahman["RD"] / lahman["G"]

# Restrict to after the year 2000
lahman = lahman.loc[lahman['yearID'] >= 2000].copy()

lahman.head()

### First Look
Let's create scatter plots showing the relationship between runs scored, runs allowed, and run differential.  Clearly as runs scored increases, runs allowed decreased, or run differential increases, we should expect to win more games.   While it is not guaranteed that scoring more runs or allowing fewer runs will yield more wins, the tendency is quite strong.  The strongest relationship is clearly with run differential since winning isn't solely about scoring or preventing runs but doing both.

Also, the relationship appears to be very linear.  The incremental improvement in run differential will yield the same improvement in expected winning percentage regardless the overall size of the run differential.  A team with a net negative 200 run differential improving by 30 runs will see the same increase in expected winning percentage as a team with a net positive 200 run differential improving by 30 runs.

_Question_

1. What do you make of the phrase "defense wins championships" given the plots below?

In [None]:
fig, axarr = plt.subplots(ncols=3, figsize=(12,4), sharey=True)
lahman.plot.scatter(ax=axarr[0], x='R', y='Wpct')
lahman.plot.scatter(ax=axarr[1], x='RA', y='Wpct')
lahman.plot.scatter(ax=axarr[2], x="RD", y='Wpct')
fig.suptitle("Runs vs Win Pct");

### Linear Fit

Let's compute a simple linear fit for wins against run differential.  This is given by the equation
$$
    \text{Linear Win pct} = \alpha  + \beta \cdot \text{Run Differential per Game}
$$
where $\alpha$ gives $\text{Average Win Pct}$ (which we reason should be $0.500$) and $\beta$ gives $\text{Wins per Unit Run Differential}$.

The next cell computes $\alpha=0.500$ and $\beta = 0.101$.  

_Questions_

1. What are the units for $\alpha$ and $\beta$?  
2. What is the meaning of the reciprocal $1 / \beta$?
3. If a team improves its differential per game by by .1 runs (16.2 total runs / 162 games), how many more wins do we expect to have?

In [None]:
# Compute a linear fit
params, predictions, errors = linear_fit(lahman['RDperG'], lahman['Wpct'])

alpha, beta = params['const'], params['RDperG']
print("Computed Linear Fit:")
print("====================")
s = "xWpct = {alpha:.3f} + {beta:.3f} * RDperG".format(alpha=alpha, beta=beta)
print(s)

fig, ax = plt.subplots()
lahman.plot.scatter(ax=ax, x="RDperG", y='Wpct')

lahman['xWpct'] = predictions
plt.plot(lahman['RDperG'], predictions, color='C1', label='xWpct')
plt.legend();

### Pythagorean Expectation

Recall from above Bill James' formula for Pythagorean Expectation:
$$
    \text{Pythagorean Win Pct}
        = \frac{\text{Runs Scored}^2}{\text{Runs Scored}^2 + \text{Runs Allowed}^2}
        = \frac{(\text{Runs Scored}\ /\ \text{Runs Allowed})^2}{1 + (\text{Runs Scored}\ /\ \text{Runs Allowed})^2}
$$

_Questions_

1. What's the difference between Pythagorean Expectation and our linear fit above?

2. What happens if the run differential is every really high or low?



Performance wise, both formulas do the trick.  Normally we'd opt for a simpler formula like the linear win percentage formula but the Pythagorean formula is still fairly simple and elegant.  There are a few other areas where the Pythagorean formula is a bit better.

If the run differential is every really high or low, the linear win percentage formula could be greater than 1 or negative.  The Pythagorean formula also does a bit better in handling teams at the extremes (like the 2001 Seattle Mariners or 2003 Detroit Tigers).  

Finally, the Pythagorean formula better quantifies performance by being dependent on the run ratio,
$$
    \text{Runs Scored}\ /\ \text{Runs Allowed}
$$
instead of the run differential
$$
    \text{Runs Scored} - \text{Runs Allowed}.
$$

To see why this is more desirable, consider an era where defense and pitching are strong and fewer runs scored.  The linear win percentage formula will always require a 10 run change in total run differential to increase expected wins by 1.  A run poor environment like this should require fewer runs in order to increase winning percentage.  Conversely, a run rich environment should require more runs.  The Pythagorean formula captures this.

The plot in the following cell shows how win percentage, expected win percentage from the linear fit, and the Pythagorean Expectation compare as a function of the run ratio.

In [None]:
K = 2
lahman['RR'] = lahman['R'] / lahman['RA']
lahman['pythag_Wpct'] = 1 / (1 + (1 / lahman['RR'])**K)

In [None]:
fig, ax = plt.subplots()
lahman.plot.scatter(ax=ax, x="RR", y='Wpct', color='C0')
lahman.plot.scatter(ax=ax, x="RR", y='xWpct', color='C1', label='xWpct')
lahman.plot.scatter(ax=ax, x="RR", y='pythag_Wpct',
                    color='C2', label='pythag_Wpct')
plt.legend()
ax.set_ylabel('Wpct');

### Deriving the Pythagorean Expectation Formula

Where does the exponent come from in the Pythagorean Expectation formula?  Consider the ratio of wins to losses:
$$
    \frac{\text{Pythagorean Win Pct}}{\text{Pythagorean Loss Pct}}
        = \frac{\text{Pythagorean Win Pct}}{1 - \text{Pythagorean Win Pct}}
        = \left(\frac{\text{Runs Scored}}{\text{Runs Allowed}}\right)^2
$$
If we take the log we get a linear relationship:
$$
    \log\left(\frac{\text{Pythagorean Win Pct}}{\text{Pythagorean Loss Pct}}\right)
        = 2\log\left(\frac{\text{Runs Scored}}{\text{Runs Allowed}}\right)
$$

This suggests that we should explore the log win-loss ratio on the left-hand side and the log run ratio on the right-hand side. 

In the next cell, we show the scatter plot and the result of the linear fit
$$
    \text{Log Win-Loss Ratio} = K \cdot \text{Log Run Ratio}
$$

The linear relationship of the log values is clear.  And our fit produces a value of $K = 1.87$.  This shows where the Pythagorean Expectation formula comes from.  While not exactly the same, taking $K=2$ is "good enough".

In [None]:
lahman['log_RR'] = np.log(lahman['RR'])
lahman['log_Wrat'] = np.log(lahman["W"] / lahman["L"])

params, predictions, errors = linear_fit(
    lahman['log_RR'], lahman['log_Wrat'], constant=False)

K = params['log_RR']

print("Computed Linear Fit:")
print("====================")
s = "xlog_Wrat = {K:.2f} * log_RR".format(K=K)
print(s)

fig, ax = plt.subplots()
lahman.plot.scatter(ax=ax, x='log_RR', y='log_Wrat')

pythag_log_Wrat = lahman['log_RR'] * 2
lahman['xlog_Wrat'] = predictions

lahman.plot.scatter(ax=ax, x='log_RR', y='xlog_Wrat', color='C1')
plt.plot(lahman['log_RR'], pythag_log_Wrat, '.',
         color='C2', label='pythag_log_Wrat')
plt.legend();

### Pythagorean Luck

One thing we can use Pythagorean Expectation for is determining the extent of luck or lack of it.  Given that we have a formula for expected winning percentage given a run differential, we consider deviations from the expectation to be attributable to luck.  

_Question_

1. Before reading on, think about that assumption.  What are we assuming about the way a team scores runs both game to game but also in relation to its opponents?

So why are we considering this definition of luck?  Our assumption is that a team cannot perform magic and optimally allocate runs so that it wins many games by only one run.  Instead, the spread of runs is random enough that over 162 games, short by statistical standards, pockets of run clustering can form.  For example, a team with a narrow run differential that managed to win a lot of one run games despite a relatively poor run differential was lucky in how the runs clustered together to produce wins.  

We compute Pythagorean Luck as the difference in wins and expected wins:
$$
    \text{Pythagorean Luck} = \text{Games} * (\text{Win Pct} - \text{Pythagorean Win Pct})
$$

_Questions_

1. What is one way we could employ the Pythagorean Win Pct formula and the notion of Pythagorean Luck midseason?
2. Based on the histogram below, how much do you think luck plays a role in determining the final standings?

In [None]:
lahman['pythag_luck'] = lahman['G'] * (lahman['Wpct'] - lahman['pythag_Wpct'])

In [None]:
lahman[['yearID', 'name', 'W', 'L', 'RD', 'pythag_luck']].\
    sort_values(by='pythag_luck', ascending=False).\
    head(10)

In [None]:
lahman[['yearID', 'name', 'W', 'L', 'RD', 'pythag_luck']].\
    sort_values(by='pythag_luck', ascending=True).\
    head(10)

In [None]:
lahman['pythag_luck'].plot.hist(bins=70);

In [None]:
lahman['pythag_W'] = lahman['G'] * lahman['pythag_Wpct']

fig, axarr = plt.subplots(ncols=2, figsize=(12, 5))
lahman.plot.scatter(ax=axarr[0], x='W', y='pythag_luck')
lahman.plot.scatter(ax=axarr[1], x='pythag_W', y='pythag_luck');

### 10 Runs to a Win?

There is a common approximation in baseball analysis that it takes about 10 extra wins to generate a win.  That is, all things being equal, if you add 10 runs to a team's season total (or take away from their runs allowed total, or in general to their differential), you should expect about an increase of 1 win.  Can we derive this empirically?  Yes!

First off, from the linear formula for expected winning percentage, it is clear that increasing the run differential by 10 runs translated to about 1 win (this is found by $1 / \beta$).  A similar value will hold for the Pythagorean Expectation which we explore here.

To find the number of runs needed to increase wins, we first take a derivative of the Pythagorean Expectation with respect to runs scored.  This will give us the change in wins per run scored.  From a little calculus and algebra:
$$
    \text{Change in Wins per Run Scored}
    = 2\frac{\text{Games}}{\text{Runs Scored}} \cdot
        \left(
            \frac
                {\text{Runs Allowed}\ /\ \text{Runs Scored}}
                {1 + (\text{Runs Allowed}\ /\ \text{Runs Scored})^2}
        \right)^{2}
$$

We invert the formula to get the change in runs scored per win.  This gives us what we're interested in.  We can also convert Runs Scored and Runs Allowed to 'per game' values and drop Games.
$$
    \text{Change in Runs Scored per Win}
    = \frac12 \cdot
        \text{Runs Scored per Game} \cdot
        \left(
            \frac
                {1 + (\text{Runs Allowed per Game}\ /\ \text{Runs Scored per Game})^2}
                {\text{Runs Allowed per Game}\ /\ \text{Runs Scored per Game}}
        \right)^{2}
$$

Okay, great!  But what does this formula tell us?  As is, it's not particularly illuminating but we can plug in some values and get a feel for it.

In the next cell, we produce a runs-to-wins table for various runs scoring enviroments. 

_Questions_

1. What do you observe about the lower right corner of the chart where more runs are scored by teams?
2. What about the upper left corner when fewer runs are scored?
3. How does the value of an extra run change as more runs are scored?



To put this in historical perspective, in 1968 runs per game was as low as 3.4 and in 2000 runs per game was as high as 5.2.  The varying level offense in baseball shows that the value of a hitter's performance can vary considerably depending on the run scoring environment.  E.g. in 2000, a hitter with 30 homeruns wouldn't even sniff the top ten but in 1968 would be near top 5.  And things can change pretty quickly, not just over multiple decades!

In [None]:
def runs_per_win(R=5, RA=5):
    return R * (((RA / R)**2 + 1.) / (RA / R))**2 / 2.


runs = np.arange(3, 6.5, .5)
run_vals = np.around(
    [[runs_per_win(R=r, RA=ra) for r in runs] for ra in runs], decimals=2)
runs_to_wins = pd.DataFrame(
    run_vals,
    index=pd.Series(runs, name='RAperG'),
    columns=pd.Series(runs, name='RperG')
)
runs_to_wins

## On You Own: Pythagorean Expectation for NBA

In this portion of the lab, you will perform the same analysis for the NBA reusing much of the same code from above but tweaked whereever necessary.  If you are unsure how to do something, just look to the corresponding part of the MLB section and emulate the code.  The data is loaded in the first cell.

The columns (excluding some self-explanatory ones):
+ `lg_id`: League ID
+ `off_rtg`: Offensive rating.  Number of points scored per 100 possessions.
+ `def_rtg`: Defensive rating.  Number of points allowed per 100 possessions.
+ `off_rtg_rel`, `def_rtg_rel`: Off rating and Def rating relative to the league average for the season.
+ `mp`: minutes played
+ `pts`: points scored
+ `opp_pts`: opponent points scored

In [None]:
nba = pd.read_csv(
    "nba_team_season_data.csv", 
    usecols=[1, 2, 3, 8, 9, 10, 11, 12, 16, 17, 38, 59, 60, 61]
)
nba.head()

##### 1. The first thing we need to do is compute the winning percentage $\text{Win Pct} = W / G$.

##### 2. Then we need to compute the Net Rating
\begin{align*}
    \text{Net Rating} & = \text{Off Rating} - \text{Def Rating}
\end{align*}

_Question_

1. Should we use 82 or perhaps something else?  What happened in the NBA that might necessitate not using 82?

Show the top 10 team seasons by Net Rating.  Only show the following columns: `team_id, wins, losses, off_rtg, def_rtg, net_rtg`

##### 3. Reusing the code from the MLB portion (but with changes to fit for the NBA example), compute the model
$$
    \text{Linear Win Pct} = \alpha  + \beta \cdot \text{Net Rating}
$$
where $\alpha$ gives $\text{Average Win Pct}$ and $\beta$ gives $\text{Win Pct per Net Rating}$.  It should be pointed out that Net Rating is in points per 100 possessions.

The model (and the Pythagorean Expectation model) could be done using per game values and the results would be basically identical.

Also, plot the linear model results as we did with MLB

_ Question_

1. For what values of Net Rating does $\text{Linear Win Pct} < 0$ and $\text{Linear Win Pct} > 1$?  How much of an issue is that here?


The estimated value of $\beta$ should be about $0.03$.

Compute the "Net Rating per Win" from the linear model.

##### 4. Now compute the following values:
\begin{align*}
    \text{Points Ratio} & = \text{Offense Rating}\ /\ \text{Defense Rating} \\
    \text{Log Points Ratio} & = \log \text{Points Ratio} \\
    \text{Log Win Ratio} & = \log \text{Wins}\ /\ \text{Losses}
\end{align*}

##### 5. Again, reusing the MLB code with appropriate changes, compute a Pythagorean exponent for the NBA

Plot the results of the model for the Pythagorean exponent.  

You should get a large value (around 14).  We could perform this analysis on all sorts of sports. 

_Question_

1. What does it mean for a sport to have a large Pythagorean exponent versus a small exponent?  Put another way, if a team scores more points than it allows, in which sport would you expect that team to have a higher winning percentage, one with a large exponent or one with a small exponent?


##### 6. Using the computed exponent*, compute the Pythagorean Expectation
*To skip the previous cell if it isn't working immediately: use 14. 

_Question_

1. For team with really poor net scoring performance, how does the Pythagorean formula compare to the linear formula?  Which seems to perform better in this case?



##### 7. Compute Pythagorean Luck

Again, use the columns `team_id, wins, losses, off_rtg, def_rtg, net_rtg`

+ Who have been the "luckiest" teams?
+ Who have been the "unluckiest"?

##### 8. Compute a table of rating per game to wins

+ A function with the Net Rating per Win formula has been provided
+ A range of rating values for offensive and defensive rating has been provided
+ Compute the net rating per win for various rating values

_Question_

1. Say a team has a star player averages 20-30 points per game.  Give your own estimate (with a bit of explanation your thinking) of how many extra games we should expect a team without the star to lose.

In [None]:
def rtg_per_win(offrtg, defrtg, K):
    return offrtg * ((defrtg / offrtg)**K + 1.)**2 / (defrtg / offrtg)**K / K

rtg_rng = np.arange(85, 130, 5)

## Wrap Up

And that wraps up our exploration of runs, points, and wins and how we can link them through a beautiful empirical rule like the Pythagorean Expectation formula.