# Demo - Pythagorean Expectation

## MLB: The Relationship between Runs and Wins

Bill James' formula known as _Pythagorean Expectation_ for MLB is summarized as
$$
    \text{Pythagorean Win Pct}
        = \frac{\text{Runs Scored}^2}{\text{Runs Scored}^2 + \text{Runs Allowed}^2}
        = \frac{(\text{Runs Scored}\ /\ \text{Runs Allowed})^2}{1 + (\text{Runs Scored}\ /\ \text{Runs Allowed})^2}
$$
The formula produces an expected winning percentage given run scoring data.  The name comes from the similar appearance to the classic Pythagorean Theorem.

### Background

The Pythagorean Expectation is an empirically motivated relationship between the runs scored and allowed by a team and the team's winning percentage.  That is, the original insight was motivated by observing an empirical phenomenon.  It is beyond the scope of this demo but there is a theoretical justification which can be read about here: https://arxiv.org/abs/math/0509698.


An obvious result of the Pythagorean Expectation formula is that if a team scores more runs, holding runs allowed fixed, its expected winning percentage will go up.  It cannot be stressed enough that this formula is not exact, hence the usage of the term _expected_.

Let us begin.  This notebook explores the empirical relationship between runs and wins, derives the Pythagorean Expectation formula, explores some of its consequences, and then in the second part does the same for NBA data.

### Setup

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
from datascience import Table

# custom functions that will help do some simple tasks
# scatterplot_by_x: scatter plots where the x-axis quantity varies
# linear_fit: compute a simple linear fit and return the params, 
# predictions, and errors
from datascience_utils import scatterplot_by_x
from datascience_stats import linear_fit

import numpy as np

### Load the Data

We'll be using the Lahman databank, an open source data collection with yearly results for teams and players.  The team data from the Lahman databank is in the CSV file `lahman_teams.csv`.

In [None]:
# Load lahman_teams.csv obtained from the Lahman databank
# We will only want a subset of columns
lahman = Table.read_table(
    "lahman_teams.csv", usecols=[0, 1, 2, 3, 6, 8, 9, 14, 26, 40])

# Define some extra values: win pct, loss pct, and run differential
lahman['Wpct'] = lahman['W'] / lahman["G"]
lahman['Lpct'] = 1 - lahman['Wpct']
lahman["RD"] = lahman["R"] - lahman["RA"]
lahman["RDperG"] = lahman["RD"] / lahman["G"]

# Restrict to after the year 2000
lahman = lahman.where(lahman['yearID'] >= 2000)

lahman.show(5)

### First Look
Let's create scatter plots showing the relationship between runs scored, runs allowed, and run differential.  Clearly as runs scored increases, runs allowed decreased, or run differential increases, we should expect to win more games.   While it is not guaranteed that scoring more runs or allowing fewer runs will yield more wins, the tendency is quite strong.  The strongest relationship is clearly with run differential since winning isn't solely about scoring or preventing runs but doing both.

Also, the relationship appears to be very linear.  The incremental improvement in run differential will yield the same improvement in expected winning percentage regardless the overall size of the run differential.  A team with a net negative 200 run differential improving by 30 runs will see the same increase in expected winning percentage as a team with a net positive 200 run differential improving by 30 runs.

_Question_

1. What do you make of the phrase "defense wins championships" given the plots below?  Is preventing more runs noticeably more conducive to a higher winning percentage than scoring more runs?  Or are they pretty balanced?

In [None]:
# Need a special plot from data_science_helpers
scatterplot_by_x(
    lahman, ['R', 'RA', 'RD'], 'Wpct', sharey=True, title="Runs vs Win Pct")

## 1. Our First Model for Winning Percentage

### Linear Relationship between Winning Pct and Run Differential

Let's compute a simple linear fit for wins against run differential.  This is given by the equation
$$
    \text{Linear Win pct} = \alpha  + \beta \cdot \text{Run Differential per Game}
$$
where $\alpha$ gives $\text{Average Win Pct}$ (which we reason should be $0.500$) and $\beta$ gives $\text{Wins per Unit Run Differential}$.

The next cell computes $\alpha=0.500$ and $\beta = 0.101$.  

In [None]:
# Compute a linear fit
params, predictions, errors = linear_fit(lahman['RDperG'], lahman['Wpct'])
lahman['LinWpct'] = predictions

alpha, beta = params
print("Computed Linear Model:")
print("====================")
s = "LinWpct = {alpha:.3f} + {beta:.3f} * RDperG".format(alpha=alpha, beta=beta)
print(s)

In [None]:
lahman.scatter("RDperG", select='Wpct', fit_line=True)

Recall $\alpha$ gives $\text{Average Win Pct}$ (which we reason should be $0.500$) and $\beta$ gives $\text{Wins per Unit Run Differential}$.

_Questions_

1. What are the units for $\alpha$ and $\beta$?  
2. What is the meaning of the reciprocal $1 / \beta$?
3. If a team improves its differential per game by by .1 runs (16.2 total runs / 162 games), how many more wins do we expect to have?

## 2. Our Second Model for Winning Percentage

### Pythagorean Expectation

Recall from above Bill James' formula for Pythagorean Expectation:
$$
    \text{Pythagorean Win Pct}
        = \frac{\text{Runs Scored}^2}{\text{Runs Scored}^2 + \text{Runs Allowed}^2}
        = \frac{(\text{Runs Scored}\ /\ \text{Runs Allowed})^2}{1 + (\text{Runs Scored}\ /\ \text{Runs Allowed})^2}
$$

Pythagorean Expectation uses a measure of quality of a team:
$$
    \text{Run Ratio} = \text{Runs Scored}\ /\ \text{Runs Allowed}
$$
An average team will have $\text{Runs Scored} = \text{Runs Allowed}$ and therefore $ \text{Run Ratio} = 1$.  If $ \text{Run Ratio} > 1$, the team is good and we expect the winning percentage to be $> .500$.  Conversely, if $\text{Run Ratio} < 1$, the team is not good and we expect the winning percentage to be $< .500$.

_Questions_

1. What's the major difference between Pythagorean Expectation and our linear model above?  Is Pythagorean Expectation linear?

2. What happens in the linear model if the run differential is every really high or low?  Should Win Pct every be less than 0 or larger than 1?



In [None]:
lahman['RR'] = lahman['R'] / lahman['RA']
lahman['pythag_Wpct'] = lahman['RR']**2 / (1 + lahman['RR']**2)

#### Comparing the Linear Model and Pythagorean Expectation

Before, we plotted the linear model against Run Differential.  We'll compare the two models with respect to Run Ratio though.

The plot in the following cell shows how win percentage, expected win percentage from the linear fit, and the Pythagorean Expectation compare as a function of the run ratio.

In [None]:
lahman.scatter("RR", select=['Wpct', 'LinWpct', 'pythag_Wpct'])

## 3. Deriving the Pythagorean Expectation Formula

### Measuring a Team's Quality

Performance wise, both of the above formulae do the trick.  Normally we'd opt for a simpler formula like the linear win percentage formula but the Pythagorean formula is still fairly simple and elegant.  There are a few other areas where the Pythagorean formula is better.

If the run differential is every really high or low, the linear win percentage formula could be greater than 1 or negative.  The Pythagorean formula also does a bit better in handling teams at the extremes (like the 2001 Seattle Mariners or 2003 Detroit Tigers).

The Pythagorean formula better quantifies performance by being dependent on the Run Ratio,
$$
    \text{Runs Scored}\ /\ \text{Runs Allowed}
$$
instead of the Run Differential
$$
    \text{Runs Scored} - \text{Runs Allowed}.
$$

To see why this is more desirable, consider an era where defense and pitching are strong and fewer runs scored.  The linear win percentage formula will always require a 10 run change in total run differential to increase expected wins by 1.  A "run poor" environment like this should require fewer runs in order to increase winning percentage.  Conversely, a "run rich" environment with lots of scoring should require more runs.  The Pythagorean Expectation model captures this.

#### Derivation 1: Win Percentage Proportional to Run Ratio

A team's quality is given by its Run Ratio 
$$
    \mathit{RR} = \text{Runs Scored}\ /\ \text{Runs Allowed}.
$$
The average quality of all its opponents is given by $\frac{1}{\mathit{RR}} = \text{Runs Allowed}\ /\ \text{Runs Scored}$.  We can pose a simple model that says that winning percentage is (roughly) proportional the Run Ratio:
$$
    \text{Win Pct} = \frac{\mathit{RR}}{\mathit{RR} + \frac{1}{\mathit{RR}}}
$$

If we clear the fractions, we get the Pythagorean Expectation
$$
    \text{Win Pct} = \frac{\text{Runs Scored}^2}{\text{Runs Scored}^2 + \text{Runs Allowed}^2}
$$

#### Derivation 2: Odds Proportional to (Run Ratio)^2

The odds ratio of winning for a team is simply defined as:
$$
    \text{Odds} = \frac{\text{Win Pct}}{1 - \text{Win Pct}} = \frac{\text{Wins}}{\text{Losses}}
$$

If we posit that odds is proportional squared Run Ratio, we get,
$$
    \text{Odds} = \mathit{RR}^2.
$$

Doing the algebra will yield Pythagorean Expectation
$$
    \text{Win Pct} = \frac{\text{Runs Scored}^2}{\text{Runs Scored}^2 + \text{Runs Allowed}^2}
$$

#### Derivation 3: Log Odds Linear Relationship with Log Run Ratio
 
In statistical modeling, we often favor sums/differences instead of ratios.  If we instead model the log-odds as having a linear relationship with Log Run Ratio, we get
$$
    \log\text{Odds} = \log \frac{W}{L} = \log \frac{\text{Win Pct}}{1 - \text{Win Pct}}
        = K\log\left(\frac{\text{Runs Scored}}{\text{Runs Allowed}}\right)
$$

Setting $K=2$ and working out the algebra gives us the Pythagorean Expectation formula.

## 4. Computing the Exponent in Pythagorean Expectation: Why should we use 2?

In derivation 3 above, we modeled the log-odds as linear with respect to the log Run Ratio.  Well, why don't we explore that and see just why we take $K=2$ and maybe if we can find a better value than 2!
$$
    \log\text{Odds} = \log \frac{W}{L} = \log \frac{\text{Win Pct}}{1 - \text{Win Pct}}
        = K\log\left(\frac{\text{Runs Scored}}{\text{Runs Allowed}}\right)
$$

For some value of $K$, the algebra will lead us to the formula
$$
    \text{Win Pct} = \frac{\text{Runs Scored}^K}{\text{Runs Scored}^K + \text{Runs Allowed}^K}
$$

We saw that Bill James suggested $K=1.83$.  We also saw that other sports have their own coefficients.

In [None]:
lahman['log_RR'] = np.log(lahman['RR'])
lahman['log_odds'] = np.log(lahman["W"] / lahman["L"])
lahman['pythag_log_odds'] = lahman['log_RR'] * 2

In [None]:
params, predictions, errors = linear_fit(
    lahman['log_RR'], lahman['log_odds'], constant=False)
lahman['xlog_odds'] = predictions

K = params.item()

print("Computed Linear Fit:")
print("====================")
s = "xlog_odds = {K:.2f} * log_RR".format(K=K)
print(s)

In [None]:
lahman.scatter('log_RR', select=['log_odds', 'xlog_odds', 'pythag_log_odds'])

The linear relationship of the log values is clear.  This is important because if this wasn't the case empirically, all our modeling would be bunk!

Our fit produces a value of $K = 1.87$ because we're looking at a different dataset than whatever was used that produces 1.83.  However, you'll notice that $K=2$ is not too much different from $K=1.87$.

## 5. Pythagorean Luck

One thing we can use Pythagorean Expectation for is determining the extent of luck or lack of it.  Given that we have a formula for expected winning percentage given a run scoring profile, we consider deviations from the expectation to be attributable to luck.  

_Question_

Before reading on, think about that assumption.  What are we assuming about the way a team scores runs both game to game but also in relation to its opponents?  Are we assuming that the team has an ability to control how it sequences or spaces out its runs to maximize winning?  Or are we assuming a team _cannot_ perform magic and optimally allocate runs so that it wins many games by only one run?


It turns out that 162 games is not that many.  Thus we can expect to see runs to cluster enough that we should see deviations by several games between the model for expected number of wins and actual wins.  For example, a team with a narrow run differential that managed to win a lot of one run games despite a relatively poor run differential can potentially be considered to be "lucky" in how the runs clustered together to produce wins.  

We compute Pythagorean Luck as the difference in wins and expected wins:
$$
    \text{Pythagorean Luck} = \text{Games} * (\text{Win Pct} - \text{Pythagorean Win Pct})
$$

_Questions_

1. What is one way we could employ the Pythagorean Win Pct formula and the notion of Pythagorean Luck midseason?
2. Based on the histogram below, how much do you think luck plays a role in determining the final standings?

In [None]:
lahman['pythag_luck'] = (lahman['G'] * (lahman['Wpct'] - lahman['pythag_Wpct'])).round()

In [None]:
lahman.select('yearID', 'name', 'W', 'L', 'RD', 'pythag_luck').\
    sort('pythag_luck', descending=True).\
    show(10)

In [None]:
lahman.select('yearID', 'name', 'W', 'L', 'RD', 'pythag_luck').\
    sort('pythag_luck', descending=False).\
    show(10)

In [None]:
lahman.hist('pythag_luck', bins=70)

## 6. 10 Runs per Win?

There is a common approximation in baseball analysis that it takes about 10 extra wins to generate a win.  That is, all things being equal, if you add 10 runs to a team's season total (or take away from their runs allowed total, or in general to their differential), you should expect about an increase of 1 win.  Can we derive this empirically?  Yes!

### 10 Runs per Win from the Linear Model

First off, from the linear formula for expected winning percentage, it is clear that increasing the run differential by 10 runs translated to about 1 win (this is found by $1 / \beta$).  A similar value will hold for the Pythagorean Expectation which we explore here.

### Runs per Win from Pythagorean Expectation

To find the number of runs needed to increase wins, we first take a derivative of the Pythagorean Expectation with respect to the run ratio.  This will give us the change in wins per change in run ratio.  From a little calculus and algebra:
\begin{align*}
    \text{Change in Wins per Run Ratio}
      & = \frac{d}{d\text{Run Ratio}}\text{Games}\times
        \left(
            \frac
                {\text{Run Ratio}^2}
                {1 + \text{Run Ratio}^2}
        \right) \\
      & = \frac{\text{Games}}{\text{Run Ratio}^3}
        \left(
            \frac
                {\text{Run Ratio}}
                {1 + \text{Run Ratio}^2}
        \right)^{2} \\
      & = \frac{\text{Games}}{\text{Run Ratio}^3}
        \left(\text{Pythagorean Expected Win Pct}\right)^{2} 
\end{align*}

We invert the formula to get the change in run ratio per win.
\begin{align*}
    \text{Change in Runs Ratio per Win}
      & = \frac{1}{2 \times \text{Pythagorean Wins}} \times
        \frac{\text{Run Ratio}^3}{ \text{Pythagorean Win Pct} }
\end{align*}

Finally, we multiply by runs allowed per game and total games to get what we're interested in.
\begin{align*}
  \text{Change in Runs Scored per Win} = \text{Games} \times \text{Runs Allowed per Game} \times \text{Change in Runs Ratio per Win}
\end{align*}

Okay, great!  But what does this formula tell us?  It's hard to interpret too much but we can plug in some values and get a feel for it.

In the next cell, we produce a runs-to-wins table for various runs scoring enviroments. 

In [None]:
def runs_per_win(R=5, RA=5):
    RR = R / RA
    pyth = RR**2 / (1 + RR**2)
    return RA * RR**3 / 2 / pyth**2

runs = np.arange(3, 6.5, .5)

table_data = ['RA', runs]
for r in runs:
    col = 'R: ' + str(r)
    data = []
    for ra in runs:
        data.append(runs_per_win(R=r, RA=ra))
    
    table_data.extend([col, data])

t = Table()
t = t.with_columns(table_data)
t

_Questions_

1. What do you observe about the lower right corner of the chart where more runs are scored by teams?
2. What about the upper left corner when fewer runs are scored?
3. How does the value of an extra run change as more runs are scored?



To put this in historical perspective, in 1968 runs per game was as low as 3.4 and in 2000 runs per game was as high as 5.2.  The varying level offense in baseball shows that the value of a hitter's performance can vary considerably depending on the run scoring environment.  E.g. in 2000, a hitter with 30 homeruns wouldn't even sniff the top ten but in 1968 would be near top 5.  And things can change pretty quickly, not just over multiple decades!