<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Classical-Baseball-Statistics" data-toc-modified-id="Classical-Baseball-Statistics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Classical Baseball Statistics</a></span><ul class="toc-item"><li><span><a href="#Computing-Classical-Statistics" data-toc-modified-id="Computing-Classical-Statistics-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Computing Classical Statistics</a></span></li><li><span><a href="#Classical-Stats-and-Runs-Scored" data-toc-modified-id="Classical-Stats-and-Runs-Scored-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Classical Stats and Runs Scored</a></span><ul class="toc-item"><li><span><a href="#Correlation-with-Team-Runs" data-toc-modified-id="Correlation-with-Team-Runs-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Correlation with Team Runs</a></span></li></ul></li><li><span><a href="#Correlation-Between-Statistics" data-toc-modified-id="Correlation-Between-Statistics-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Correlation Between Statistics</a></span></li></ul></li><li><span><a href="#Advanced-Statistics" data-toc-modified-id="Advanced-Statistics-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Advanced Statistics</a></span><ul class="toc-item"><li><span><a href="#OPS" data-toc-modified-id="OPS-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>OPS</a></span></li><li><span><a href="#wOBA" data-toc-modified-id="wOBA-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>wOBA</a></span></li><li><span><a href="#Advanced-Stats-and-Runs-Scored" data-toc-modified-id="Advanced-Stats-and-Runs-Scored-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Advanced Stats and Runs Scored</a></span></li></ul></li><li><span><a href="#Wrapping-up" data-toc-modified-id="Wrapping-up-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Wrapping up</a></span></li></ul></div>

# Demo - Offensive Metrics in Baseball


The classical baseball statistics are _Batting Average_, _On-Base Percentage_, and _Slugging Percentage_, which make up what is commonly known as a batters _Slash Line_ due to how the stats are displayed in order with a "/" between them.  

This demo will explore how these metrics perform in quantifying team run scoring performance.  We will look at how well a statistic correlates with team run scoring to determine how well it serves as a measurement of a team's ability to score runs.  We will then look at advanced metrics like OPS and an empirical, data-driven metric known as wOBA.  We will show how these stats improve on the classical metrics.  

The demo shows how we should think about evaluating metrics and provide motivation for exploring the construction process for building the advanced metrics.

For this demo, we assume the user is familiar with the usual acronyms, shorthand abbreviations, and definitions of commonly used categories in baseball like BA for Batting Average, 1B for singles, and PA for plate appearances. 

## Classical Baseball Statistics

In this first part, we study the classical baseball statistics that have been used extensively over the years.  We use team-level data from the Lahman Database which is obtainable here in CSV form: http://www.seanlahman.com/baseball-archive/statistics/

The cell below loads the dataset and performs a few extra operations:
+ We restrict to year 2000 and later.  We could take the analysis back further to say 1962 but we definitely should not consider the entire dataset without careful considerations (What happened in 1962 to the schedule? What was baseball like in 1900? What about 1994?).*  
+ The dataset does not contain a field for singles so this is easy to compute from the available data as $H - \mathit{2B} - \mathit{3B} - \mathit{HR}$
+ $\mathit{HBP}$ is a field that is not guaranteed to be recorded in the dataset.  After 2000 it is but if you make any modifications in your explorations, you'll want to 0 it out (we know it shouldn't be zero but we have no better option right now).
+ Finally, we need to add plate appearances.  The quantity is only an approximate value for PA.  The reason is that there are other baseball events that occur but one unfortunate limitation of the Lahman Database is that it can sometimes have holes like this.

*If you like, after working through this part of the notebook rerun the analyses and see what happens if you comment out the year restriction part.  If you know your ancient baseball history, think about the nature of the game at the end of the 19th century and the beginning of the 20th century and how its different from today.

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
from datascience import Table
from datascience.util import table_apply

import numpy as np

# custom functions that will help do some simple tasks
from datascience_utils import *
from datascience_stats import *
from datascience_topic import fast_run_expectancy, most_common_lineup_position

In [None]:
# Load lahman_teams.csv obtained from the Lahman databank.  We only need a selection of the columns.
lahman = Table.read_table("lahman_teams.csv", usecols=[0, 3, 6] + list(range(14, 28)))

# Restrict to after the year 2000
lahman = lahman.where(lahman['yearID'] >= 2000).copy()

# Need to add two fields, singles and PA (which is only approximate)
lahman['1B'] = lahman['H'] - lahman['2B'] - lahman['3B'] - lahman['HR']
lahman['HBP'] = fill_null(lahman, fill_column='HBP', fill_value=0)
lahman['PA'] = lahman['AB'] + lahman['BB'] + lahman['HBP'] + lahman['SF']

lahman.show(5)

### Computing Classical Statistics

Using the formula for BA, OBP, and SLG, we add the classical stats to the dataset.

In [None]:
# Batting Average
lahman['BA'] = lahman['H'] / lahman['AB']
# On-Base Percentage
lahman['OBP'] = (lahman['H'] + lahman['BB'] + lahman['HBP']) / lahman['PA']
# Slugging Percentage
lahman['SLG'] = (lahman['1B'] + 2 * lahman['2B'] +
                 3 * lahman['3B'] + 4 * lahman['HR']) / lahman['AB']

We can visualize the typical team values for the stats to get a feel for what we can expect from teams.  Team values will be inherently more concentrated than player values since teams are made up of an assortment of player ability levels.

In [None]:
lahman.hist('BA', bins=20)

In [None]:
lahman.hist('OBP', bins=20)

In [None]:
lahman.hist('SLG', bins=20)

### Classical Stats and Runs Scored

We can look at simple scatter plots between the classic stats and runs scored to see how they relate.  We will also compute some metrics that quantify the relationship but suffice to say, having higher values for these metrics will tend to increase run scoring.  The most tenuous relationship definitely appears to be batting average.


In [None]:
stats = ['BA', 'OBP', 'SLG']
scatterplot_by_x(lahman, stats, 'R', title='Classical Stats vs Runs')

#### Correlation with Team Runs

This is the most involved step of Part I of the lab.  For each of the three classical stats, we need to do four things:
1. Compute a linear relationship between Team Runs and the statistic.  For example, for Batting Average:
$$
    \text{Predicted Team Runs} = \alpha + \beta \cdot \text{Team Batting Average}
$$
Each statistic will have its own $\alpha, \beta$ value.  We are not so interested in those values but rather the predicted value of Team Runs given a Team Batting Average and its error.
2. The error of the prediction in 1:
$$
    \text{Error} = \text{Team Runs} - \text{Predicted Team Runs}
$$
3. The correlation between the statistic and Team Runs
4. Plots of the relationship between Team Runs and the statistic, the linear relationship, and the errors.


##### 1. and 2. Linear Relationship and Error

In [None]:
linear_relationships = {}
errors = {}

linear_fits = Table().with_column('R', lahman['R'])

for i, stat in enumerate(stats):
    # Linear fit
    params, predictions, error = linear_fit(lahman[stat], lahman['R'])
    linear_relationships[stat] = params
    linear_fits = linear_fits.with_column(stat, lahman[stat])
    linear_fits = linear_fits.with_column(stat + '_pred', predictions)
    linear_fits = linear_fits.with_column(stat + '_err', error)

linear_fits.show(10)

##### 3. Correlation

In [None]:
correlations = {}

for i, stat in enumerate(stats):
    # Correlation
    correlations[stat] = correlation(lahman[stat], lahman['R'])

##### 4. Plotting the results

We put together the results into a simple set of pairs of plots
+ In the first plot we show a scatter plot and the linear relationship between a statistic and team runs.  What is immediately clear is that these stats are in fact related to scoring and not complete nonsense. In fact, if they were all you had, they would be okay.  But as we will see, you can do better.
+ In the second plot, we show the errors between the actual team run values and the predicted values from the linear relationship.  We do this because we want to visually see that OBP and SLG do in fact improve on BA in terms of the size of the errors.  By tending to have smaller error, this shows OBP and SLG correlate/associate with run scoring in a stronger manner than just batting average.

In [None]:
stat = 'BA'
linear_fits.scatter(stat, select='R', fit_line=True, color='C0')
linear_fits.scatter(stat, select=stat + '_err', color='C1')

In [None]:
stat = 'OBP'
linear_fits.scatter(stat, select='R', fit_line=True, color='C0')
linear_fits.scatter(stat, select=stat + '_err', color='C1')

In [None]:
stat = 'SLG'
linear_fits.scatter(stat, select='R', fit_line=True, color='C0')
linear_fits.scatter(stat, select=stat + '_err', color='C1')

We also print out some results for each statistic.

In [None]:
def stat_summary_print(stat, corr, err_std):
    print(f"Stat: {stat}")
    print("=" * 20)
    print(f"Correlation with Runs: {corr:.3f}")
    print(f"Std dev of errors (in Runs): {err_std:.3f}")
    print()
    
for stat in stats:
    corr = correlations[stat]
    err_std = np.std(linear_fits[stat + '_err'])
    # Print summary
    stat_summary_print(stat, corr, err_std)

The correlation shows that OBP and SLG have a stronger link to run scoring.  The standard deviation of the prediction errors, which measures the overall magnitude of the errors (and is in the units of runs) and shows that errors are dramatically reduced by considering OBP and SLG.  A difference in standard deviation between BA and SLG of 15 runs translates to about 1.5 wins, no small matter.

_Questions_

+ If we came up with a metric that can perfectly predict team run scoring in the period we have been exploring (post year 2000), why should we be suspicious of this metric?
+ What do you think of the implicit assumption that if a metric works well in measuring team performance that it works well at the player level?  Do you have a problem with the assumpion and if so, why?  If not, give a justification for the assumption.

### Correlation Between Statistics

One thing we might wonder is since OBP and SLG are seemingly both improvements on BA, are they telling us something different?  Well, for one, based on the construction we know that has to be true.  The extra weighting of extra base hits means they cannot possibly be the same.  But to what extent are they different?  

Well, we can look at the correlations between the three statistics to see how closely they relate to each other.  We find that OBP is pretty closely linked to BA and we know that whatever difference exists there is due entirely to ignoring walks and hit py pitch events. 

SLG shows a lower correlation with either BA and OBP, clearly due to its weighting for extra base hits.

In [None]:
rho_obp_ba = correlation(lahman["OBP"], lahman['BA'])
rho_obp_slg = correlation(lahman["OBP"], lahman['SLG'])
rho_ba_slg = correlation(lahman["BA"], lahman['SLG'])

print(f" BA and OBP: {rho_obp_ba:.3f}")
print(f" BA and SLG: {rho_ba_slg:.3f}")
print(f"OBP and SLG: {rho_obp_slg:.3f}")

_Question_
+ Why does it matter that OBP and SLG correlate with team runs but not completely with each other?

## Advanced Statistics

### OPS

OPS stands for "On-Base Plus Slugging".  The formula for OPS is pretty obvious:
$$
    \mathit{OPS} = \mathit{OBP} + \mathit{SLG}
$$
OPS is probably the most well-known advanced metric since it is often the first foray into advanced stats for people.

As we've seen above, OBP and SLG are both good metrics for run scoring.  Because they're not perfectly correlated, we can try to combine them in some way (in this case, just adding them) and hopefully gain some extra power in measuring performance.

One thing that should be pointed out about OPS is that is adds two metrics that measure two entirely different things: times on-base per plate appearance vs. total bases per at-bat.  Therefore, the actual number coming out of OPS is meaningless other than higher is better. 

In [None]:
# Team OPS
lahman['OPS'] = lahman['OBP'] + lahman['SLG']

As before, we can visualize the typical team values for OPS to get a feel for what we can expect from teams.  Since the actual numeric value of OPS has no meaning, it's important to know what typical team OPS values are.  

In [None]:
lahman.hist('OPS', bins=20)

### wOBA

wOBA stands for "Weighted On Base Average" and is one of the premier advanced stats out there.  It is empirically driven meaning that it was created using play-by-play baseball data and is specifically designed to perform well.  This is in contrast to BA, OBP, SLG, and OPS that were designed on intuition to measure something we had good to think mattered.  And to a large degree, BA, OBP, and SLG do measure things that matter as the above analysis showed.

The formula for wOBA is given by,
$$
    \mathit{wOBA} =
    \frac{0.72\cdot \mathit{BB} + 0.75\cdot \mathit{HBP} + 0.90\cdot \mathit{1B} + 1.24\cdot\mathit{2B} + 1.56\cdot\mathit{3B} + 1.95\cdot\mathit{HR}}{\mathit{PA}}
$$
Notice how wOBA values events differently but in different proportion from SLG.  A HR is only worth about 2.2x as much as a single instead of 4x as in SLG.  We'll dive into why wOBA doesn't weight HR as strongly but the main point is that it's actually the data telling us not to weight HR as strongly compared to a (sort of) arbitrary choice to weight using the number of bases for the hit.  

Also notice also that BB and HBP are worth not quite as much as a single.  This is actually pretty easy to grasp: we naturally prefer (and thus weight more heavily) a single to a walk because while both events leave you at first, a single puts the ball in play and therefore can lead to more advancement of runners and thus more scoring.  

In [None]:
# Team wOBA
lahman['wOBA']= (
    .72 * lahman['BB'] + .75 * lahman['HBP'] + 
    .9 * lahman['1B'] + 1.24 * lahman['2B'] +  
    1.56 * lahman['3B'] + 1.95 * lahman['HR']
) / lahman['PA']

You may notice with wOBA that it appears to be very similar in scale to OBP.  In fact, that is by design: a typical good OBP for a player or a team is also a typical good wOBA value.  Ditto for bad OBP values.

In [None]:
lahman.hist('wOBA', bins=20)

### Advanced Stats and Runs Scored

1. We'll reuse the previous correlation and linear fit analysis for BA, OBP, and SLG and compute the corresponding quantities for OPS and wOBA.
2. Then we'll generate the same scatter plots of OPS and wOBA vs Runs Scored as well as the scatter plots on the errors.
3. We'll use the `stat_summary_print` function to print out the results
4. Finally, we'll generate the correlation between wOBA and OPS

In [None]:
adv_stats = ['OPS', 'wOBA']

for i, stat in enumerate(adv_stats):
    # Linear fit
    params, predictions, error = linear_fit(lahman[stat], lahman['R']) 
    linear_relationships[stat] = params
    correlations[stat] = correlation(lahman[stat], lahman['R'])
    linear_fits = linear_fits.with_column(stat, lahman[stat])
    linear_fits = linear_fits.with_column(stat + '_pred', predictions)
    linear_fits = linear_fits.with_column(stat + '_err', error)

In [None]:
stat = 'OPS'
linear_fits.scatter(stat, select='R', fit_line=True, color='C0')
linear_fits.scatter(stat, select=stat + '_err', color='C1')

In [None]:
stat = 'wOBA'
linear_fits.scatter(stat, select='R', fit_line=True, color='C0')
linear_fits.scatter(stat, select=stat + '_err', color='C1')

In [None]:
for stat in adv_stats:
    corr = correlations[stat]
    err_std = np.std(linear_fits[stat + '_err'])
    # Print summary
    stat_summary_print(stat, corr, err_std)

In [None]:
rho_ops_wOBA = correlation(lahman["OPS"], lahman['wOBA'])
print(f"wOBA and OPS: {rho_ops_wOBA:.3f}")

## Wrapping up

Okay, this was a pretty fun entry into developing some empirically driven approaches to measuring performance.

Some final questions:

+ What does our analysis say about future performance?  Does it say anything? 
+ Say a player has a .400 wOBA for a season.  What might we want to know about the metric wOBA when evaluating the player and valuing the player's future performance?