# Offensive Metrics in Baseball

This demo will explore how metrics for quantifying team run scoring performance.  We will look at the classical slash line statistics and get a feel for how they work.  We will then look at how effective they are at measuring performance.  We will then look at two advanced metrics and how these metrics improve on the classical statistics.  

The demo shows how we should think about evaluating metrics and provide motivation for exploring the construction process for building the advanced metrics.

For this demo, we assume the user is familiar with the usual acronyms, shorthand abbreviations, and definitions of commonly used categories in baseball like BA for Batting Average, 1B for singles, and PA for plate appearances. 

In [None]:
# Setup
%run ../../utils/notebook_setup.py

from datascience import Table
import numpy as np

# custom functions that will help do some simple tasks
from datascience_utils import *
from datascience_stats import *

## 1. The Slash Line

We're going to study the classical baseball statistics that have been used extensively over the years.

The classical baseball statistics are _Batting Average_, _On-Base Percentage_, and _Slugging Percentage_, which make up what is commonly known as a batters _Slash Line_ due to how the stats are displayed in order with a "/" between them.  

In [None]:
# Load lahman_teams.csv obtained from the Lahman databank.  We only need a selection of the columns.
lahman = Table.read_table("lahman_teams.csv", usecols=[0, 3, 6, 8, 9] + list(range(14, 28)))

# Compute runs per game
lahman['Rpg'] = lahman['R'] / lahman['G']
lahman.show(5)

After loading the dataset we need to perform a few extra operations:
+ We restrict to year 1940 and later.  
+ The dataset does not contain a field for singles so this is easy to compute from the available data as $H - \mathit{2B} - \mathit{3B} - \mathit{HR}$
+ Some fields like $\mathit{HBP}$ are not guaranteed to be recorded in the dataset.  We'll fill in those values as 0 even though we know it shouldn't be zero.  We have no better option right now.
+ Finally, we need to add plate appearances.  The quantity is only an approximate value for PA.  The reason is that there are other baseball events that occur but one unfortunate limitation of the Lahman Database is that it can sometimes have holes like this.

In [None]:
# First we'll restrict to after the year 1940
lahman = lahman.where(lahman['yearID'] >= 1940).copy()

We want the years in five year periods so we use a helper function

In [None]:
# bucket by five year increments
lahman['year5'] = floor_to_nearest(lahman['yearID'], 5) 

Unfortunately there are some null values because some statistics weren't reliably recorded previously.  We'll fill those with 0s.

In [None]:
# fill in null values
lahman['HBP'] = fill_null(lahman, fill_column='HBP', fill_value=0)
lahman['SF'] = fill_null(lahman, fill_column='SF', fill_value=0)
lahman['CS'] = fill_null(lahman, fill_column='CS', fill_value=0)

We also need to add two fields: singles and PA (which is only approximate)

In [None]:
# compute some missing columns
lahman['1B'] = lahman['H'] - lahman['2B'] - lahman['3B'] - lahman['HR']
lahman['PA'] = lahman['AB'] + lahman['BB'] + lahman['HBP'] + lahman['SF']

### Computing the Slash Line

Using the formula for BA, OBP, and SLG, we add the classical stats to the table.

In [None]:
# Batting Average
lahman['BA'] = lahman['H'] / lahman['AB']
# On-Base Percentage
lahman['OBP'] = (lahman['H'] + lahman['BB'] + lahman['HBP']) / lahman['PA']
# Slugging Percentage
lahman['SLG'] = (lahman['1B'] + 2 * lahman['2B'] +
                 3 * lahman['3B'] + 4 * lahman['HR']) / lahman['AB']

### Histograms

We can visualize the typical team values for the stats to get a feel for what we can expect from teams.  Team values will be inherently more concentrated than player values since teams are made up of an assortment of player ability levels.  For instance, only one team since 1940 has had a BA over .300 yet many players have.  Ted Williams had a BA over .400!

In [None]:
lahman.hist('BA', bins=20)

In [None]:
lahman.hist('OBP', bins=20)

In [None]:
lahman.hist('SLG', bins=20)

### Historical Trends/Cycles

We can also see how team values change over time according to various eras in baseball.

In [None]:
boxplots(lahman, column='BA', by='year5', figsize=(12, 4), rot=90)

In [None]:
boxplots(lahman, column='OBP', by='year5', figsize=(12, 4), rot=90)

In [None]:
boxplots(lahman, column='SLG', by='year5', figsize=(12, 4), rot=90)

### Best and Worst Teams

#### Batting Average

In [None]:
lahman.sort('BA').\
    select('yearID', 'franchID', 'G', 'W', 'L', 'R', 'Rpg', 'BA').\
    show(10)

In [None]:
lahman.sort('BA', descending=True).\
    select('yearID', 'franchID', 'G', 'W', 'L', 'R', 'Rpg', 'BA').\
    show(10)

#### On-Base Percentage

In [None]:
lahman.sort('OBP').\
    select('yearID', 'franchID', 'G', 'W', 'L', 'R', 'Rpg', 'OBP').\
    show(10)

In [None]:
lahman.sort('OBP', descending=True).\
    select('yearID', 'franchID', 'G', 'W', 'L', 'R', 'Rpg', 'OBP').\
    show(10)

#### Slugging Percentage

In [None]:
lahman.sort('SLG').\
    select('yearID', 'franchID', 'G', 'W', 'L', 'R', 'Rpg', 'SLG').\
    show(10)

In [None]:
lahman.sort('SLG', descending=True).\
    select('yearID', 'franchID', 'G', 'W', 'L', 'R', 'Rpg', 'SLG').\
    show(10)

## 2. Classical Stats and Runs Scored

We can look at simple scatter plots between the classic stats and runs scored to see how they relate.  We will also compute some metrics that quantify the relationship but suffice to say, having higher values for these metrics will tend to increase run scoring.  The most tenuous relationship definitely appears to be batting average.

In [None]:
stats = ['BA', 'OBP', 'SLG']
scatterplot_by_x(lahman, stats, 'R', title='Classical Stats vs Runs')

Note that due to the variable number of games a team might have played, we cannot directly consider runs scored.  Instead, we use Runs per Game.

In [None]:
scatterplot_by_x(lahman, stats, 'Rpg', title='Classical Stats vs Runs per Game')

### Correlation with Team Runs

This is the most involved step of the demo.  For each of the three classical stats, we need to do four things:
1. Compute a linear relationship between Team Runs and the statistic.  For example, for Batting Average:
$$
    \text{Predicted Team Runs} = \alpha + \beta \cdot \text{Team Batting Average}
$$
Each statistic will have its own $\alpha, \beta$ value.  We are not so interested in those values but rather the predicted value of Team Runs given a Team Batting Average and its error.
2. The error of the prediction in 1:
$$
    \text{Error} = \text{Team Runs} - \text{Predicted Team Runs}
$$
3. The correlation between the statistic and Team Runs
4. Plots of the relationship between Team Runs and the statistic, the linear relationship, and the errors.


#### 1. and 2. Linear Relationship and Error

In [None]:
linear_relationships = {}

# Start a table with runs per game
linear_fits = Table().with_column('Rpg', lahman['Rpg'])

# Compute the linear fit for BA
params, predictions, error = linear_fit(lahman['BA'], lahman['Rpg'])
# add columns for BA, BA model predictions, and BA model errors
linear_fits = linear_fits.with_columns(
    'BA', lahman['BA'],
    'BA_pred', predictions,
    'BA_err', error
)
# save the slope/intercept parameters
linear_relationships['BA'] = params

# Compute the linear fit for OBP
params, predictions, error = linear_fit(lahman['OBP'], lahman['Rpg'])
# add columns for OBP, OBP model predictions, and OBP model errors
linear_fits = linear_fits.with_columns(
    'OBP', lahman['OBP'],
    'OBP_pred', predictions,
    'OBP_err', error
)
# save the slope/intercept parameters
linear_relationships['OBP'] = params

# Compute the linear fit for SLG

params, predictions, error = linear_fit(lahman['SLG'], lahman['Rpg'])
# add columns for SLG, SLG model predictions, and SLG model errors
linear_fits = linear_fits.with_columns(
    'SLG', lahman['SLG'],
    'SLG_pred', predictions,
    'SLG_err', error
)
# save the slope/intercept parameters
linear_relationships['SLG'] = params

linear_fits.show(10)

#### 3. Correlation with Runs per Game

In [None]:
correlations = {}

# Compute the correlations of each stat with runs per game
correlations['BA'] = correlation(lahman['BA'], lahman['Rpg'])
correlations['OBP'] = correlation(lahman['OBP'], lahman['Rpg'])
correlations['SLG'] = correlation(lahman['SLG'], lahman['Rpg'])

#### 4. Plotting the results

We put together the results into a simple set of pairs of plots
+ In the first plot we show a scatter plot and the linear relationship between a statistic and team runs.  What is immediately clear is that these stats are in fact related to scoring and not complete nonsense. In fact, if they were all you had, they would be okay.  But as we will see, you can do better.
+ In the second plot, we show the errors between the actual team run values and the predicted values from the linear relationship.  We do this because we want to visually see that OBP and SLG do in fact improve on BA in terms of the size of the errors.  By tending to have smaller error, this shows OBP and SLG correlate/associate with run scoring in a stronger manner than just batting average.

In [None]:
linear_fits.scatter('BA', select='Rpg', fit_line=True, color='C0')
linear_fits.scatter('BA', select='BA_err', color='C1')

In [None]:
linear_fits.scatter('OBP', select='Rpg', fit_line=True, color='C0')
linear_fits.scatter('OBP', select='OBP_err', color='C1')

In [None]:
linear_fits.scatter('SLG', select='Rpg', fit_line=True, color='C0')
linear_fits.scatter('SLG', select='SLG_err', color='C1')

We also print out some results for each statistic.

In [None]:
def stat_summary_print(stat, corr, err_std):
    print(f"Stat: {stat}")
    print("=" * 20)
    print(f"Correlation with Runs: {corr:.3f}")
    print(f"Std dev of errors (in Runs): {err_std:.3f}")
    print()
    
# Print summaries
corr = correlations['BA']
err_std = np.std(linear_fits['BA_err'])
stat_summary_print('BA', corr, err_std)

corr = correlations['OBP']
err_std = np.std(linear_fits['OBP_err'])
stat_summary_print('OBP', corr, err_std)

corr = correlations['SLG']
err_std = np.std(linear_fits['SLG_err'])
stat_summary_print('SLG', corr, err_std)

The correlation shows that OBP and SLG have a stronger link to run scoring.  The standard deviation of the prediction errors, which measures the overall magnitude of the errors (and is in the units of runs) and shows that errors are dramatically reduced by considering OBP and SLG.  A difference in standard deviation between BA and SLG of 15 runs translates to about 1.5 wins, no small matter.

_Questions_

+ If we came up with a metric that can perfectly predict team run scoring for all season post-2000, why should we be suspicious of this metric?
+ What do you think of the implicit assumption that if a metric works well in measuring team performance that it works well at the player level?  Do you have a problem with the assumpion and if so, why?  If not, give a justification for the assumption.

### Correlation Between Statistics

One thing we might wonder is since OBP and SLG are seemingly both improvements on BA, are they telling us something different?  Well, for one, based on the construction we know that has to be true.  The extra weighting of extra base hits means they cannot possibly be the same.  But to what extent are they different?  

Well, we can look at the correlations between the three statistics to see how closely they relate to each other.  We find that OBP is pretty closely linked to BA and we know that whatever difference exists there is due entirely to ignoring walks and hit py pitch events. 

SLG shows a lower correlation with either BA and OBP, clearly due to its weighting for extra base hits.

In [None]:
rho_obp_ba = correlation(lahman["OBP"], lahman['BA'])
rho_obp_slg = correlation(lahman["OBP"], lahman['SLG'])
rho_ba_slg = correlation(lahman["BA"], lahman['SLG'])

print(f" BA and OBP: {rho_obp_ba:.3f}")
print(f" BA and SLG: {rho_ba_slg:.3f}")
print(f"OBP and SLG: {rho_obp_slg:.3f}")

_Question_
+ Why does it matter that OBP and SLG correlate with team runs but not completely with each other?

## 3. Advanced Statistics

### OPS

OPS stands for "On-Base Plus Slugging".  The formula for OPS is pretty obvious:
$$
    \mathit{OPS} = \mathit{OBP} + \mathit{SLG}
$$
OPS is probably the most well-known advanced metric since it is often the first foray into advanced stats for people.

As we've seen above, OBP and SLG are both good metrics for run scoring.  Because they're not perfectly correlated, we can try to combine them in some way (in this case, just adding them) and hopefully gain some extra power in measuring performance.

One thing that should be pointed out about OPS is that is adds two metrics that measure two entirely different things: times on-base per plate appearance vs. total bases per at-bat.  Therefore, the actual number coming out of OPS is meaningless other than higher is better. 

In [None]:
# Team OPS
lahman['OPS'] = lahman['OBP'] + lahman['SLG']

As before, we can visualize the typical team values for OPS to get a feel for what we can expect from teams.  Since the actual numeric value of OPS has no meaning, it's important to know what typical team OPS values are.  

In [None]:
lahman.hist('OPS', bins=20)

### wOBA

wOBA stands for "Weighted On Base Average" and is one of the premier advanced stats out there.  It is empirically driven meaning that it was created using play-by-play baseball data and is specifically designed to perform well.  This is in contrast to BA, OBP, SLG, and OPS that were designed on intuition to measure something we had good to think mattered.  And to a large degree, BA, OBP, and SLG do measure things that matter as the above analysis showed.

The formula for wOBA is given by,
$$
    \mathit{wOBA} =
    \frac{0.72\cdot \mathit{BB} + 0.75\cdot \mathit{HBP} + 0.90\cdot \mathit{1B} + 1.24\cdot\mathit{2B} + 1.56\cdot\mathit{3B} + 1.95\cdot\mathit{HR}}{\mathit{PA}}
$$
Notice how wOBA values events differently but in different proportion from SLG.  A HR is only worth about 2.2x as much as a single instead of 4x as in SLG.  We'll dive into why wOBA doesn't weight HR as strongly but the main point is that it's actually the data telling us not to weight HR as strongly compared to a (sort of) arbitrary choice to weight using the number of bases for the hit.  

Also notice also that BB and HBP are worth not quite as much as a single.  This is actually pretty easy to grasp: we naturally prefer (and thus weight more heavily) a single to a walk because while both events leave you at first, a single puts the ball in play and therefore can lead to more advancement of runners and thus more scoring.  

In [None]:
# Team wOBA
lahman['wOBA']= (
    .72 * lahman['BB'] + .75 * lahman['HBP'] + 
    .9 * lahman['1B'] + 1.24 * lahman['2B'] +  
    1.56 * lahman['3B'] + 1.95 * lahman['HR']
) / lahman['PA']

You may notice with wOBA that it appears to be very similar in scale to OBP.  In fact, that is by design: a typical good OBP for a player or a team is also a typical good wOBA value.  Ditto for bad OBP values.

In [None]:
lahman.hist('wOBA', bins=20)

### Advanced Stats and Runs Scored

1. We'll reuse the previous correlation and linear fit analysis for BA, OBP, and SLG and compute the corresponding quantities for OPS and wOBA.
2. Then we'll generate the same scatter plots of OPS and wOBA vs Runs Scored as well as the scatter plots on the errors.
3. We'll use the `stat_summary_print` function to print out the results
4. Finally, we'll generate the correlation between wOBA and OPS

#### 1. and 2. Linear Relationship and Error

In [None]:
# Compute the linear fit for OPS
params, predictions, error = linear_fit(lahman['OPS'], lahman['Rpg'])
# add columns for OPS, OPS model predictions, and OPS model errors
linear_fits = linear_fits.with_columns(
    'OPS', lahman['OPS'],
    'OPS_pred', predictions,
    'OPS_err', error
)
# save the slope/intercept parameters
linear_relationships['OPS'] = params

# Compute the linear fit for wOBA
params, predictions, error = linear_fit(lahman['wOBA'], lahman['Rpg'])
# add columns for wOBA, wOBA model predictions, and wOBA model errors
linear_fits = linear_fits.with_columns(
    'wOBA', lahman['wOBA'],
    'wOBA_pred', predictions,
    'wOBA_err', error
)
# save the slope/intercept parameters
linear_relationships['wOBA'] = params

In [None]:
stat = 'OPS'
linear_fits.scatter(stat, select='Rpg', fit_line=True, color='C0')
linear_fits.scatter(stat, select=stat + '_err', color='C1')

In [None]:
stat = 'wOBA'
linear_fits.scatter(stat, select='Rpg', fit_line=True, color='C0')
linear_fits.scatter(stat, select=stat + '_err', color='C1')

#### 3. Correlation with Runs per Game

In [None]:
correlations = {}

# Compute the correlations of each stat with runs per game
correlations['OPS'] = correlation(lahman['OPS'], lahman['Rpg'])
correlations['wOBA'] = correlation(lahman['wOBA'], lahman['Rpg'])

# Print summary
corr = correlations['OPS']
err_std = np.std(linear_fits['OPS_err'])
stat_summary_print('OPS', corr, err_std)


# Print summary
corr = correlations['wOBA']
err_std = np.std(linear_fits['wOBA_err'])
stat_summary_print('wOBA', corr, err_std)

### Similarity between OPS and wOBA

In [None]:
rho_ops_wOBA = correlation(lahman["OPS"], lahman['wOBA'])
print(f"wOBA and OPS: {rho_ops_wOBA:.3f}")

It turns out OPS and wOBA are highly correlated.  So why do we care about wOBA if we already had OPS?  The construction of wOBA is far more justified and scientific.  OPS works partly on a justification that OPB and SLG measure two different concepts and you could benefit by combining them.  But why combine them equally?  Why not weight OBP more (as empirical results have suggested)?  Other than that, it's a bit of luck OPS is the thing that works and is close to wOBA rather than some other stat that was concocted and did the same thing.

## Wrapping up

Some final questions:

+ What does our analysis say about future performance?  Does it say anything? 
+ Say a player has a .400 wOBA for a season.  What do we require of wOBA for us to be able to reliably project next season's wOBA?