# The Figgins-Hill Convergence: Properly Weighting **OBP** and **SLG** in **OPS**

In [1]:
# import the necessary packages
import os
import sys
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

In [2]:
# set up the file paths
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
raw_data_dir = os.path.join(project_root, 'data', 'raw')
processed_data_dir = os.path.join(project_root, 'data', 'processed')

# test the paths
# print(f'Project Root: {project_root}')
# print(f'Raw Data Directory: {raw_data_dir}')
# print(f'Processed Data Directory: {processed_data_dir}')

In [3]:
# read in the csv for all qualified seasons from 2006 - 2015
# data courtesy of stathead
filename = 'mlb_qualified_batters_2006_2015.csv'
csv_path = os.path.join(raw_data_dir, filename)
batters = pd.read_csv(csv_path)
batters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1495 entries, 0 to 1494
Data columns (total 40 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rk                 1495 non-null   int64  
 1   Player             1495 non-null   object 
 2   Season             1495 non-null   int64  
 3   Age                1495 non-null   int64  
 4   Team               1495 non-null   object 
 5   Lg                 1495 non-null   object 
 6   G                  1495 non-null   int64  
 7   PA                 1495 non-null   int64  
 8   AB                 1495 non-null   int64  
 9   R                  1495 non-null   int64  
 10  H                  1495 non-null   int64  
 11  1B                 1495 non-null   int64  
 12  2B                 1495 non-null   int64  
 13  3B                 1495 non-null   int64  
 14  HR                 1495 non-null   int64  
 15  RBI                1495 non-null   int64  
 16  SB                 1495 

## Question 1: Which of the three Triple Slash Statistics (AVG / OBP / SLG) is most strongly correlated with run expectancy?

My goal is to determine which of **AVG**, **OBP**, and **SLG** carry more weight in terms of a batter's overall contribution to **run expectancy**. The independent, or predictor, variables will each be **AVG**, **OBP**, and **SLG**. The dependent, or response, variable will be **RE24**, the metric that tracks total net **run expectancy** for a player's plate appearances. I chose **RE24** for that reason, instead of **runs scored** or **RBI**. **Runs scored**, for the most part, don't occur during a batter's plate apperance, unless they hit a **home run**. **RBI** are situational. You can't collect or amass **RBI** unless your teammates are on in front of you. That's not within a player's control. The outcome of their individual plate appearance, however, is mostly within their control.

In short, I'll use **linear regression** to determine an equation of the form:

$$ \hat{\text{RE}24} = a \times (\text{AVG or OBP or SLG}) + b $$

Here, $a$ and $b$ are constants.

I will be using the `statsmodels` package to construct my model.

In [4]:
def model_run_expectancy(slash_stat):
    # set up the indepedent and dependent variables
    X = batters[slash_stat]
    X = sm.add_constant(X)
    y = batters['RE24']

    # construct the model
    model = sm.OLS(y, X).fit()

    # extract the parameters
    intercept = model.params['const']
    slope = model.params[slash_stat]
    r2 = model.rsquared
    r = np.sign(slope) * np.sqrt(r2)

    # print clean output
    print(f"{slash_stat} MODEL")
    print(f"Equation: RE24 = {intercept:.4f} + {slope:.4f}·{slash_stat}")
    print(f"R²: {r2:.4f}")
    print(f"r:  {r:.4f}")
    print()

In [5]:
slash_stats = batters[['AVG', 'OBP', 'SLG']]

for stat in slash_stats:
    model_run_expectancy(stat)

AVG MODEL
Equation: RE24 = -106.4302 + 431.2676·AVG
R²: 0.3467
r:  0.5888

OBP MODEL
Equation: RE24 = -148.4414 + 467.8851·OBP
R²: 0.6520
r:  0.8074

SLG MODEL
Equation: RE24 = -97.3306 + 246.7514·SLG
R²: 0.6841
r:  0.8271



To evaluate how well each Triple Slash stat predicts, or is correlated with, **run expectancy** I will use three complementary measures:

- $R^2$: Tells us how much of the variation in **run expectancy** is explained by a single metric
    - "How much does this stat *explain*?"
- $r$ (correlation): Reflects the strength and direction of the linear relationship
    - "How tightly do runs follow this stat?"
- $a$ (slope): Tells us how much we would predict **run expectancy** to change with each additional point (.001) of a single slash stat
    - "How *valuable* is this stat in terms of runs?"

---

## INSIGHTS

### Batting Average

- $R^2 = 0.3467$: Only about 35% of the variance in runs is accounted for by **AVG**. There is still about 65% of the variability left to explain.
- $r = 0.5888$: Indicates that there is a positive association between **AVG** and **RE24** (as expected), but the strength is modest at best. There is still a noticable scatter between the points.
- $a=431.2676$: Every 1-point of **AVG** (.001) tends to equate to about **0.43 runs**. While **AVG** appears valuable because a single point carries a large estimated effect, it simply isn’t a strong predictor of run expectancy. The slope tells us what a point of **AVG** is “worth” if it changes, but the weak correlation tells us **AVG** doesn’t consistently move with run production across players.

### On-Base Percentage

- $R^2 = 0.6520$: About 65% of the variance in run expectancy is accounted for by **OBP**. **OBP** explains about **30% more** of run scoring compared to **AVG**.
- $r = 0.8074$: Again, indicates a positive relationship (as expected), but compared to **AVG** it's a more stronger predictor. There is a **strong** correlation between **OBP** and **RE24**.
- $a=467.8851$: Every 1-point increase in **OBP** is associated with about **0.47 runs**. Unlike **AVG**, **OBP** is both valuable and reliable, meaning changes in **OBP** are much more consistently reflected in run production.

### Slugging Percentage

- $R^2 = 0.6841$: About 68% of the variance in run expectancy is accounted for by **SLG**. **SLG** only explains about **3 percentage points more** than **OBP**, marginal increase. This makes sense due to the fact that **SLG** captures the rate at which a player collects bases per at bat. Extra bases make large contributions to run scoring potential.
- $r = 0.8271$: **SLG** is strongly correlated with **run expectancy**, as expected.
- $a=246.7514$: Every 1-point increase in **SLG** is associated with about **0.25 runs**. While **SLG** reliably predicts run production, each point of **SLG** contributes less marginal run value than a point of **OBP**. This reflects a key baseball principle: avoiding outs has a larger impact on run expectancy than advancing extra bases.

---

Using these equations, let's see who overperformed and underperformed the expectations by finding the predicted value from the model and finding the residual.

In [6]:
# predict the re24 based on the coefficients from each model
batters['pRE24_AVG'] = -106.4302 + 431.2676 * batters['AVG']
batters['pRE24_OBP'] = -148.4414 + 467.8851 * batters['OBP']
batters['pRE24_SLG'] = -97.3306 + 246.7514 * batters['SLG']

# calculate their residuals
batters['rRE24_AVG'] = batters['RE24'] - batters['pRE24_AVG']
batters['rRE24_OBP'] = batters['RE24'] - batters['pRE24_OBP']
batters['rRE24_SLG'] = batters['RE24'] - batters['pRE24_SLG']

Those will be used for later analysis.

---

## Question 2: How much more valuable is OBP compared to SLG in terms of run expectancy?

**On-Base Plus Slugging (OPS)** is the sum of **OBP** and **SLG**, but as we observed earlier, it seems like there isn't really a 1:1 relationship between the two slash stats like **OPS** suggests. I'll use **linear regression** to determine an equation of the form:

$$ \hat{\text{RE}24} = a \times \text{OBP} + b \times \text{SLG} + c $$

Here, $a$, $b$, and $c$ are coefficients. What I’m especially interested in is the relative size of $a$ and $b$.

I will be using the `statsmodels` and `sklearn` packages to construct my model.

In [7]:
# set up the independent and dependent variables
X = batters[['OBP', 'SLG']]
X = sm.add_constant(X)
y = batters['RE24']

# construct the model
model = sm.OLS(y, X).fit()

# extract the metrics
intercept = model.params['const']
slope_OBP = model.params['OBP']
slope_SLG = model.params['SLG']
r2 = model.rsquared
r = np.sqrt(r2)

# print clean output
print(f'Equation: RE24 = {intercept:.2f} + {slope_OBP:.2f}·OBP + {slope_SLG:.2f}·SLG')
print(f'Every point of OBP is about {slope_OBP / slope_SLG:.2f} times more valuable than every point of SLG.')
print(f'R²: {r2:.3f}')
print(f'r: {r:.3f}')

Equation: RE24 = -149.52 + 269.40·OBP + 155.75·SLG
Every point of OBP is about 1.73 times more valuable than every point of SLG.
R²: 0.807
r: 0.898


## INSIGHTS

- $R^2 = 0.807$: About 81% of the variability in run expectancy is explained by this model. Combining **OBP** and **SLG** captures more of the variation than either metric alone, showing that both reaching base **and** collecting extra bases are important contributors to run production.
- $r=0.898$: This combination of **OBP** and **SLG** is **strongly** correlated with run expectancy.
- By taking the ratio of the respective slopes for **OBP** and **SLG**, we see that every point of **OBP** is roughly **1.73 times** that of **SLG**, reinforcing that the relationship between these two slash stats and run production is **not** simply 1:1.

So, it would appear that when calculating **OPS**, we should really be weighting **OBP** a little more than **SLG**. Meaning . . .

$$ \text{trOPS} = 1.73 \times \text{OBP} + \text{SLG} $$

Here, I am calculating what I call **True OPS (trOPS)**, which properly weights **OBP** based on the linear regression analysis.

---

## Question 3: Does properly weighing OBP even matter?

We properly scaled OBP to account for its weight in OPS, but does it really add much? Let's compare this with vanilla **OPS**!

In [8]:
# calculate trOPS
batters['trOPS'] = 1.73 * batters['OBP'] + batters['SLG']

In [9]:
slash_stats = batters[['OPS', 'trOPS']]

for stat in slash_stats:
    model_run_expectancy(stat)

OPS MODEL
Equation: RE24 = -138.2760 + 191.0486·OPS
R²: 0.7958
r:  0.8921

trOPS MODEL
Equation: RE24 = -149.5198 + 155.7339·trOPS
R²: 0.8072
r:  0.8984



## INSIGHTS

$R^2$ for **trOPS** is higher, but it's only about 1% of additional explanatory power. Properly weighting **OBP** allows for us to account for more variability in run value, but it's marginal at best.

---

Let's find the over and underperformers for these two stats!

In [10]:
# predict the re24 based on the coefficients from each model
batters['pRE24_OPS'] = -138.2760 + 191.0486 * batters['OPS']
batters['pRE24_trOPS'] = -149.5198 + 155.7339 * batters['trOPS']

# calculate their residuals
batters['rRE24_OPS'] = batters['RE24'] - batters['pRE24_OPS']
batters['rRE24_trOPS'] = batters['RE24'] - batters['pRE24_trOPS']

## Export to CSV

In [12]:
filename = 'mlb_qualified_batters_2006_2015_processed.csv'
csv_path = os.path.join(processed_data_dir, filename)
batters.to_csv(csv_path)