# Simulation

## 9.1 Introduction

- Baseball's nested structure: seasons made of games, games of innings, innings of plate appearances lends itself well to simulation. 
- This chapter covers simulating half-innings with Markov chains and full seasons using the Bradley-Terry model.

## 9.2 Simulating a Half Inning

### 9.2.1 Markov chains

- A Markov chain models transitions between states, where each state describes runners on base and outs 
    - 24 possible combinations, plus the inning-ending 3 out state. 
- The key assumption is that the probability of moving to a new state depends only on the current state, not what happened earlier in the inning.

### 9.2.2 Review of work in run expectancy

Using Retrosheet play-by-play data, we can calculate how many runs are expected to score from each runners-and-outs state. These empirical frequencies form the foundation for our transition matrix.

In [3]:
import pandas as pd
import sys
sys.path.append('../src')
from retrosheet_utils import retrosheet_add_states

fields = pd.read_csv('../data/fields.csv')
headers = fields['Header'].str.lower().tolist()

retro2016 = pd.read_csv('../data/all2016.csv', names=headers, low_memory=False)
retro2016 = retrosheet_add_states(retro2016)

- `state` - gives the runner locations and the number of outs at the beginning of each play.
- `new_state` - contains same information at the conclusion of the play.

In [4]:
retro2016['runs'] = retro2016['away_score_ct'] + retro2016['home_score_ct']
retro2016['half_inning_id'] = (retro2016['game_id'] + ' ' + 
                               retro2016['inn_ct'].astype(str) + ' ' + 
                               retro2016['bat_home_id'].astype(str))

half_innings = (retro2016
    .groupby('half_inning_id')
    .agg(
        outs_inning=('event_outs_ct', 'sum'),
        runs_inning=('runs_scored', 'sum'),
        runs_start=('runs', 'first')
    )
    .reset_index()
)

half_innings['max_runs'] = half_innings['runs_inning'] + half_innings['runs_start']

- Create the variable `half_inning_id` as a unique identifier for each half-inning in each baseball game.
- `half_innings` new dataframe that contains data aggregated over each half-inning of baseball played in 2016.

In [5]:
retro2016 = retro2016.merge(half_innings, on='half_inning_id', how='inner')

retro2016_complete = retro2016[
    ((retro2016['state'] != retro2016['new_state']) | (retro2016['runs_scored'] > 0)) &
    (retro2016['outs_inning'] == 3) &
    (retro2016['bat_event_fl'] == 'T')
].copy()


In [None]:
retro2016_complete['new_state'] = retro2016_complete['new_state'].str.replace(
    r'[0-1]{3} 3', '3', regex=True
)

- `new_state` runner locations when there were three outs.
    - always have the value 3 when the number of outs is equal to 3

### 9.2.3 Computing the transition probabilities

Computing the frequencies of all possible transitions between states 

In [9]:
T_matrix = pd.crosstab(retro2016_complete['state'], retro2016_complete['new_state'])
T_matrix.shape

(24, 25)

There are 24 possible values of the beginning state `state`, and 25 values of the final state `new_state` including the 3-outs state.

Creating probability matrix:

In [10]:
P_matrix = T_matrix.div(T_matrix.sum(axis=1), axis=0)
P_matrix.shape

(24, 25)

In [11]:
P_matrix.loc['3'] = [0] * 24 + [1]

In [12]:
P_matrix.shape

(25, 25)

The matrix `P_matrix` now has two important properties that allow it to model transitions between states in a Markov chain 
   1. it is square
   2. the entries in each of its rows sum to 1.

In [13]:
P_matrix.sum(axis=1)

state
000 0    1.0
000 1    1.0
000 2    1.0
001 0    1.0
001 1    1.0
001 2    1.0
010 0    1.0
010 1    1.0
010 2    1.0
011 0    1.0
011 1    1.0
011 2    1.0
100 0    1.0
100 1    1.0
100 2    1.0
101 0    1.0
101 1    1.0
101 2    1.0
110 0    1.0
110 1    1.0
110 2    1.0
111 0    1.0
111 1    1.0
111 2    1.0
3        1.0
dtype: float64

In [14]:
(P_matrix.loc[['000 0']]
    .T
    .reset_index()
    .rename(columns={'index': 'new_state', '000 0': 'Prob'})
    .query('Prob > 0')
)

state,new_state,Prob
0,000 0,0.033364
1,000 1,0.675935
3,001 0,0.005627
6,010 0,0.050267
12,100 0,0.234807


In [16]:
(P_matrix.loc[['010 2']]
    .T
    .reset_index()
    .rename(columns={'index': 'new_state', '010 2': 'Prob'})
    .query('Prob > 0')
)

state,new_state,Prob
2,000 2,0.023319
5,001 2,0.005867
8,010 2,0.05762
11,011 2,0.000451
14,100 2,0.07447
17,101 2,0.032496
20,110 2,0.155709
24,3,0.650068


###  9.2.4 Simulating the Markov chain

The number of runs scored on a transition equals the sum of runners and outs before the play, minus the sum after, plus one (for the batter):

$$\text{runs} = (N_{runners}^{before} + O^{before} + 1) - (N_{runners}^{after} + O^{after})$$

In [17]:
def num_havent_scored(s):
    return sum(int(c) for c in s if c.isdigit())

runners_out = {state: num_havent_scored(state) for state in T_matrix.index}
runners_out = pd.Series(runners_out)

In [18]:
import numpy as np

R_runs = np.subtract.outer(runners_out.values + 1, runners_out.values)
R_runs = pd.DataFrame(R_runs, index=runners_out.index, columns=runners_out.index)
R_runs['3'] = 0


The simulation randomly samples state transitions using the probability matrix until reaching 3 outs, accumulating runs along the way.

In [19]:
def simulate_half_inning(P, R, start=0):
    s = start
    runs = 0
    while s < 24:  # 24 is the "3" absorbing state index
        s_new = np.random.choice(25, p=P.iloc[s].values)
        runs += R.iloc[s, s_new]
        s = s_new
    return runs


In [20]:
np.random.seed(111653)
simulated_runs = [simulate_half_inning(P_matrix, R_runs) for _ in range(10000)]

In [21]:
pd.Series(simulated_runs).value_counts().sort_index()

0    7245
1    1473
2     729
3     323
4     131
5      53
6      24
7      17
8       2
9       3
Name: count, dtype: int64

**Key findings from 10,000 simulated half-innings:**
- ~73% of innings produce zero runs
- Less than 1% produce 5+ runs  
- Average runs per half-inning ≈ 0.5

In [22]:
sum(np.array(simulated_runs) >= 5) / 10000

np.float64(0.0099)

In [23]:
np.mean(simulated_runs)

np.float64(0.4995)

In [24]:
def runs_j(j):
    return np.mean([simulate_half_inning(P_matrix, R_runs, j) for _ in range(10000)])

erm_2016_mc = pd.DataFrame({
    'state': P_matrix.index[:24],
    'mean_run_value': [runs_j(j) for j in range(24)]
})

erm_2016_mc['bases'] = erm_2016_mc['state'].str[:3]
erm_2016_mc['outs_ct'] = erm_2016_mc['state'].str[4].astype(int)
erm_2016_mc = erm_2016_mc.drop(columns='state')

erm_2016_mc.pivot(index='bases', columns='outs_ct', values='mean_run_value')

outs_ct,0,1,2
bases,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.4672,0.2533,0.099
1,1.3139,0.9218,0.3236
10,1.0939,0.6433,0.3015
11,1.9009,1.3133,0.4937
100,0.8413,0.5094,0.2181
101,1.7102,1.1466,0.4542
110,1.3793,0.8608,0.4153
111,2.1934,1.4757,0.6853


**Validating the model:** Simulate 10,000 half-innings from each starting state and compare the mean runs to the empirical run expectancy matrix from Chapter 5.

In [25]:
erm_2016 = pd.read_pickle('../data/erm_2016.pkl')

comparison = erm_2016.merge(erm_2016_mc, on=['bases', 'outs_ct'], suffixes=('_emp', '_mc'))
comparison['run_value_diff'] = (comparison['mean_run_value_emp'] - comparison['mean_run_value_mc']).round(2)

comparison[['bases', 'outs_ct', 'run_value_diff']].pivot(
    index='bases', 
    columns='outs_ct', 
    values='run_value_diff'
)

outs_ct,0,1,2
bases,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.03,0.01,0.01
1,0.03,0.02,0.05
10,0.04,0.03,0.01
11,0.03,0.04,0.05
100,0.02,0.0,0.0
101,0.01,0.05,0.02
110,0.07,0.06,-0.0
111,-0.09,0.06,0.01


Small differences (within ±0.1 runs) confirm the Markov chain simulation closely matches empirical run expectancy.