# Simulation

## 9.1 Introduction

- Baseball's nested structure: seasons made of games, games of innings, innings of plate appearances lends itself well to simulation. 
- This chapter covers simulating half-innings with Markov chains and full seasons using the Bradley-Terry model.

## 9.2 Simulating a Half Inning

### 9.2.1 Markov chains

- A Markov chain models transitions between states, where each state describes runners on base and outs 
    - 24 possible combinations, plus the inning-ending 3 out state. 
- The key assumption is that the probability of moving to a new state depends only on the current state, not what happened earlier in the inning.

### 9.2.2 Review of work in run expectancy

Using Retrosheet play-by-play data, we can calculate how many runs are expected to score from each runners-and-outs state. These empirical frequencies form the foundation for our transition matrix.

In [3]:
import pandas as pd
import sys
sys.path.append('../src')
from retrosheet_utils import retrosheet_add_states

fields = pd.read_csv('../data/fields.csv')
headers = fields['Header'].str.lower().tolist()

retro2016 = pd.read_csv('../data/all2016.csv', names=headers, low_memory=False)
retro2016 = retrosheet_add_states(retro2016)

- `state` - gives the runner locations and the number of outs at the beginning of each play.
- `new_state` - contains same information at the conclusion of the play.

In [4]:
retro2016['runs'] = retro2016['away_score_ct'] + retro2016['home_score_ct']
retro2016['half_inning_id'] = (retro2016['game_id'] + ' ' + 
                               retro2016['inn_ct'].astype(str) + ' ' + 
                               retro2016['bat_home_id'].astype(str))

half_innings = (retro2016
    .groupby('half_inning_id')
    .agg(
        outs_inning=('event_outs_ct', 'sum'),
        runs_inning=('runs_scored', 'sum'),
        runs_start=('runs', 'first')
    )
    .reset_index()
)

half_innings['max_runs'] = half_innings['runs_inning'] + half_innings['runs_start']

- Create the variable `half_inning_id` as a unique identifier for each half-inning in each baseball game.
- `half_innings` new dataframe that contains data aggregated over each half-inning of baseball played in 2016.

In [5]:
retro2016 = retro2016.merge(half_innings, on='half_inning_id', how='inner')

retro2016_complete = retro2016[
    ((retro2016['state'] != retro2016['new_state']) | (retro2016['runs_scored'] > 0)) &
    (retro2016['outs_inning'] == 3) &
    (retro2016['bat_event_fl'] == 'T')
].copy()


In [None]:
retro2016_complete['new_state'] = retro2016_complete['new_state'].str.replace(
    r'[0-1]{3} 3', '3', regex=True
)

- `new_state` runner locations when there were three outs.
    - always have the value 3 when the number of outs is equal to 3

### 9.2.3 Computing the transition probabilities

Computing the frequencies of all possible transitions between states 

In [9]:
T_matrix = pd.crosstab(retro2016_complete['state'], retro2016_complete['new_state'])
T_matrix.shape

(24, 25)

There are 24 possible values of the beginning state `state`, and 25 values of the final state `new_state` including the 3-outs state.

Creating probability matrix:

In [10]:
P_matrix = T_matrix.div(T_matrix.sum(axis=1), axis=0)
P_matrix.shape

(24, 25)

In [11]:
P_matrix.loc['3'] = [0] * 24 + [1]

In [12]:
P_matrix.shape

(25, 25)

The matrix `P_matrix` now has two important properties that allow it to model transitions between states in a Markov chain 
   1. it is square
   2. the entries in each of its rows sum to 1.

In [13]:
P_matrix.sum(axis=1)

state
000 0    1.0
000 1    1.0
000 2    1.0
001 0    1.0
001 1    1.0
001 2    1.0
010 0    1.0
010 1    1.0
010 2    1.0
011 0    1.0
011 1    1.0
011 2    1.0
100 0    1.0
100 1    1.0
100 2    1.0
101 0    1.0
101 1    1.0
101 2    1.0
110 0    1.0
110 1    1.0
110 2    1.0
111 0    1.0
111 1    1.0
111 2    1.0
3        1.0
dtype: float64