<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Run-Expectancy" data-toc-modified-id="Run-Expectancy-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Run Expectancy</a></span><ul class="toc-item"><li><span><a href="#Load-Play-by-Play-Data-from-Retrosheet" data-toc-modified-id="Load-Play-by-Play-Data-from-Retrosheet-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load Play-by-Play Data from Retrosheet</a></span></li><li><span><a href="#Runs-in-Remainder-of-Inning" data-toc-modified-id="Runs-in-Remainder-of-Inning-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Runs in Remainder of Inning</a></span></li><li><span><a href="#Run-Expectancy-Matrix" data-toc-modified-id="Run-Expectancy-Matrix-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Run Expectancy Matrix</a></span></li><li><span><a href="#On-Your-Own:-Evaluating-Strategies" data-toc-modified-id="On-Your-Own:-Evaluating-Strategies-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>On Your Own: Evaluating Strategies</a></span><ul class="toc-item"><li><span><a href="#Stealing-Bases" data-toc-modified-id="Stealing-Bases-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Stealing Bases</a></span></li><li><span><a href="#Bunting" data-toc-modified-id="Bunting-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Bunting</a></span></li></ul></li><li><span><a href="#On-Your-Own:-End-of-Game-Stealing/Bunting" data-toc-modified-id="On-Your-Own:-End-of-Game-Stealing/Bunting-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>On Your Own: End of Game Stealing/Bunting</a></span><ul class="toc-item"><li><span><a href="#End-of-Game-Stealing" data-toc-modified-id="End-of-Game-Stealing-1.5.1"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>End of Game Stealing</a></span></li><li><span><a href="#End-of-game-bunting" data-toc-modified-id="End-of-game-bunting-1.5.2"><span class="toc-item-num">1.5.2&nbsp;&nbsp;</span>End of game bunting</a></span></li></ul></li></ul></li></ul></div>

# Demo - Run Expectancy

This demo explores the concept of run expectancy and the _Run Expectancy Matrix_, an empirically driven measurement of how many runs we should expect to score in a given out/baserunner state.  We use run expectancy to explore basic baseball strategies like bunting and stealing.

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
from datascience import Table
import numpy as np

# custom functions that will help do some simple tasks
from datascience_utils import *
from datascience_stats import *
from datascience_topic import fast_run_expectancy

## Run Expectancy

### Load Play-by-Play Data from Retrosheet

Raw Retrosheet data (http://www.retrosheet.org/) contains play-by-play event logs, representing very raw information about the events in a baseball game. Lucky for us, the software program Chadwick (found here:http://chadwick.sourceforge.net/doc/index.html) was created to handle a lot of the messy work to compiled the data into a useable form.  Chadwick converts the raw logs into CSV which is what we use here.  Also, Chadwick computes some pretty important quantities that we make use of.

Note: This notebook uses data from 2001.  We could have used more recent data but Barry Bonds is a baseball god so part of this notebook is an excuse to revel in his statistical absurdity.

For computing the Run Expectancy Matrix as well as other analsis, we only need a few columns.  The relevant columns are:
+ EVENT_ID - An ID for the event in the dataset
+ INN_CT - Inning number
+ EVENT_CD - A code for what happened in the event
+ OUTS_CT - Number of outs
+ BAT_LINEUP_ID - Place in the batting order.  1 through 9.
+ BAT_EVENT_FL - A T/F flag as to whether the play-by-play event corresponds to a plate appearance (T) or some other type of event (F).
+ START_BASES_CD - An integer code representing the state of the runners, eg. runner on 2nd
+ END_BASES_CD - An integer code representing the state of the runners AFTER the event ends.
+ EVENT_OUTS_CT - Number of outs recorded on this event
+ EVENT_RUNS_CT - Number of runs scored on this event
+ FATE_RUNS_CT - Number of runs scored AFTER this event

In [None]:
cols = ['EVENT_ID', 'INN_CT', 'EVENT_CD', 'OUTS_CT', 'BAT_ID', 'BAT_LINEUP_ID',
        'BAT_EVENT_FL', 'START_BASES_CD', 'END_BASES_CD', 'EVENT_OUTS_CT',
        'EVENT_RUNS_CT', 'FATE_RUNS_CT']
retro = Table.read_table('retrosheet_events-2001.csv.gz', sep=',', usecols=cols)

new_cols = ['ID', 'Inning', 'Event_Type', 'Outs', 'Batter_ID', 'Lineup_Order',
            'PA_Flag', 'Start_Bases', 'End_Bases', 'Event_Outs', 'Event_Runs',
            'Future_Runs']
retro.relabel(cols, new_cols)

retro.show(10)

The play-by-play data contains entries that are not plate appearances.  One example is balks.  We want to drop these entries because they are not relevant to the question we are trying to answer.  

We also want to drop all data from the ninth inning or later.  This is because we want data that represents the regular course of play.  We hypothesize these end of game events violate some of our assumptions about the regular strategic play, or even are truncated due to walk-off events.

These changes are not hugely impactful (for example, there are not that many non-batting events) but in the interest of completeness and proper data analysis, we perform this cleaning.

In [None]:
bat_mask = (retro['PA_Flag'] == "T")
retro = retro.where(bat_mask).copy()

inning_mask = (retro['Inning'] < 9)
retro = retro.where(inning_mask).copy()

Another thing we want to do is convert the integer codes for baserunner situations into recognizable strings.  This information is available from http://www.retrosheet.org/.

In [None]:
base_runner_codes = {
    0: "None on",  # No one on
    1: "1st",  # runner on 1st
    2: "2nd",  # runner on 2nd
    3: "1st and 2nd",  # runners on 1st & 2nd
    4: "3rd",  # runner on 3rd
    5: "1st and 3rd",  # runners on 1st & 3rd
    6: "2nd and 3rd",  # runners on 2nd & 3rd
    7: "Bases Loaded"  # bases loaded
}
# Replace the numeric code with a string code
retro['Start_Bases'] = replace(retro, 'Start_Bases', base_runner_codes)
retro['End_Bases'] = replace(retro, 'End_Bases', base_runner_codes)

In [None]:
event_codes = {
    0: 'Unknown',
    1: 'None',
    2: 'Generic out',
    3: 'K',  # Strikeout
    4: 'SB',  # Stolen Base
    5: 'Defensive indifference',
    6: 'CS',  # Caught stealing
    7: 'Pickoff error',
    8: 'Pickoff',
    9: 'Wild pitch',
    10: 'Passed ball',
    11: 'Balk',
    12: 'Other advance/out advancing',
    13: 'Foul error',
    14: 'BB',  # Walk
    15: 'IBB',  # Intentional walk
    16: 'HBP',  # Hit by pitch
    17: 'Interference',
    18: 'RBOE',  # Reached base on error
    19: 'FC',  # Fielder's choice
    20: '1B',  # Single
    21: '2B',  # Double
    22: '3B',  # Triple
    23: 'HR',  # Home run
    24: 'Missing play',
}

# Replace numeric code with string
retro['Event_Type'] = replace(retro, 'Event_Type', event_codes)

### Runs in Remainder of Inning

In order to compute the Run Expectancy Matrix, we need to just add `Future_Runs` and `Event_Runs`.  

In [None]:
retro['Runs_ROI'] = retro['Future_Runs'] + retro['Event_Runs']

### Run Expectancy Matrix

The Run Expectancy Matrix is computed by grouping by `Outs` and `Bases` and computing an average.  For each out and baserunner combination, this collects all plate appearances in our dataset and the runs scored in the remainder of the inning from that plate appearance and after.  We are left with the 24 values that comprise the Run Expectancy Matrix.

In [None]:
re = retro.select('Outs', 'Start_Bases', 'Runs_ROI').\
    group(['Outs', 'Start_Bases'], np.mean)
re.relabel('Runs_ROI mean', 'RE')
re.pivot('Outs', 'Start_Bases', values='RE', collect=np.sum).\
    sort('0')

_Questions_
1. What does the Run Expectancy Matrix describe about the nature of baseball?  
2. What can you observe about the difference between having a runner on second base vs third base (consider 2nd vs 3rd and 1st and 2nd vs 1st and 3rd)?  
3. How do things change with 2 outs?
4. What does run expectancy tell us generally about outs?  That is, how valuable is one extra base compared to an out?

### On Your Own: Evaluating Strategies
#### Stealing Bases

We can perform a simple analysis of the strategy of stealing bases by computing a success rate for a base stealer that makes the run expectancy of the steal attempt equivalent to the run expectancy of the current state.

The current state has run expectancy $\mathit{RE}_{\text{Current}}$ while the steal attempt has run expectancy of
$$
    \text{RE}_{\text{Steal Attempt}} = \mathit{RE}_{\text{SB}} \cdot p_{\text{SB}} + 
        \mathit{RE}_{\text{CS}} \cdot (1 - p_{\text{SB}}) 
$$
Equating $\mathit{RE}_{\text{Current}} = \text{RE}_{\text{Steal Attempt}}$ gives the equalizing probability
$$
    p_{\text{SB}} = \frac{\mathit{RE}_{\text{Current}} - \mathit{RE}_{\text{CS}}}{\mathit{RE}_{\text{SB}} - \mathit{RE}_{\text{CS}}}
$$

To do:
+ A helper function has been provided where you put in the RE matrix and an out value and baserunner situation and it will give you the RE value.
+ Write a function to compute $p_{\text{SB}}$.  The function should take these inputs:
    + RE table (the non-pivoted version)
    + Starting baserunner situation
    + Caught stealing baserunner situation
    + Stolen base baserunner situation.
+ Consider three baserunner situations: "1st", "2nd", and "1st and 2nd" (ie. a double steal). Compute the success probability for each of the baserunner situations with varying out situations.  In the case of the double steal, we consider the natural case where the lead runner is thrown out.  
+ For each baserunner situation, iterate over the number of outs and print the results using this string:  
`f"Outs: {outs}  P(SB): {p:.3f}"`

In [None]:
def get_matrix_val(table, outs, base):
    for o, b, v in table.to_array():
        if outs == o and base == b:
            return v

_Questions_
1. At about what success probability do most situations balance out?
2. What about for the double steal?  What does this analysis suggest about the risk-reward of a double steal?
3. What about with a runner at second with 2 outs?  Does this make sense based on how you understand the baseball run scoring process?

#### Bunting

For bunting, we take a simpler view.  Assume we can execute the bunt strategy 100% of the time with a regular hitter (ie. not a pitcher at the plate).  Should we use this strategy?

To do:
+ Write a function that prints a string comparing the run expectancy for various baserunner and out situations.  Use the string `f"Outs: {outs} RE_curr: {val_curr:.3f} RE_bunt: {val_bunt:.3f}"` to print out the results.  The function should take as inputs:
    + RE table (the non-pivoted version)
    + Starting baserunner situation
    + Ending baserunner situation following successful bunt
+ Consider the three baserunner situations: "1st", "2nd", and "1st and 2nd". Compute the 4 values for each of the baserunner situations with varying out situations.    For each baserunner situation, iterate over the number of outs and print the results.

_Questions_
1. What does run expectancy say about whether you should bunt?  What does this tell us about the value of an out?
2. What if we consider pitchers?  Does this analysis make sense?

### On Your Own: End of Game Stealing/Bunting

Let's consider bunts or steals in the context of an end-of-game strategy.  Let's say we are the home team batting in the last half of the ninth inning (or later) and the game is tied.  We get a runner on base.  Should we bunt or steal?

In this situation we are no longer interested in expected runs but rather win probability.  And in this case, our probability of winning is just the probability of scoring more than one run.  We can compute a run/win probability matrix instead of a run expectancy matrix and use that to analyze the strategies.

To do:
+ Compute the win probability matrix but instead of using `np.mean` like we did with the table `re`, use a function that will compute
$$
    \text{Probability of at least 1 run}
        = \frac{\text{# of times scoring $\geq$ 1 run}}
                {\text{Total number of observations}}
$$

#### End of Game Stealing

Perform the exact same base stealing analysis as before but use the win probability matrix instead.  You should be able to reuse almost all of the code from before but with the WP matrix.

_Questions_
1. What changed?  What do the new results say about getting a runner into scoring position in order to win?

#### End of game bunting

Repeat the bunting analysis but use the win probability matrix instead.

_Questions_
1. Does it now make sense to bunt when you only need 1 run?
2. What does the analysis say about an out now in an end of game situation?
3. Would you use a bunt strategy?

_Questions_
1. What are the limits of kind of analysis?  What kind of caveats are there?  To what kind of hitter and team circumstances does this analysis apply?  Can this analysis still be useful?
2. Brainstorm some ways you might try to augment the analysis to improve it.