# Lab 2: Theory of Baseball

Author: Jeff Xiang

# Due: [MM/DD/YY]

Today, we'll be learning about the math and stats concepts behind various tools we can use to better understand baseball analytics. After today's lab, you should have a solid grasp of the following statistical concepts:

- Expected values
- Bayes' theorem
- Run expectancy

As usual, submit this lab by running the tests at the very bottom.

In [1]:
import math
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from datascience import *
import seaborn as sns

from client.api.notebook import Notebook

In [23]:
mlb_data = Table.read_table("Teams.csv").where('yearID', are.above(2005));
totalruns = mlb_data.column("R").sum()
totalinnings = (mlb_data.column("IPouts")/3).sum()
runsperinning = totalruns/totalinnings
totalruns

237465

In [4]:
pitching_data = Table.read_table("Pitching.csv")
pitching_data

playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,SHO,SV,IPouts,H,ER,HR,BB,SO,BAOpp,ERA,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP
bechtge01,1871,1,PH1,,1,2,3,3,2,0,0,78,43,23,0,11,1,,7.96,,,,0,,,42,,,
brainas01,1871,1,WS3,,12,15,30,30,30,0,0,792,361,132,4,37,13,,4.5,,,,0,,,292,,,
fergubo01,1871,1,NY2,,0,0,1,0,0,0,0,3,8,3,0,0,0,,27.0,,,,0,,,9,,,
fishech01,1871,1,RC1,,4,16,24,24,22,1,0,639,295,103,3,31,15,,4.35,,,,0,,,257,,,
fleetfr01,1871,1,NY2,,0,1,1,1,1,0,0,27,20,10,0,3,0,,10.0,,,,0,,,21,,,
flowedi01,1871,1,TRO,,0,0,1,0,0,0,0,3,1,0,0,0,0,,0.0,,,,0,,,0,,,
mackde01,1871,1,RC1,,0,1,3,1,1,0,0,39,20,5,0,3,1,,3.46,,,,0,,,30,,,
mathebo01,1871,1,FW1,,6,11,19,19,19,1,0,507,261,97,5,21,17,,5.17,,,,2,,,243,,,
mcbridi01,1871,1,PH1,,18,5,25,25,25,0,0,666,285,113,3,40,15,,4.58,,,,0,,,223,,,
mcmuljo01,1871,1,TRO,,12,15,29,29,28,0,0,747,430,153,4,75,12,,5.53,,,,0,,,362,,,


# Part 1: Expected Values

In probability and statistics, a binomial random variable refers to one that has only two possible outcomes: success or failure. The expected value, E(X), of a binomial random variable is the number of successes we expect from this variable over n trials. The formula for E(X) is shown below:



$E(X) = P(X) * N$



where $P(X)$ is the probability of success per trial, and $N$ is the number of trials.

** Question 1: ** Suppose the probability of an unfair coin landing on its tail side is 0.63. What is the expected value of the number of tails obtained after 1000 flips?

** Question 2a: ** The table `pitching_data` contains stats for every pitcher in the MLB since 1871. Suppose we're back in 2008, when Clayton Kershaw (kershcl01) had his fantastic debut season. Based only on his 2008 stats in strikeouts (SO) and batters faced by pitcher (BFP), what was the probability that Clayton would strike out a batter he faced?

** Question 2b: ** For all pitchers from 2008 to 2016 who have played at least 20 games (G) per season, what is their average number of batters faced per season?

** Question 2c: ** Suppose Clayton Kershaw faced the statistical average number of batters per season from 2008 to 2016, a number that you calculated in the question above. Based on his probability to strike out a batter he faced during his debut season, find the expected value for the total number of strikeouts between 2008 and 2016 for Clayton Kershaw.

# Part 2: Bayes' Theorem

Bayes' theorem is a useful tool in statistics to find the probability of an event occuring given a prior (conditional) probability. It simply states that the probability of an event A occurring, given the condition B is true, is equal to the probability of B given that A is true, multiplied by the original probability of A without regard to B, divided by the original probability of B disregarding A. In mathematical notation, it is stated as follows:

$P(A | B) = P(B | A) * P(A) / P(B)$

** Question 1a: ** Berkeley is notorious for rainy winters. On the morning of the day of one of your final exams, you look outside and notice that it is cloudy. To help you decide whether you should bring an umbrella outside, you decide to calculate the probability that it is a rainy day, given the following information about Berkeley's winter weather patterns:

- 75% of all rainy days start off cloudy
- 63% of days in December are rainy
- Cloudy mornings are common. 70% of all mornings start off cloudy, regardless of whether it rains later in the day

Given the above information, calculate the probability that it will indeed be a rainy day.

In [13]:
batting_data = Table.read_table("Batting.csv")

** Question 2: ** Given Bryce Harper's stats over the course of his career since he was drafted in 2012, what is the probability that Bryce Harper hits a home run given that the ball was indeed hit? Relevant data can be found in `batting_data`.

# Part 3: Run Expectancy

In [29]:
play_by_play = Table.read_table("retrosheet-events-plus-woba-2005_2015.csv")

Run expectancy is a concept that sheds light on the expected number of runs a team would score by the end of an inning given a conditional base/out state. For example, at the beginning of an inning, the base/out state would be 0 outs, none of the bases filled. Given only this information, one can relatively easily figure out the expected number of runs an average MLB team would score by the end of that inning, at that current base/out state.

Run expectancy is therefore just the expected value of the number of runs a team would score by the end of an inning, given the current base/out state. In mathematical notation, it is:

$E(R) = P(S) * 1$

where $S$ is the event that a team scores before the end of an inning from a given base/out state. Since base/out states reset at the end of every inning, we're only concerned with one inning at a time, thus multiplying the probability P(S) by 1 to obtain the expected runs.

** Question 1: ** Between 2006 and 2015, MLB teams played a total of 477482 innings in the regular season, scoring 237465 runs in the process. Find the expected number of runs a team would score in an inning.

*Hint:* The event of scoring a run in an inning can be characterized as scoring a run from a base/out state of 0/0.

Note: Since all innings begin with a base/out state of 0/0, the expected runs scored by a team in a single inning would also be the expected runs value for a base/out state of 0/0.

We can also figure out the run expectancy values of any base/out state. For example, if the leadoff batter happened to reach first base, the base/out state would change to 1B/0outs. This would increase the expected number of runs until the end of the inning from your answer above.

** Question 2a: ** Calculate the number of runs a team can expect to score before the end of the inning, given that it finds itself in a base/out state of 1B/0outs. Use the below information for your calculations:

- When a team scores a run in an inning, there is a 88% chance that it found itself at some point during the inning in a base/out state of 1B/0outs.
- The probability that a team scores before the end of the inning from any base/out state is 65%.
- The probability that a team finds itself in a base/out state of 1B/0outs at a point during any given inning is 72%.

** Question 2b: ** Give an intuitive explanation (in words) for the difference between your answer to question 1 and your answer to question 2a.