In [1]:
import numpy as np
import scipy.stats as stats
random_float = np.random.rand()


In game factoid: “63% of game one winners have gone on to win the World Series”

###  Monte Carlo:

First, let's brute force this using a Monte Carlo. We first build a function that calculates the outcome of a series, assuming that a team has won the first game.  It takes a win probablity as its argument (which we are presuming to be equal to 50% as our null hypothesis)

In [2]:
def series_outcome(win_prob = 0.5):
    win_prob = 0.5
    games_won=0
    for i in range(0,6):
        random_float = np.random.rand()
        if random_float>win_prob:
            games_won=games_won+1
    outcome = 0
    if games_won >= 3:
        outcome =1
    return(outcome)
series_outcome()

0

This next function calculates the series win percentage based on sample size.  We have set the default to 119 since there have been 119 World Series.  This includes all Series in the modern "2-3-2" format.  Since the team in question has won 1 game already, the must win 3 of the remaining 6 contests to win the series.

In [3]:
def series_sample(num_series=119):
    series_list = []
    for i in range(num_series):
        series_list.append(series_outcome())
    return np.mean(series_list)
series_sample()
        

0.5966386554621849

Finally we run this experiment a million times and calculate the average of the sucesses.  This probability will be our mean.  The standard deviation of this is the standard deviation of the mean (for our sample size of 119)

In [4]:
num_trials = 50000
trial_probs = []
for i in range(num_trials):
    trial_probs.append(series_sample())
trial_mean, trial_stdev = np.mean(trial_probs), np.std(trial_probs)
trial_mean, trial_stdev

(0.656246050420168, 0.04352605259547632)

Our brute force method of calculating tells us that game one winners have performed *worse* than expected.

###  Theory

If the teams in question are equally matched, we expect the odds of either team winning is essentially $p=50$%.  This will be our null hypothesis.  Since our trial consists of counting outcomes of $n=6$ contest (since one game has been played) random outcomes, and we need to calculate the net probability that a team wins $k=3$ or more games, we need to subtract the probality of winning 0,1, or 2 games from 1.  Since $p=0.5$ for our case this is equivalent to calculating the cdf from 0 to 3:

In [5]:
stats.binom.cdf(2,  6,.5)+stats.binom.cdf(3,  6,.5)

1.0

So what is the theortical probability that the first game winner wins the World Series?  How does that compare to the (measured/sampled) probability (which was 63%) that was neationed during the game?  

In [6]:
theo_trial_mean = 1-stats.binom.cdf(2,  6,.5)
theo_trial_stdev = np.sqrt(theo_trial_mean *(1-theo_trial_mean))/np.sqrt(119)
z = np.abs((0.63-theo_trial_mean)/theo_trial_stdev)
theo_trial_mean, theo_trial_stdev, z

(0.65625, 0.04353940912620225, 0.6029020725548295)

Our takeaways are that the first game winners actually *underperform* slightly (by $\approx 0.6 \sigma_\mathrm{m}$).  Since we have the actual theoretical distribution with the exact parameters here, we use standard normal statistics to calculate one minus the probability that the null hypothesis is true:

In [7]:
from scipy.stats import norm
1-norm.cdf(z)+norm.cdf(-z)

0.5465738371408011

So, 54.7% of the time, a fluctuation like this will occur during random sampling.  It may be indicative of sometime causal (such as adjustments that players, coaches, and other team staff make after the game), but certainly nothing conclusive.