#### What is the amount of wins for either of the two drivers that would allow you to conclude that one of them is better than the other, assuming a significance level of 5% (essentially, find the rejection regions or the confidence interval for our MV/LV experiment)?

In [5]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.stats.api as sms
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

%matplotlib inline

Let's simulate two sequences of 100 races, one where a driver wins 55 times, and the other were the same driver wins 50 times

In [6]:
fifty_ones = [1] * 50
fifty_zeros = [0] * 50
one_hundred_races_with_50_wins = fifty_ones + fifty_zeros
np.random.shuffle(one_hundred_races_with_50_wins)
print(one_hundred_races_with_50_wins)

[0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0]


Let's simulate two sequences of 100 races, one where a driver wins 55 times, and the other were the same driver wins 55 times

In [7]:
fiftyfive_ones = [1] * 55
fourtyfive_zeros = [0] * 45
one_hundred_races_with_55_wins = fiftyfive_ones + fourtyfive_zeros
np.random.shuffle(one_hundred_races_with_55_wins)
print(one_hundred_races_with_55_wins)

[0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]


Let's simulate two sequences of 100 races, one where a driver wins 55 times, and the other were the same driver wins 60 times

In [8]:
sixty_ones = [1] * 60
fourty_zeros = [0] * 40
one_hundred_races_with_60_wins = sixty_ones + fourty_zeros
np.random.shuffle(one_hundred_races_with_60_wins)
print(one_hundred_races_with_60_wins)

[0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0]


Let's simulate two sequences of 100 races, one where a driver wins 55 times, and the other were the same driver wins 64 times

In [9]:
sixtyfour_ones = [1] * 64
thirtysix_zeros = [0] * 36
one_hundred_races_with_64_wins = sixtyfour_ones + thirtysix_zeros
np.random.shuffle(one_hundred_races_with_64_wins)
print(one_hundred_races_with_64_wins)

[1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1]


From scipy.stats we are using the ttest function to find if our hypothesis is correct or not. Taking the wins for lewis or max having equal probability we use the data values to get the p value which is close to the significance level of 0.05 i.e. 5%.

Looking at the ttest below we can see that if we assume values of 55wins for either max or lewis then the p value comes close to 50% which is too far away from the actual desired value 5%.

In [10]:
stats.ttest_ind(a=one_hundred_races_with_55_wins , b= one_hundred_races_with_50_wins, equal_var=True) 

TtestResult(statistic=np.float64(0.7053278933842974), pvalue=np.float64(0.48143514316269365), df=np.float64(198.0))

For the 60 wins to either lewis or max we can see that pvalue comes a little close to the desired value. But still it is not correct so we go ahead.

In [11]:
stats.ttest_ind(a=one_hundred_races_with_60_wins , b= one_hundred_races_with_50_wins, equal_var=True) 

TtestResult(statistic=np.float64(1.421410624438028), pvalue=np.float64(0.15677053340775413), df=np.float64(198.0))

For 64 wins, we finally reach the 5% mark with 64 wins. That makes sense as our significance value is the same.

In [12]:
stats.ttest_ind(a=one_hundred_races_with_64_wins , b= one_hundred_races_with_50_wins, equal_var=True) 

TtestResult(statistic=np.float64(2.0097597007986314), pvalue=np.float64(0.04581400584048895), df=np.float64(198.0))

In [14]:
import pingouin as pg

pg.ttest(x=one_hundred_races_with_64_wins, y=one_hundred_races_with_50_wins, paired=False)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,2.00976,198,two-sided,0.045814,"[0.0, 0.28]",0.284223,1.008,0.516008
