## Method 2: using external reference distribution + t test

Let's recall the case study. We have 10 data points for method A and 10 data points for method B.

The means are different: 84.24 for method A vs. 85.54 for method B.

The reference distribution has 210 data points with a mean of 84.12

Question: are 84.24 and 85.54 statistically different based on the reference distribution?

To answer with method 2, we will do the following:
- define a new population based on the reference distribution, consisting of the difference between the averages of 2 consecutive sets of 10 samples
- this population will be normally distributed because of the central limit theorem
- its mean will be assumed to be known and equal to 0
- we can reasonably assume that the differences between averages will be distributed approximately independently

In [1]:
import pandas as pd
import numpy as np
import math
import scipy.stats as st
%config Completer.use_jedi = False
pd.set_option('display.max_rows', 500)
y_210 = pd.read_excel('yield 210.xlsx')
y_AB = pd.read_excel('yield 20.xlsx')

In [2]:
# let's generate the list of index.
# Goal is to take the difference between 2 consecutive sets of 10 samples, leave gap of 1 sample to diminish correlation and repeat counter
sa=[]
k=1
while (k + 20*(k-1)) < len(y_210):
    sa.append(k + 20*(k-1))
    k += 1

print('the list of index is {}'.format(sa))

the list of index is [1, 22, 43, 64, 85, 106, 127, 148, 169, 190]


In [3]:
# now we create the list of differences between consecutive samples
d_sa1 = lambda x: y_210.iloc[x-1:x+9,0].mean(axis=0)
d_sa2 = lambda x: y_210.iloc[x+9:x+19,0].mean(axis=0)
y1, y2 =[],[]
for i in sa:
    y1.append(d_sa1(i))
    y2.append(d_sa2(i))
y_m2 = pd.DataFrame({'y1': y1,
                     'y2': y2,})
y_m2['diff']=y_m2['y2']-y_m2['y1']
y_m2

Unnamed: 0,y1,y2,diff
0,83.94,83.51,-0.43
1,83.99,84.42,0.43
2,84.19,84.01,-0.18
3,85.18,84.28,-0.9
4,83.58,84.38,0.8
5,84.42,83.99,-0.43
6,84.72,84.21,-0.51
7,84.78,83.96,-0.82
8,84.09,84.58,0.49
9,83.62,84.26,0.64


Let's understand what we just did. We started from a set of 210 samples and created another set of 10 samples of the population of differences between 2 consecutive averages. 

We know this new population:
- is normally distributed thanks to the central limit theorem 
- has a know mean which is 0
- has unknown variance

It sounds familiar and we know that we must define the t statistic using the sample variance. But with how many degrees of freedom? 10 since we know the mean of the population.

In [4]:
# compute the sample variance
s_dot = math.sqrt((1/10)*((y_m2['diff']-0)**2).sum(axis=0))
print('sample standard deviation is {:.02f}'.format(s_dot))

# compute t statistic
t = (1.3-0)/s_dot
p_value = 1 - st.t.cdf(t,df=10,loc=0,scale=1)
print('p value for p(t > {:.02f}) is {:.03f}'.format(t,p_value))

sample standard deviation is 0.60
p value for p(t > 2.16) is 0.028


Finding a p value of 0.028 is comparable to the 0.047 results found in method 1. It allows us to disqualify the null hypothesis because it is very unlikely this result could have happened if the 2 means were equal. 

Thus the tests confirms method B is better than method A

### Application exercise

Just to make sure we get it, it's good to do a quick additional application exercise from the book

- Six temperature readings (in F) taken on a patient at 5-min intervals, before and after taking a drug:
- note: temperatures are recorder as 10(T-98.0)
- Before: 4, 3, 7
- After: 10, 6, 8
- reference data: 5,5,9,7,3,4,5,8,9,12,14,8,9,11,14,9,10,10,6,5,4,2,3,3,3,8,2,3,4,6,5,3,2,4,6,4

Using the data in five sets of six, leaving a gap of one between sets, determine significance level for the null hypothesis $\eta_{B}$ = $\eta_{A}$ when alternative is $\eta_{B}$ > $\eta_{A}$.

In [65]:
d_before = [4, 3, 7]
d_after = [10, 6, 8]
d_ref = [5,5,9,7,3,4,5,8,9,12,14,8,9,11,14,9,10,10,6,5,4,2,3,3,3,8,2,3,4,6,5,3,2,4,6,4]
dn_ref = np.array(d_ref)

# now we build the new sample consisting of differences between the mean of 2 consective sets of sample data size 3

# compute the index values
p_ind = []
k = 1
while k + 6*(k-1) < len(d_ref):
    p_ind.append(k + 6*(k-1))
    k += 1
print('list of index is{}'.format(p_ind))
    
# creates lambdas to compute the means
m1 = lambda x: dn_ref[x-1:x+2].mean()
m2 = lambda x: dn_ref[x+2:x+5].mean()
d_diff = pd.DataFrame({'diff': [m2(k)-m1(k) for k in p_ind]})
print('new sample of differences between 2 consecutive samples of size 3 is {}'.format(new_pop))

# now that we have the new sample. We assume it comes from normal population of mean 0 but unknown variance
s_dot = math.sqrt((1/5)*((d_diff['diff']-0)**2).sum(axis=0))
print('s_dot standard deviation with 5 degrees of freedom (since we know the mean) is {:.02f}'.format(s_dot))

# now we are statistically equipped to interpret the temperature difference observed before and after the drug
t = (1/3)*(sum(d_after)-sum(d_before))/s_dot
p_value = 1 - st.t.cdf(t,df=5,loc=0,scale=1)
print('t statistic is {:0.3f} and p value is {:0.03f}'.format(t,p_value))

list of index is[1, 8, 15, 22, 29]
new sample of differences between 2 consecutive samples of size 3 is        diff
0 -1.666667
1  0.666667
2 -4.000000
3  1.666667
4 -2.000000
s_dot standard deviation with 5 degrees of freedom (since we know the mean) is 2.28
t statistic is 1.462 and p value is 0.102


With p value greater than 0.05, we do not have any reason to reject the null hypothesis. 

Notice how I used more numpy arrays than panda dataframes, as I am trying to use the right tool for the right approach. No need to have a big dataframe when we just have a few integers to manipulate. I create d_diff as a panda dataframe because it displays so much better on the jupyter notebook