In [42]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### Scenario 1

$H_0: \mu_s = \mu_i = \mu_l$  
$H_A: \text{at least one of brand has significantly different prices}$

In [43]:
x_s = np.array([15, 12, 14, 11]) # snapzi clothing prices
x_i = np.array([39, 45, 48, 60]) # irisa clothing prices
x_l = np.array([65, 45, 32, 38]) # lolamoon clothing prices
num_groups = 3
num_observations = len(x_s) + len(x_i) + len(x_l)

In [44]:
print('x_s mean: ', x_s.mean())
print('x_i mean: ', x_i.mean())
print('x_l mean: ', x_l.mean())

x_g_mean = np.sum(x_s + x_i + x_l) / num_observations
print('x_g mean: ', x_g_mean)

x_s mean:  13.0
x_i mean:  48.0
x_l mean:  45.0
x_g mean:  35.333333333333336


In [45]:
# Compute sum of squares between groups (samples)
ss_between = 0
for x in [x_s, x_i, x_l]:
    ss_between += len(x)*np.sum((x.mean() - x_g_mean)**2)
print('SS_between: ', ss_between)

SS_between:  3010.666666666667


In [46]:
# Compute sum of squares within groups (samples)
ss_within = 0
for x in [x_s, x_i, x_l]:
    ss_within += np.sum((x - x.mean())**2)
print('SS_within: ', ss_within)

SS_within:  862.0


In [47]:
df_between = num_groups - 1
print('df_between: ', df_between)

df_within = num_observations - num_groups
print('df_within: ', df_within)

df_between:  2
df_within:  9


In [48]:
ms_between = ss_between / df_between
print('MS_between: ', ms_between)

ms_within = ss_within / df_within
print('MS_within: ', ms_within)

MS_between:  1505.3333333333335
MS_within:  95.77777777777777


In [49]:
f = ms_between / ms_within
print('F: ', f)

F:  15.716937354988401


Using the [F Table](http://socr.ucla.edu/Applets.dir/F_Table.html), find the $F^*$ value for $\alpha=0.05$.

$F=4.2565$

### Question
Do we reject or fail to reject $H_0$?  
Answer: Reject as $F > F^*$

### Scenario 2

A researcher is attempting to determine what type of food cows prefer by measuring the amount of food consumed (lbs) over 8 hours.

| Food A | Food B | Food C |
| --- | --- | --- |
| 2 | 6 | 8 |
| 4 | 4 | 9 |
| 3 | 7 | 10 |

##### Question 1
What is the independent variable?  
Answer: Type of food

##### Question 2
What is the dependent variable?  
Answer: Amount of food eaten (lbs)

##### Question 3
What is the null hypothesis?  
Answer: $H_0: \text{cows will eat similar amounts of each food type}$

##### Question 4
What is the alternative hypothesis?  
Answer: $H_A: \text{cows will eat more or less of at least one food type}$

##### Question 5
Compute ANOVA.

In [50]:
a = np.array([2, 4, 3])
b = np.array([6, 5, 7])
c = np.array([8, 9, 10])
num_groups = 3
num_observations = len(a) + len(b) + len(c)

x_g_mean = np.sum(a + b + c) / num_observations
print('a.mean(): ', a.mean())
print('b.mean(): ', b.mean())
print('c.mean(): ', c.mean())
print('x_g_mean: ', x_g_mean)

ss_between = 0
for x in [a, b, c]:
    ss_between += len(x)*np.sum((x.mean() - x_g_mean)**2)
print('SS_between: ', ss_between)

ss_within = 0
for x in [a, b, c]:
    ss_within += np.sum((x - x.mean())**2)
print('SS_within: ', ss_within)

df_between = num_groups - 1
print('df_between: ', df_between)
df_within = num_observations - num_groups
print('df_within: ', df_within)

ms_between = ss_between / df_between
print('MS_between: ', ms_between)
ms_within = ss_within / df_within
print('MS_within: ', ms_within)

f = ms_between / ms_within
print('F: ', f)

a.mean():  3.0
b.mean():  6.0
c.mean():  9.0
x_g_mean:  6.0
SS_between:  54.0
SS_within:  6.0
df_between:  2
df_within:  6
MS_between:  27.0
MS_within:  1.0
F:  27.0


##### Question 6
Do we reject or fail to reject the null hypothesis for $\alpha=0.05$?  
Answer: $F^*=5.1433$, so because $F > F^*$ we reject the null hypothesis. This suggests that there is a significant statistical difference between the food options.

##### Question 7
What is the difference of each value from the grand mean?  
Answer:

In [51]:
diff = 0
for x in [a, b, c]:
    diff += np.sum((x - x_g_mean)**2)
print('np.sum((x_i - x_g_mean)**2):', diff)

np.sum((x_i - x_g_mean)**2): 60.0


Note that $\sum\limits_{i=1}^k\sum\limits_{j=1}^{n_i}(x_{ij}-\bar{x}_G)^2 = \text{SS}_{\text{between}} + \text{SS}_{\text{within}} = \text{SS}_{\text{total}}$

##### Question 8
What can we conclude?  
Answer: At least two food options significantly differ from each other in terms of the amount eaten by the cows

##### Question 9
Compute Tukey's HSD for $\alpha=0.05$.

Note: $k=3$, $df_{\text{within}}=N-k=6$  
Using these, we find $q^*=4.34$

In [52]:
q_star = 4.34
tukeys_hsd = q_star * np.sqrt(ms_within/(len(a)))
print('tukeys_hsd: ', tukeys_hsd)

tukeys_hsd:  2.5057001682829756


##### Question 10
Which means are honestly significantly different?  
Answer: A and B ($\bar{x}_A-\bar{x}_B\gt 2.506$), B and C, and A and C

##### Question 11
Compute Cohen's d for multiple comparisons for each difference in means.

In [53]:
print('Cohen\'s d (A-B): ', (a.mean()-b.mean())/np.sqrt(ms_within))
print('Cohen\'s d (A-C): ', (a.mean()-c.mean())/np.sqrt(ms_within))
print('Cohen\'s d (B-C): ', (b.mean()-c.mean())/np.sqrt(ms_within))

Cohen's d (A-B):  -3.0
Cohen's d (A-C):  -6.0
Cohen's d (B-C):  -3.0


##### Question 12
Compute $\eta^2$.

In [54]:
print('eta squared: ', ss_between/(ss_between+ss_within))

eta squared:  0.9


### Scenario 3

Researchers conduct a study to reduce tumor sizes in breast cancer. They provided a placebo and three different drugs to 20 people and measured the reduction of tumor diameter in centimeters.

| Placebo | Drug 1 | Drug 2 | Drug 3 |
| --- | --- | --- | --- |
| 1.5 | 1.6 | 2.0 | 2.9 |
| 1.3 | 1.7 | 1.4 | 3.1 |
| 1.8 | 1.9 | 1.5 | 2.8 |
| 1.6 | 1.2 | 1.5 | 2.7 |
| 1.3 | | 1.8 | |
| | | 1.7 | |
| | | 1.4 | |

$$H_0: \mu_p = \mu_1 = \mu_2 = \mu_3$$
$$H_A: \text{at least one of the means is significantly different from one of the others}$$

##### Question 1
Compute the f-statistic for this study and determine if there is statistically significant result for $\alpha=0.05$.

$F^*=3.2389$

In [58]:
x_p = np.array([1.5, 1.3, 1.8, 1.6, 1.3])
x_1 = np.array([1.6, 1.7, 1.9, 1.2])
x_2 = np.array([2.0, 1.4, 1.5, 1.5, 1.8, 1.7, 1.4])
x_3 = np.array([2.9, 3.1, 2.8, 2.7])
num_groups = 4
num_observations = len(x_p) + len(x_1) + len(x_2) + len(x_3)

x_g_mean = (np.sum(x_p) + np.sum(x_1) + np.sum(x_2) + np.sum(x_3)) / num_observations

print('x_p.mean(): ', x_p.mean())
print('x_1.mean(): ', x_1.mean())
print('x_2.mean(): ', x_2.mean())
print('x_3.mean(): ', x_3.mean())
print('x_g_mean: ', x_g_mean)

ss_between = 0
for x in [x_p, x_1, x_2, x_3]:
    ss_between += len(x)*np.sum((x.mean() - x_g_mean)**2)
print('SS_between: ', ss_between)

ss_within = 0
for x in [x_p, x_1, x_2, x_3]:
    ss_within += np.sum((x - x.mean())**2)
print('SS_within: ', ss_within)

df_between = num_groups - 1
print('df_between: ', df_between)
df_within = num_observations - num_groups
print('df_within: ', df_within)

ms_between = ss_between / df_between
print('MS_between: ', ms_between)
ms_within = ss_within / df_within
print('MS_within: ', ms_within)

f = ms_between / ms_within
print('F: ', f)

x_p.mean():  1.4999999999999998
x_1.mean():  1.5999999999999999
x_2.mean():  1.6142857142857143
x_3.mean():  2.875
x_g_mean:  1.8350000000000002
SS_between:  5.449428571428573
SS_within:  0.8360714285714287
df_between:  3
df_within:  16
MS_between:  1.816476190476191
MS_within:  0.05225446428571429
F:  34.7621244482415


Thus, we reject $H_0$ as $F > F^*$.

##### Question 2
What proportion of the difference between tumor reductions is due to the type of drug?  

In [59]:
print('ss_beteen/ss_total: ', ss_between/(ss_between + ss_within))

ss_beteen/ss_total:  0.8669841017307408


### Scenario 4

Desert ants have a nifty way of finding their way back home after a foray out of the nest to find food -- they count their steps. To prove it, some scientists devised a creative experiment that showed just how the little guys do it.

[Can Ants Counts?](http://www.youtube.com/embed/7DDF8WZFnoU)

Let's work with some fictitious data that is consistent with the study described in the video. The table below list how close (distance from nest in centimeters) each ant got to their nest using either short-, long-, or normal-legs.

**Length Length**
| Short | Long | Normal |
| --- | --- | --- |
| -8 | 12 | 0.5 |
| -11 | 9 | 0.0 |
| -17 | 16 | -1.0 |
| -9 | 8 | 1.5 |
| -10 | 15 | 0.5 |
| -5 | | -0.1 |
| | | 0.0 |

Use $\alpha=0.05$.

$F^*=3.6823$ ([F Table](http://socr.ucla.edu/Applets.dir/F_Table.html))

In [69]:
short = np.array([-8, -11, -17, -9, -10, -5])
long = np.array([12, 9, 16, 8, 15])
normal = np.array([.5, 0, -1, 1.5, .5, -.1, 0])
num_groups = 3
num_observations = len(short) + len(long) + len(normal)

x_g_mean = np.sum(np.concatenate([short, long, normal]))/num_observations
print('x_g_mean: ', x_g_mean)

ss_between = 0
for x in [short, long, normal]:
    ss_between += len(x)*np.sum((x.mean() - x_g_mean)**2)
print('SS_between: ', ss_between)

ss_within = 0
for x in [short, long, normal]:
    ss_within += np.sum((x - x.mean())**2)
print('SS_within: ', ss_within)

df_between = num_groups - 1
print('df_between: ', df_between)
df_within = num_observations - num_groups
print('df_within: ', df_within)

ms_between = ss_between / df_between
print('MS_between: ', ms_between)
ms_within = ss_within / df_within
print('MS_within: ', ms_within)

f = ms_between / ms_within
print('F: ', f)


x_g_mean:  0.07777777777777778
SS_between:  1320.171111111111
SS_within:  133.48
df_between:  2
df_within:  15
MS_between:  660.0855555555555
MS_within:  8.898666666666665
F:  74.1780291679153


Thus, we reject $H_0$ as $F > F^*$. Leg length has a significant impact on the ability of ants to navigate to their nests.

In [70]:
eta_squared = ss_between/(ss_between + ss_within)
print('eta_squared: ', eta_squared)

eta_squared:  0.908176041018554


From the $\eta^2$ value, we can see $\approx 90\%$ of the variance is due to between-group variability (explained variance).