In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### Scenario 1

You and your friends want to go out to eat, but you don't want to pay a lot. You decide to either go to gettyburg or Wilma. You look online and find the avreage meal prices at 18 restaurants in Gettysburg and 14 restaurants in Wilma.

$$H_0: \mu_G = \mu_W$$
$$H_A: \mu_G \neq \mu_W$$

##### Question 1
What do we need to know to compare these samples?
Answer: Sample averages, sample standard deviations, sample sizes

##### Question 2
What is the average meal price for each sample of restaurants?

In [3]:
df = pd.read_csv('../../Data/Food-Prices-Lesson-11.csv')
g_mean = df['Gettysburg'].mean()
w_mean = df['Wilma'].mean()
print('Gettyburg Mean: ', g_mean)
print('Wilma Mean: ', w_mean)

Gettyburg Mean:  8.944444444444445
Wilma Mean:  11.142857142857142


##### Question 3
What is the sample standard deviation for each sample of restaurants?

In [4]:
# ddof=1 for sample standard deviation
g_std = df['Gettysburg'].std(ddof=1)
w_std = df['Wilma'].std(ddof=1)
print('Gettysburg SD: ', g_std)
print('Wilma SD: ', w_std)

Gettysburg SD:  2.6451336499586917
Wilma SD:  2.1788191176076888


##### Question 4
Calculate the standard error:
$$\text{SE}_{\bar{x}_G-\bar{x}_W} = \sqrt{\frac{s_G^2}{n_G}+\frac{s_W^2}{n_W}}$$

In [5]:
se = np.sqrt((g_std**2)/len(df['Gettysburg']) + (w_std**2)/df['Wilma'].dropna().count())
print('Standard Error: ', se)

Standard Error:  0.8531100847677227


##### Question 5
What is correct for calculating the t-statistic?  
Answer: $$\frac{\bar{x}_G-\bar{x}_W}{\text{SE}}\quad\text{OR}\quad \frac{\bar{x}_W-\bar{x}_G}{\text{SE}}$$

We can use either given this is a two-tailed test, and the result will only be a change in sign.

##### Question 6
Calculate the t-statistic.

In [6]:
print('t-statistic (g-w): ', (g_mean - w_mean)/se)
print('t-statistic (w-g): ', (w_mean - g_mean)/se)

t-statistic (g-w):  -2.5769390582356815
t-statistic (w-g):  2.5769390582356815


##### Question 7
What are the $t^*$ values? Use a t-table for a two-tailed test with an $\alpha=0.05$  
Answer: $\pm 2.042$ as $df=n_G+n_W-2=30$

##### Question
Do we *reject* or *fail to reject* $H_0$?  
Answer: Reject as $\lvert t\rvert \gt \lvert t^*\rvert$

##### Scenario 2

Imagine two dermatologists are checking to see the utility of an acne drug and it's ability to reduce acne. Drug A and Drug B acne reduction percentages are presented.

| Drug A | Drug B |
| ------ | ------ |
| 40% | 41% |
| 36% | 39% |
| 20% | 18% |
| 32% | 23% |
| 45% | 35% |
| 28% | |

$$\bar{x}_A=33.5\% \qquad \bar{x}_B=31.2\%$$
$$s_A=8.89\% \qquad s_B=10.16\%$$
$$H_0: \mu_A = \mu_B$$
$$H_A: \mu_A \neq \mu_B$$

##### Question 1
Calculate the t-statistic.
$$t = \frac{\bar{x}_A-\bar{x}_B}{\sqrt{\frac{s_A^2}{n_A}+\frac{s_B^2}{n_B}}}$$

In [7]:
dA = np.array([.4, .36, .2, .32, .45, .28])
dB = np.array([.41, .39, .18, .23, .35])

meanA = dA.mean()
meanB = dB.mean()
print('Mean A: ', meanA)
print('Mean B: ', meanB)

stdA = dA.std(ddof=1) # ddof=1 for sample standard deviation, n-1 degrees of freedom
stdB = dB.std(ddof=1) # ddof=1 for sample standard deviation, n-1 degrees of freedom
print('Standard Deviation A: ', stdA)
print('Standard Deviation B: ', stdB)

se = np.sqrt((stdA**2)/len(dA) + (stdB**2)/len(dB))
print('Standard Error: ', se)

t = (meanA - meanB)/se
print('t-statistic (A-B): ', t)

Mean A:  0.33499999999999996
Mean B:  0.312
Standard Deviation A:  0.08893818077743663
Standard Deviation B:  0.10158740079360236
Standard Error:  0.05815783122962318
t-statistic (A-B):  0.39547554497329196


##### Question 2
What are the $t^*$ values for a two-tailed test with a significance level of $5$% ($\alpha=0.05$, $95$% CI)?  
Answer: $\pm 2.262$  
(Remember: $df = n_A + n_B - 2$)

##### Question 3
Do we *reject* or *fail to reject* $H_0$?  
Answer: We fail to reject as $\lvert t\rvert \lt \lvert t^*\rvert$

### Scenario 3
Who has more shoes - males or females?
$$H_0: \mu_F = \mu_M \quad (\mu_F - \mu_M = 0)$$
$$H_A: \mu_F \neq \mu_M \quad (\mu_F - \mu_M \neq 0)$$

In [9]:
df = pd.read_csv('../../Data/Shoes-Lesson 11.csv')
df.head()

Unnamed: 0,females,males
0,90.0,4
1,28.0,120
2,30.0,5
3,10.0,3
4,5.0,10


##### Question 1
Compute the mean and standard deviation for the number of shoes owned by males and females.

In [18]:
meanM = df['males'].mean()
meanF = df['females'].mean()
print('male mean: ', meanM)
print('female mean: ', meanF)

stdM = df['males'].std(ddof=1)
stdF = df['females'].std(ddof=1)
print('male standard deviation: ', stdM)
print('female standard deviation: ', stdF)

male mean:  18.0
female mean:  33.142857142857146
male standard deviation:  34.27243790569909
female standard deviation:  31.360423952430722


##### Question 2
Compute the standard error.

In [23]:
se = np.sqrt(stdM**2/df['males'].count() + stdF**2/df['females'].count())
print('standerd error: ', se)

standerd error:  15.725088769901236


##### Question 3
Compute the t-statistic. $$t=\frac{\bar{x}_F-\bar{x}_M}{\text{SE}}$$

In [24]:
t = (meanF - meanM)/se
print('t-statistic (F-M): ', t)

t-statistic (F-M):  0.9629743503795974


##### Question 4
Determine the t-critical values, $t^*$, for $\alpha=0.05$. Remember this is a two-tailed test and $df = n_F+n_M-2$.  
Answer: $\pm 2.120$  

##### Question 5
Compute the $95$% CI. Remember $\text{CI}=\bar{x} \pm t^*\cdot\text{SE}$.  
NOTE: $\bar{x}$ could be $-15.14$ or $15.14$. The direction (i.e., positive or negative) selection is not significant as long as the direction is reported and the computations are performed with the selection in mind.

In [29]:
t_crit = 2.12
(lbound, ubound) = (meanF - meanM - t_crit*se, meanF - meanM + t_crit*se)
print('95% confidence interval: ', (lbound, ubound))

95% confidence interval:  (-18.194331049333478, 48.48004533504777)


##### Question 6
What proportion of the difference in pairs of shoes owned can be attributed to gener? Remember
$$r^2=\frac{t^2}{t^2+df}$$

In [31]:
r2 = t**2/(t**2 + df['males'].count() + df['females'].count() - 2)
print('r^2: ', r2)

r^2:  0.05478242400037162


### Scenario 4

Pooled variance.

In [None]:
x = np.array([5, 6, 1, -4])
y = np.array([3, 7, 8])

x_mean = x.mean()
y_mean = y.mean()
print('x mean: ', x_mean)
print('y mean: ', y_mean)

x_std = x.std(ddof=1)
y_std = y.std(ddof=1)
print('x standard deviation: ', x_std)
print('y standard deviation: ', y_std)

