# Two Sample Hypothesis Test

## Want to know if two sample of size $n_{1}$ and $n_{2}$ from populations 1 and 2 have different means

> $H_{0}: \mu_{1} = \mu_{2}$

**We assume that both populations share the same covariance matrix.**

### Whales!

In [1]:
# blue whales
blue.l <- c(24.3, 24.96, 25.36, 25.74)
blue.w <- c(109.74, 108.95, 109.12, 109.44)
blue.f <- c(2.46, 1.95, 1.75, 2.35)

# bow whales
bow.l <- c(22.39, 22.45, 22.75, 20.92, 21.64)
bow.w <- c(83.07, 81.84, 82.81, 81.9, 82.65)
bow.f <- c(2.53, 2.62, 3.39, 2.94, 2.19)

In [4]:
blue <- data.frame(blue.l, blue.w, blue.f)
bow <- data.frame(bow.l, bow.w, bow.f)

In [7]:
n.blue <- 4
n.bow <- 5

# three parameters - body length, weight, flipper length
p <- 3

In [31]:
blue

blue.l,blue.w,blue.f
24.3,109.74,2.46
24.96,108.95,1.95
25.36,109.12,1.75
25.74,109.44,2.35


In [32]:
bow

bow.l,bow.w,bow.f
22.39,83.07,2.53
22.45,81.84,2.62
22.75,82.81,3.39
20.92,81.9,2.94
21.64,82.65,2.19


## Steps

1) Calculate Pooled Unbiased Estimate for the Covariance Matrix

> $S_{U} = \dfrac{(n_{1}-1)S_{1U} + (n_{2}-1)S_{2U}}{n_{1} + n_{2} - 2}$

2) Calculate Mahalanobis distance between the two sample means

> $D^{2}(\bar{\mathbf{x}_{1}}, \bar{\mathbf{x}_{2}}) = \left ( \bar{\mathbf{x}_{1}} - \bar{\mathbf{x}_{2}} \right )^{T} S_{U}^{-1} \left ( \bar{\mathbf{x}_{1}} - \bar{\mathbf{x}_{2}} \right )$

3) Calculate $T^{2}$

> $T^{2} = \dfrac{n_{1}n_{2}}{n_{1} + n_{2}} D^{2}(\bar{\mathbf{x}_{1}}, \bar{\mathbf{x}_{2}})$

4) Run the F-test: Reject $H_{0}$ at the $100\alpha\%$ level of significance if:

> $T^{2} > \dfrac{(n_{1} + n_{2} - 2)p}{n_{1} + n_{2} - p - 1} F_{p, n_{1} + n_{2} - p - 1, \alpha}$

*or alternately*

> $ \dfrac{n_{1} + n_{2} - p - 1}{(n_{1} + n_{2} - 2)p} T^{2} > F_{p, n_{1} + n_{2} - p - 1, \alpha}$



In [19]:
S.1U <- cov(blue)

S.2U <- cov(bow)

S.U <- ((n.blue-1)*S.1U + (n.bow-1)*S.2U) / (n.blue + n.bow - 2)

S.U.inv <- solve(S.U)

In [24]:
x.bar.1 <- apply(blue, 2, mean)

x.bar.2 <- apply(bow, 2, mean)

In [34]:
D.2 <-  t(x.bar.1 - x.bar.2) %*% S.U.inv %*% (x.bar.1 - x.bar.2)
D.2

0
3381.794


In [35]:
T.2 <- ((n.blue*n.bow) / (n.blue + n.bow)) * D.2
T.2

0
7515.099


In [36]:
F <- ((n.blue + n.bow - p - 1) / ((n.blue + n.bow - 2)*p)) * T.2

In [37]:
F

0
1789.309


In [39]:
pf(F, p, n.blue+n.bow-p-1, lower.tail = FALSE)

0
5.380012e-08


**We can therefore reject the null hypothesis (that the two sample means are the same) at the 0.1% level of significance.**

---

&nbsp;

&nbsp;

## Problem Sheet 2 Questions

### Q1) One-Sample Hotelling $T^{2}$ Test

**Hypothesis test**:

> $H_{0}: \mu = [180, 113, 750]^{T}$

### Steps

1) Calculate Mahalanobis distance between the two sample means

> $D^{2}(\bar{\mathbf{x}}, \mu_{0}) = \left ( \bar{\mathbf{x}} - \mu_{0} \right )^{T} S_{U}^{-1} \left ( \bar{\mathbf{x}} - \mu_{0} \right )$

2) Calculate $T^{2}$

> $T^{2} = n D^{2}(\bar{\mathbf{x}}, \mu_{0})$

3) Run the F-test: Reject $H_{0}$ at the $100\alpha\%$ level of significance if:

> $T^{2} > \dfrac{(n-1)p}{n-p} F_{p, n-p, \alpha}$

*or alternately*

> $ \dfrac{n-p}{(n-1)p} T^{2} > F_{p, n-p, \alpha}$

In [45]:
groceries <- c(227.01, 241.42, 188.08, 238.23, 235.86)
leisure <- c(96.98, 140.44, 85.13, 158.22, 103.06)
income <- c(741.29, 854.07, 812.07, 813.69, 731.42)

spend <- data.frame(groceries, leisure, income)

In [46]:
mu.0 <- c(180, 113, 750)

In [47]:
spend

groceries,leisure,income
227.01,96.98,741.29
241.42,140.44,854.07
188.08,85.13,812.07
238.23,158.22,813.69
235.86,103.06,731.42


In [57]:
n <- 5
p <- 3

S.U <- var(spend)

S.U.inv <- solve(S.U)

x.bar <- apply(spend, 2, mean)

D.2 <- t(x.bar - mu.0) %*% S.U.inv %*% (x.bar - mu.0)
D.2

0
25.23166


In [58]:
T.2 <- n * D.2
T.2

0
126.1583


In [60]:
F <- ((n-p) / ((n-1)*p))*T.2
F

0
21.02638


In [61]:
pf(F, p, n-p, lower.tail = FALSE)

0
0.04574171


We can reject the null hypothesis at the 5% level of significance, and therefore accept the alternative that the vector mean value is not $[180, 113, 750]^{T}$

## Question 2 looks disgusting, let's crack Q3 next, and finish on Q4.

### Q3) Allows you to test the ratios of the mean vector

> $H_{0}: \dfrac{1}{6} \mu_{1} = \dfrac{1}{4} \mu_{2} = \mu_{3}$

Have matrix C that will essentially allow us to run the  hypothesis test:

> $H_{0}: C\mu = \mathbf{0}$

by modifying our $T^{2}$ statistic:

> $T^{2} = n (C \bar{X} - \phi)^{T} (CS_{U}C^{T})^{-1} (C \bar{X} - \phi)$

where under $H_{0}$, $T^{2} \sim T^{2}_{m}(n-1)$.

**IMPORTANT**

Converting from $T^{2} \sim T^{2}_{m}(n-1)$ to **F-stat**:

> $T^{2}_{m}(n-1) = \dfrac{(n-1)m}{n-m}F_{m, n-m}$

Therefore reject $H_{0}$ at $100\alpha\%$ level of significance when:

> $\dfrac{n-m}{(n-1)m}T^{2} > F_{m, n-m, \alpha}$

In [64]:
b.height <- c(78, 76, 92, 81, 81, 84)
b.chest <- c(60.6, 58.1, 63.2, 59, 60.8, 59.5)
b.muac <- c(16.5, 12.5, 14.5, 14, 15.5, 14)
boys <- data.frame(b.height, b.chest, b.muac)

In [87]:
n <- 6
p <- 3
m <- 2

In [65]:
boys

b.height,b.chest,b.muac
78,60.6,16.5
76,58.1,12.5
92,63.2,14.5
81,59.0,14.0
81,60.8,15.5
84,59.5,14.0


In [94]:
C <- matrix(c(2,1,-3,0,0,-6), nrow=2, ncol=3)

In [95]:
C

0,1,2
2,-3,0
1,0,-6


In [96]:
x.bar <- apply(boys, 2, mean)

In [97]:
S.U <- var(boys)

CSC <- C %*% S.U %*% t(C)

CSC.inv <- solve(CSC)

In [98]:
CSC

0,1
58.468,56.66
56.66,94.0


In [99]:
T.2 <- n * t(C%*%x.bar) %*% CSC.inv %*% (C%*%x.bar)

In [100]:
T.2

0
47.1434


In [101]:
F <- ((n-m)/ ((n-1)*m)) * T.2

In [102]:
pf(F, m, n-m, lower.tail = FALSE)

0
0.009194778


**Can reject the null hypothesis at the 0.1% level of significance.**

## Q3: 2-Sample Hotelling $T^{2}$ Test

**Comparing Boys with Girls**

## Steps

1) Calculate Pooled Unbiased Estimate for the Covariance Matrix

> $S_{U} = \dfrac{(n_{1}-1)S_{1U} + (n_{2}-1)S_{2U}}{n_{1} + n_{2} - 2}$

2) Calculate Mahalanobis distance between the two sample means

> $D^{2}(\bar{\mathbf{x}_{1}}, \bar{\mathbf{x}_{2}}) = \left ( \bar{\mathbf{x}_{1}} - \bar{\mathbf{x}_{2}} \right )^{T} S_{U}^{-1} \left ( \bar{\mathbf{x}_{1}} - \bar{\mathbf{x}_{2}} \right )$

3) Calculate $T^{2}$

> $T^{2} = \dfrac{n_{1}n_{2}}{n_{1} + n_{2}} D^{2}(\bar{\mathbf{x}_{1}}, \bar{\mathbf{x}_{2}})$

4) Run the F-test: Reject $H_{0}$ at the $100\alpha\%$ level of significance if:

> $T^{2} > \dfrac{(n_{1} + n_{2} - 2)p}{n_{1} + n_{2} - p - 1} F_{p, n_{1} + n_{2} - p - 1, \alpha}$

*or alternately*

> $ \dfrac{n_{1} + n_{2} - p - 1}{(n_{1} + n_{2} - 2)p} T^{2} > F_{p, n_{1} + n_{2} - p - 1, \alpha}$

In [103]:
g.height <- c(80, 75, 78, 75, 79, 78, 75, 64, 80)
g.chest <- c(58.4, 59.2, 60.3, 57.4, 59.5, 58.1, 58, 55.5, 59.2)
g.muac <- c(14, 15, 15, 13, 14, 14.5, 12.5, 11, 12.5)
girls <- data.frame(g.height, g.chest, g.muac)

In [105]:
# 1 = boys, 2 = girls

S.1.U <- var(boys)
S.2.U <- var(girls)

n.1 <- 6
n.2 <- 9
p <- 3

S.U <- ((n.1-1)*S.1.U + (n.2-1)*S.2.U) / (n.1 + n.2 - 2)

# sample pooled estimate for the covariance matrix for the two groups
S.U

Unnamed: 0,b.height,b.chest,b.muac
b.height,27.230769,6.561538,2.846154
b.chest,6.561538,2.432308,1.4
b.muac,2.846154,1.4,1.846154


In [106]:
x.1.bar <- apply(boys, 2, mean)
x.2.bar <- apply(girls, 2, mean)

D.2 <- t(x.1.bar - x.2.bar) %*% solve(S.U) %*% (x.1.bar - x.2.bar)
D.2

0
1.475479


In [109]:
T.2 <- (n.1 * n.2) / (n.1 + n.2) * D.2
T.2

0
5.311726


In [110]:
F <- ((n.1 + n.2 - p - 1) / ((n.1 + n.2 - 2)*p))*T.2
F

0
1.498179


In [111]:
pf(F, p, n.1 + n.2 - p - 1, lower.tail = FALSE)

0
0.2692616


**Cannot reject the null hypothesis that the two mean vectors are the same**.