# Chapter 10
Analysis of Variance

## One-Way Analysis of Variance

Consider $m$ independent samples, each of size $n$, where the members
of the $i$th sample ––$X_{i1}, X_{i2}, ..., X_{in}$ –– are *normal* random
variables each with unknown mean $\mu_i$ and unknown variance $\sigma^2$.

$ X_{ij} \sim N(\mu_i$

$ i = 1, ..., m $

$ j = 1, ..., n $

We want to set up the following hypothesis:

$H_0: \mu_1 = \mu_2 = ... = \mu_m$

versus

$H_1: $ not all means are equal.

### Sum of the Squares of their Standard Versions

Note, since there are a total of $n \times m$ independent r.v.s for
$X_{ij}$, then the sum of the squares of their standard versions will
be a *chi-square* distribution!

$ \sum^m_{i = 1}\sum^n_{j = 1}(X_{ij}-\mu_i)^2/\sigma^2 \sim \chi_{mn}^2$ 

Doesn't make sense? Just think about the fact that each individual r.v. $Xi$ is
a normal r.v. When we square those random variables, they become chi-square.


### Average of all Elements in One R.V.

Let the population average of all the elements in R.V. $i$ be

$ \overline{X_i} = \sum^n_{j = 1}\frac{X_{ij}}{n}$


## Within Samples Sum of Squares

If we replace the population means from our sum of squares of the
standard versions, we get a value defined as the *within samples sum of squares*.

$ \sum^m_{i = 1}\sum^n_{j = 1}(X_{ij}-\overline{X_i})^2/\sigma^2 \sim \chi_{mn-m}^2$ 

Notice, this new distribution is of $nm-m$ degrees of freedom, since using
sample means decreases our degrees of freedom by $ 1 \times m $.

The *within samples sum of squares* is defined as

$ SS_W = \sum^m_{i = 1}\sum^n_{j = 1}(X_{ij}-\overline{X_i})^2 $

NOTE, $SS_W/(nm - m)$ ALWAYS ESTIMATES $\sigma_2$.

## Between Samples Sum of Squares

The statistic below is defined as the *between samples sum of squares*.

$ SS_b = \sum^m_{i = 1}(\overline{X_{i.}}-X_{..})^2 $

where $X_{..}$ is defined as $\sum^m_{i = 1}\frac{\overline{X_i}}{m}$

WHEN $H_0$ IS TRUE, $SS_b/(m-1)$ BECOMES AN ESTIMATOR OF $\sigma^2$.

OTHERWISE, $SS_b/(m-1)$ WILL EXCEED $\sigma^2$.

## Test Statistic

We can formulate the test statistic $TS$ by comparing the values of $SS_b/(m-1)$ and $SS_W/(nm - m)$,
which would obviously be $1$ when $H_0$ is true (remember, this is when all means are equal).

$ TS = \frac{SS_b/(m-1)}{SS_W/(nm - m)}$

When this value is substantially greater than 1, it gives us sufficient reason to reject $H_0$.

## Significance Level Test of $H_0$

To determine how large $TS$ has to be to reject our null hypothesis $H_0$, we use the fact that when
$H_0$ is true, $SS_W$ and $SS_b$ are independent of one another.

Thus, when $H_0$ is true, we can put the relationship between $SS_W$ and $SS_b$ into an F-distribution!

We thus skip the statistical B.S. and construct a test for significance level $\alpha$ as follows:

- Reject $H_0$ if $\frac{SS_b/(m-1)}{SS_W/(nm-m)} \sim F_{m-1, nm-m, \alpha}$
- **Do not reject** $H_0$ otherwise.

## Example

*An auto rental firm is using 15 identical motors that are adjusted to run at a fixed speed to test 3 different brands of gasoline.
Each brand of gasoline is assigned to exactly 5 of the motors. Each motor runs on 10 gallons of gasoline until it is out of fuel.*

The following total mileages were obtained by the different motors:

Gas 1: 220 251 226 246 260

Gas 2: 244 235 232 242 225

Gas 3: 252 272 250 238 256

Test the hypothesis that the average mileage obtained is not affected by the type of gas used. Use the 5 percent level of significance.

Here, note $X_{ij}$ denotes a certain "identical motor" testing a brand of gasoline.

Also, our $H_0: \mu_1 = \mu_2 = ... = \mu_m$, where $\mu$ denotes the average mileage of the motor.

In [42]:
from scipy.stats import f

# Takes in a matrix where each row is for a new gas #
# Returns true if we must reject the null hypothesis for the
#   given significance level
def reject_h_0(data, alpha):
    m = len(data)
    n = len(data[0])
    x_means = [ sum(data[i])/len(data[i]) for i in range(m) ]
    x_dot_dot = sum(x_means) / len(x_means)
    SS_b = n * sum([ (x_means[i] - x_dot_dot)**2 for i in range(m) ])
    square_sum = sum([ sum([ data[i][j]**2 for j in range(n)]) for i in range(m) ])
    SS_W = square_sum - n * m * x_dot_dot**2 - SS_b
    TS = (SS_b / (m - 1))/(SS_W / (n * m - m))
    p_val = f.sf(alpha, m - 1, n * m - m)
    F = f.isf(alpha, m - 1, n * m - m)
    print("alpha: " + str(alpha))
    print("X_i.: " + str(x_means))
    print("X_..: " + str(x_dot_dot))
    print("square sum: " + str(square_sum))
    print("SS_b: " + str(SS_b))
    print("SS_W: " + str(SS_W))
    print("p-value: " + str(p_val))
    print("TS: " + str(TS))
    print("F: " + str(F))
    return TS >= F

reject_h_0([ [220, 251, 226, 246, 260], [244, 235, 232, 242, 225], [252, 272, 250, 238, 256] ], 0.05)

alpha: 0.05
X_i.: [0.029000000000000005, 0.027600000000000003, 0.030000000000000006]
X_..: 0.02886666666666667
square sum: 0.014303000000000001
SS_b: 1.453333333333337e-05
SS_W: 0.0017892
p-value: 0.9514265235900422
TS: 0.04873686563827421
F: 3.8852938346523933


False