# Analysis of Variance (ANOVA)

In [1]:
from data.create_data import *
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
data = read_frmgham()
data.educ.dropna(inplace=True)
data['educ'] = data.educ.astype('int')
data

Unnamed: 0,randid,sex,totchol,age,sysbp,diabp,cursmoke,cigpday,bmi,diabetes,...,cvd,hyperten,timeap,timemi,timemifc,timechd,timestrk,timecvd,timedth,timehyp
0,2448,male,195.0,39,106.0,70.0,0,0.0,26.97,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
1,2448,male,209.0,52,121.0,66.0,0,0.0,,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
2,6238,female,250.0,46,121.0,81.0,0,0.0,28.73,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
3,6238,female,260.0,52,105.0,69.5,0,0.0,29.43,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
4,6238,female,237.0,58,108.0,66.0,0,0.0,28.50,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
5,9428,male,245.0,48,127.5,80.0,1,20.0,25.34,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
6,9428,male,283.0,54,141.0,89.0,1,30.0,25.34,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
7,10552,female,225.0,61,150.0,95.0,1,30.0,28.58,0,...,1,1,2956,2956,2956,2956,2089,2089,2956,0
8,10552,female,232.0,67,183.0,109.0,1,20.0,30.18,0,...,1,1,2956,2956,2956,2956,2089,2089,2956,0
9,11252,female,285.0,46,130.0,84.0,1,23.0,23.10,0,...,0,1,8766,8766,8766,8766,8766,8766,8766,4285


### Research Question
1. Is there any difference in the mean blood pressures across the various levels of education? 
  * Do any of the group means differ from one another?
  * Can a significant proportion of the overall variability found in the blood pressures be attributed to the known differences between the groups?  

#### Legend: Factor/Treatment = `Education Level`
The group or categorical variable is `educ`, which persists of 4 levels:
  * **`1`**: 0-11 years of education (denoted as `elem`)
  * **`2`**: High School Diploma, GED (denoted as `high`)
  * **`3`**: some college, vocational school (denoted as `some_college`)
  * **`4`**: College degree or more (denoted as `college`)

## One-way ANOVA
One-way ANOVA tests whether the mean of systolic or diastolic blood pressure differs across the various levels of education.

The ANOVA test enables comparison of more than two group means, while keeping the Type I error at 0.05.

### Test-statistic: `F-statistic`
The **F-statistic** or **F-ratio** is the ratio comparing variance.  

`F = variance`<sub>between groups</sub>` / variance`<sub>within groups</sub>

Interpretations:
  * **F = 1.0**: There is no difference in the mean blood pressures of individuals in the various education groups.
    * H<sub>0</sub> is true
  * **F > 1.0**: There is a difference in the mean blood pressures of individuals in the various education groups.
    * H<sub>0</sub> is false

#### Systolic Blood Pressure

1) Hypotheses
  * **H<sub>0</sub>**: The sample mean of systolic blood pressure points to a population where the population mean is μ so that μ<sub>1</sub> = μ<sub>2</sub> = μ<sub>3</sub> = μ<sub>4</sub>.
  * **H<sub>A</sub>**: The sample mean of systolic blood pressure points to a population where the population mean is not μ so that μ<sub>1</sub> ≠  μ<sub>2</sub> ≠ μ<sub>3</sub> ≠ μ<sub>4</sub>.
  
  
2) Compute **t-statistic** (F-statistic/F-ratio) & **p-value**.

In [3]:
elem_sysbp = data[data['educ'] == 1]['sysbp']
high_sysbp = data[data['educ'] == 2]['sysbp']
some_college_sysbp = data[data['educ'] == 3]['sysbp']
college_sysbp = data[data['educ'] == 4]['sysbp']

In [4]:
f_sysbp, fpval_sysbp = stats.f_oneway(elem_sysbp, high_sysbp, 
                                      some_college_sysbp, college_sysbp)
print "One-way ANOVA F-statistic = %.2f" % f_sysbp
print "One-way ANOVA p-value = %.2f" % fpval_sysbp

One-way ANOVA F-statistic = 55.53
One-way ANOVA p-value = 0.00


3) Results
  * test-statistic: **F-statistic** = 55.53
  * **p-value** = 0

The **p-value** is small (p-value < 0.05), providing substantial evidence against the null hypothesis (H<sub>0</sub>). Hence, H<sub>0</sub> is rejected in favor of the alternative hypothesis (H<sub>A</sub>).

Additionally, the test-statistic, **F-statistic/ratio** is greater than 1. Hence, further supporting the rejection of H<sub>0</sub>.

##### Conclusion
μ<sub>1</sub> ≠  μ<sub>2</sub> ≠ μ<sub>3</sub> ≠ μ<sub>4</sub>

Based on the results, it's concluded that there is a statistically significant difference in the mean systolic blood pressure (mmHg) of individuals in the various levels of education (groups 1, 2, 3, 4). The mean systolic blood pressure vary between the groups.

#### Diastolic Blood Pressure

1) Hypotheses
  * **H<sub>0</sub>**: The sample mean of diastolic blood pressure points to a population where the population mean is μ so that μ<sub>1</sub> = μ<sub>2</sub> = μ<sub>3</sub> = μ<sub>4</sub>.
  * **H<sub>A</sub>**: The sample mean of diastolic lbood pressure points to a population where the population mean is not μ so that μ<sub>1</sub> ≠  μ<sub>2</sub> ≠ μ<sub>3</sub> ≠ μ<sub>4</sub>.
  
  
2) Compute **t-statistic** (F-statistic/F-ratio) & **p-value**.

In [5]:
elem_dbp = data[data['educ'] == 1]['diabp']
high_dbp = data[data['educ'] == 2]['diabp']
some_college_dbp = data[data['educ'] == 3]['diabp']
college_dbp = data[data['educ'] == 4]['diabp']

In [6]:
f_diabp, fpval_diabp = stats.f_oneway(elem_dbp, high_dbp, 
                                      some_college_dbp, college_dbp)
print "One-way ANOVA F-statistic = %.2f" % f_diabp
print "One-way ANOVA p-value = %.2f" % fpval_diabp

One-way ANOVA F-statistic = 8.91
One-way ANOVA p-value = 0.00


3) Results
  * test-statistic: **F-statistic** = 8.91
  * **p-value** = 0

The **p-value** is small (p-value < 0.05), providing substantial evidence against the null hypothesis (H<sub>0</sub>). Hence, H<sub>0</sub> is rejected in favor of the alternative hypothesis (H<sub>A</sub>).

Additionally, the test-statistic, **F-statistic/ratio** is greater than 1. Hence, further supporting the rejection of H<sub>0</sub>.

##### Conclusion
μ<sub>1</sub> ≠  μ<sub>2</sub> ≠ μ<sub>3</sub> ≠ μ<sub>4</sub>

Based on the results, it's concluded that there is a statistically significant difference in the mean diastolic blood pressure (mmHg) of individuals in the various levels of education (groups 1, 2, 3, 4). The mean diatolic blood pressure vary between the groups.