## ANOVA & Multiple Hypothesis Testing
---
- Binary data: comparing multiple proportions
- Categorical data: comparing multiple sets of categorical
- continuous data: comparing multiple means

셋 이상의 집단 분석에 ANOVA를 이용한다. ANOVA 수행시, 설명변수는 categorical, 종속변수는 continuous data여야 함<br>
ANOVA는 모집단이 정규분포를 따름을 가정하며, 그렇지 않은 경우 비모수 추정을 이용해야 함

---

### ANOVA: factor의 수에 따라 n-way ANOVA로 구분
|One-way ANOVA|Two-way ANOVA|Three-way ANOVA|
|-------------|-------------|---------------|
|ex) smoking status|ex) gender and smoking status|ex) gender, smoking and beer consumption|

#### Hypothesis testing with ANOVA
* is there truly a difference in means `across gropus`?

| H0 | all of means are same |
|----|-----------------------|
| Ha | at least one of the means is different |

---

### one-way ANOVA

In [1]:
#rnorm(개수,평균,표준편차) 를 이용해 난수를 생성해 분석함

set.seed(12345)

df <- data.frame(norm = rnorm(400,0,10),
                t = rt(400,10),
                group = rep(1:2,200),
                group2 = c(rep('A',100),rep('B',100),rep('C',100),rep('D',100)))

head(df,10)

norm,t,group,group2
5.855288,0.66159902,1,A
7.09466,0.39241905,2,A
-1.093033,-0.61929381,1,A
-4.534972,-0.63997732,2,A
6.058875,0.09406951,1,A
-18.17956,0.94656523,2,A
6.300986,-0.93458005,1,A
-2.761841,-0.05962288,2,A
-2.841597,1.02120868,1,A
-9.19322,0.35627552,2,A


In [2]:
#norm과 group 사이의 평균이 다른지 one-way ANOVA를 진행
oneway.test(norm~group, data = df)


	One-way analysis of means (not assuming equal variances)

data:  norm and group
F = 0.16256, num df = 1.00, denom df = 397.54, p-value = 0.687


분석 결과, p-value>.05 이므로 두 그룹 간에는 평균 차이가 있는 것을 알 수 있음

---
### Multiple Hypothesis testing
* a = 0.05인 경우, 1회 수행시 5% 수준의 Type I error (False positive) 가 발생할 가능성이 있음
* 단독시행 시에는 무시할 수 있는 값이지만, 여러 번 반복시행할 경우 이 확률은 비약적으로 늘어날 수 있음
> ex) 10 000개의 유전자를 이용해 테스트할 경우, p-value = .05 인 경우 1500개 상당의 error가 발생할 가능성이 있음<br>
> `따라서, Bonferroni correction, FDR 등을 이용해 보정할 필요가 있음`

#### Bonferroni correction
* A possible correction for multiple comparisons
* Test each hypoyhesis at level a* = (a/n): 반복시행할 수만큼 p-value를 나눈 값으로 cutoff할 수 있음
* Adjustment ensures overall Type I error rate does not exceed a = .05
* But, too conservative

In [3]:
Input = ("
Factor   Raw.p
 A        .001
 B        .01
 C        .025
 D        .05
 E        .1
")

In [4]:
df2 = read.table(textConnection(Input), header=TRUE)

In [5]:
df2

Factor,Raw.p
A,0.001
B,0.01
C,0.025
D,0.05
E,0.1


In [6]:
df2$Bonferroni = p.adjust(df2$Raw.p, method = "bonferroni")

In [7]:
df2

Factor,Raw.p,Bonferroni
A,0.001,0.005
B,0.01,0.05
C,0.025,0.125
D,0.05,0.25
E,0.1,0.5


---
#### FDR
* q-value를 함께 고려하여 보정
* Bonferroni 보정 등의 FWER와 비교하여 Type I error에 대해 덜 엄격하다

In [8]:
df2$FDR = p.adjust(df2$Raw.p, method = "fdr")

In [9]:
df2

Factor,Raw.p,Bonferroni,FDR
A,0.001,0.005,0.005
B,0.01,0.05,0.025
C,0.025,0.125,0.04166667
D,0.05,0.25,0.0625
E,0.1,0.5,0.1
