# MATH 3350 Course Notes - Module S7

## Type I & Type II Errors

Statistical conclusions are based on _probability_ and _strength of evidence_.  For instance, a p-value of 0.06 will be considered significant at $\alpha = 0.1$ but not at $\alpha = 0.05$, leading to different conclusions depending on the strength of evidence required.  

The table below shows the four possible outcomes of a statistical test.

<span style="color:blue">Decision</span> | $H_0$ is TRUE | $H_0$ is FALSE
---------|---------------|---------------
<span style="color:blue">Reject $H_0$</span> |**<span style="color:red">NOT INTENDED</span>** | **<span style="color:green">Desired Outcome</span>**
<span style="color:blue">Fail to reject $H_0$</span> | **<span style="color:green">Desired Outcome</span>** | **<span style="color:red">NOT INTENDED</span>**

Outcomes that are not intended are called "Errors", not because the test was conducted or interpreted incorrectly, but because the inherent uncertainty of probability will sometimes lead to the wrong conclusion, even when the procedure is followed correctly.  These errors are called Type I and Type II Errors, as identified below.  

<span style="color:blue">Decision</span> | $H_0$ is TRUE | $H_0$ is FALSE
---------|---------------|---------------
<span style="color:blue">Reject $H_0$</span> |**<span style="color:red">Type I Error</span>** | 
<span style="color:blue">Fail to reject $H_0$</span> |  | **<span style="color:red">Type II Error</span>**

We reject $H_0$ when $p < \alpha$ because $p$ represents the probability that a result as extreme as our sample could occur _when the null hypothesis is true_.  Therefore, $\alpha$ represents the probability that we could obtain such a sample and reject $H_0$ even though $H_0$ is actually true.  This means $\alpha$ is the probability of a Type I Error.  

We use $\beta$ to represent the probability of a Type II Error.  When $H_0$ is true, the possible outcomes have probabilities $\alpha$ and its complement, $1-\alpha$. When $H_0$ is false, the possible outcomes have probabilities $\beta$ and its complement, $1-\beta$.  These probabilities are given in the table below.  

| PROBABILITIES |
|---------------|

<span style="color:blue">Decision</span> | $H_0$ is TRUE | $H_0$ is FALSE
---------|---------------|---------------
<span style="color:blue">Reject $H_0$</span> |**<span style="color:red">$\alpha$</span>** | **<span style="color:green">$1-\beta$</span>**
<span style="color:blue">Fail to reject $H_0$</span> | **<span style="color:green">$1-\alpha$</span>** | **<span style="color:red">$\beta$</span>**

### Selecting $\alpha$

The value we select for $\alpha$ governs how likely we are to reject $H_0$.  The following table identifies several consequences of lowering or raising $\alpha$ values.

DECREASE $\alpha$ values | INCREASE $\alpha$ values
----------------------|-----------------------
Harder to reject $H_0$ | Easier to reject $H_0$
Type I Error less likely | Type I Error more likely
Type II Error more likely | Type II Error less likely
$\beta$ _increases_ | $\beta$ _decreases_


## Power of a Statistical Test

Power of a test refers to the likelihood that if the null hypothesis is false, that the test will result in correctly rejecting $H_0$.  

As we can see in the probabilities given above:

<center>
POWER = $1-\beta$
</center>

The following table gives historically accepted thresholds for power, as established by Cohen (1969). 
    
Power | Rating
------|--------
0.6 | Poor
0.7 | Adequate
0.8 | Good
0.9 | Excellent

However, recent researchers (_e.g., Correll et al., 2020_) have suggested that fixed levels for power are arbitrary and not meaningful, and that the desired power chosen should be contextual (as the choice for $\alpha$ is contextual).


### Effect Size 
The calculation of power depends on an important construct known as **_effect size_**, most commonly defined by Cohen's $d$.  In general, if we calculate the gap (or "gain" in some contexts) $G$ between the sample statistic and the null hypothesis, Cohen's $d$ is computed as:

<center>
$d=\frac{G}{s}$
</center>
    
where s represents the standard deviation of the sample (pooled in 2-sample scenarios). 

Again, Cohen proposed the following thresholds for effect size. These thresholds are still treated as convention.

$d$ | Effect Size 
------|--------
0.2 | Small
0.5 | Medium
0.8 | Large


#### Example 1. Pre-Test and Post-Test Scores
Suppose a teacher gives students a pre-test before a lesson and a post-test after the lesson. A hypothesis test to determine the average learning gains associated with this lesson would be a **_matched pairs_** scenario with these hypotheses:

<center>
$H_{0}: \mu_\delta = 0$  
</center>
<center>
$H_{a}: \mu_\delta > 0$
</center>

where $\delta = post - pre$ 

Suppose our sample resulted in the following statistics:

Sample statistic | Value
---------------|-----
$\overline{x}_\delta$ | 12.4
$s_\delta$ | 10.7

This would result in the following effect size:

<center>
$d=\frac{12.4}{10.7} \approx 1.16 $
</center>




The following 4 values are all part of the equation when conducting a power analysis:
* Sample size
* Significance level
* Effect size
* Power

R has a library "pwr" that will conduct analyses by providing any one of the above variables if the other 3 are given.

#### Example 2. Car Speeds (2 Independent Samples)
We will illustrate with the mtcars data set, comparing the quarter-mile time between manual and automatic vehicles.

In [None]:
# Examine quarter mile time in seconds (qsec) by transmission type
boxplot(qsec~am, data=mtcars, horizontal=TRUE)


In [None]:
#Examine spread of each group and spread overall
sd(mtcars$qsec[mtcars$am==0])
sd(mtcars$qsec[mtcars$am==1])
sd(mtcars$qsec)

In [None]:
#Compare Student and Welch t-test results
t.test(qsec ~ factor(am), data=mtcars, var.equal=TRUE)
t.test(qsec ~ factor(am), data=mtcars)


We have established that pooled variance is reasonable in this scenario, which will make effect size more straightforward to compute.

In [None]:
# Run t test and store the results for future reference
tt <- t.test(qsec ~ factor(am), data=mtcars, var.equal=TRUE)

# View summary of the t.test object we just created (we named the result tt)
tt



In [None]:
# Show elements of the t.test object called tt
str(tt)

First, we will load the 'pwr' package.  You *may* need to install it first; if so, temporarily un-comment the 'install' line.

In [None]:
#install.packages("pwr")   #Run this command AT MOST one time, IF command below fails to load "pwr" package
library (pwr)

Below we collect the information needed for a power analysis and use _**pwr**_ to determine the power of the t-test we have been looking at. Remember that the power is the probability of rejecting $H_0$ if it should be rejected.

In [None]:
n1 <- length(mtcars$qsec[mtcars$am==0])
n2 <- length(mtcars$qsec[mtcars$am==1])
cat("Sample sizes: ", n1, n2, "\n")

diff <- tt$est[1] - tt$est[2]
cat("Difference in Means: ", diff, "\n")

effect <- diff/sd(mtcars$qsec)
cat("Effect size: ", effect, "\n")

#pwr.t2n.test(n1, n2, d, sig.level = 0.05, power, alternative = "two.sided") 
# n1, n2: sizes of samples
# d: effect size (Cohen's d)
# sig.level: significance level (default 0.05)
# alternative: direction of Ha (less, greater, or two.sided; default is two.sided)
pwr.t2n.test(n1 = n1, n2 = n2, d = effect)

Below are some additional ways to use the **_pwr_** function.

In [None]:
#How large should samples be to detect an effect size this large?

#pwr.t.test(n, d, sig.level = 0.05, power, type, alternative = "two.sided")
# n: size of EACH sample (assumes equal sample sizes for 2-sample tests)
# d: effect size (Cohen's d)
# sig.level: significance level (default 0.05)
# type: type of test (one.sample, two.sample, or paired)
# alternative: direction of Ha (less, greater, or two.sided; default is two.sided)

#Target: power of 0.7 (Looking for sample size)
pwr.t.test(d=effect, power=0.7, type="two.sample")

In [None]:
#Target: power of 0.8 (Looking for sample size)
pwr.t.test(d=effect, power=0.8, type="two.sample")

In [None]:
#Target: power of 0.8 at significance level of 0.01 (Looking for sample size)
pwr.t.test(d=effect, sig.level = 0.01, power=0.8, type="two.sample")

In [None]:
#Target: power of 0.8 with sample size of 100 (Looking for significance level)
pwr.t.test(n=100, d=effect, sig.level=NULL, power=0.8, type="two.sample")

In [None]:
#Target: power of 0.8 with sample size of 50 and significance level 0.01 (Looking for effect size needed)
pwr.t.test(n=50, sig.level=0.01, power=0.8, type="two.sample")