<img src="materials/images/introduction-to-statistics-II-cover.png"/>


# 👋 Welcome, before you start
<br>

### 📚 Module overview

We will go through eleven lessons with you:
    
- [**Lesson 1: Z-score**](Lesson_1_Z-score.ipynb)

- [**Lesson 2: P-value**](Lesson_2_P-value.ipynb)

- [**Lesson 3: Lesson 3: Welchs T-test**](Lesson_3_Welchs_T-test.ipynb)

- [**Lesson 4: Log2 Fold Change**](Lesson_4_Log2_Fold_Change.ipynb)

- [**Lesson 5: Pearson Correlation**](Lesson_5_Pearson_Correlation.ipynb)

- [**Lesson 6: Spearman Correlation**](Lesson_6_Spearman_Correlation.ipynb)

- <font color=#E98300>**Lesson 7: False Discovery Rate**</font>    `📍You are here.`

- [**Lesson 8: Benjamini Hochberg**](Lesson_8_Benjamini_Hochberg.ipynb)

- [**Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis**](Lesson_9_Dimensionality_Reduction_Methods_Principal_Component_Analysis.ipynb)

- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)

- [**Lesson 11: UMAP**](Lesson_11_UMAP.ipynb)
</br>



<div class="alert alert-block alert-info">
<h3>⌨️ Keyboard shortcut</h3>

These common shortcut could save your time going through this notebook:
- Run the current cell: **`Enter + Shift`**.
- Add a cell above the current cell: Press **`A`**.
- Add a cell below the current cell: Press **`B`**.
- Change a code cell to markdown cell: Select the cell, and then press **`M`**.
- Delete a cell: Press **`D`** twice.

Need more help with keyboard shortcut? Press **`H`** to look it up.
</div>

---

# Lesson 7: False Discovery Rate

`🕒 This module should take about 15 minutes to complete.`

`✍️ This notebook is written using Python.`

<mark>**False Discovery Rate (FDR)**</mark> is a measure of accuracy when multiple hypotheses are being tested at once. 

In classical statistical testing, we begin with the **null hypothesis** as the formal basis for testing statistical significance. The null hypothesis states that there is no association between the predictor and outcome variables in the population. By starting with the proposition that there is no association, statistical tests can estimate the probability that an observed association could be due to chance. 

After a study is completed, based on the data collected, the investigator uses statistical tests to determine whether there is sufficient evidence to reject the null hypothesis in favor of the **alternative hypothesis** that there is an association in the population.

When running a statistical test, any time a null hypothesis is rejected, it can be considered to be a "significant" finding since we can conclude that the measured difference is highly unlikely to be due to random chance alone and the treatment is likely directly influencing the metric. Alternatively, outcomes that do not reach significance are not considered a "discovery" since we aren't able to reject the null hypothesis.

<div class="alert alert-block alert-warning">
    <b>Alert:</b>  The alternative hypothesis cannot be tested directly. It is implicitly accepted when the test of statistical significance rejects the null hypothesis.
</div>

Importantly, an investigator’s conclusion may be wrong. Sometimes, by chance alone, a sample is not representative of the population. Thus, the results in the sample do not reflect reality in the population, and the random error leads to an erroneous inference. A false positive (type I error) occurs if an investigator rejects a null hypothesis that is actually true.

## Level of statistical significance 

When conducting hypothesis tests, for example to see whether two means are significantly different, we calculate a p-value, which is the probability of obtaining a test statistic that is as, or more extreme than the observed one, assuming the null hypothesis is true.

The level of statistical significance for rejecting the null hypothesis is typically set at 0.05. This states that we can reject the null hypothesis when the probability of rejecting it (p-value), when it is actually true, is less than 5%. In other words, we've set <mark> 5% as the maximum chance of **incorrectly rejecting the null hypothesis - having a false positive</mark>**.

The **False Discovery Rate** is the proportion of all outcomes deemed to be significant that are <mark>falsely significant</mark>. Only the positive outcomes matter for the false discovery rate. False negatives don't influence the false discovery rate.

## The multiple testing problem

<img src="materials/images/images_false_discovery_rate/genomic_testing.png"/>

False Discovery Rate comes into play particularly when lots of hypothesis tests are being conducted. For example, when analyzing results from genome-wide studies, a typical microarray experiment might result in performing 10,000 separate hypothesis tests. If we use a p-value of 0.05 as our threshold, we’d expect 500 genes to be deemed as “significant” by chance. The implication is that if you repeat a test enough times, you’re going to find an effect even though an effect may not actually exist. This is called **the multiple testing problem**.

---

## Example calculation of the false discovery rate (FDR)

<img src="materials/images/images_false_discovery_rate/fdr_equation.png"/>

Imagine that we are doing a genome-wide study looking at differential gene expression between tumor tissue and healthy tissue, and we tested 1000 genes. The image below shows that 950 (95%) of the null hypotheses are actually true, and 50 (5%) of the null hypotheses are actually false ("significant").

Of the 950 observations where the null hypothesis was actually true, 19 were incorrectly rejected, or deemed as significant (box at bottom left). 

Of the 50 observations that were truly significant, 45 were correctly identified as significant (box at bottom right). 

<img src="materials/images/images_false_discovery_rate/fdr_figure.png"/>

Therefore, out of the 1000 experiments, our analysis identified 45 true positive results and 19 false positive results for a total of 64 positive results. Of these results, 19/64 are false positives so the false discovery rate is **30%**, the percentage of the rejected null hypotheses that were erroneously rejected.

Once again, the **False Discovery Rate** is the proportion of all outcomes deemed to be significant that are <mark>falsely significant</mark>. 

<div class="alert alert-block alert-warning">
    <b>Alert:</b> Only the positive outcomes matter for the false discovery rate--only false positives or true positives. False negatives don't influence the false discovery rate.
</div>

<div class="alert alert-block alert-success">
    <b>Note:</b> There are various ways to control the False Discover Rate. A common approach is known as the <b>Benjamini-Hochberg procedure</b>.
</div>

---

# 🌟 Ready for the next one?
<br>

- [**Lesson 8: Benjamini Hochberg**](Lesson_8_Benjamini_Hochberg.ipynb)

- [**Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis**](Lesson_9_Dimensionality_Reduction_Methods_Principal_Component_Analysis.ipynb)

- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)

- [**Lesson 11: UMAP**](Lesson_11_UMAP.ipynb)
</br>

# Contributions & acknowledgment

Thanks Antony Ross for contributing the content for this notebook.

---

Copyright (c) 2022 Stanford Data Ocean (SDO)

All rights reserved.