<img src="materials/images/introduction-to-statistics-II-cover.png"/>


# 👋 Welcome, before you start
<br>

### 📚 Module overview

We will go through eleven lessons with you:
    
- [**Lesson 1: Z-score**](Lesson_1_Z-score.ipynb)

- [**Lesson 2: P-value**](Lesson_2_P-value.ipynb)

- [**Lesson 3: Lesson 3: Welchs T-test**](Lesson_3_Welchs_T-test.ipynb)

- [**Lesson 4: Log2 Fold Change**](Lesson_4_Log2_Fold_Change.ipynb)

- [**Lesson 5: Pearson Correlation**](Lesson_5_Pearson_Correlation.ipynb)

- [**Lesson 6: Spearman Correlation**](Lesson_6_Spearman_Correlation.ipynb)

- [**Lesson 7: False Discovery Rate**](Lesson_7_False_Discovery_Rate.ipynb)

- <font color=#E98300>**Lesson 8: Benjamini Hochberg**</font>    `📍You are here.`

- [**Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis**](Lesson_9_Dimensionality_Reduction_Methods_Principal_Component_Analysis.ipynb)

- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)

- [**Lesson 11: UMAP**](Lesson_11_UMAP.ipynb)
</br>



<div class="alert alert-block alert-info">
<h3>⌨️ Keyboard shortcut</h3>

These common shortcut could save your time going through this notebook:
- Run the current cell: **`Enter + Shift`**.
- Add a cell above the current cell: Press **`A`**.
- Add a cell below the current cell: Press **`B`**.
- Change a code cell to markdown cell: Select the cell, and then press **`M`**.
- Delete a cell: Press **`D`** twice.

Need more help with keyboard shortcut? Press **`H`** to look it up.
</div>

---

# Lesson 8: Benjamini Hochberg

`🕒 This module should take about 15 minutes to complete.`

`✍️ This notebook is written using Python.`

One powerful tool to decrease the false discovery rate is called the **Benjamini-Hochberg Procedure**. In hypothesis testing, the **false discovery rate (FDR)** is the proportion of the <mark>falsely significant</mark> outcomes among all the outcomes that are deemed significant. 

It is a measure of accuracy when multiple hypotheses are being tested at once. 

<img src="materials/images/images_benjamini_hochberg/fdr_equation.png"/>

The more hypotheses that are being tested, the higher the chances of a null hypothesis being falsely identified as significant. A p-value of 0.05 means that there’s only a 5% chance that you would get your observed result if the null hypothesis were true, but it’s only a probability. 

For example, let’s say you have a group of 1000 genes that you know are free of a certain condition. Your null hypothesis is that the genes are free of the condition, and your alternative hypothesis is that the condition is present. If you ran 1000 statistical tests at the 0.05 significance level, roughly 50 (5%) of your results would falsely be identified as significant.

<img src="materials/images/images_benjamini_hochberg/hypothesis_testing.png"/>

Thus, some amount of false positives are unavoidable, and will occur because of the randomness of results. However, it's important to control the proportion of false positives among the set of rejected hypotheses. It is vital to utilize a method that enables the investigator to identify as many significant comparisons as possible, while still maintaining a low false discovery rate. 

<mark>**Benjamini-Hochberg Procedure**</mark> calculates a critical value for each test's p-value. This is known as the **q-value**. Q-values usually result in much smaller numbers of false positives.

---

## The Benjamini-Hochberg procedure is performed as follows:

**To control FDR at level Q** (your chosen false discovery rate):

**Step 1**: Conduct all of your statistical tests, and find the p-value for each test.

<img src="materials/images/images_benjamini_hochberg/chart_1.png"/>

**Step 2**: Arrange the p-values in order <mark>**from smallest to largest**</mark>, assigning a rank to each one with <mark>**the smallest p-value having a rank of 1**</mark>, etc.

<img src="materials/images/images_benjamini_hochberg/chart_2.png"/>

**Step 3**: Calculate the Benjamini-Hochberg critical value for each p-value, using the formula <mark>**(i/m)*Q**</mark>


>   i = the individual p-value’s rank

>   m = total number of tests

>   Q = your chosen false discovery rate

Below, we used 10 tests with Q set to a desired False Discovery Rate of 10% (.1).

<img src="materials/images/images_benjamini_hochberg/chart_3.png"/>

**Step 4**: Find the largest p-value that is less than, or equal to the B-H critical value.

<img src="materials/images/images_benjamini_hochberg/chart_4.png"/>

**Step 5**: Designate every p-value that is smaller than or equal to the p-value identified above to be significant (the null hypothesis can be rejected).

<img src="materials/images/images_benjamini_hochberg/chart_5.png"/>

The Benjamini-Hochberg procedure controls the false discovery rate so that FDR ≤ Q. Using q-values allows us to decide how many false positives we are willing to accept among all of the tests that we call significant. Q-values usually result in much smaller numbers of false positives.

---

## P-values vs. Q-values

<div class="alert alert-block alert-warning">
    <b>Alert:</b> The <b>q-value</b> can be interpreted as, for example, in an array study testing genes for differential expression, if gene X has a q-value of 0.028 it means that 2.8% of genes that show <b>p-values</b> at least as small as gene X are false positives.
</div>

<div class="alert alert-block alert-success">
    <b>Note:</b> When we set the threshold for statistical significance of our <mark>p-value</mark>, that's what controls what is know as our <b>False Positive Rate</b> (the expected proportion of all truly null hypotheses that are falsely identified to be significant).
    
Similarly, we set a threshold for our <mark>q-value</mark> which controls our <b>False Discovery Rate</b> (the expected proportion of all hypotheses deemed to be significant that are falsely identified to be significant).
</div>

---

# 🌟 Ready for the next one?
<br>


- [**Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis**](Lesson_9_Dimensionality_Reduction_Methods_Principal_Component_Analysis.ipynb)

- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)

- [**Lesson 11: UMAP**](Lesson_11_UMAP.ipynb)
</br>

---

# Contributions & acknowledgment

Thanks Antony Ross for contributing the content for this notebook.

---

Copyright (c) 2022 Stanford Data Ocean (SDO)

All rights reserved.