# Project 1 - Statistical testing and multiple testing correction

Notebook version: `25.0` (please don't change)


Alzheimer's disease (AD) is the most common form of dementia. AD is a progressive neurodegenerative disease characterized by loss of cognitive functions and autonomy, eventually leading to death. Genome-wide gene expression profiling of the brains of individuals with AD can provide insight into differences opposed to cognitively healthy individuals.
Hokama et al. (Cerebral Cortex, Volume 24, Issue 9, September 2014, Pages 2476-2488) measured the genome-wide (RNA) expression profiles of 79 individuals (32 with Alzheimer's) from four different brain regions; temporal cortex, frontal cortex, hippocampus and temporal cortex, using Affymetrix Human Gene 1.0ST arrays.

For the project you should use the data provided to:

1. Analyze the data distributions in the two sets of samples (diseases and controls) in order to motivate your choice for a statistical test
2. Perform a differential expression test to find genes that are differentially expressed between individuals with AD and without.
3. Do a multiple testing correction to correct for the number of tests that you performed to obtain corrected p-values.
4. Perform an enrichment analysis to find functions enriched/depleted in the genes which are differentially expressed. And, reason on how these functions relate to AD?

Hint: For functional enrichment analysis of genes you can make use of PANTHER or ENRICHR (http://amigo.geneontology.org/amigo,  https://maayanlab.cloud/Enrichr)

---
 
The results should be summarized in a poster. Make sure that you: motivate choices that you made during the analyses (aim of the performed analysis, type of algorithm, parameter settings etc.); explain and discuss your findings; explain what is represented in figures (what is on the axes etc.); and how your results relate to the original paper of Hokama et al.

---

## Dataset

The gene expression data from Hokama et al. can be found in `Alzheimer_dataset.csv`. This is a csv file where the rows represent the measured genes and the columns represent the individuals. Information about the individuals can be found in `Alzheimer_metadata.csv`.

In [None]:
!wget -nc https://github.com/brmprnk/LB2292/raw/main/project1/Alzheimer_dataset.csv
!wget -nc https://github.com/brmprnk/LB2292/raw/main/project1/Alzheimer_metadata.csv

In [6]:
import pandas as pd
import numpy as np

# ... here you can add more imports, like the ones from the Lab modules if you need them!

# Below we read both csv files and store them in two pandas dataframes.
data = pd.read_csv("Alzheimer_dataset.csv", delimiter=";", index_col=0, decimal=",")
metadata = pd.read_csv("Alzheimer_metadata.csv", delimiter=";")

## Pandas Tips

The dataset is loaded using the `pandas` library in Python. This library provides a lot of utility functions that can be very useful, but it takes a bit of getting used to. We've provided a couple of examples for the most common operations below, but for an overview of all the operations you can do with a pandas dataset, please check the [pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf). 

In [None]:
# To get an idea of what the data looks like, we can print the first 5 rows of each dataframe using the head() method.
# And if we use display() instead of print(), we get a nicer-looking output!
display(data.head(5))

In [None]:
# We can also print the first 5 rows of the metadata dataframe.
display(metadata.head(5))

In [None]:
# If we want to select a row from the data dataframe, in pandas we can use the gene name as an index!
display(data.loc["OR4F5"])

# But we can also still use the index if we prefer. Then we just use iloc:
display(data.iloc[0])

In [None]:
# If we want to select a column, we can use the column name as an index. Here we select the first 5 rows of the first column.
display(data["individual_1"][:5])

# Or we do it by index, using iloc:
display(data.iloc[:, 0][:5])

In [None]:
# Pandas can also store "categorical" data, instead of just numbers. 
# This is used for example in the metadata to show if a patient has Alzheimer's or not.

# We can see the different categories in a column and how many times they appear using the value_counts() method.
display(metadata["group"].value_counts())

In [None]:
# As a final example, we'll look at how to use pandas to select a subset of the data based on a condition.
# Let's select the gene expression of the FAM87A gene (index=2) for the patients that have Alzheimer's.

# First we find all AD patients in the metadata dataframe:
ad_patients = metadata[metadata["group"] == "Alzheimer's Disease"]

# The ad_patients variable now contains a dataframe with only the patients that have Alzheimer's:
display(ad_patients.head(5))

# We can use it to select the gene expression of the FAM87A gene for these patients.
# For this, we use the "individual" column of the sliced dataframe we made above to select just the columns corresponding to these AD patients.
# We then select the row corresponding to the FAM87A gene using .loc again:
data[ad_patients["individual"]].loc["FAM87A"]