# GWAS HW
## 12/03/19
## Due 12/10/19 @ 11:59 PM

In [None]:
# Load the modules we'll need
from datascience import *
import numpy as np
import random
import seaborn as sns
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.manifold import TSNE
import statsmodels.api as sm
plt.style.use('fivethirtyeight')
from client.api.notebook import Notebook

Genome-wide association studies (GWAS) are used to identify the regions of the genome which affect traits (phenotypes). These analyses require genotypes as well as phenotypes across a range of individuals. Many statistical problems arise in these types of studies. Some include the large multiple testing burden that appears when looking at genome-sized data, linkage disequilibrium between genomic loci, confounding effects of population structure, and the difficulty in identifying non-linear and/or complicated effects (e.g. epistasis, pleiotropy). Example of traits that we want to investigate are height, eye color, skin pigmentation, disease susceptibility, and gene expression levels.

To identify which mutations influence specific traits, we work with something called a "SNP matrix". Here, SNP refers to single nucleotide polymorphisms, aka point mutations. This is a matrix of 0s and 1s where 1 indicates a mutation with respect to the so-called "ancestral allele". Each column in this matrix corresponds to a single chromosome, so each pair of columns come from a single individual. These data come from the HapMap project and contain SNPs as measured on chromosome 22 in CEU (British ancestry) and Yoruban (a population in Nigeria) individuals.

In [None]:
# Load SNP data
SNP_df = pd.read_csv('https://raw.githubusercontent.com/ds-connectors/Data88-Genetics_and_Genomics/master/Lab09/chr22_CEU_YRI.csv')

Linkage disequilibrium occurs when SNPs co-occur (or fail to co-occur) at rates different than what would be expected if they were inherited independently. One way to measure this turns out to be the Pearson correlation. Plot the correlation matrix and see if you notice anything weird on the diagonal.

In [None]:
# Compute the LD matrix by taking the Pearson correlations.
LD_mat = SNP_df.iloc[0:1000, :].T.corr()

f = plt.figure(figsize=(15, 15))
plt.matshow(LD_mat, fignum=f.number)
cb = plt.colorbar()
plt.xlabel('SNP #')
plt.ylabel('SNP #')
plt.title('Linkage disequilibrium')
plt.show()

# Describe the pattern you observe. What can we say about linkage disequilibrium as a function of distance between the SNPs? Does this make sense? What process (that we've covered) could account for this pattern?

# Answer:

To perform a GWAS, we usually want to work with the genotype matrix. This requires we combine the chromosomes from each individual such that for each SNP we now have possible values of 0, 1, and 2.

In [None]:
# Create genotype matrix
x = SNP_df.iloc[:,np.arange(0,SNP_df.shape[1],2)]
y = SNP_df.iloc[:,np.arange(1,SNP_df.shape[1],2)]
y.columns = x.columns[:]
genotype_df = x+y

Let's quickly check whether we can see evidence that we have two different populations in our data by computing and inspecting the correlation between the individuals' genotypes.

In [None]:
# Population structure matrix
individual_corr_df = genotype_df.corr()

f = plt.figure(figsize=(12, 12))
plt.matshow(individual_corr_df, fignum=f.number)
cb = plt.colorbar()
plt.xlabel('Individual #')
plt.ylabel('Individual #')
plt.title('Correlation matrix of genotypes')
plt.show()

# How do you interpret what you observe? What might this portend for our further analyses?

# Answer:

Now we're ready to perform our GWAS. Let's start by loading the phenotype for all our individuals.

In [None]:
# Load phenotype data and look at a histogram
pheno_df = list(pd.read_csv('phenotype_obs.csv').iloc[0,:])

plt.hist(pheno_df)
plt.xlabel('Diastolic blood pressure')
plt.ylabel('Frequency')
plt.show()

We want to determine the effect of each SNP on diastolic blood pressure. Since blood pressure is a quantitative trait, the naive way to do this is to perform a linear regression marginally for each SNP and then look at which ones are statistically significant.

In [None]:
# Perform linear regression for each SNP and extract the p-value
p_vals_vec = []
for i in np.arange(0, genotype_df.shape[0]):        
    SNP_vec = list(genotype_df.iloc[i,:])
    X = sm.add_constant(SNP_vec)
    model = sm.OLS(pheno_df, X).fit()
    p_vals_vec.append(model.pvalues[1])

# Here we've computed our p-values from linear regression. However, we could've also obtained a different set of them from a permutation test. Let's say we have computed the correlation between the value of our trait of interest and the number of mutations at a given location. Describe how you would perform the permutation test with correlation as the test statistic. (Hint: look back at some older labs if you're stuck).

# Answer:

The way we often examine our results is via something called a "Manhattan plot". This plots the -log10 p-value for each SNP over the entire region of interest. Our usual threshhold of 0.05 is no longer suitable because of how many tests we're performing. Typically we adjust our cutoff, in this case using the Bonferroni correction.

In [None]:
plt.figure(figsize=(15, 8))
plt.plot(np.arange(0, genotype_df.shape[0]), -np.log10(p_vals_vec), 'k.')
plt.xlabel('SNP #')
plt.ylabel('-log10 p-value')
plt.title('Manhattan plot')
plt.axhline(y = -np.log10(.05/genotype_df.shape[0]), linewidth=1, color='r')     # plot Bonferroni-corrected cutoff
plt.show()

# What does the plot look like? Why might it be called a Manhattan plot? Can you explain what could be causing this phenomenon?

# Answer:

# The high cutoff for significance comes from the fact that we have to perform many tests. However, the calculation of this cutoff makes the assumption that all the tests are independent. From our LD matrix plot earlier, we know that these tests are not independent because many SNPs are in LD with one another. Given these bits of information, can you suggest a strategy that mostly preserves our ability to detect regions associated with traits while reducing the number of tests we have to perform?

# Answer:

In [None]:
ok = Notebook('hw_gwas.ok')
_ = ok.auth(inline=True)

In [None]:
# Submit the assignment.
_ = ok.submit()