# Explore and annotate GWAS results

<span style="color:gray">Disclaimer: Any third party module or library installed by the user (i.e, using `pip install` or `install.packages()` ) and not included in an image build or maintained by DNAnexus, may be subject to updates or changes with downstream implications. Please verify dependencies and build before using any third party data source.</span>

This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

[MIT License](https://github.com/dnanexus/UKB_RAP/blob/main/LICENSE) applies to this notebook.

#### Download and import necessary packages for use in Python

In [None]:
%%bash
# Visualization for GWAS results
pip -q --no-cache-dir install https://github.com/khramts/assocplots/archive/master.zip

In [None]:
from assocplots.manhattan import manhattan
from assocplots.qqplot import qqplot, get_lambda
import matplotlib as plt # Plotting
import numpy as np       # Numerical operations
import pandas as pd      # Manipulate tables of data (e.g. R-like data frames, SQL tables) -- can also use R directly!
import warnings

warnings.filterwarnings('ignore')

### GWAS results

We will read in the results from GWAS analysis into dataframes.

Flat files from GWAS analysis can be readily read in as a dataframe in Python and R notebooks.  

If you prefer to work with **R**, refer to **_*gwas_results_R.ipynb_** notebook. 


In [None]:
%%bash
dx download -f gwas_results/multiple_assoc_edit_tab.all.regenie

# View head of results file
head -3 multiple_assoc_edit_tab.all.regenie

# Lets remove "#" from the first row to read in the header row correctly in R
sed -i -e "1 s/\\#//" multiple_assoc_edit_tab.all.regenie

In [None]:
gwas = pd.read_csv("multiple_assoc_edit_tab.all.regenie",sep='\t')
gwas.head()

Adding `P` column by inversing negative logarithm with the base 10.

In [None]:
gwas['P'] = 10**(-gwas.LOG10P)

In [None]:
gwas = gwas[gwas.TEST == "ADD"]
gwas.head()

#### Dataframe of results
Let's subset dataframe to exclude NAs

In [None]:
# Lets remove NAs from the dataframe
gwas = gwas.dropna(axis=0, how='any', subset=['P', 'CHROM','GENPOS']).reset_index(drop=True)
gwas.head()

In [None]:
# Lets see what is the minimum p-value in our results
gwas['P'].min()

### Visual exploration of results

#### Manhattan and QQ plots
We'll use [assocplots](https://academic.oup.com/bioinformatics/article/33/3/432/2593901), a Python module for visualization of GWAS results  

In [None]:
# Set figure parameters
plt.rcParams['figure.dpi'] = 150
plt.rcParams['figure.figsize'] = 5,5
plt.rcParams['legend.fontsize'] = 'small'

# Generate QQ plot
qqplot(data=[gwas['P']], labels=['QQ plot'], title='QQ plot',  color=['b'], fill_dens=[0.2])
plt.pyplot.savefig('qq.png', dpi=300)                   ## Save QQ plot as .png file

In [None]:
# Calculate lambda - genomic inflation factor
get_lambda(gwas['P'], definition = 'median')

In [None]:
# Set the dimensions for Manhattan plot
plt.rcParams['figure.dpi']=300
plt.rcParams['figure.figsize']=15,6
plt.rcParams['font.size'] = 10
plt.rcParams['legend.fontsize'] = 'large'
plt.rcParams['figure.titlesize'] = 'large'

# To select a different color map: http://matplotlib.org/examples/color/colormaps_reference.html 
cmap = plt.pyplot.get_cmap('PuRd')
colors = [cmap(i) for i in [0.25,0.45,0.60,0.80]]

# Generate Manhattan plot
manhattan(p1=gwas['P'],
          pos1=gwas['GENPOS'],
          chr1=gwas['CHROM'].astype(str), 
          label1='Trait', 
          cut = 0, 
          top1=0,
          title='GWAS Manhattan Plot - Chr19', 
          xlabel='Chromosome',
          ylabel='-log10(p-value)',
          lines=[6,8], 
          colors = colors, 
          scaling = '-log10')

# Subset out GWAS candidate variants by Thresholding

In [None]:
# Subset GWAS results to select top variants
sig_gwas = gwas.loc[gwas['P'] < 0.001]
sig_gwas["CHROM"] = sig_gwas["CHROM"].astype(str)
sig_gwas.head()

# Annotating GWAS results with clinVar

## Downloading ClinVar Annotation Files

We will use a tab-delimited report based on each variant at a location on the genome for which data have been submitted to ClinVar.

1. `dx download` `variant_summary.txt` file from ClinVar
2. Load variant_summary in Pandas
3. Subset variant_summary to only include SNPs
4. Merge with `sig_gwas` using Chromosome and Position
5. Select relevant columns in merged table


In [None]:
%%bash
wget --no-verbose https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz
gunzip variant_summary.txt.gz

In [None]:
clinvar = pd.read_csv("variant_summary.txt", delimiter="\t")
clinvar.head()

In [None]:
clinvar = clinvar[clinvar.Type == "single nucleotide variant"]

In [None]:
list(clinvar.columns)

In [None]:
clinvar_candidates = pd.merge(sig_gwas, clinvar, left_on=["CHROM", "GENPOS"], right_on=["Chromosome", "Start"])
clinvar_candidates[["GeneSymbol", "Type", "P","Chromosome", "Start", "ReferenceAllele", "AlternateAllele", "ClinicalSignificance"]]

## Saving our annotated results

Finally, we'll use the `.to_csv()` method to write a csv file and then use `dx upload` to get this result back onto the platform. 

In [None]:
clinvar_candidates.to_csv("clinvar_annotated_candidates.csv")

In [None]:
%%bash
dx upload clinvar_annotated_candidates.csv --path gwas_results/