# Explore and annotate GWAS results

This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

[MIT License](https://github.com/dnanexus/UKB_RAP/blob/main/LICENSE) applies to this notebook.

## Merging our multiple regenie files

We have provided a short shell script (`process_regenie_results.sh`) that will merge regenie results from a multiple chromosomes into a single file. Depending on how your naming conventions, you may have to adjust the wildcard expression used for your file.

We will proceed assuming that you have this merged file.

## Working with flat files on DNAnexus in R

To work with R, files from the project on DNAnexus should be either read as a data.frame in R with pipe("dx cat <filename>") functionality supported within R, or downloaded locally with `dx download` in local instance via the terminal or additional notebooks.

Let's download the regenie results file to JupyterLab

In [None]:
system("dx download -f gwas_results/multiple_assoc_edit_tab.all.regenie")

# View first few rows:
system('head -3 multiple_assoc_edit_tab.all.regenie', intern = T)

# Lets remove "#" from the first row to read in the header row correctly in R
system('sed -i -e "1 s/\\#//" multiple_assoc_edit_tab.all.regenie', intern = T)

# Install Needed Packages

These are the following packages that are required for this JupyterLab notebook. They are not installed by default; note that you will need to decide if the licenses are appropriate for your application.

There can be some errors when installing these packages from an R code cell. We recommend that you open a terminal using the JupyterLab launcher, launch R on the command line and then cut and paste the code cell. Another option is to use specified version when installing libraries.

In [None]:
install.packages("rlang", version = '1.0.1')
install.packages("qqman")
install.packages("tidyr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("manhattanly")

# Loading the Required Packages

We'll do a little bit of data wrangling using `{tidyr}` and `{dplyr}`. Make sure that you've loaded the correct snapshot for this.

`{manhattanly}` will let us produce an interactive plot using `{plotly}`. The nice thing about this package is that it will produce an interactive plot that can be shared in a Jupyter notebook.

In [None]:
# load packages 
library(rlang)
library(qqman, quietly = TRUE)
library(repr, quietly = TRUE)
library(tidyr, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(ggplot2)
library(manhattanly)

## Reading in GWAS Result File from Jupyter Storage

We'll take our GWAS result file that we downloaded it and read it in using the `read.table()` function.

In [None]:
gwas = read.table("multiple_assoc_edit_tab.all.regenie", header = T, as.is = T, sep = '\t')
# Look at the head of the gwas dataframe
head(gwas)

Adding `P` column by inversing negative logarithm with the base 10.

In [None]:
gwas <- 
    gwas %>% mutate(P = (10^(-LOG10P)))
head(gwas)

Regenie output may contain multiple rows for each variant for all predictor's in the model specified with 'TEST' column. Let's filter results to look at the additive effects per variant.

In [None]:
# Subset dataframe
gwas_additive <- 
    gwas %>%
        filter(TEST == "ADD") %>%
        tidyr::drop_na(LOG10P)

In [None]:
# Dimensions of the dataframe
dim(gwas_additive)

In [None]:
head(gwas_additive)

What is the lowest P-value in our set of variants? 

In [None]:
# Lowest P-value
min(gwas_additive$P)

## Generating a Q-Q plot

We can generate a Q/Q plot to check our p-value distribution.

In [None]:
# Generate QQ plot with the GWAS results
qq(gwas_additive$P, main = "Q-Q plot of case-control GWAS p-values")

# Plotting a Manhattan Plot

We can use the `manhattan()` function from the `{qqman}` package to generate a manhattan plot.

Let's first define a couple of color palettes for distinguishing the different chromosomes in our Manhattan plot.

In [None]:
# Adjust plot size
options(repr.plot.width=12, repr.plot.height=8)

# Select Manhattan plot color palette 
# w = warmer tones
# n = neutral
# c = cooler tones

# Reds
reds.w <- c("#FFAD7E", "#E9874F", "#D96726", "#AE4A12", "#873100") 
reds.n <- c("#FF817E", "#E9534F", "#D92B26", "#AE1612", "#870300") 
reds.c <- c("#E2709A", "#CB4577", "#BD215B", "#970F42", "#75002B") 

In [None]:
# Make the Manhattan plot on the gwas results dataframe
#Use reds.c as our color palette
manhattan(gwas_additive, chr="CHROM", 
          bp="GENPOS", snp="ID", p="P", ylim=c(0,10), suggestiveline=FALSE,
          col=reds.c,main="Manhattan Plot for case control GWAS")


We can zoom into Chromosomes 1 by using a `filter()` operation:

In [None]:
gwas_additive_12 <- 
    gwas_additive %>%
        filter(CHROM %in% c("1"))
    
manhattan(gwas_additive_12, chr="CHROM", 
          bp="GENPOS", snp="ID", p="P", ylim=c(0,10),  suggestiveline=FALSE,
          col=reds.w,main="Manhattan Plot for case control GWAS")

## Interactive Manahattan Plot with the `{manhattanly}` package

The `{manhattanly}` package uses `plotly` under the hood to make an interactive manhattan plot.

We can control the tooltip by utilizing the `annotation1` and `annotation2` arguments and using column names.

In [None]:
# By default, the `manhattanly` function assumes columns are named CHR, BP and P. 
# These can be specified by the user if they are different, like below:
library(manhattanly)

subset_gwas <- gwas_additive %>%
    filter(CHROM %in% c(1:2))

manhattanly(subset_gwas, chr = "CHROM", bp = "GENPOS", 
            snp = "ID", annotation1 = "CHISQ", suggestiveline = FALSE, 
            annotation2 = "BETA", p = "P")

In [None]:
qqly(
    subset(gwas, CHROM %in% 1:2), chr = "CHROM", bp = "GENPOS", snp = "ID", 
    annotation1 = "CHISQ", annotation2 = "BETA"
)

# Filtering our Candidate Variant List

In [None]:
# Subset results showing suggestive association
gwas_top <- gwas %>%
    filter(P < 0.001) %>%
    arrange(P)

dim(gwas_top)

head(gwas_top)

# Annotating GWAS results with clinVar

## Downloading ClinVar Annotation Files

We will use a tab-delimited report based on each variant at a location on the genome for which data have been submitted to ClinVar.

1. `wget` `variant_summary.txt.gz` file and unzip it
2. Load variant_summary table
3. Subset variant_summary to only include SNPs
4. Merge with `gwas_top` using Chromosome and Position
5. Select relevant columns in merged table

In [None]:
system("wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz")
system("gunzip variant_summary.txt.gz")

In [None]:
clinvar <- read.delim("variant_summary.txt", sep="\t") 

In [None]:
colnames(clinvar)

First we need to filter `clinvar` to only contain SNPs. We do that by `filter()` by `Type == "single nucleotide variant"`.

In [None]:
clinvar <- clinvar %>%
    filter(Type == "single nucleotide variant") %>%
    mutate(Chromosome = as.character(Chromosome))

Here we merge our `gwas_top` file with `clinvar` using `dplyr::inner_join()` on both the `CHROM` and `GENEPOS` columns in our data.

In [None]:
gwas_top_annotated <- gwas_top %>%
    mutate(CHROM = as.character(CHROM)) %>%
    inner_join(y=clinvar, by=c("CHROM"="Chromosome", "GENPOS"="Start")) %>%
    mutate(CHROM = as.numeric(CHROM))

colnames(gwas_top_annotated)

Now we have our tables merged, we can pass the `clinicalsignificance` column to the `annotation1` argument and `BETA` to the `annotation2` argument in `manhattanly()`, to further understand our candidates.

In [None]:
manhattanly(gwas_top_annotated, chr = "CHROM", bp = "GENPOS", 
            snp = "ID", suggestiveline = FALSE, annotation1 = "ClinicalSignificance", 
            annotation2 = "BETA")

## Saving our annotated results

Finally, we'll use the `write.csv()` function to write a csv file and then use `dx upload` to get this result back onto the platform. 

In [None]:
write.csv(gwas_top_annotated, "clinvar_annotated_candidates.csv")

In [None]:
system("dx upload clinvar_annotated_candidates.csv --path gwas_results/")