# Annotate SNPs to dbSNP and interactively profile ontologies 

In this notebook we will annotate SNPs using a dbSNP and retrieve overrepresented GO terms, using the list of significant variants from the GWAS on participant height in **Notebook G202**.

We will also retrieve and plot overrepresented GO terms.

- runtime: 30m (largely software install)
- recommended instance: mem1_ssd1_v2_x16
- estimated cost: <£0.70

This notebook depends on:
* **Bioconductor install**
* **Notebook G202** - height_signif_snp.csv

## Install required R packages

Function `p_load` from `pacman` loads packages into R.
If a given package missing it will be automatically installed - this can take a considerable amount of time for packages that need C or FORTRAN code compilation.

The following packages are needed to run this notebook:

- reticulate - R-Python interface, required to use `dxdata` package that connects to Spark database and allows retrieval of phenotypic data
- `dplyr` - tabular data manipulation in R, require to pre-process and filtering of phenotypic data
- `readr` - read and write tabular file formats: CSV, TSV, TDF, etc.
- `skimr` - provide summary statistics about variables in data frames, `tibble` objects, data tables and vectors
- `gprofiler2` - A toolset for functional enrichment analysis and visualization of genes and variants
- `SNPlocs.Hsapiens.dbSNP151.GRCh38` - a snapshot of dbSNP


## Install required Bioconductor packages

In [2]:
if(!require(pacman)) install.packages("pacman")
pacman::p_load(dplyr, skimr, readr, gprofiler2)

In [15]:
# Install BioConductor contingent on R version
if(as.double(R.version$minor) < 3.0) {
  version <- '3.16'} else {
  version <- '3.17'}

BiocManager::install(version = version, ask=FALSE)
if(!require(GenomicRanges)) BiocManager::install("GenomicRanges", version=version, ask=FALSE)
if(!require(SNPlocs.Hsapiens.dbSNP155.GRCh38)) BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh38", version=version, ask=FALSE)

In [4]:
pacman::p_load(GenomicRanges, SNPlocs.Hsapiens.dbSNP155.GRCh38)

## Load output of GWAS

In this step, we load a list of variants. In the following example, we use the list of significant variants from GWAS on the participant height example from notebook **G202**.

In [None]:
system('dx download gwas/height_signif_snp.csv') #From Notebook G202
snp <- readr::read_csv('height_signif_snp.csv', show_col_types = FALSE)
head(snp)

## Convert variant list to GenomicRanges format

In this step, we use the genomic coordinates (chromosome and physical position) of variants to construct the GenomicRanges object.
This allows us to simply query the dbSNP based on variant positions and assign RSID to known variants.

In [6]:
snp_gr <- makeGRangesFromDataFrame(
    snp, 
    seqnames.field = 'chromosome', 
    start.field = 'physical.pos', 
    end.field = 'physical.pos', 
    keep.extra.columns = TRUE)

The GRanges object consists of 3 mandatory fields: `seqnames` - the name of the chromosome, `ranges` - position on the chromosome and `strand` - the strand, where `*` denotes any strand.
In addition, there can be an arbitrary number of additional annotation fields. We use them to store marker ID, the information about alleles and statistics from GWAS analysis. 

In [None]:
snp_gr

## Load and inspect dbSNP

The following code loads the local R instance of dbSNP, version 151 for the GRCh38 reference genome. 
We can see how many variants are annotated on each chromosome.

In [None]:
snps <- SNPlocs.Hsapiens.dbSNP155.GRCh38
snpcount(snps) %>% as_tibble(rownames = 'chr') %>% head

## Annotate variants with RefSNP IDs based on genomic coordinates

In [9]:
my_snps <- snpsByOverlaps(snps, snp_gr)

This function creates a new GRanges object with `RefSNP_id` annotation filed:

In [None]:
my_snps

## Retrieve the overrepresented GO terms for variants

In this step, we will be using functions from the `gprofiler2` package to find overrepresented ontology terms enriched for our SNP IDs and visualize them as interactive plots.

In [11]:
gostres <- gost(query = my_snps$RefSNP_id, organism = "hsapiens", significant = FALSE) #significant = TRUE for sign. SNPs

## Visualize the statistical significance of GO terms

In [12]:
p <- gostplot(gostres, capped = FALSE)

The following plot visualizes the overrepresented terms. It is similar to the Manhattan plot - the terms are dispersed and grouped by ontology on X-axis, while the negative log of the p-value from the hypergeometric test for overrepresentation is plotted on Y-axis. This transformation makes the visualization of statistical significance more intuitive, with a higher value on Y-axis denoting a more significant term 

In [None]:
p

These results can be also obtained as a table:

In [None]:
gostres$result

## Annotate variants with gene IDs

Finally, we can use the `gsnpense` function to check which genes are affected by these variants:

In [None]:
gsnpense(query = my_snps$RefSNP_id)