# Functional annotations for variants

This notebook show how to use the genome annotations and gene models to translate the variant genomic coordinates into functional annotations.


- runtime: 30m
- recommended instance: mem1_ssd1_v2_x16
- estimated cost: <£0.70

This notebook depends on:
* **Bioconductor install**
* **Notebook 203** - height_signif_snp.csv

## Install required R packages

Function `p_load` from `pacman` loads packages into R.
If a given package missing it will be automatically installed - this can take a considerable amount of time for packages that need C or FORTRAN code compilation.

The following packages are needed to run this notebook:

- `dplyr` - tabular data manipulation in R, require to pre-process and filtering of phenotypic data
- `readr` - read and write tabular file formats: CSV, TsSV, TDF, etc.
- `skimr` - provide summary statistics about variables in data frames, `tibble` structures, data tables and vectors
- `gprofiler2` - A tool set for functional enrichment analysis and visualization of genes and variants
- `VariantAnnotation` - Bioconductor package for variant annotations 
- `TxDb.Hsapiens.UCSC.hg38.knownGene` - gene position for hg38 human genome release
- `BSgenome.Hsapiens.UCSC.hg38` - the DNA sequence of hg38 human genome release



In [None]:
if(!require(pacman)) install.packages("pacman")
pacman::p_load(dplyr, skimr, readr, gprofiler2)

In [None]:
# Install BioConductor contingent on R version
if(as.double(R.version$minor) < 3.0) {version <- '3.16'} else {version <- '3.17'}

if(!require(GenomicRanges)) BiocManager::install("GenomicRanges", version=version, ask=FALSE)
if(!require(VariantAnnotation)) BiocManager::install("VariantAnnotation", version=version, ask=FALSE)
if(!require(TxDb.Hsapiens.UCSC.hg38.knownGene)) BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene", version=version, ask=FALSE)
if(!require(BSgenome.Hsapiens.UCSC.hg38)) BiocManager::install("BSgenome.Hsapiens.UCSC.hg38", version=version, ask=FALSE)

In [3]:
pacman::p_load(TxDb.Hsapiens.UCSC.hg38.knownGene, BSgenome.Hsapiens.UCSC.hg38, VariantAnnotation, GenomicRanges)

## Load output of GWAS

In the first step, we get a list of variants. In the following example, we use the list of significant variants from GWAS on participant height example from **Notebook 203**.

In [None]:
system('dx download gwas/height_signif_snp.csv')
snp <- readr::read_csv('./height_signif_snp.csv', show_col_types = FALSE)
head(snp)

## Construct the GenomicRanges object from the list of variants

In this step, we use the genomic coordinates (chromosome and physical position) of variants to construct the GenomicRanges object.

In [5]:
snp_gr <- makeGRangesFromDataFrame(
    snp, 
    seqnames.field = 'chromosome', 
    start.field = 'physical.pos', 
    end.field = 'physical.pos', 
    keep.extra.columns = TRUE)

The GRanges object consists of 3 mandatory fields: `seqnames` - the name of the chromosome, `ranges` - position on the chromosome and `strand` - the strand, where `*` denote any strand.
In addition, there can be an arbitrary number of additional annotation fields. We use them to store marker ID, the information about alleles and statistics from GWAS analysis. 

In [None]:
head(snp_gr)

Next, we convert a `GenomicRanges` structure to `VariantRanges` class.  `VRanges` structure is a specialized extension of `GRanges`, designed specifically to hold information about genomic variation. 

In [7]:
vr <- VRanges(
    seqnames = seqnames(snp_gr),
    ranges = ranges(snp_gr),
    ref = snp_gr$allele1, 
    alt = snp_gr$allele2)

seqlevelsStyle(vr) <- seqlevelsStyle(TxDb.Hsapiens.UCSC.hg38.knownGene)

In [None]:
head(vr)

## Predict coding variants

This function returns the amino acid coding for variants that fall completely `within' a coding region For further information on predictCoding click [here](https://www.rdocumentation.org/packages/VariantAnnotation/versions/1.18.5/topics/predictCoding)

In [None]:
coding <- predictCoding(vr, TxDb.Hsapiens.UCSC.hg38.knownGene, BSgenome.Hsapiens.UCSC.hg38)
head(coding)

## Locate variants

We can assess the variant location with respect to gene function [more info](https://www.rdocumentation.org/packages/VariantAnnotation/versions/1.18.5/topics/locateVariants). 
In the examples below we select a different classes of variants based on functional annotation.

###  Coding variants

In [None]:
cds <- locateVariants(vr, TxDb.Hsapiens.UCSC.hg38.knownGene, CodingVariants())
head(cds)

### Five UTR variants

In [None]:
five <- locateVariants(vr, TxDb.Hsapiens.UCSC.hg38.knownGene, FiveUTRVariants())
head(five)

### Variants overlapping splice sites

In [None]:
splice <- locateVariants(vr, TxDb.Hsapiens.UCSC.hg38.knownGene, SpliceSiteVariants())
head(splice)

### Intronic variants

In [None]:
intron <- locateVariants(vr, TxDb.Hsapiens.UCSC.hg38.knownGene, IntronVariants())
head(intron)

head(intron)

## Summaries functional annotations

We can summaries the number of the variants in functional classes

In [None]:
lengths(list(cds=cds, five=five, splice=splice, intron=intron))

## Get more info about coding genes

This function will convert coding Gene IDs to Ensembl IDs, gene names and short functional descriptions.

In [None]:
gconvert(query = unique(coding$GENEID), organism = "hsapiens", numeric_ns = 'ENTREZGENE_ACC')