# Download genome annotations for the Y chromosome

Links:

* info about different coordinate types: https://groups.google.com/forum/#!msg/biomart-users/OtQbAx3y9CA/wrF19ID1AgAJ
* https://www.biostars.org/p/2005/
* http://www.bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html
* http://www.ensembl.org/info/data/biomart/biomart_r_package.html
* http://www.ensembl.info/blog/2015/06/01/biomart-or-how-to-access-the-ensembl-data-from-r/
* for checking with manually downloaded data: http://www.ensembl.org/info/data/biomart/how_to_use_biomart.html
* biotypes FAQ: http://www.ensembl.org/Help/Glossary
* http://www.ensembl.org/info/genome/funcgen/regulatory_build.html

Following [this](http://www.ensembl.org/info/data/biomart/biomart_r_package.html) tutorial, I want to extract coordinates of exonic and regulatory regions from the Ensembl database and then calculate the density of such regions in a defined window around each SNP.

These densities will they be used as predictors in a linear model, predicting the Nea. ancestry at each site.

Alternatively, I could just test if the distribution of densities for different regions differ based on frequency of Nea. alleles at each site.

# Fetch coordinates of different genomic regions

In [1]:
suppressMessages(suppressWarnings({
    library(biomaRt)
    library(rtracklayer)
    library(BSgenome.Hsapiens.UCSC.hg19)
    library(tidyverse)
    library(stringr)
    library(magrittr)
    library(here)
}))

Show all the available biomarts for hg19:

In [2]:
listMarts(host="grch37.ensembl.org")

biomart,version
<chr>,<chr>
ENSEMBL_MART_ENSEMBL,Ensembl Genes 97
ENSEMBL_MART_SNP,Ensembl Variation 97
ENSEMBL_MART_FUNCGEN,Ensembl Regulation 97


Connect to the human gene Ensembl dataset:

In [3]:
ensembl_mart_genes <- useMart("ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org")
listDatasets(ensembl_mart_genes) %>% filter(str_detect(dataset, "sapiens"))

dataset,description,version
<I<chr>>,<I<chr>>,<I<chr>>
hsapiens_gene_ensembl,Human genes (GRCh37.p13),GRCh37.p13


In [4]:
mart <- useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl_mart_genes)

### What is the number of genes and pseudogenes on the chromosome Y

In [5]:
getBM(c("chromosome_name", "ensembl_gene_id"),
      filters = c("chromosome_name", "biotype"),
      values = list("Y", "protein_coding"),
      mart = mart) %>%
count(chromosome_name)

chromosome_name,n
<chr>,<int>
Y,54


### Protein coding gene coordinates on the Y

In [6]:
genes <-
    getBM(c("chromosome_name", "start_position", "end_position"),
    filters = c("chromosome_name", "biotype"),
    values = list("Y", "protein_coding"),
    mart = mart) %>%
    filter(complete.cases(.)) %>%
    select(chrom = chromosome_name, start = start_position, end = end_position) %>%
    arrange(chrom, start) %>%
    makeGRangesFromDataFrame %>%
    IRanges::reduce()

### Total size of CDS sequence on both chromosomes?

In [7]:
genes %>% as.data.frame %>% group_by(seqnames) %>% summarise(total = sum(width))

seqnames,total
<fct>,<int>
Y,2836755


## Coordinates of primate phastCons elements

How to retrieve them: https://support.bioconductor.org/p/25587/

Per-base vs elements diference: https://www.biostars.org/p/2129/#2143

[From](http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgTables?db=hg19&hgta_group=compGeno&hgta_track=cons46way&hgta_table=phastConsElements46wayPrimates&hgta_doSchema=describe+table+schema):

_PhastCons (which has been used in previous Conservation tracks) is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignment. It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP separately measures conservation at individual columns, ignoring the effects of their neighbors. As a consequence, the phyloP plots have a less smooth appearance than the phastCons plots, with more "texture" at individual sites. The two methods have different strengths and weaknesses. **PhastCons is sensitive to "runs" of conserved sites, and is therefore effective for picking out conserved elements.** PhyloP, on the other hand, is more appropriate for evaluating signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites)._

_Another important difference is that phyloP can measure acceleration (faster evolution than expected under neutral drift) as well as conservation (slower than expected evolution). In the phyloP plots, sites predicted to be conserved are assigned positive scores (and shown in blue), while sites predicted to be fast-evolving are assigned negative scores (and shown in red). The absolute values of the scores represent -log p-values under a null hypothesis of neutral evolution. The **phastCons scores, by contrast, represent probabilities of negative selection and range between 0 and 1.**_

[...]

#### Conserved Elements

_The conserved elements were predicted by running phastCons with the --viterbi option. **The predicted elements are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM.** Each element is assigned a log-odds score equal to its log probability under the conserved model minus its log probability under the non-conserved model. The "score" field associated with this track contains transformed log-odds scores, taking values between 0 and 1000. (The scores are transformed using a monotonic function of the form a * log(x) + b.) The raw log odds scores are retained in the "name" field and can be seen on the details page or in the browser when the track's display mode is set to "pack" or "full"._


In [8]:
library(rtracklayer)

In [9]:
session <- browserSession()
genome(session) <- "hg19"

In [10]:
query <- ucscTableQuery(session, "cons46way", GRangesForUCSCGenome("hg19", chrom = c("chr21", "chrY")))

In [11]:
tableNames(query)

In [12]:
tableName(query) <- "phastConsElements46way"

In [13]:
phastcons <-
    getTable(query) %>%
    select(-bin, -name, -score) %>%
    makeGRangesFromDataFrame(starts.in.df.are.0based = TRUE) %>%
    .[seqnames(.) == "chrY"]
seqlevels(phastcons) <- "chrY"

### Total size of phastCons sequence on both chromosomes?

In [14]:
phastcons %>% as.data.frame %>% group_by(seqnames) %>% summarise(total = sum(width))

seqnames,total
<fct>,<int>
chrY,1212827
