# 1. General

Exercises belong to two different categories:
* Pen and paper exercise marked with the symbol
* Programming exercise marked with the symbol


For questions regarding this exercise, feel free to contact corinna.losert@helmholtz-muenchen.de or katharina.
schmid@helmholtz-muenchen.de .

# 2. Requirements

1. R packages
    - VariantAnnotation (install with: BiocManager::install("VariantAnnotation"))
    - biomaRt (install with: BiocManager::install("biomaRt"))
    - Gviz (install with: BiocManager::install("Gviz"))
    - optional but really useful to create pdf-reports: markdown & knitr (needs a valid TeX installation)
2. Data
    - filtered 1000 genomes genotypes
vcf file (e-geuv-1_filtered.vcf.bgz and e-geuv-1_filtered.vcf.bgz.tbi, 1.4 MB / 114 KB)
In case you have problems installing R or any package, have a look at the instructions on moodle (called
“installation.pdf”) or use google colab instead (you can copy the notebook from this template )

In [None]:
# Install required packages if missing

if (!requireNamespace("BiocManager", quietly = TRUE, verbose = FALSE)) {
    install.packages("BiocManager")
}

if (!("VariantAnnotation" %in% rownames(installed.packages()))) {
    BiocManager::install("VariantAnnotation")
}
if (!("snpStats" %in% rownames(installed.packages()))) {
    BiocManager::install("snpStats")
}
if (!("biomaRt" %in% rownames(installed.packages()))) {
    BiocManager::install("biomaRt")
}
if (!("Gviz" %in% rownames(installed.packages()))) {
    BiocManager::install("Gviz")
}

# 3. Quick R refresher

Several of the following exercises will be in R, so let’s make sure you all know the basic R commands. For help, have a look at the large collections of cheatsheets from Rstudio https://www.rstudio.com/resources/cheatsheets/, such as the one with base R commands http://github.com/rstudio/cheatsheets/raw/master/base-r.pdf.

Try to solve these short exercises to make sure you now basic R commands:
* Load the ’VariantAnnotation’ package which will also be used later on in the script
* Get a list of the functions within the package
* Check out the documentation for the function "readVcf"
* Get your current working directory

### Install required packages if missing

In [None]:
if (!requireNamespace("BiocManager", quietly = TRUE, verbose = FALSE)) {
    install.packages("BiocManager")
}

### Load the ’VariantAnnotation’ package which will also be used later on in the script

In [None]:
library(VariantAnnotation, quietly = TRUE)

### Get a list of the functions within the package

In [None]:
lsf.str("package:VariantAnnotation")

### Check out the documentation for the function "readVcf"

In [None]:
?readVcf
# help(readVcf)

### Get your current working directory

In [None]:
getwd()

### Assign the sum of 2,3 and 4 to variable x

In [None]:
x <- 2 + 3 + 4
x

x <- sum(2:4)
x

x <- sum(c(2, 3, 4))
x

### Make a character vector of the gene names PAX6, ZIC2, OCT4 and SOX2 and a second numeric countvector of the same length containing randomly sampled numbers between 1 and 10 (set the seed 42)

In [None]:
genes <- c("PAX6", "ZIC2", "OCT4", "SOX2")
set.seed(24)
counts <- sample(1:10, length(genes))

genes
counts

### Subset the gene - vector using [] notation, and get the 2nd and 4th element

In [None]:
genes[c(2, 4)]

### Generate a dataframe out of the two generated vectors

In [None]:
genes_df <- data.frame(genes, counts)
genes_df

### Select the genes of the generated dataframe for which the corresponding count value is greater than 5

In [None]:
# genes_df['counts']
# genes_df['counts'] > 5

# genes_df[genes_df$counts > 5, ]
subset(genes_df, counts > 5)

 ### Make a boxplot of the distribution of the generated count values in the dataframe

In [None]:
boxplot(genes_df$counts)

### Write a function that takes a gene name and a dataframe as input with one column named \"genes\" and searches whether this gene name occurs in the column. It returns TRUE in case the genename occurs in thedataset and FALSE in case it doesn’t. Test the function with the above generated dataset.

In [None]:
is_gene_present <- function(gene_name, genes_df) {
    return(gene_name %in% genes_df$genes)
}

.test_gene_present <- function(gene_name, genes_df) {
    if (is_gene_present(gene_name, genes_df)) {
        print(paste0("Gene ", gene_name, " is present"))
    } else {
        print(paste0("Gene ", gene_name, " is not present"))
    }
}

.test_gene_present("BURUNDU", genes_df)
.test_gene_present("PAX6", genes_df)

# 4. Quick R refresher

Please explain shortly the following genetic terms:
* Central Dogma of Molecular Biology
* gene
* allele
* genotype
* heterozygous
* phenotype
* SNP

| **TERM**                           | **DEFINITION**                                                                                                          |
|------------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| Central Dogma of Molecular Biology | The DNA is translated in RNA that is translated in proteins                                                             |
| Gene                               | A gene is a specific DNA region of a chromosome. It is a piece of DNA translated into RNA.                              |
| Allele                             | An allele is a "value" that a gene can assume (a sequence of basis that it is possible that it constitutes that gene)   |
| Genotype                           | The set of alleles assumed by an individual for a gene (AA, Aa, aa)                                                     |
| Heterozygous                       | It means that the individual has two different alleles for a genotype                                                   |
| Phenotype                          | The tangible results of the genotype                                                                                    |
| SNP                                | Single nucleotide polymorphism, a genetic variant where just a single base is changed (could have effects on phenotype) |

# 5. VCF (Variant Call format) files

## 5.1 VCF Format

In the following, you can find the first few lines of a VCF file.

![vcf_image](./assets/vcf.png)


### What is saved in a VCF file?

VCF = Variant Call Format

File format that is used for files containing information about genetic variant.

There is an explaination header, than each row represents a single variant, the firsts columns are the annotations describing the variant, while the last columns are the single samples of that variant

### Which are the eight mandatory columns in the header line? Explain their meaning and specify their data format!

* CHROM: It is the id of the chromosome of the variant (String)
* POS: Reference position. It is the position (in n of bases) of the variant (Integer)
* ID: It is the unique id of the variant (String)
* REF: It is the reference base (in the dominant allele) (String)
* ALT: It is the set of alternatives bases (in normal alleles) (String)
* QUAL: It specifies the quality of this data (Numeric)
* FILTER: Filter informations (String)
* INFO: additional information (String)

### What are the genotypes of samples NA00001, NA00002 and NA00003 for the variant rs6054257 (write down nucleotides)?

* NA00001: _G G_
* NA00002: _A G_
* NA00003: _A A_

## 5.2 VCF in R

Read-in the vcf file using the package VariantAnnotation and answer the following questions:

* Get the number of samples and variants.
* Get the first 5 SNPs from the first 3 samples.
* Get the reference and alternative alleles for these first 5 SNPs.
* Get the genotypes of samples HG00351, HG00353 and HG00355 for the variant rs17042098
* Get the frequencies of the genotypes for SNP rs17042098
* Convert the genotypes 0/0, 0/1, 1/1 for SNP rs17042098 to 0, 1, 2.

For help, check the Bioconductor documentation:
[http://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html](http://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html)

In [None]:
vcf <- readVcf("data/e-geuv-1_filtered.vcf")

### Get the number of samples and variants.

In [None]:
.tmp <- dim(vcf)
n_variants <- .tmp[1]
n_samples <- .tmp[2]

cat("n_variants", n_variants, "\n", "n_samples", n_samples, "\n")

### Get the first 5 SNPs from the first 3 samples.

In [None]:
vcf_gen <- geno(vcf)$GT

vcf_gen[1:5, 1:3]

### Get the reference and alternative alleles for these first 5 SNPs.

In [None]:
ref(vcf)[1:5]
alt(vcf)[1:5]

### Get the genotypes of samples HG00351, HG00353 and HG00355 for the variant rs17042098

In [None]:
vcf_gen["rs17042098", c("HG00351", "HG00353", "HG00355")]

### Get the frequencies of the genotypes for SNP rs17042098

In [None]:
table(vcf_gen["rs17042098", ])

### Convert the genotypes 0/0, 0/1, 1/1 for SNP rs17042098 to 0, 1, 2.

In [None]:
head(as(genotypeToSnpMatrix(vcf["rs17042098", ])$genotype, "numeric"))

## 5.3 Genomic Ranges in R

GRanges objects are representations of genomic regions in R, consiting of a chromosome (called seqnames), a
start position in base pairs (bp) and a width in bp. Additional information can be added as metadata columns,
better describing the regions. For more details, see:

[https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.
html](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.
html)

Get all SNPs in the region of chromsome 4, 2 000 000 - 3 000 000 bp.

Hints:
* you can extract a GRranges object containing SNP annotations from the vcf file using rowRanges()
* create a GRanges object of the region of interest and calculate overlaps with findOverlaps()

In [None]:
# define the search region as a GRanges object
search_range <- GRanges(
    seqnames = "4",
    ranges = IRanges(start = 2000000, width = 1000000)
)

# get GRanges object with all SNPs annotated in the vcf file
snp_regions <- rowRanges(vcf)

# find overlaps between both GRanges objects and subset the vcf GRanges object
ov <- findOverlaps(snp_regions, search_range)
snp_regions[ov@from]

# 6. Biomart annotation

biomaRt is a package to retrieve gene annotations from Biomart in R.

Use this to look up the position (chromosome_name, start_position, end_position) and HGNC gene symbols for
the genes with Ensembl gene ids (ensembl_gene_id) ENSG00000196620, ENSG00000109787, ENSG00000241163
and ENSG00000000938.

Hints:
* get ensembl GRCh37 annotations:
useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl",GRCh=37)
* select desired genes with getBM()

In [None]:
library(biomaRt)

In [None]:
ensembl <- useEnsembl(
    biomart = "ensembl",
    dataset = "hsapiens_gene_ensembl", GRCh = 37
)
genes <- c(
    "ENSG00000196620", "ENSG00000109787",
    "ENSG00000241163", "ENSG00000000938"
)
getBM(attributes = c(
    "ensembl_gene_id", "chromosome_name",
    "start_position", "end_position", "hgnc_symbol"
), filters = "ensembl_gene_id", values = genes, mart = ensembl)

# 7. Visualization of genomic data

Visualize the surrounding of rs17042098 (+/- 500 000 bp) on the genome using the package Gviz. Possible tracks
that you could use are: IdeogramTrack to depict the whole chromosome, GenomeAxisTrack to add bp axis and
BiomartGeneRegionTrack to add gene annotations.

Use the user guide for more information:
[https://bioconductor.org/packages/release/bioc/vignettes/Gviz/inst/doc/Gviz.html](https://bioconductor.org/packages/release/bioc/vignettes/Gviz/inst/doc/Gviz.html)


In [None]:
library(Gviz)

In [None]:
start_region <- rowRanges(vcf)["rs17042098"]@ranges@start - 5e05
end_region <- rowRanges(vcf)["rs17042098"]@ranges@start + 5e05
# get ideogram of the chromosome 4
itrack <- IdeogramTrack(genome = "hg19", chromosome = "4")
# get track for axis labeling
axtrack <- GenomeAxisTrack()
# get track with all genes with biomart annotation in the region
biom_track <- BiomartGeneRegionTrack(
    genome = "hg19", name = "gene model",
    chromosome = 4,
    start = start_region, end = end_region,
    transcriptAnnotation = "symbol", frame = T
)
# plot all tracks
plotTracks(list(itrack, axtrack, biom_track),
    from = start_region, to = end_region
)