![UoE](dc.png)

# DNAseq Coursework - Processing and analysis of high throughput DNA sequencing data

KS Singh<sup>1\*</sup>

<sup>1</sup>College of Life and Environmental Sciences;
Penryn Campus, University of Exeter TR10 9FE Cornwall UK

<sup>\*</sup> Correspondence: <ks575@exeter.ac.uk>


## Abstract

The latest technological advancement along this line, namely next generation of sequencing (NGS), allows to routinely sequence and re-sequence the whole genome of single individuals in a single laboratory within a couple of weeks and at comparably low cost. Having a tool for sequencing massive amounts of DNA enables us to investigate almost any question that is associated with the genetic sequence. First, it allows us to determine the nucleotide sequence of a target region (e.g., all exonic regions or the whole exome) or the complete genome and to identify known as well as novel single nucleotide polymorphisms (SNPs) in the sequenced region. Furthermore, paired reads facilitate the investigation of larger structural variants such as inversions, deletions, and insertions. Analysis of variants reveals the genetic makeup of that particular species and also accounts for differences from other organism or differences due to different conditions. This coursework summarizes the main approaches to analyzing DNAseq data using GATK pipeline and Samtools. It demonstrates approaches to map short-read paired-end sequencing data against the reference genome and calling high quality variants from the sample(s) of interest. 

1. [Introduction]()
2. [Know your data]()
3. [Variant calling approaches]()
4. [File formats]()
5. [Short-read mapping]()
6. [Variant calling]()
7. [Variant filtering]()
8. [Variant annotation using SnpEff]()
9. [Process VCF using Python]()


## Introduction

### General Terminologies

**Variant** 
By genetic variant we mean difference between a genome and a "reference" genome. As an example, imagine we are sequencing a "sample". Here "sample" can mean anything that you are interested in studying, from a cell culture, to a mouse or a cancer patient. It is a standard procedure to compare your sample sequences against the corresponding "reference genome". For instance you may compare the cancer patient genome against the "reference genome". In a typical sequencing experiment, you will find many places in the genome where your sample differs from the reference genome. These are called "genomic variants" or just "variants". 

Typically, variants are categorized as follows*:

|Type|Meaning|Example|
| --- | --- | --- |
| SNP | Single-Nucleotide Polymorphism | Reference = A; Sample = C |
| INS | Insertion | Reference = A; Sample = AGT |
| DEL | Deletion | Reference = AC; Sample = C |
| MNP | Multiple-Nucleotide Polymorphism | Reference = ATA; Sample = GTC |
| MIXED | Multiple Nucleotide & Insertion Deletion | Reference = ATA; Sample = GTCAGT |   

*It’s not a comprehensive list but just to give you an idea

**Haplotype**
A haplotype is a set of DNA variations, or polymorphisms, that tend to be inherited together. A haplotype can refer to a combination of alleles or to a set of single nucleotide polymorphisms (SNPs) found on the same chromosome.

**SNP Calling** 
process of identifying variable sites.
Genotype calling: process that determines the genotype for each individual at each site.

![gatk](gatk.png)


## know your data

- Calling variants from high coverage DNAseq data
- Calling variants from low/shallow coverage DNAseq data
- Calling variants from RNAseq data
- Calling variants from Linkage mapping data (RADseq)

## Variant calling approaches

### GATK

![gatk_workflow](gatk_workflow.png)

### Samtools

![samtools](samtools.png)

In [2]:
#check the file structure
!ls -la ../

total 20
drwxr-xr-x 10 ks575 domain^users  141 May 28 22:22 .
drwxr-xr-x  4 ks575 domain^users   44 May 28 18:14 ..
drwxr-xr-x  2 ks575 domain^users 4096 May 28 18:14 Bay-S
drwxr-xr-x  3 ks575 domain^users  135 May 28 22:22 docs
drwxr-xr-x  2 ks575 domain^users 4096 May 28 18:15 genome
drwxr-xr-x  2 ks575 domain^users 4096 May 28 18:14 Nl33
drwxr-xr-x  2 ks575 domain^users 4096 May 28 18:14 Nl55
drwxr-xr-x  2 ks575 domain^users   67 May 28 18:16 snpeff
drwxr-xr-x  3 ks575 domain^users   28 May 28 18:15 snpeff_data
drwxr-xr-x  2 ks575 domain^users 4096 May 28 18:14 VCF


In [3]:
!ls -la ../

/home/ISAD/ks575/Data/BUF19/DNAseq/docs
