# NGS Data formats and QC

## Introduction
There are several file formats for storing Next Generation Sequencing (NGS) data. In this tutorial we will look at some of the most common formats for storing NGS reads and variant data. We will cover the following formats:

__FASTQ__ - This format stores unaligned read sequences with base qualities  
__SAM/BAM__ - This format stores unaligned or aligned reads (text and binary formats)  
__CRAM__ - This format is similar to BAM but has better compression than BAM  
__VCF/BCF__ - Flexible variant call format for storing SNPs, indels, structural variations (text and binary formats)  

Following this, we will work through some examples of converting between the different formats.  

Further to understanding the different file formats, it is important to remember that all sequencing platforms have technical limitations that can introduce biases in your sequencing data. Because of this it is very important to check the quality of the data before starting any analysis, whether you are planning to use something you have sequenced yourself or publicly available data. In the latter part of this tutorial we will describe how to perform a QC assessment for your NGS data.

## Learning outcomes
On completion of the tutorial, you can expect to be able to:

* Describe the different NGS data formats available (FASTQ, SAM/BAM, CRAM, VCF/BCF)
* Perform conversions between the different data formats
* Perform a QC assessment of high throughput sequence data

## Tutorial sections
This tutorial comprises the following sections:   
 1. [Data formats](formats.ipynb)   
 2. [File conversion](conversion.ipynb)   
 3. [QC assessment](assessment.ipynb)    

## Authors
This tutorial was written by [Jacqui Keane](https://github.com/jacquikeane) and [Sara Sjunnebo](https://github.com/ssjunnebo) based on material from [Petr Danecek](https://github.com/pd3) and [Thomas Keane](https://github.com/tk2).

## Running the commands from this tutorial
You can follow this tutorial by typing all the commands you see into a terminal window. This is similar to the "Command Prompt" window on MS Windows systems, which allows the user to type DOS commands to manage files.

To get started, open a new terminal on your computer and type the command below:

In [None]:
cd /home/manager/pathogen-informatics-training/Notebooks/QC/

Now you can follow the instructions in the tutorial from here.

## Let’s get started!
This tutorial assumes that you have samtools, bcftools and Picard tools installed on your computer. These are already installed on the VM you are using. To check that these are installed, you can run the following commands:

In [None]:
samtools --help

In [None]:
bcftools --help

In [None]:
PicardCommandLine -h

This should return the help message for samtools, bcftools and picard tools respectively.

To get started with the tutorial, head to the first section: [Data formats](formats.ipynb)