# Data Formats and QC tutorial

## Introduction
There are many different file formats for storing NGS data and analysis results. In this tutolial we will be looking at the most common ones used for storing NGS reads and variant calling results:

__FASTQ__ - Unaligned read sequences with base qualities  
__SAM/BAM__ - Unaligned or aligned reads (text and binary formats)  
__CRAM__ - Better compression than BAM  
__VCF/BCF__ - SNPs, indels, structural variations (text and binary formats)  

All sequencing platforms have technical limitations that can introduce errors in your sequencing data. Because of this it is very important to check the quality of the data before starting any analysis, wether it's something you have sequenced yourself or publicly available data. We will look a how you can performa a QC assessment for your NGS data, and also how to identify possible contamination.

## Learning outcomes
On completion of the tutorial, you can expect to be able to:

* Describe the different NGS data formats available (FASTQ, SAM/BAM, CRAM, VCF/BCF)
* Perform conversions between the different data formats
* Perform a QC assessment of high throughput sequence data
* Identify possible contamination in high throughput sequence data

## Tutorial sections
This tutorial comprises the following stages:
1. [Data formats](formats.ipynb)
2. [File conversion](conversion.ipynb)
3. [QC assessment](assessment.ipynb)
4. [Identifying contamination](contamination.ipynb)

## Authors
This tutorial was written by [Sara Sjunnebo](https://github.com/ssjunnebo) based on material from [Petr Danecek](https://github.com/pd3) and [Thomas Keane](https://github.com/tk2).

## Running the commands from this tutorial
You can run the commands in this tutorial either directly from the Jupyter notebook (if using Jupyter), or by typing the commands in your terminal window. 

### Running commands on Jupyter
If you are using Jupyter, command cells (like the one below) can be run by selecting the cell and clicking _Cell -> Run_ from the menu above or using _ctrl Enter_ to run the command. Let's give this a try by printing our working directory using the _pwd_ command and listing the files within it. Run the commands in the two cells below.

In [None]:
pwd

In [None]:
ls -l

### Running commands in the terminal
You can also follow this course by typing all the commands you see into a terminal window. This is similar to the "Command Prompt" window on MS Windows systems, which allows the user to type DOS commands to manage files.

To get started, select the cell below with the mouse and then either press control and enter or choose Cell -> Run in the menu at the top of the page.

In [None]:
echo cd $PWD

Now open a new terminal on your computer and type the command that was output by the previous cell followed by the enter key. The command will look similar to this:

    cd /home/manager/pathogen-informatics-training/Notebooks/QC/
    
Now you can follow the instructions in the tutorial from here.

## Let’s get started!
This tutorial assumes that you have samtools, bcftools and bwa installed on your computer. For download and installation instructions, please see:

* The [samtools website](http://samtools.sourceforge.net/)
* The [bcftools website](http://www.htslib.org/download/)
* The [bwa GitHub page](https://github.com/lh3/bwa)

To check that you have installed samtools and bcftools correctly, you can run the following commands:

In [None]:
samtools --help

In [None]:
bcftools --help

This should return the help message for samtools and bcf tools, respectively.

Similarly, to check that you have installed bwa correctly, you can run:

In [None]:
bwa

This should return the help message for bwa.

To get started with the tutorial, head to the first section: [Data formats](formats.ipynb)  
The answers to all questions in the tutorial can be found [here](answers.ipynb).