# QC assessment of NGS data
QC is an important part of any analysis. In this section we are going to look at some of the metrics and graphs that can be used to assess the QC of NGS data. 


## Base quality
[Illumina sequencing](https://en.wikipedia.org/wiki/Illumina_dye_sequencing) technology relies on sequencing by synthesis. One of the most common problems with this is __dephasing__. For each sequencing cycle, there is a possibility that the replication machinery slips and either incorporates more than one nucleotide or perhaps misses to incorporate one at all. The more cycles that are run (i.e. the longer the read length gets), the greater the accumulation of these types of errors gets. This leads to a heterogeneous population in the cluster, and a decreased signal purity, which in turn reduces the precision of the base calling. The figure below shows an example of this.

![Mean Base Quality](img/QC_good.png)

Because of dephasing, it is possible to have high-quality data at the beginning of the read but really low-quality data towards the end of the read. In those cases you can decide to trim off the low-quality reads, for example using a tool called [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic).



The figures below shows an example of a good sequencing run (left) and a poor sequencing run (right).

![Base quality](img/QC.png)

## Generating QC stats
Now let's try this out! 
It is a good practice to check the quality of the sequences by plotting the quality (Q) scores by the position. In general, a Q score of > 30 is good.

To generate a plot, we will use `fastQC`  for raw [fastq](https://www.ncbi.nlm.nih.gov/sra/SRR13882963) that produce the SRR13882963_1.fastq and SRR13882963_2.fastq.

- Install on your [laptop](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- The analysis [modules](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/)
- Example of [good data](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html)
- Example of [bad data](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html)

In [None]:
fastqc data/SRR13882963_1.fastq

it will generate the result in html file that can be opened using browser.
If the result isn't good enough, we can proceed to trim off the bad sequences using [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)

## Trimming the fastq sequences

Trimmomatic is a fast, multithreaded command line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters. These adapters can pose a real problem depending on the library preparation and downstream application.

There are two major modes of the program: Paired end mode and Single end mode. The paired end mode will maintain correspondence of read pairs and also use the additional information contained in paired reads to better find adapter or PCR primer fragments introduced by the library preparation process.

Trimmomatic works with FASTQ files, Files compressed using either „gzip‟ or „bzip2‟ are supported, and are identified by use of „.gz‟ or „.bz2‟ file extensions.

Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data. The selection of trimming steps and their associated parameters are supplied on the command line.

The current trimming steps are:
- ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read. SLIDINGWINDOW: Performs a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.
- MAXINFO: An adaptive quality trimmer which balances read length and error rate to maximise the value of each read
- LEADING: Cut bases off the start of a read, if below a threshold quality
- TRAILING: Cut bases off the end of a read, if below a threshold quality
- CROP: Cut the read to a specified length by removing bases from the end HEADCROP: Cut the specified number of bases from the start of the read MINLEN: Drop the read if it is below a specified length
- AVGQUAL: Drop the read if the average quality is below the specified level

For more information, you can go to: http://www.usadellab.org/cms/?page=trimmomatic

### Paired End Mode

For paired-end data, two input files, and 4 output files are specified, 2 for the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired' output where a read survived, but the partner read did not.

In [None]:
#Command for paired-end trimming:

java -jar <path-to-file-trimmomatic-0.35.jar> PE -phred33 <path-to-file-input_R1.fq.gz> <path-to-file-input_R2.fq.gz> <path-to-file-output_1P.fq.gz> <path-to-file-output_1U.fq.gz> <path-to-file-output_2P.fq.gz> <path-to-file-output_2U.fq.gz> (set-of-trimming-parameters)

Example for variant calling:

In [None]:
java -jar Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 data/SRR13882963_1.fastq data/SRR13882963_2.fastq data/SRR13882963_1P.fastq data/SRR13882963_1U.fastq data/SRR13882963_2P.fastq data/SRR13882963_2U.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:17 MINLEN:35

__Congratulations__ you have reached the end of the Data formats and QC tutorial!