# Contents
1. [Evaluate Fastq properties](#Evaluate Fastq properties)
+ [NGS data processing: mapping and optimizing the WES data.](#NGS data processing: mapping and optimizing the WES data.)
    1. [Mapping with BWA](#Mapping with BWA)
    + [Marking duplicates](#Marking duplicates)
    + [Adding ReadGroups](#Adding ReadGroups)
+ [NGS data processing: mapping and optimizing the WGS data](#NGS data processing: mapping and optimizing the WGS data)
+ [Variant Calling & filtering](#Variant Calling & filtering)

Exercise processing, visualization and interpretation of NGS data.
==================================================================

The raw output of a NGS sequencer is a FASTA file, which is a flat text file containing sequence reads and associated read names and quality scores on every line in the file.

![The Fastq Format](https://kscbioinformatics.files.wordpress.com/2017/02/rawilluminadatafastqfiles.png?w=840)

A first step in the analysis of next-generation sequencing reads concerns the mapping of the sequence reads to a reference genome. We have generated three types of sequencing data:
1.	Partial whole exome sequencing (WES): datafiles/na12878_wes_brcagenes-1.fastq **and** datafiles/na12878_wes_brcagenes-2.fastq
2.	Partial RNA-sequencing (RNA)
3.	Partial Whole genome sequencing (WGS): datafiles/na12878_wgs_brcagenes-1.fastq **and** datafiles/na12878_wgs_brcagenes-2.fastq

Each of these datasets were derived from a public human sample and the sequence reads must be mapped to the human reference genome. The output of the mapping is a so-called Binary Alignment Map (BAM) file, which is not readable, but can be visualized using the Integrative Genomics Viewer. 

<span style="color:red">***IMPORTANT: $filename markings indicates that you should change this into YOUR filename of either an existing file or a new filename.***</span>



1. Evaluate Fastq properties<a name="Evaluate Fastq properties"></a>
=============
A very well-known tool for evaluating the quality of sequence data is FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). Since processing NGS data can be a compute intensive step, it could be advisable to check if the data is any good before proceeding.


<span style="color:green">[DO]: Run fastqc on the fastq files: ‘fastqc datafiles/*.fastq’ to generate a HTML report for all the fastq files in the folder datafiles.</span>

To create links to the resulting HTML pages you’ll need to take your amazon-computer name from your webbrowsers adressbar (in between with https:// and amazonaws.com) and copy it in the instructions below:

In [1]:
for i in datafiles/*.html; 
    do echo https://COMPUTERNAMEHERE.amazonaws.com:8888/files/$i;
done

https://COMPUTERNAMEHERE.amazonaws.com:8888/files/datafiles/na12878_wes_brcagenes-1_fastqc.html
https://COMPUTERNAMEHERE.amazonaws.com:8888/files/datafiles/na12878_wes_brcagenes-2_fastqc.html
https://COMPUTERNAMEHERE.amazonaws.com:8888/files/datafiles/na12878_wgs_brcagenes-1_fastqc.html
https://COMPUTERNAMEHERE.amazonaws.com:8888/files/datafiles/na12878_wgs_brcagenes-2_fastqc.html


<span style="color:green">[DO]: take 10’ to review the result pages. What is your opinion on  the quality of the data? If needed, compare it to an (old) example of poor data (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) or good data (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html).</span>

2 NGS data processing: mapping and optimizing the WES data.<a name="NGS data processing: mapping and optimizing the WES data."></a>
==============================
2A Mapping with BWA<a name="Mapping with BWA"></a>
------------------
BWA needs a number of files to start the alignment: execute ‘bwa mem’ to see them. We will use default settings, but still at least 3 files are required.


<span style="color:purple">[Q]: What files are required to start the mapping?</span>

<span style="color:purple">[Q]: What is the default ouput location of the resulting alignmentfile?</span>

<span style="color:green">[DO]: start an alignment of the WES (!) sequence reads in paired end mode, and store the output file. Remember to name it properly so you can see from which experiment the alignment was.</span>

<span style="color:green">[DO]: show the first 30 lines of the SAM file you’ve obtained by executing 
```bash 
head -n 30 $file
```
where \$file is your samfile. You’ll clearly see a header section and the start of the reads section. </span>

<span style="color:purple">[Q]: What does this entry in the header mean?</span>

@SQ     SN:21   LN:48129895

<span style="color:purple"> 
[Q]: Describe some read details from the sam file for the first read. If you need more explanation, see the format specifications [here.](https://samtools.github.io/hts-specs/SAMv1.pdf)
-	<span style="color:purple"> What is it’s orientation?
-	<span style="color:purple"> Where does it map to in the genome?
-	<span style="color:purple"> Does it have mismatches compared to the reference?
-	<span style="color:purple"> What is the distance to its mate?
</span>

The SAM files are not a very efficient way to handle these kind of data: information can be compressed and indexed by converting the SAM (Sequence Alignment Map) file into a BAM (Binary alignment Map).

<span style="color:green">[DO]: Covert your SAM file into a BAM file with ‘sambamba view -S -f bam \$samfile > \$bamfile’, where \$samfile is your samfile and \$bamfile is the bamfile to be generated. Give it a proper name.<span>

<span style="color:purple"> [Q]: Compare the filesizes of the SAM and BAM file. What is the compression ratio?</span>

To be able to make use of the indexed searching options, we need to sort the BAM file by coordinates (default). 

<span style="color:green">[DO]: Sort your BAM file by 
```bash
sambamba sort \$bamfile
```
<span style="color:green">This will result in a \$bamfile.sorted automatically.</span>

<span style="color:purple">[Q]: Review the headers of the sorted and unsorted $bamfile by executing 
```bash
sambamba view -H \$bamfile
``` 
<span style="color:purple">for both and compare. Notice the flag stating the bam file is sorted (and how).</span>

### 2B Marking duplicates <a name="Marking duplicates"></a>

We need to mark (but not remove) the duplicate reads in our (sorted) bamfile so these won’t be used by analysis tools lateron.

<span style="color:green">[DO]: mark the duplicates by executing 
```bash
sambamba markdup \$sorted_bamfile \$dedupped_sorted_bamfile
```
</span>

<span style="color:purple">[Q]: what is the percentage duplicate reads in this dataset?</span>

### 2C Adding ReadGroups <a name="Adding ReadGroups"></a>
Previously you’ve examined the header of the SAM file and probably noticed that it contains very usefull information. Remember, it is very important to know from what sequence data and genome the alignment was generated. Additionally, you would want to know what tools and settings were used to be able to replicate results (See later in the course the F.A.I.R. principles). In the REadGroups section of the header you can store information on the sample and experiment (see details [SAM specifications](http://samtools.github.io/hts-specs/SAMv1.pdf)). This is considered so important that GATK for instance requires additional information in these readgroups. In advanced analyses you can actively use this information from the readgroups.

In this part we are going to fill in the minimal required readgroups with data using PicardTools ($PICARD): AddOrReplaceReadGroups. In real life we advise to be much more elaborate and specific.
- LB (library): WES
- PL (platform): illumina
- PU (platform unit): slide_barcode
- SM (sample name): na12878


<span style="color:green">[DO]: Execute Picardtools AddOrReplaceReadGroups with the following parameters (adjust to your own BAM file) and the data above:
```
java -jar $PICARD AddOrReplaceReadGroups \
I=$inputfile \
O=$outputfile \
CREATE_INDEX=true \
RGLB=$LB \
RGPL=$PL \
RGSM=$SM \
RGPU=$PU \
```



### 3. NGS data processing: mapping and optimizing the WGS data <a name="NGS data processing: mapping and optimizing the WGS data"></a>
Now we successfully have performed the steps to align and optimize the results of the exome sequencing, we want to analyse the whole genome data(WGS) data in the same way.

<span style="color:green">[DO]: repeat the steps to get to a sorted, deduplicated BAM file with readgroups for the WGS fastqs. Take care not to overwrite your WES results, and clearly mark WGS in the appropriate readgroup.</span>


## 4. Variant Calling & filtering <a name="Variant Calling & filtering"></a>