# Amplicon Sequencing Report

> Please take a look [here](www.github.com/francisglee/amplicon-ngs-workflow) for code samples and jupyter notebooks.

## Table of Contents

- [Introduction](#introduction)
- [Findings](#findings)
- [Exploration](#exploration)
- [Discussion](#discussion)
- [Methods](#methods)
- [References](#references)

## Introduction

__Background Information__:

1. Data contains NGS reads of human DNA regions.
2. The sequenced file is pair-ended from Miseq sequencing platform. 
3. This file is the result of an amplicon sequencing design.
4. The primers used in this file are:
   
| Forward Primer | Reverse Primer |
|:---------------|:---------------|
| TTGCCAGTTAACGTCTTCCTTCTCTCTCTG | GAGAAAAGGTGGGCCTGAGGTTCAGAGCCA |   
| CCCTTGTCTCTGTGTTCTTGTCCCCCCCA | CCCCACCAGACCATGAGAGGCCCTGCGGCC |    
| TGATCTGTCCCTCACAGCAGGGTCTTCTCT | TGACCTAAAGCCACCTCCTTA |
| CACACTGACGTGCCTCTCCCTCCCTCCA | CCGTATCTCCCTTCCCTGATTA |
   
   
5. The adaptor sequences used in this file are:
   
| Adaptor 1 | Adaptor 2 |
|:----------|:----------|
| AAGACTCGGCAGCATCTCCA | GCGATCGTCACTGTTCTCCA| 

__Starting Files__:

- `aln.bam`: BAM file
- `aln.bam.bai`: BAM index file
- `aln1.fastq`: Forward sequencing reads
- `aln2.fastq`: Reverse sequencing reads

__Outstanding Questions:__

- What is the constitution of each read, for example (adaptor + primer + amplified region) ?
    - primer + amplied region (adaptors seem to be cut)
- Which gene region did those reads mapped to ?
    - EGFR exon
- What is the percentage of reads that were mapped to the human genome? Can you give some comments on the unmapped reads?
    - [see this](https://www.biostars.org/p/335579/)
    - We have 91.56% reads mapped to genome.
    - comment unmapped reads
- What is the coverage of each amplicons? Could you find any explanations or clues why the coverage varies among amplicons, especially for those with big differences?
    - Amplicon 1
    - Amplicon 2
    - Amplicon 3
    - Amplicon 4
- What variants/mutations can you identify from this data? How can you evaluate which variants is more real than others?
    - Amplicon 1 has a 15 bp deletion in one allele
    - Amplicon 2
    - Amplicon 3
    - Amplicon 4

## Initial Findings

### FASTQ Analysis

We can use [FASTQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to perform a quick quality control on the FASTQ files.  Listed below is the summary.

| Measure | `aln1.fastq` | `aln2.fastq` | 
|:--------|:-------------|:-------------|
| Filename | aln1.fastq | aln2.fastq |
| File type	| Conventional base calls | onventional base calls |
| Encoding | Sanger / Illumina 1.9 | Sanger / Illumina 1.9 |
| Total Sequences | 575002 | 575002 | 
| Sequences flagged as poor quality | 0 | 0 | 
| Sequence length | 151 | 151 | 
| %GC | 53 | 52 |

| Aln1.FASTQ Per-base Sequence Quality | Aln2.FASTQ Per-base Sequence Quality |
|:------------------------------------:|:------------------------------------:|
| ![aln1-Per-base-sequence-quality](./../docs/images/aln1perbaseseqqual.png) | ![aln2-Per-base-sequence-quality](./../docs/images/perbaseseqqual.png) |

#### Overrepresented sequences

| Sequence | Count | Percentage | File |
|:---------|:------|:-----------|:-----|
| TTGCCAGTTAACGTCTTCCTTCTCTCTCTGTCATAGGGACTCTGGATCCC | 192894 | 33.546665924640266 | aln1.FASTQ |	
| CACACTGACGTGCCTCTCCCTCCCTCCAGGAAGCCTACGTGATGGCCAGC | 135419 | 23.55104851809211 | aln1.FASTQ |	
| TGATCTGTCCCTCACAGCAGGGTCTTCTCTGTTTCAGGGCATGAACTACT | 92426 | 16.074031046848532 | aln1.FASTQ |	
| CACACTGACGTGCCTCTCCTAGACTATGCTGCATGGTGTAGTACCATTAA | 11143 | 1.9379063029345986 | aln1.FASTQ |  | CCCTTGTCTCTGTGTTCTTGTCCCCCCCAGCTTGTGGAGCCTCTTACACC | 3845 | 0.6686933262840824 | aln1.FASTQ |	
| CACACTGACGTGCCTCTCCTGCCACCCTCTTCCACTTTGAAGGCCCCTGT | 2278 | 0.39617253505205197	| aln1.FASTQ |
| CACACTGACGTGCCTCTCCCCCTTGTCTCATGGTCTGGTGGGGTGGAGAA | 2151 | 0.3740856553542422 | aln1.FASTQ |	
| CACACTGACGTGCCTCTCATGGTCTGGTGGGGTGGAGAACAGTGACGATC | 2021 | 0.35147703834073624 | aln1.FASTQ |
| CACACTGACGTGCCTCTCCTGGTCTGGTGGGGTGGAGAACAGTGACGATC | 1779 | 0.3093902282079019 | aln1.FASTQ |
| GAGAAAAGGTGGGCCTGAGGTTCAGAGCCATGGACCCCCACACAGCAAAG | 173011 | 30.088764908643796 | aln2.FASTQ |
| CCGTATCTCCCTTCCCTGATTACCTTTGCGATCTGCACACACCAGTTGAG | 117060 | 20.35819005846936 | aln2.FASTQ |
| TGACCTAAAGCCACCTCCTTACTTTGCCTCCTTCTGCATGGTATTCTTTC | 82682 | 14.379428245466972 | aln2.FASTQ |
| CCGTATCTCCCTTCCCTGATGTTTTGTGGGAACAGGGAAGAAGGATTACC | 10219 | 1.7772112097001402 | aln2.FASTQ |
| CCCCACCAGACCATGAGAGGCCCTGCGGCCCAGCCCAGAGGCCTGTGCCA | 7657 | 1.3316475420955056 | aln2.FASTQ |
| CCCCACCAGACCATGAGACAAGGGGGAGAGGCACGTCAGTGTGTGGAGAT | 3253 | 0.5657371626533473 | aln2.FASTQ |
| CCCCACCAGACCATGAGAGGCACGTCAGTGTGTGGAGATGCTGCCGAGTC | 2518 | 0.43791152030775543 | aln2.FASTQ |
| GAGAGAAGGTGGGCCTGAGGTTCAGAGCCATGGACCCCCACACAGCAAAG | 2469 | 0.42938981081804933 | aln2.FASTQ | CCCCACCAGACCAGGAGAGGCACGTCAGTGTGTGGAGATGCTGCCGAGTC | 2375 | 0.4130420415928988 | aln2.FASTQ |

We can see that the top overrepresented sequences correspond to the primer sequences that target the four amplicons.

### BAM Analysis

The BAM file metadata indicates that the sample was aligned using the Bowtie2 algorithm using the `Bowtie2 v2.0.2` package.  The sample ID is `Bordet_EGFR_S1`.  This might be indicative of the genome region being targeted, but we can't be too sure.  EGFR queried in [NCBI Gene](https://www.ncbi.nlm.nih.gov/gene) returns Epidermal Growth Factor Receptor protein.  `_S1` most likely refers to the sequencing sample.  

Upon submitting the primer sequences as BLAST queries into the UCSC genome browser, we can clearly see where they map to in the human genome.   

![primer blast results](./../docs/images/primerblast.png)

These line up the exon regions of the EGFR gene on chromosome 7 of the human genome.

- What is the percentage of reads that were mapped to the human genome? Can you give some comments on the unmapped reads?
- What is the coverage of each amplicons? Could you find any explanations or clues why the coverage varies among amplicons, especially for those with big differences?
- What variants/mutations can you identify from this data? How can you evaluate which variants is more real than others?

We can use samtools `flagstat` function to view the summary of `aln.bam`.

```
1150004 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1052983 + 0 mapped (91.56% : N/A)
1150004 + 0 paired in sequencing
575002 + 0 read1
575002 + 0 read2
1046208 + 0 properly paired (90.97% : N/A)
1049400 + 0 with itself and mate mapped
3583 + 0 singletons (0.31% : N/A)
646 + 0 with mate mapped to a different chr
583 + 0 with mate mapped to a different chr (mapQ>=5)
```

```
async.wait()

def optimize_sequence(self, _iterid, seq, region):
    """
    
    """
    

We have 91.56% reads mapped to genome.

### Trimming the BAM file

Looking at the BAM file, we can see alignments in a number of different chromosomes.  Since we know that the intended alignment was for the EGFR gene on Chromosome 7, we can subsect the BAM file using samtools:

```
samtools view -b aln_sorted.bam 7:55242300-55242600 > amplicon1.bam
samtools view -b aln_sorted.bam 7:55241500-55241850 > amplicon2.bam
samtools view -b aln_sorted.bam 7:55259300-55259600 > amplicon3.bam
samtools view -b aln_sorted.bam 7:55248900-55249250 > amplicon4.bam
```

We can look at the specific intervals for each primer pair in Broad's Integrative Genome Viewer (IGV).

| Primer Pair | Interval (Chrm. 7) |
|:------------|:-------------------|
| 1 | 55242300-55242600 |
| 2 | 55241500-55241850 |
| 3 | 55259300-55259600 |
| 4 | 55248900-55249250 |

We can take a look at all the flags

```
samtools view amplicon1.bam | cut -f2 | sort -u > bamflags.txt
```

- 133: read paired, read unmapped, second in pair
- 145: read paired, read reverse strand, second in pair
- 147: read paired, read mapped in proper pair, read reverse strand, second in pair
- 153: read paired, mate unmapped, read reverse strand, second in pair
- 177: read paired, read reverse strand, mate reverse strand, second in pair
- 65: read paired, first in pair
- 69: read paired, reads unmapped, first in pair
- 73: read paired, mate unmapped, first in pair
- 97: read paired, mate reverse strand, first in pair
- 99: read paired, rad mapped in proper pair, mate reverse strand, first in pair



__Primer Pair 1__

The first amplicon seems to have two sets, one with a 15-bp insertion, and one without.  It also seems like only a small handful seem clean.

![primer-pair1.1](./../docs/images/primerpair1-1.png)
![primer-pair1.2](./../docs/images/primerpair1-2.png)
![primer-pair1.3](./../docs/images/primerpair1-3.png)
![primer-pair1.4](./../docs/images/primerpair1-4.png)

__Primer Pair 2__

![primer-pair2.1](./../docs/images/primerpair2-1.png)
![primer-pair2.2](./../docs/images/primerpair2-2.png)
![primer-pair2.3](./../docs/images/primerpair2-3.png)
![primer-pair2.4](./../docs/images/primerpair2-4.png)
![primer-pair2.5](./../docs/images/primerpair2-5.png)

__Primer Pair 3__


![primer-pair3.1](./../docs/images/primerpair3-1.png)
![primer-pair3.2](./../docs/images/primerpair3-2.png)
![primer-pair3.3](./../docs/images/primerpair3-3.png)
![primer-pair3.4](./../docs/images/primerpair3-4.png)


__Primer Pair 4__

The amplicon for Primer Pair 4 seems to indicate a G > A mutation in position 55,249,063 of Chromosome 7.

![primer-pair4.1](./../docs/images/primerpair4-1.png)
![primer-pair4.2](./../docs/images/primerpair4-2.png)
![primer-pair4.3](./../docs/images/primerpair4-3.png)
![primer-pair4.4](./../docs/images/primerpair4-4.png)


### Calling Variants

```
freebayes --fasta-reference chrm7.fa amplicon1.bam > amplicon1.var
bgzip amplicon1.var
tabix -p vcf amplicon1.vcf.gz
```

## Methods

### Workflow

1. [Quality Control and Preprocessing](#quality-control-and-preprocessing)
2. [Alignment](#alignment)
3. [Alignment Postprocessing](#alignment-postprocessing)
4. [Variant Calling](#variant-calling)
5. [Variant Filtering](#variant-filtering)

### Tools

__Database Access__:

- ~~[cruzdb](https://github.com/brentp/cruzdb): UCSC genomes database~~ 
- [pyensembl](https://github.com/openvax/pyensembl): Ensembl
- ~~[bioservices](https://github.com/cokelaer/bioservices): ArrayExpress, BioModels, ChEBI, KEGG, PDB, Uniprot, UniChem, NCBIBlast~~

__File Handling__:
 
- [bamnostic](https://github.com/betteridiot/bamnostic): `BAM`
- [pyVCF](https://github.com/jamescasbon/PyVCF): `VCF`
- [pyfaidx](https://github.com/mdshw5/pyfaidx): `FASTA`
- [fastq-and-furious](https://github.com/lgautier/fastq-and-furious): `FASTQ`
- ~~[cyvcf2](https://github.com/brentp/cyvcf2): `VCF` and `BCF`~~
- [pysam](https://github.com/pysam-developers/pysam): `SAM`, `BAM`, `VCF`, and `BCF`
- [pybedtools](https://github.com/daler/pybedtools): `BAM`, `SAM`, `BED`, `BCF`, `VCF`, `GFF`, and `GTF` 
- [htseq](https://github.com/simon-anders/htseq): `FASTQ`, `BAM`, `SAM`, `VCF`, `BED`
- [biopython](https://github.com/biopython/biopython): DNA, RNA, Proteins, `FASTQ`
- [scikit-allel](https://github.com/cggh/scikit-allel): `VCF`

* trouble installing cyVCF2

__Quality Control__:

- ~~[AfterQC](https://github.com/OpenGene/AfterQC): `FASTQ` file Processing~~
- [fastqp](https://github.com/mdshw5/fastqp): `FASTQ`, `BAM`, and `SAM` file Processing
- ~~[bam-toolbox](https://github.com/AndersenLab/bam-toolbox) `BAM` coverage~~
- ~~[mergesam](https://github.com/DarwinAwardWinner/mergesam): `SAM` and `BAM` conversion and merging and sorting~~

__Variant Calling__:

- [freebayes](https://github.com/ekg/freebayes): a haplotype-based variant detector
- [gatk](https://software.broadinstitute.org/gatk/): assortment of variant callers
- [bcftools](http://www.htslib.org/): assortment of variant callers

__Visualization__:

- ~~[Biodalliance](https://github.com/dasmoth/dalliance): visualize `BAM`, `VCF`~~
- [GenomeView](https://github.com/nspies/genomeview): visualize `BAM`
- [plotly](https://github.com/plotly): general-purpose graphing library
- [seaborn](https://github.com/mwaskom/seaborn): general-purpose graphing library
- [altair](https://github.com/altair-viz/altair): statistical visualization

__Report Generation__:

- [MultiQC](https://github.com/ewels/MultiQC): Aggregate results from bioinformatics analyses across many samples into a single report
- [Jupyter Notebook](https://github.com/jupyter/notebook)
- [pyPipeline](https://github.com/AndersenLab/pyPipeline)
- [bcbio](https://github.com/bcbio/bcbio-nextgen)

__Math/Stat__:

- [pandas](https://github.com/pandas-dev/pandas)
- [scipy](https://github.com/scipy/scipy)
- [cython](https://github.com/cython/cython)

__Tutorials__:

- [samtools tutorial](http://quinlanlab.org/tutorials/samtools/samtools.html)
- [ngs alignment and variant calling](https://github.com/ekg/alignment-and-variant-calling-tutorial)
- [scikit-allel](http://alimanfoo.github.io/2017/06/14/read-vcf.html)
- [genomic data visualization](http://fullstackdatascientist.io/15/03/2016/genomic-data-visualization-using-python/)

### Quality Control and Preprocessing

The main purpose of pre-processing is to keep low-quality reads from entering the variant evaluation procedure. Read quality is typically measured by average base quality score, mapping quality score, and number of mismatches from the reference genome, etc. 

| __Input__ | __Output__ |
|:----------|:-----------|
| `FASTQ` | `FASTQ` |

1. __Trimming__: removal of low quality sequences
2. __Demultiplexing__:
3. __Removal of Adapters and Primers__: already done.
4. __Error_Correction__:
5. __Detection of Enrichment Biases__:

### Alignment

This is the initial alignment and the first `BAM`/`SAM` file generated.  For reads from 70bp up to a few megabases we recommend using BWA MEM to map the data to a given reference genome.  `aln.bam` is aligned with bowtie2.

| __Input__ | __Output__ |
|:----------|:-----------|
| `FASTQ` | `SAM`, `BAM` |

### Alignment Postprocessing

- A correctly aligned region (reads are shown as gray vertical bars with SNPs indicated as colored letters). 
- A spurious alignment where reads exhibit many small insertions (indicated as purple Is), deletions (shown as black horizontal lines) and SNPs.

| __Input__ | __Output__ |
|:----------|:-----------|
| `SAM`, `BAM` | `SAM`, `BAM` |

1. __Local realignment around indels or__:
2. __Calculation of per-base Base Quality scores (BAQ)__:
3. __Removal of Duplications__: Sorted `BAM` file is needed for this.
4. __Recalibration of Base Quality Scores__:

### Variant Calling

Depending on the type of input data and the intended application, the algorithms can be summarized to four categories: 

- matched tumor-normal variant calling
- single-sample variant calling 
- UMI-based variant calling
- RNAseq variant calling

As the samples analyzed in this report are generated from a single-sample amplicon sequencing, we will only look at `single-sample variant calllinng` methods.

| __Input__ | __Output__ |
|:----------|:-----------|
| `SAM`, `BAM` | `VCF`, `BCF` |

| Name | Type of Variant | Description | Reference |
|:-----|:----------------|:------------|:----------|
| FaSD-somatic |SNV | Joint genotype analysis | [[Link]](http://jjwanglab.org/FaSD-somatic/) |
| FreeBayes | SNV, indel | Haplotype analysis | [[GitHub]](https://github.com/ekg/freebayes) |
| HapMuC | SNV, indel | Haplotype analysis | [[GitHub]](https://github.com/usuyama/hapmuc) |
| LoFreq | SNV, indel | Allele frequency analysis | [[GitHub]](https://github.com/CSB5/lofreq) |
| MuTect | SNV | Allele frequency analysis | [[Link]](https://software.broadinstitute.org/cancer/cga/mutect)[[GitHub]](https://github.com/broadinstitute/mutect) |
| SAMtools | SNV, indel | Joint genotype analysis | [[Link]](http://www.htslib.org/)[[GitHub]](https://github.com/samtools/samtools) | 
| Platypus | SNV, indel, SV | Haplotype analysis | [[GitHub]](https://github.com/andyrimmer/Platypus) |
| SNooPer | SNV, indel | Machine learning | [[Link]](https://sourceforge.net/p/snooper/wiki/User%20Manual/) |
| SNVSniffer | SNV, indel | Joint genotype analysis | [[Link]](http://snvsniffer.sourceforge.net/homepage.htm) |
| VarDict | SNV, indel, SV | Heuristic threshold | [[GitHub]](https://github.com/AstraZeneca-NGS/VarDict) |
| VarScan2 | SNV, indel | Heuristic threshold | [[Link]](http://dkoboldt.github.io/varscan/) |

1. __Variant Analysis__: To convert your BAM file into genomic positions we first use mpileup to produce a BCF file that contains all of the locations in the genome. We use this information to call genotypes and reduce our list of sites to those found to be variant by passing this file into bcftools call.

    ```
    bcftools mpileup -Ou -f Homo_sapiens.GRCh37.dna_sm.toplevel.fa aln.bam | bcftools call -mv -Ob -o alncalls.bcf
    ```

    - __Copy Number Variation__: sensitive detection of copy number alterations, aneuploidy and contamination.
    - __Consequence Calling__: haplotype-aware variant calling
    - __Consensus Calling__: Sometimes there is the need to create a consensus sequence for an individual where the sequence incorporates variants typed for this individual.
    - __ROH Calling__: detects regions of autozygosity in sequencing data, including exome data, using a hidden Markov model.
    - __Variant Simulation__: Add mutations to existing BAM files to test mutation callers.  `Bam Surgeon` and `wgsim` are good candidate tools.

2. __Annotation Mining__: The location is `1:[326216,326357)/-` for `hg19`.  Can we try to mine the literature for possible targets?

3. __Experiment QC__: Is there something in the meta-data that we can see that can help explan the data?

4. __FDR Correction__: Variants are passed if they've cleared both FreeBayes, GATK HaplotypeCaller, and Vardict.


### Post-Filtering

Sequencing or alignment artifacts may appear to have strong read evidence and trick the statistical model to pass them as real variants.  Most variant callers apply a set of filters to identify these artifacts and hence improve the specificity.

- Strand bias filter
- hard thresholds set by variant callers (empirically defined)
- filters focus on repetitive regions

- 


- __Coverage histogram in amplicon region__:  This is a distribution of coverage inside the amplicon regions. The function parses region by region and get the coverage of all the samples provided, then render a histogram with X axis being the coverage
- __Cumulative coverage saturation plot__: This is the same as before, instead we render a cumulative histogram, it can be useful, you can see how bad the NSG sample renders for example
- __Mapping qualities in amplicon regions__: This is a stacked histogram that renders the mapping qualities inside the amplicon regions, we can see that NSG is again failing and that we have a mapping qualities around 42 in this example in particular
- __Targeted regions coverage__: This is the so called CDF function of coverage per sample inside the target regions
- __Coverage heatmap inside amplicon regions__: This plot is another view of coverage inside amplicon region, it places all the samples side by side and render all the row as a heatmap, I did 2 examples here, one with a crazy sample like the NSG one and one without

## Exploration

## Bash History

__To Index Reference__:

```
bwa index reference.fa
```

__To Sort BAM__:

```bash
samtools sort ./../../notebooks/data/original_files/aln.bam > ./../../notebooks/data/samtools_output/aln_sorted.bam
```

__To Index BAM__:

```bash
samtools index ./../../notebooks/data/samtools_output/aln_sorted.bam ./../../notebooks/data/samtools_output/aln_sorted.bam.bai
```

__Call Variants__:

```
freebayes --fasta-reference ./../../Reference_Sequence/Homo_sapiens.GRCh37/Homo_sapiens.GRCh37.dna_sm.toplevel.fa ./../../notebooks/data/samtools_output/aln_sorted.bam > ./../../notebooks/data/freebayes_output/var.vcf
```

```
bcftools mpileup -Ou -d 150004 -f ./../../Reference_Sequence/Homo_sapiens.GRCh37/Homo_sapiens.GRCh37.dna_sm.toplevel.fa  ./../../notebooks/data/samtools_output/aln_sorted.bam | bcftools call -mv -Ob -o ./../../notebooks/data/bcftools_output/var150004.bcf
```

```
bcftools mpileup -d 150004 -f ./../../Reference_Sequence/Homo_sapiens.GRCh37/Homo_sapiens.GRCh37.dna_sm.toplevel.fa  ./../../notebooks/data/samtools_output/aln_sorted.bam | bcftools call -mv -Oz -o ./../../notebooks/data/bcftools_output/var150004.vcf
```


__Call Filtering__:

## Discussion

## References

Schirmer, Melanie, et al. "**Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform.**" Nucleic acids research 43.6 (2015): e37-e37.

Laehnemann, David, Arndt Borkhardt, and Alice Carolyn McHardy. "**Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction.**" Briefings in bioinformatics 17.1 (2015): 154-179.

Xu, Chang. "**A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data.**" Computational and structural biotechnology journal 16 (2018): 15-24.

Pfeifer, S. P. "**From next-generation resequencing reads to a high-quality variant data set.**" Heredity 118.2 (2017): 111.

Tian, Rui, Malay K. Basu, and Emidio Capriotti. "**Computational methods and resources for the interpretation of genomic variants in cancer.**" BMC genomics 16.8 (2015): S7.



__From BCFTools__

- Copy number variation
- Consequence calling 
- Detecting Runs of homozygocity
- Variant Calling
- ASSAY_TYPES = ("presence/absence", "SNP", "gene variant", "ROI", "mixed")


### Workflow 

1.  adapter, and optionally, quality trimming using Trimmomatic
2.  aligned to the reference amplicon sequences
3.  BAM files are analyzed with a custom Python script using the pysam and scikit-bio libraries to aid in analysis
4.  combines the alignment data in the BAM file with the assay data in the JSON file and interprets the results



    
    
    - alignment evaluation
    - read evaluation
    