## Alignment with BWA, Rebaler and BLAST

There are many algorithms to efficiently align short reads, although they are not optimal for long reads (e.g. ONT reads). To put it simple, long-reads are usually affected by structural variations and the indels that may be due to sequencing errors. For this reason, NanoDJ relies on aligners such as BWA, Rebaler or BLAST that are better for finding local matches. 

## Alignment with BWA

Burrows Wheeler Aligner ([BWA](https://github.com/lh3/bwa)) is carefully designed to achieve a good balance between performance and accuracy in the alignment. BWA is software package that includes tools for mapping ONT long reads to a reference, including many alternative alignment algorithms. Some of them are ideal for short-reads, while others are better suited for long reads. BWA-MEM is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and chimeras, and is applicable to a wide range of sequence lengths from 70 bp to a few Mb. 

This algorithm is used many times in NanoDJ notebooks as a step for some of the applications. BWA needs the sequence reads and a reference as inputs and supports more than one execution thread with the <font color='blue'>-t</font> option.

Before running BWA-MEM, the user will first need to index the reference genome (FASTA):

In [1]:
bwa index data/agalactiae/reference/NZ_CP010867.1_Ref.fasta

[bwa_index] Pack FASTA... 0.02 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.61 seconds elapse.
[bwa_index] Update BWT... 0.01 sec
[bwa_index] Pack forward-only FASTA... 0.01 sec
[bwa_index] Construct SA from BWT and Occ... 0.28 sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa index data/agalactiae/reference/NZ_CP010867.1_Ref.fasta
[main] Real time: 1.074 sec; CPU: 0.937 sec


Once the reference is indexed, BWA-MEM can be run using the <font color='blue'>-t</font> option to allow multithreaded execution. One must specify the (previously indexed) reference, the reads file (either as FASTA or FASTQ), and redirect the output to a file (SAM format):

In [2]:
bwa mem -t 2 data/agalactiae/reference/NZ_CP010867.1_Ref.fasta data/agalactiae/reads.fastq > data/agalactiae/bwa_output.sam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2650 sequences (20048748 bp)...
[M::process] read 1070 sequences (14793475 bp)...
[M::mem_process_seqs] Processed 2650 reads in 41.787 CPU sec, 21.757 real sec
[M::mem_process_seqs] Processed 1070 reads in 14.884 CPU sec, 7.734 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -t 2 data/agalactiae/reference/NZ_CP010867.1_Ref.fasta data/agalactiae/reads.fastq
[main] Real time: 29.989 sec; CPU: 56.849 sec


## Reference based assembly with Rebaler

[Rebaler](https://github.com/rrwick/Rebaler) is used to obtain reference-based assemblies but can also reassemble/polish an assembly of long reads, using a reference assembly to guide the large-scale structure. Another advantage of Rebaler is that the reference assembly sequence does not influence the sequence of the resulting assembly.

After loading the reference, Rebaler uses minimap2 to align long reads to the reference. Then, it Removes lower quality alignments (judged by length, identity and size of indels) until the reference is just covered. Any given position in the reference should now have a coverage of 1 or 2 (or 0 if the reads failed to cover a spot). The reference sequence is replaced with corresponding read fragments to produce an unpolished assembly. If parts of the reference had no read coverage, the original reference sequence will be left in place.

Once the Rebaler assembly is built, multiple [Racon](https://github.com/isovic/racon) rounds are run to polish the consensus sequence.

In [3]:
rebaler -h

usage: rebaler [-t THREADS] [--keep] [--random] [-h] [--version]
               reference reads

Rebaler: reference-based long read assemblies of bacterial genomes

Positional arguments:
  reference               FASTA file of reference assembly
  reads                   FASTA/FASTQ file of long reads

Optional arguments:
  -t THREADS, --threads THREADS
                          Number of threads to use for alignment and polishing
                          (default: 2)
  --keep                  Do not delete temp directory of intermediate files
                          (default: delete temp directory)
  --random                If a part of the reference is missing, replace it
                          with random sequence (default: leave it as the
                          reference sequence)

Help:
  -h, --help              Show this help message and exit
  --version               Show program's version number and exit


In [4]:
rebaler -t 48 data/agalactiae/reference/NZ_CP010867.1_Ref.fasta data/agalactiae/reads.fastq > data/agalactiae/assembly_with_rebaler.fasta



[93m[1m[4mLoading reference[0m (2018-11-10 08:24:32)
    This reference sequence will be used as a template for the Rebaler
assembly.

[1m[4mReference contig   Circular      Length[0m
NZ_CP010867.1      no         2,183,395


[93m[1m[4mBuilding unpolished assembly[0m (2018-11-10 08:24:32)
    Rebaler first aligns long reads to the reference using minimap2. It then
selects high quality alignments and replaces the reference sequence with the
corresponding read sequence. This creates an unpolished assembly made directly
from read fragments, similar to what would be produced by miniasm.

Loading reads...                             3,720 reads
Aligning reads to reference with minimap2... 1,274 initial alignments
                                             2.43x depth
Culling alignments to a non-redundant set... 343 alignments remain

Constructing unpolished assembly:

NZ_CP010867.1:
[31mreference(+):0-9[0m → [32m603a0efb(+):0-4373[0m → [32mbbfedfc3(-):4309-6274[0m → [3

## Alignment with BLAST

Basic Local Alignment Search Tool (BLAST) is a classic local sequence alignment, which compares nucleotide to sequence databases finding similarity regions between sequences. BLAST uses a heuristic algorithm. Therefore, there is no guarantee that BLAST finds the correct solution albeit it will calculate the significance of the results, proving a parameter to score the results obtained.

The BLAST algorithm has three main stages: training (finds local matches), extension (alignment is extended on both sides of the words) and evaluation (evaluate the statistical significance of the resulting alignments and eliminates the inconsistent ones). For this, BLAST needs a database where the reference sequences (all in one FASTA file) are indexed for comparison. BLAST has many different commands available (https://www.ncbi.nlm.nih.gov/books/NBK279684/). However, we will focus on a few basic ones that may be useful for the user:

| Option | Type | Description | Notes |
| :------: | :----: | :-----------: | :-----: |
| evalue | real | Expect value (E) for saving hits.(Default value = 10.0)| E-value: expected number of chance alignments; the smaller the E-value, the better the match. |
| html | flag | Produce HTML output| . |
| outfmt | string | Alignment pairwise| Can be additionally configured by indicating “length means” or “Pident means” to obtain the legth of alignments and the % of identical matches, respectively. For example: -outfmt 0 length means |

Before proceeding to the alignment step, the user first needs to create a local database with the reference sequences among which the alignment is to be done. This can be done with the following arguments:

In [1]:
!makeblastdb -in data/metagenomics/reference/metagenomics_references_withnames.fasta -parse_seqids -dbtype nucl



Building a new DB, current time: 11/10/2018 08:31:38
New DB name:   /home/jovyan/notebooks/data/metagenomics/reference/metagenomics_references_withnames.fasta
New DB title:  data/metagenomics/reference/metagenomics_references_withnames.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 7 sequences in 0.0083971 seconds.


Once the local database is created, the user is ready to query it. The input is a FASTA file with unmapped sequences. To query the database, the following arguments can be used:

In [None]:
!blastn -query data/metagenomics/sample.fasta -db /home/jovyan/notebooks/data/metagenomics/reference/metagenomics_references_withnames.fasta -task blastn -dust no -outfmt "10 qseqid positive sseqid" -max_hsps 1 -max_target_seqs 1 -num_threads 2 > blast_metagenomics.csv

The outputs are generated as csv files, which makes easier for pandas package to be processed and to plot the results.

### References

[Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60.](https://www.ncbi.nlm.nih.gov/pubmed/19451168)


[Ryan Wick. Rebaler (GitHub repository)](https://github.com/rrwick/Rebaler)

[Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410](https://www.sciencedirect.com/science/article/pii/S0022283605803602?via%3Dihub)