## Basecalling with Albacore

[Albacore](#albacore) is a processing pipeline that provides the basecalling algorithms to process raw electrical data from an Oxford Nanopore (ONT) sequencer to a DNA sequence. FAST5 files with event data are processed with Albacore which outputs the FASTQ sequence files. The transducer basecaller (helps with homopolymers) was added in v1.0 of the tool and basecalling from raw signal (without segmenting the signal into events) appeared in v2.0. 

The Albacore pipeline ([de.NBI Nanopore Training Course](#deNBI)) contains:

**Basecalling**: The implementation of basecalling algorithms containing configuration files for basecalling chemistry.</br> 

**Calibration Strand Detection**: Reads are aligned against a calibration strand reference via the integrated minimap2 aligner. Calibration strands serve as a quality control for pore and experiment. If the current read is identified as a calibration strand, no barcoding or alignment steps are performed.</br> 

**Barcoding/Demultiplexing**: The beginning and the end of each strand are aligned against the barcodes currently provided by Oxford Nanopore Technologies. The reads are demultiplexed by the barcoding results.</br> 

**Alignment**: The user can provide a reference file in FASTA, lastdb or minimap2 index format. If so, the reads are aligned against this reference via the integrated minimap2 aligner.</br> 


### Executing Albacore

Let's see the usage message for read_fast5_basecaller.py:

In [1]:
read_fast5_basecaller.py -h

usage: read_fast5_basecaller.py [-h] [-l] [-v] [-i INPUT] -t WORKER_THREADS -s
                                SAVE_PATH [--resume [X]] [-f FLOWCELL]
                                [-k KIT] [--barcoding] [-c CONFIG]
                                [-d DATA_PATH] [-b] [-r]
                                [-n FILES_PER_BATCH_FOLDER] [-o OUTPUT_FORMAT]
                                [-q READS_PER_FASTQ_BATCH]
                                [--disable_filtering] [--disable_pings]

ONT Albacore Sequencing Pipeline Software

optional arguments:
  -h, --help            show this help message and exit
  -l, --list_workflows  List standard flowcell / kit combinations.
  -v, --version         Print the software version.
  -i INPUT, --input INPUT
                        Folder containing read fast5 files (if not present,
                        will expect file names on stdin).
  -t WORKER_THREADS, --worker_threads WORKER_THREADS
                        Number of worker threads to use.
  -s SA

A list of available flowcell + kit combinations as well as the configuration files with the DNA libraries can be obtained by using the option *-l*

In [2]:
read_fast5_basecaller.py -l

Parsing config files in /opt/albacore.
Available flowcell + kit combinations are:
flowcell    kit         barcoding  config file
FLO-MIN106  SQK-DCS108             r94_450bps_linear.cfg
FLO-MIN106  SQK-LRK001             r94_450bps_linear.cfg
FLO-MIN106  SQK-LSK108             r94_450bps_linear.cfg
FLO-MIN106  SQK-LSK109             r94_450bps_linear.cfg
FLO-MIN106  SQK-LWB001  included   r94_450bps_linear.cfg
FLO-MIN106  SQK-LWP001             r94_450bps_linear.cfg
FLO-MIN106  SQK-PBK004  included   r94_450bps_linear.cfg
FLO-MIN106  SQK-PCS108             r94_450bps_linear.cfg
FLO-MIN106  SQK-PSK004             r94_450bps_linear.cfg
FLO-MIN106  SQK-RAB201  included   r94_450bps_linear.cfg
FLO-MIN106  SQK-RAB204  included   r94_450bps_linear.cfg
FLO-MIN106  SQK-RAD002             r94_450bps_linear.cfg
FLO-MIN106  SQK-RAD003             r94_450bps_linear.cfg
FLO-MIN106  SQK-RAD004             r94_450bps_linear.cfg
FLO-MIN106  SQK-RAS201             r94_450bps_linear.cfg
FLO-MIN106  SQK-

Albacore (since v1.1 release) can basecall directly to fastq and fast5 files by using the options *-o fastq* or *-o fast5*, respectively. The fastq option saves disk space and is usually more convenient but we can specify both types of files.
We run the tool with the following options:

| <p style='text-align: left;'>What?</p>                     | <p style='text-align: left;'>parameter</p> | <p style='text-align: left;'>value</p>                                      |
|:-----------------------------------------------------------|:----------|:----------------------------------------|
| <p style='text-align: left;'>Output file type you want (fast5, fastq, or both) </p>         | <p style='text-align: left;'>-o</p>        | <p style='text-align: left;'>fastq,fast5</p>                             |
| <p style='text-align: left;'>Full path to directory containing the input raw read files</p> | <p style='text-align: left;'>-i</p>        | <p style='text-align: left;'>data/Agalactiae/Data_MinION/raw_1D/pass</p> |
| <p style='text-align: left;'>Recursive search through subfolders for input data files</p>   | <p style='text-align: left;'>-r</p>        |                                         |
| <p style='text-align: left;'>Full path to directory where the output basecalled files</p>   | <p style='text-align: left;'>-i</p>        | <p style='text-align: left;'>data/Agalactiae/Outputs/Albacore</p>        |
| <p style='text-align: left;'>Number of worker threads to use</p>                            | <p style='text-align: left;'>-t</p>        | <p style='text-align: left;'>48</p>                                      |
| <p style='text-align: left;'>Configuration file for basecalling chemistry</p>               | <p style='text-align: left;'>-c</p>        | <p style='text-align: left;'>r94_450bps_linear.cfg</p>                   |

#### REMARK: **Data should not be included in /data folder of this project (see "Additional data" section in README file)**


In [4]:
read_fast5_basecaller.py -o fastq,fast5 \
                         -i data/sample/fast5 \
                         -r \
                         -s data/sample/output \
                         -t 48 \
                         -c r94_450bps_linear.cfg

Output files are often divided in more than one FASTQ/FASTA file. In order to perform QC, alignments or assemblies, the sequences need to be in one unique file. We can easily join the output files using bash commands:

In [5]:
cat data/sample/*.fastq > data/sample/merged-output.fastq

### Comparison of ONT basecalling tools

A comparison of different ONT basecalling tools is presented in the GitHub repository by [Wick et al., 2018](#wick_et_al). In particular, they use a bacterial genome to assess the read accuracy and consensus sequence accuracy of different tools developed by ONT. Among the conclusions of this comparison study, Albacore, Guppy and Scrappie raw were the best performers for read accuracy, whereas Chiron produced the best assemblies. In particular, the authors suggest that Albacore v2.1.10 is probably the best basecaller choice for most users because it has very good read accuracy and produces acceptable assemblies. This tool runs quickly, is simple to use and has many useful features such as barcode demultiplexing. If a GPU is available, Guppy v0.5.1 can produce the same basecalls in much less time, but it is not yet publicly available and lacks barcode demultiplexing. 

### References

<a id='albacore'>[1]</a> Oxford Nanopore Technology (2017, September 4). New basecaller now performs ‘raw basecalling’, for improved sequencing accuracy. URL 
https://nanoporetech.com/about-us/news/new-basecaller-now-performs-raw-basecalling-improved-sequencing-accuracy 

<a id='wick_et_al'>[2]</a> Wick R., Judd L.M. and Holt K.E. (2018, March 5). Comparison of Oxford Nanopore basecalling tools. GitHub. URL https://github.com/rrwick/Basecalling-comparison 

<a id='deNBI'>[3]</a> de.NBI Nanopore Training Course (2017). GitHub repository. URL https://denbi-nanopore-training-course.readthedocs.io/en/latest/index.html