## Basecalling

The characteristic electric signals, also called events, are recorded in a digital file with FAST5 format. The FAST5 format is in fact HDF5, a very flexible data format for storing and manipulating data. These files contain not only the sequence events but also logs and metadata. For these to be understandable by us, the first step just after the sequencing is the translation of the electric signals generated into the nucleotide letters (in FASTQ/FASTA files), a step known as basecalling. 

###### NOTE:  Basecalling process is computationally-intensive in terms of memory and CPU. This task should be run on a GPU or a powerful enough machine to avoid overhead or long execution times.

### Albacore

For this step NanoDJ uses Albacore, the official basecaller for ONT data at the time of writing this notebook. Albacore uses neural networks and runs from the command line with some important indications or arguments such as: the place (directory) where the FAST5 files are, the location where we want the output to be delivered, the computer resources we want to use (number of computing threads), and specific information of the reagents employed for preparing the sample. 

In [1]:
read_fast5_basecaller.py -h

usage: read_fast5_basecaller.py [-h] [-l] [-v] [-i INPUT] -t WORKER_THREADS -s
                                SAVE_PATH [--resume [X]] [-f FLOWCELL]
                                [-k KIT] [--barcoding] [-c CONFIG]
                                [-d DATA_PATH] [-b] [-r]
                                [-n FILES_PER_BATCH_FOLDER] [-o OUTPUT_FORMAT]
                                [-q READS_PER_FASTQ_BATCH]
                                [--disable_filtering] [--disable_pings]

ONT Albacore Sequencing Pipeline Software

optional arguments:
  -h, --help            show this help message and exit
  -l, --list_workflows  List standard flowcell / kit combinations.
  -v, --version         Print the software version.
  -i INPUT, --input INPUT
                        Folder containing read fast5 files (if not present,
                        will expect file names on stdin).
  -t WORKER_THREADS, --worker_threads WORKER_THREADS
                        Number of worker threads to use.
  -s SA

In order to execute Albacore properly, the user should first know some features about the experiment, such as the flowcell and the sequencing kit that were used. The following command shows the name of the configuration files used in Albacore execution.

In [4]:
read_fast5_basecaller.py -l

Parsing config files in /opt/albacore.
Available flowcell + kit combinations are:
flowcell    kit         barcoding  config file
FLO-MIN106  SQK-DCS108             r94_450bps_linear.cfg
FLO-MIN106  SQK-LRK001             r94_450bps_linear.cfg
FLO-MIN106  SQK-LSK108             r94_450bps_linear.cfg
FLO-MIN106  SQK-LSK109             r94_450bps_linear.cfg
FLO-MIN106  SQK-LWB001  included   r94_450bps_linear.cfg
FLO-MIN106  SQK-LWP001             r94_450bps_linear.cfg
FLO-MIN106  SQK-PBK004  included   r94_450bps_linear.cfg
FLO-MIN106  SQK-PCS108             r94_450bps_linear.cfg
FLO-MIN106  SQK-PSK004             r94_450bps_linear.cfg
FLO-MIN106  SQK-RAB201  included   r94_450bps_linear.cfg
FLO-MIN106  SQK-RAB204  included   r94_450bps_linear.cfg
FLO-MIN106  SQK-RAD002             r94_450bps_linear.cfg
FLO-MIN106  SQK-RAD003             r94_450bps_linear.cfg
FLO-MIN106  SQK-RAD004             r94_450bps_linear.cfg
FLO-MIN106  SQK-RAS201             r94_450bps_linear.cfg
FLO-MIN106  SQK-

The main Albacore command runs the basecalling process. 
- **-i** :  Input files directory 
- **-r** : for recurive search
- **-t** : is for available execution threads, 
- **-s** : is for the output directory and -o is for the output data formats.
- **-o** : type format of the output
- **-c** : specifies the configuration file to be used in the execution. This file should match one of the listed files that were printed with the command above.

#### REMARK: **Data should not be included in /data folder of this project (see "Additional data" section in README file)**


In [1]:
read_fast5_basecaller.py -i data/edudata/fast5 \
                         -r \
                         -t 48 \
                         -s data/edudata/albacore_output/ \
                         -o fastq,fast5 \
                         -c r94_450bps_linear.cfg


| 1556 of 1556|##############################################|100% Time: 0:04:54


Output files are often divided in more than one FASTQ/FASTA file. In order to perform QC, alignments or assemblies, the sequences need to be placed in one unique file. The user can easily join the output files using the following bash command:

In [2]:
cat data/edudata/albacore_output/workspace/pass/*.fastq > data/edudata/reads.fastq
cat data/edudata/albacore_output/workspace/fail/*.fastq >> data/edudata/reads.fastq

### References

<a id='albacore'>[1]</a> Oxford Nanopore Technology (2017, September 4). New basecaller now performs ‘raw basecalling’, for improved sequencing accuracy. URL 
https://nanoporetech.com/about-us/news/new-basecaller-now-performs-raw-basecalling-improved-sequencing-accuracy 

<a id='wick_et_al'>[2]</a> Wick R., Judd L.M. and Holt K.E. (2018, March 5). Comparison of Oxford Nanopore basecalling tools. GitHub. URL https://github.com/rrwick/Basecalling-comparison 

<a id='deNBI'>[3]</a> de.NBI Nanopore Training Course (2017). GitHub repository. URL https://denbi-nanopore-training-course.readthedocs.io/en/latest/index.html