## Basecalling

The characteristic electric signals, also called events, are recorded in a digital file with FAST5 format. The FAST5 format is in fact HDF5, a very flexible data format for storing and manipulating data. These files contain not only the sequence events but also logs and metadata. For these to be understandable by us, the first step just after the sequencing is the translation of the electric signals generated into the nucleotide letters (in FASTQ/FASTA files), a step known as basecalling. 

###### NOTE:  Basecalling process is computationally-intensive in terms of memory and CPU. This task should be run on a GPU or a powerful enough machine to avoid overhead or long execution times.

### Albacore

For this step NanoDJ uses Albacore, the official basecaller for ONT data at the time of writing this notebook. Albacore uses neural networks and runs from the command line with some important indications or arguments such as: the place (directory) where the FAST5 files are, the location where we want the output to be delivered, the computer resources we want to use (number of computing threads), and specific information of the reagents employed for preparing the sample. 

In [None]:
read_fast5_basecaller.py -h

In order to execute Albacore properly, the user should first know some features about the experiment, such as the flowcell and the sequencing kit that were used. The following command shows the name of the configuration files used in Albacore execution.

In [None]:
read_fast5_basecaller.py -l

The main Albacore command runs the basecalling process. 
- -i :  Input files directory 
- -r : for recurive search
- -t : is for available execution threads, 
- -s : is for the output directory and -o is for the output data formats.
- -o : type format of the output
- -c : specifies the configuration file to be used in the execution. This file should match one of the listed files that were printed with the command above.

#### REMARK: **Data should not be included in /data folder of this project (see "Additional data" section in README file)**


In [None]:
read_fast5_basecaller.py -i data/sample/fast5 \
                         -r \
                         -t 48 \
                         -s data/albacore_output/ \
                         -o fastq,fast5 \
                         -c r94_450bps_linear.cfg


Output files are often divided in more than one FASTQ/FASTA file. In order to perform QC, alignments or assemblies, the sequences need to be placed in one unique file. The user can easily join the output files using the following bash command:

In [None]:
cat data/albacore_output/*.fastq > data/sample/reads.fastq