## Hybrid assemblies

##### NOTE: This notebook has high memory requirements

Obtaining the assembled sequence of a complete genome is a complex multi-step task. For De novo assembly, the simplest elements of the hierarchy are the reads provided after the sequencing. The next level of hierarchy is the alignment of multiple reads without definite order (i.e. contigs). Finally, the top level of the hierarchy corresponds to the sum of two or more contigs where a (near) complete structure of the genome under study is obtained. Ideally, one expects to obtain a single fragment (contig) for each chromosome or a plasmid that is present in the genome. However, most of the times the assemblies are incomplete, especially when dealing with short reads that can be caused by the process of preparing the material or by the technological limitations. Specifically, when a repeat region is longer than the reads, this will create a single contig in the assembly with multiple connections. ONT technologies provide long reads that helps to build bridges between the contigs generated by short reads, improving the continuity of the assemblies. To put it simple, the hybrid assemblies use short reads to produce accurate contigs and the long reads to provide information to assemble them together.

## Hybrid assembly with Unicycler

Unicycler is a hybrid assembly for circular genomes based on a combination of short (Illumina) and long (ONT) reads, which produces assemblies that are accurate and complete (using depth and connectivity information). Unicycler has a very low misassembly rate. However, it is not particularly fast, with necessary time being dependent on the number of long reads and the genome size/complexity. It can use long reads of any depth and quality. The recommendations suggest a coverage of more than 10× in long reads to be able to warrant completing a genome. Despite this, Unicycler can assembly nearly-complete genomes with fewer long reads as long as an optimal coverage with short reads is attained. Furthermore, it is very easy to use and produce an assembly graph. 

Unicycler relies on SPAdes to perform the assembly, which proceeds using De Bruijn graphs with a range of different sizes of k-mer, and then does multiple rounds of short-read polishing using Pilon. Unicycler eliminates contigs with a depth of less than half the depth of the average graph, avoiding possible contaminations. It also eliminates bridges that may be wrong, assigning a quality score to each bridge and applying them in decreasing order. The quality of the bridge depends on the number of reads that support the bridge, the quality of alignment between the read consensus and the path of the graph, the length of the two contigs to be bridged, the length and quality of the read alignments to the contigs, and the consistency of the read between the contigs. An ideal long read bridge, therefore, connects two long contigs with the same depth.

Unicycler uses SeqAn as a semi-global aligner to produce a consensus break sequence and find the best graphic path that connects the contigs, through a branch and boundary algorithm. If any circular replicon was completely assembled, now it will be a single contig with a link that will connect its end to its beginning. In this case, Unicycler uses TBLASTN to search for dnaA or repA alleles in each completed replicon which provides consistently oriented assemblies and reduces the risk of a gene dividing at the beginning and the end of the sequence. As a final step, Unicycler uses Bowtie2 and Pilon to polish the assembly and reduce the rate of small errors.

Unicycler can be run in three different modes: Conservative, where only very high-quality bridges are used; Bold, which use inferior quality bridges; and Normal mode, which uses an intermediate quality and is recommended in most cases.

Short Illumina/IonTorrent read sets can be combined with ONT read sets. For that, the Unicycler basic command performs both assembly and polish tasks and takes the following inputs:

<font color='blue'>-1</font> and <font color='blue'>-2</font> : Illumina reads.
<font color='blue'>-l</font> : Long reads (ONT in our case).
<font color='blue'>-s</font> : IonTorrent reads

#### REMARK: **Data should not be included in /data folder of this project (see "Additional data" section in README file)**

In [None]:
#Data not included in repository
./unicycler-runner.py \
                      -1 data/sample/short/reads_1.fastq.gz \
                      -2 data/sample/short/reads_2.fastq.gz \
                      -l data/sample/reads.fastq \
                      -o hybrid_output


## Hybrid assembly with MaSuRCA

[MaSuRCA](http://www.genome.umd.edu/masurca.html) is a de novo assembler that has the capacity to assemble both short reads only or a mixture of short (Illumina) and long reads (ONT) of animal or plant genomes. It combines the efficiency of the De Bruijn graphs (DBG) and the Overlap-Layout-Consensus (OLC) approaches. With OLC, it attempts to calculate all overlaps by pairs between reads, using the sequence similarity to determine the superposition. While with DBG, it represents the overlaps between the sequences using k-mers, avoiding the calculation of the overlap by pairs. The computational requirements vary with the size of the genome to be assembled. The memory usage scales linearly with the size of the genome, and the execution time scales linearly with the depth of coverage.

MaSuRCA uses a modified version of the CABOG assembler, for the superposition-based assembly after the construction of super reads. The basic concept of super reads is to extend each original read forward and backward, base by base. Each of the original reads is contained in a super read and many of the original reads produce the same super read, so it leaves a very small data set to use.

MaSuRCA can use BioNano Genomics (BNG) maps to improve the assembly of several highly repetitive genomes, where distances and positions of the same restriction sites are compared to a restriction map built computationally. The BNG map can detect rearrangements, insertions or eliminations on a large scale. The assembly algorithms can find the correct location for these elements if the input includes a track that is long enough to contain the full range of one repetition plus unique flank regions on each side. The last important step in MaSuRCA is the filling of spaces in the scaffolds that are relatively short and do not contain complicated repetitive structures. 

MaSuRCA has a configuration file where arguments and input data paths are placed. Once the configuration file is ready, the "masurca" binary script will check the configuration file and load the arguments. In this example, the configuration file is created only providing the necessary arguments to build an assembly from Illumina short reads and ONT long reads. MaSuRCA assemblies are better if several PE libraries are used. Finally, the "assembly.sh" script is executed to build the hybrid assembly.

The config file consists of two sections: DATA and PARAMETERS. Each section concludes with END statement. User should copy the sample config file to the directory of choice for running the assembly and then modify it according to the specifications of the assembly project.


In [None]:
#The following command builds the configuration file adding the lines below from "DATA" to the "EOL" lines in an empty file.

cat > config_file.txt << EOL
DATA
PE= pe 300 15  data/test/short/Sample_S5_L001_R1_001.fastq.gz data/test/short/Sample_S5_L001_R2_001.fastq.gz
NANOPORE=data/agalactiae/reads.fastq
END
PARAMETERS
GRAPH_KMER_SIZE = auto
USE_LINKING_MATES = 0
LIMIT_JUMP_COVERAGE = 300
CA_PARAMETERS =  cgwErrorRate=0.15
KMER_COUNT_THRESHOLD = 1
NUM_THREADS = 48
JF_SIZE = 63000000
END
EOL

In [None]:
#Config file is loaded and assembly script is run
masurca config_file.txt && assemble.sh

### References

[1] Wick R.R., Judd L.M., Gorrie C.L., Holt K.E. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017. DOI: https://doi.org/10.1371/journal.pcbi.1005595

[2] Wick R.R., Judd L.M., Gorrie C.L., Holt K.E. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom 2017. DOI: https://dx.doi.org/10.1099/mgen.0.000132

[3] Zimin A.V., Marçais G., Puiu D., Roberts M., Salzberg S.S., Yorke J.A. The MaSuRCA genome assembler, Bioinformatics, Volume 29, Issue 21, 1 November 2013, Pages 2669–2677. DOI: https://doi.org/10.1093/bioinformatics/btt476

[4] Zimin A.V., Puiu D., Luo M.-C., Zhu T., Koren S., Marcais G., Yorke J.A., Dvorak J., Salzberg S.L. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. DOI: https://doi.org/10.1101/gr.213405.116 

