# De novo assembly with Flye

Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and ONT, developed by [Kolmogorov et al](2018). Flye implements an algorithm that builds the A-Bruijn (assembly) graph from long error-prone reads and features many new improvements. It also has a polished module that produces the final assembly. As a difference with other existing assemblers, Flye does not attempt to construct accurate contigs (at least at the initial assembly stage) but instead generates arbitrary paths in the unknown assembly graph and, further, an assembly graph is built from these paths. Although Flye constructs overlapping contigs with possible assembly errors at the initial stage, they are combined into an accurate assembly graph and the repetitions in the assembly graph are resolved by using small variations between several repeat instances that were left unresolved during the initial assembly stage. Finally, the algorithm constructs a new and less tangled assembly graph, in which the accurate contigs as represented as paths.

Flye have been tested against several assemblers to demonstrate that it generates better or comparable assemblies and performs 2-10 times faster than hierarchical assembly pipelines. Our own experience shows that Flye was by far one of the most accurate alternatives (along with Canu), although it was much less computationally expensive. Flye can generate contiguous and precise assemblies with any type of genome, even with large genomes. 

### Flye basic arguments:

<font color='blue'>--nanoraw</font> Reads file of ONT reads. The input reads can be in FASTA or FASTQ format, uncompressed or compressed with gz.

<font color='blue'>--out-dir</font> Output directory.

<font color='blue'>--genome-size</font> An estimate of the genome size must be provided (e.g. 5m or 2.6g). The estimate could be rough (e.g. within 0.5x-2x range) and does not affect the other assembly stages.

<font color='blue'>--threads</font> Number of threads to be used in the assembly.

<font color='blue'>--subassemblies</font> Multiple high-quality contigs input. This option allows performing a consensus assembly of multiple sets of high-quality contigs by specifying multiple files separated by spaces.

<font color='blue'>--asm-coverage</font> The assemblies of large genomes at high coverage typically require a lot of RAM. In order to reduce memory consumption, this option can be used to specify a subset of the longest reads for initial contig assembly (the developer recommends 40x coverage to produce enough good draft contigs). Regardless of this parameter, all reads will be later used for repeat graph analysis.

Currenlty, Flye supports both raw and corrected reads. The expected error rates are <30% for raw and <2% for corrected reads.

In [1]:
flye --nano-raw data/agalactiae/merged-output.fastq --out-dir data/agalactiae/flye_output --genome-size 4.7m --threads 2

Traceback (most recent call last):
  File "/usr/local/bin/flye", line 31, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/flye/main.py", line 505, in main
    os.mkdir(args.out_dir)
OSError: [Errno 2] No such file or directory: 'data/agalactiae/flye_output'


: 1

Flye also outputs an assembly graph in .gv format that can be visualized using Graphviz. The graph edges represent genomic sequences and nodes serve as junctions. Each edge is labelled with its ID, length and coverage. Repetitive edges are shown in color, while unique edges are shown in black.

In [None]:
dot -Tpng -O assembly_graph.gv

### Related publications:

[Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin and Pavel Pevzner, "Assembly of Long Error-Prone Reads Using Repeat Graphs", bioRxiv, 2018](https://www.biorxiv.org/content/early/2018/01/12/247148)


[Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, & Pavel Pevzner (2018). Supplementary files for the manuscript "Assembly of Long Error-Prone Reads Using Repeat Graphs" (Version 2.0) [Data set]. Zenodo.](http://doi.org/10.5281/zenodo.1422834)

[Yu Lin, Jeffrey Yuan, Mikhail Kolmogorov, Max W Shen, Mark Chaisson and Pavel Pevzner, "Assembly of Long Error-Prone Reads Using de Bruijn Graphs", PNAS, 2016](www.pnas.org/content/113/52/E8396)
