MIX: Combining multiple assemblies from NGS data
Python Shell Makefile Perl
Switch branches/tags
Nothing to show
Latest commit 4497fbd Apr 14, 2016 @tigloo123 tigloo123 Update README.md

README.md

Mix

Finishing bacterial genome assemblies with Mix

Mix is a tool that combines two or more draft assemblies, without relying on a reference genome and has the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a path in the extension graph that maximizes the cumulative contig length.

The Mix algorithm, approach and results were published in BMC bioinformatics : http://www.biomedcentral.com/1471-2105/14/S15/S16. If you use Mix, please cite the following paper :

@article{24564706,
Author = {Soueidan, Hayssam and Maurier, Florence and Groppi, Alexis and Sirand-Pugnet, Pascal and Tardy, Florence and Citti, Christine and Dupuy, Virginie and Nikolski, Macha},
Journal = {BMC Bioinformatics},
Number = {Suppl 15},
Pages = {S16},
Title = {Finishing bacterial genome assemblies with Mix},
Volume = {14},
Year = {2013}}

Getting help

For any information or help running mix, you can get in touch with:

Datasets for the Mix publication

Datasets corresponding to the benchmarks analyzed in the manuscript as well as supplemenatary figures are available at the accompanying website

Installation

System Requirements

Mix was implemented in Python and tested under Linux environment. It has the following requirements :

Obtaining Mix

Mix can be downloaded at :

You can get the latest version of the development via this github link.

Installing Mix

To install the software, the user must download the file Mix-master.zip (see previous section) and extract it

unzip Mix-master.zip

Usage

Mix runs in a command line environment. The package contains one main script called Mix.py that coordinates the execution of the whole process. This section describes how the user must call it.

Data input:

Mix takes an alignment file and a file containing all the contigs (in fasta format) as input. The alignment file must be in coords format, obtained with MUMmer and the following command lines :

nucmer -prefix=alignments contigs.fa contigs.fa
show-coords -rcl alignments.delta > alignments.coords

The file "contigs.fa" contains the contigs of all the assemblies that have to be mixed (two or more). This file which concatenate all the assemblies is generated by running the script called "preprocessing.py".

Options:

Usage: Mix.py [options]

Options:
  -h, --help            show this help message and exit
  -a ALN, --aln=ALN     the file containing the alignments (.coords)
  -o OUT, --out=OUT     the output directory where the scaffolds will be
                        written (it must already exist)
  -c CTG, --ctg=CTG     the multi FASTA file containing all the contigs that
                        were used in the alignment
  -A ATH, --ath=ATH     minimum length of alignment (default:0)
  -C CTH, --cth=CTH     minimum length of contig (default:500)
  -d, --dot             write the graphs in dot format
  -g, --graph           write the graphs in cytoscape format
  -r, --restrict-to-aligned
                        If on, restrict output to aligned coords

For the current version applied on bacterial assemblies, recommanded parameters values are -A 500 -C 0.

Outputs:

The output assembly will be in the file:

<specified_output_dir>/Mix_results_A<aln_threshold>_C<contig_threshold>/scaffolds.fa

along with statistics and graphs about the final assembly in the files

<specified_output_dir>/Mix_results_A<aln_threshold>_C<contig_threshold>/initial_assembly_graph.gml
<specified_output_dir>/Mix_results_A<aln_threshold>_C<contig_threshold>/reduced_assembly_graph.gml
<specified_output_dir>/Mix_results_A<aln_threshold>_C<contig_threshold>/all_alignments.csv
<specified_output_dir>/Mix_results_A<aln_threshold>_C<contig_threshold>/all_contigs.csv

The two assembly graphs can also be saved in the cytoscape or dot format by setting the corresponding command line flags.

Example Session

<your_path_to_Mix>/Mix-<version>/bin/preprocessing.py -o contigs.fa Assembly1.fa Assembly2.fa [Assembly3.fa]
nucmer --maxmatch -c 30 -l 30 -banded -prefix=alignments contigs.fa contigs.fa
show-coords -rcl alignments.delta > alignments.coords
<your_path_to_Mix>/Mix-<version>/bin/Mix.py -a alignments.coords -c contigs.fa -o output_dir/ -C 300 -A 200 

Evaluations over Mycoplasmas and GAGE-B

  • This requires QUAST (>= 2.1) in addition to MIX requirements. QUAST can be downloaded on the SF website.

  • The provided Makefiles rules to regenerate results presented in the RECOMB-CG paper expect QUAST to be installed in bin/quast2-1. Please update the headers in all Makefile to point to any other QUAST installation.

  • The script expect MUMmer in bin/MUMmer. We recommend to create a symlink from quast-2.1/libs/MUMmer3.23-ARC to bin/MUMmer

  • Datasets for Mycoplasmas studies are available on the accompanying website. We provide a script batch download tool in ``datasets/Mycoplasmas/get_mycoplasmas.sh'' .

  • Once QUAST, MUMmer and the datasets have been obtained, one can generate all the quasts reports for the 10 mycoplasmas by running ``make ALLMYCOQUASTS'' in the root Mix folder. This will take some time (~10 min on MacBook Pro i5) and will generate and populate the following files and folders :

  • temp_assemblies/*.fasta: Concatenated fasta files of the input assemblies

  • temp_assemblies/*.coords *.delta: MUMmer results for self-alignement of FASTA files

  • temp_assemblies/*.mix.log.txt: Log of Mix execution

  • result_assemblies/*_mix.fasta: Resulting merged after execution of Mix

  • result_statistics/all_myco/*_QUAST: QUAST report for each species, where the 19 resulting assemblies (for single-assemblies, GAA, GAM-NGS and mix) are compared

LICENSE

Copyright (c) 2013 Hayssam Soueidan (1) (massyah@gmail.com) 
            Florence Maurier (2) (florence.maurier@u-bordeaux2.fr)
            Alexis Groppi (2) (alexis.groppi@u-bordeaux2.fr)
            Macha Nikolski (3) (macha@labri.fr)
(1) NKI-AVL, Plesmanlaan 121,
1066 CX Amsterdam, Netherlands

(2) CBiB - Universite Victor Segalen Bordeaux,
146, rue Leo Saignat, 33076 Bordeaux, France

(3) CNRS / LaBRI, Universite Bordeaux 1, 351 cours de la Liberation,
33405 Talence Cedex, France 

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.