Alternative alignment tool #40

RenzoTale88 · 2019-03-08T12:19:39Z

Good day,
I'm currently working on the assembly of a large mammalian genome. I would like to perform the reference-assisted scaffolding using ragout. However, I cannot run progressive cactus on the genomes involved due to computational constraints, and the assembly are simply too large to be efficiently processed with sibelia.

My question is: is there a way to create alignments in a way that is suitable for Ragout2?

For example:
pairwise lastz alignments -> combined multi-species MAF file -> maf2hal -> ragout2

Thank you in advance,
Andrea

fenderglass · 2019-03-12T20:30:05Z

Hi Andrea,

It is possible to use .maf as input (I was planning to retire this functionality, but it is still in the code) - but the problem is that Ragout relies on properties of the Cactus alignments. In particular, alignment blocks should be non-intersecting - which is not true for most of the other aligners (since it is in fact a very hard problem to generate such alignments).

What are your constraints? Cactus should take ~120 CPU days per mammalian genome, which translates into a few actual days on a single cluster node.

There is a recent Sibelia update (https://github.com/medvedevgroup/SibeliaZ) that in fact should handle large genomes and is capable of producing the right maf alignment. I have not tested it yet - but you might try to use it to generate alignment and run Ragout with it.

Best,
Mikhail

RenzoTale88 · 2019-03-12T21:52:53Z

Hi Mikhail,
Thank you for your reply.
My problem with cactus is not related to computational power/time, but to incompatibilities with toil and the cluster, that prevent it from running properly.
I think that using sibeliaZ seems a good solution. I guess that you would proceed as follow: sibeliaZ > maf > hal > ragout

Am I right?
Thank you in advance,
Andrea

fenderglass · 2019-03-13T02:41:41Z

You can use maf directly by adding an extra recipe parameter .maf = path_to_maf, and using -s maf switch.

RenzoTale88 · 2019-03-13T16:17:23Z

I've just tried to run Ragout on the alignments produced using SibeliaZ (maf format) using the genomes provided in example/E.Coli folder. I've changed the header of the respective fasta so that it shows the organism in the format "dh1.gi", "ms1655.seq1", etc.
The maf file is properly parsed, even if raising errors of overlapping blocks, and finally crashes while refining the assembly graph.
I though that the problem was due to the presence of self-alignments within the same assembly. But even after filtering those out from the MAF file, the result doesn't change, still crashing while detecting chimeric adjacencies.
Below, you can find the log produced, the error, the configuration file, the two fai with the names used and the top 100 lines of the sibeliaZ maf file. I hope these information can be of use to understand the problem.
Thank you in advance for your help,
Andrea

[16:09:31] DEBUG: >>With block size: 5000
[16:09:31] DEBUG: Checking 2 target adjacencies
[16:09:31] DEBUG: Found 1 target-specific adjacencies, 1 broken as chimeric
[16:09:31] DEBUG: >>With block size: 500
[16:09:31] DEBUG: Checking 16 target adjacencies
[16:09:31] DEBUG: Found 3 target-specific adjacencies, 1 broken as chimeric
[16:09:31] DEBUG: >>With block size: 100
[16:09:31] DEBUG: Checking 105 target adjacencies
[16:09:31] DEBUG: Found 11 target-specific adjacencies, 3 broken as chimeric
[16:09:31] DEBUG: >>With block size: 100
[16:09:31] DEBUG: Checking 105 target adjacencies
[16:09:31] DEBUG: Found 11 target-specific adjacencies, 3 broken as chimeric
Traceback (most recent call last):
File "/exports/applications/apps/community/roslin/ragout/2.1.1/bin/ragout", line 35, in
sys.exit(main())
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/main.py", line 290, in main
_run_ragout(args)
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/main.py", line 198, in _run_ragout
chim_detect = ChimeraDetector(raw_bp_graphs, run_stages, target_sequences)
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/breakpoint_graph/chimera_detector.py", line 26, in init
self._make_hierarchical_breaks()
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/breakpoint_graph/chimera_detector.py", line 62, in _make_hierarchical_breaks
break_pos = self._optimal_break(seq_name, *adjusted_break)
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/breakpoint_graph/chimera_detector.py", line 70, in _optimal_break
seq = self.target_seqs[seq_name]
KeyError: 'seq13'

ragout.log
testRagout.rcp.txt
dh1.fasta.fai.txt
mg1655.fasta.fai.txt
top100.maf.txt

fenderglass · 2019-03-17T01:51:18Z

Looks like there are some caveats with using SibeliaZ. I will be testing it in the near future.

RenzoTale88 · 2019-03-17T13:25:17Z

Hi,
The error found was due to the input fasta files provided. Basically, the input fasta that I provided included the species name in each sequence header (e.g. >spp1.chr1), but ragout needed them as simple sequence name (e.g. >seq1). After using this fasta file, the program worked without issues.

The only thing that concerns me is this warning:

Too few overlaps (18) between contigs were detected -- refine procedure will be useless

Is there a way to improve this? Thank you in advance,

Andrea

fenderglass · 2019-03-18T18:55:44Z

Great, I'm glad that you got it working! Did it complain on the synteny block coverage?

Too few overlaps (18) between contigs were detected -- refine procedure will be useless

This step is optional - it only works good with certain assemblers (like SPAdes) that output contigs with overlapping ends. We usually don't use --refine option for large genomes.

Mar-y · 2019-08-06T13:12:23Z

hi,
I have similar issues. I was using sibeliaz as well for the alignment of large genomes and now I am trying to run ragout with the maf file.
I was following the instructions above, adding the recipe parameter .maf and the -s maf switch, and I also changed the headers in a simple format like that

scaffold4594
but ragout always breaks after some time with the following error
ERROR: permutation ids do not follow naming convention: 'genome.chromosome'

then I tried to change the headers in a format like

WB42.scaffold1
WB42.scaffold6
WB42.scaffold9
WB42.scaffold13

but then that error appeared:KeyError: 'scaffold830'

Do I name the headers still in a wrong way or does anybody have any ideas how to solve that problem?
Thank you in advance

if helpful or necessary the code I used for ragout: ragout -o ragout_out -s maf -t 10 recipe_file_for_ragout.txt

fenderglass · 2019-08-07T23:03:26Z

Hi,

It's a bit tricky - the MAF alignment should have the genome prefixes, but the FASTA files - should not. I think in your case it will be enough to remove the prefixes from fasta, also make sure that genome names are matching with ones provided in the recipe.

Let me know if this helps. Hopefully in the near future I will properly integrate SibeliaZ into the pipeline.

Mar-y · 2019-08-08T09:52:12Z

Hi Mikhail,

thank you so much for your reply! Ragout finished successfully after removing the prefixes from the fasta files.

Best regards

ipetrushin · 2019-12-17T02:59:14Z

Dear Mikhail!

I'm dealing with number of green algae genome. Ragout w/ Sibelia (single-threaded) is running for days. I tried to use SibeliaZ and got MAF alignments. My refences are from NCBI (GCF* GCA* FASTA files), target assembly made with SPAdes.

What I need to do with input files (FASTA/alignment) to use them with Ragout?

ipetrushin · 2019-12-17T03:14:53Z

How to understand following error (debug don't give a hint)?
ERROR: No sequences read for genome c2. Check recipe for correctness.
File exists, recipe looks like:

.references = c2
.target = s68
.maf = sz_395/alignment.maf

s68.fasta = algs68.fasta
c2.fasta = GCA_001021125.1_ASM102112v1_genomic.fna

fenderglass · 2019-12-19T03:45:28Z

Ok, it seems that you have found the .maf option :)

The sequence headers in MAF should be formatted as genome_name.contig_name, and I believe SibeliaZ just outputs contig_name. So you'd need to add genome names to MAF (according to the naming in the recipe).

Sorry for the inconvenience! SibeliaZ support is in the plans, but I didn't get a chance to work on that yet.

fergsc mentioned this issue Mar 17, 2020

Input is larger than 1 GB, can't proceed #58

Closed

fenderglass mentioned this issue Mar 24, 2020

Hal to Maf manual conversion #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative alignment tool #40

Alternative alignment tool #40

RenzoTale88 commented Mar 8, 2019

fenderglass commented Mar 12, 2019

RenzoTale88 commented Mar 12, 2019

fenderglass commented Mar 13, 2019

RenzoTale88 commented Mar 13, 2019 •

edited

fenderglass commented Mar 17, 2019

RenzoTale88 commented Mar 17, 2019

fenderglass commented Mar 18, 2019

Mar-y commented Aug 6, 2019

fenderglass commented Aug 7, 2019

Mar-y commented Aug 8, 2019

ipetrushin commented Dec 17, 2019

ipetrushin commented Dec 17, 2019

fenderglass commented Dec 19, 2019

Alternative alignment tool #40

Alternative alignment tool #40

Comments

RenzoTale88 commented Mar 8, 2019

fenderglass commented Mar 12, 2019

RenzoTale88 commented Mar 12, 2019

fenderglass commented Mar 13, 2019

RenzoTale88 commented Mar 13, 2019 • edited

fenderglass commented Mar 17, 2019

RenzoTale88 commented Mar 17, 2019

fenderglass commented Mar 18, 2019

Mar-y commented Aug 6, 2019

fenderglass commented Aug 7, 2019

Mar-y commented Aug 8, 2019

ipetrushin commented Dec 17, 2019

ipetrushin commented Dec 17, 2019

fenderglass commented Dec 19, 2019

RenzoTale88 commented Mar 13, 2019 •

edited