Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative alignment tool #40

Open
RenzoTale88 opened this issue Mar 8, 2019 · 13 comments
Open

Alternative alignment tool #40

RenzoTale88 opened this issue Mar 8, 2019 · 13 comments

Comments

@RenzoTale88
Copy link

Good day,
I'm currently working on the assembly of a large mammalian genome. I would like to perform the reference-assisted scaffolding using ragout. However, I cannot run progressive cactus on the genomes involved due to computational constraints, and the assembly are simply too large to be efficiently processed with sibelia.

My question is: is there a way to create alignments in a way that is suitable for Ragout2?

For example:
pairwise lastz alignments -> combined multi-species MAF file -> maf2hal -> ragout2

Thank you in advance,
Andrea

@fenderglass
Copy link
Owner

Hi Andrea,

It is possible to use .maf as input (I was planning to retire this functionality, but it is still in the code) - but the problem is that Ragout relies on properties of the Cactus alignments. In particular, alignment blocks should be non-intersecting - which is not true for most of the other aligners (since it is in fact a very hard problem to generate such alignments).

What are your constraints? Cactus should take ~120 CPU days per mammalian genome, which translates into a few actual days on a single cluster node.

There is a recent Sibelia update (https://github.com/medvedevgroup/SibeliaZ) that in fact should handle large genomes and is capable of producing the right maf alignment. I have not tested it yet - but you might try to use it to generate alignment and run Ragout with it.

Best,
Mikhail

@RenzoTale88
Copy link
Author

Hi Mikhail,
Thank you for your reply.
My problem with cactus is not related to computational power/time, but to incompatibilities with toil and the cluster, that prevent it from running properly.
I think that using sibeliaZ seems a good solution. I guess that you would proceed as follow: sibeliaZ > maf > hal > ragout

Am I right?
Thank you in advance,
Andrea

@fenderglass
Copy link
Owner

You can use maf directly by adding an extra recipe parameter .maf = path_to_maf, and using -s maf switch.

@RenzoTale88
Copy link
Author

RenzoTale88 commented Mar 13, 2019

I've just tried to run Ragout on the alignments produced using SibeliaZ (maf format) using the genomes provided in example/E.Coli folder. I've changed the header of the respective fasta so that it shows the organism in the format "dh1.gi", "ms1655.seq1", etc.
The maf file is properly parsed, even if raising errors of overlapping blocks, and finally crashes while refining the assembly graph.
I though that the problem was due to the presence of self-alignments within the same assembly. But even after filtering those out from the MAF file, the result doesn't change, still crashing while detecting chimeric adjacencies.
Below, you can find the log produced, the error, the configuration file, the two fai with the names used and the top 100 lines of the sibeliaZ maf file. I hope these information can be of use to understand the problem.
Thank you in advance for your help,
Andrea

[16:09:31] DEBUG: >>With block size: 5000
[16:09:31] DEBUG: Checking 2 target adjacencies
[16:09:31] DEBUG: Found 1 target-specific adjacencies, 1 broken as chimeric
[16:09:31] DEBUG: >>With block size: 500
[16:09:31] DEBUG: Checking 16 target adjacencies
[16:09:31] DEBUG: Found 3 target-specific adjacencies, 1 broken as chimeric
[16:09:31] DEBUG: >>With block size: 100
[16:09:31] DEBUG: Checking 105 target adjacencies
[16:09:31] DEBUG: Found 11 target-specific adjacencies, 3 broken as chimeric
[16:09:31] DEBUG: >>With block size: 100
[16:09:31] DEBUG: Checking 105 target adjacencies
[16:09:31] DEBUG: Found 11 target-specific adjacencies, 3 broken as chimeric
Traceback (most recent call last):
File "/exports/applications/apps/community/roslin/ragout/2.1.1/bin/ragout", line 35, in
sys.exit(main())
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/main.py", line 290, in main
_run_ragout(args)
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/main.py", line 198, in _run_ragout
chim_detect = ChimeraDetector(raw_bp_graphs, run_stages, target_sequences)
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/breakpoint_graph/chimera_detector.py", line 26, in init
self._make_hierarchical_breaks()
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/breakpoint_graph/chimera_detector.py", line 62, in _make_hierarchical_breaks
break_pos = self._optimal_break(seq_name, *adjusted_break)
File "/exports/eddie3_apps_local/apps/community/roslin/ragout/2.1.1/ragout/breakpoint_graph/chimera_detector.py", line 70, in _optimal_break
seq = self.target_seqs[seq_name]
KeyError: 'seq13'

ragout.log
testRagout.rcp.txt
dh1.fasta.fai.txt
mg1655.fasta.fai.txt
top100.maf.txt

@fenderglass
Copy link
Owner

Looks like there are some caveats with using SibeliaZ. I will be testing it in the near future.

@RenzoTale88
Copy link
Author

Hi,
The error found was due to the input fasta files provided. Basically, the input fasta that I provided included the species name in each sequence header (e.g. >spp1.chr1), but ragout needed them as simple sequence name (e.g. >seq1). After using this fasta file, the program worked without issues.

The only thing that concerns me is this warning:

Too few overlaps (18) between contigs were detected -- refine procedure will be useless

Is there a way to improve this? Thank you in advance,

Andrea

@fenderglass
Copy link
Owner

Great, I'm glad that you got it working! Did it complain on the synteny block coverage?

Too few overlaps (18) between contigs were detected -- refine procedure will be useless

This step is optional - it only works good with certain assemblers (like SPAdes) that output contigs with overlapping ends. We usually don't use --refine option for large genomes.

@Mar-y
Copy link

Mar-y commented Aug 6, 2019

hi,
I have similar issues. I was using sibeliaz as well for the alignment of large genomes and now I am trying to run ragout with the maf file.
I was following the instructions above, adding the recipe parameter .maf and the -s maf switch, and I also changed the headers in a simple format like that

scaffold4594
but ragout always breaks after some time with the following error
ERROR: permutation ids do not follow naming convention: 'genome.chromosome'

then I tried to change the headers in a format like

WB42.scaffold1
WB42.scaffold6
WB42.scaffold9
WB42.scaffold13

but then that error appeared:KeyError: 'scaffold830'

Do I name the headers still in a wrong way or does anybody have any ideas how to solve that problem?
Thank you in advance

if helpful or necessary the code I used for ragout: ragout -o ragout_out -s maf -t 10 recipe_file_for_ragout.txt

@fenderglass
Copy link
Owner

Hi,

It's a bit tricky - the MAF alignment should have the genome prefixes, but the FASTA files - should not. I think in your case it will be enough to remove the prefixes from fasta, also make sure that genome names are matching with ones provided in the recipe.

Let me know if this helps. Hopefully in the near future I will properly integrate SibeliaZ into the pipeline.

@Mar-y
Copy link

Mar-y commented Aug 8, 2019

Hi Mikhail,

thank you so much for your reply! Ragout finished successfully after removing the prefixes from the fasta files.

Best regards

@ipetrushin
Copy link

Dear Mikhail!

I'm dealing with number of green algae genome. Ragout w/ Sibelia (single-threaded) is running for days. I tried to use SibeliaZ and got MAF alignments. My refences are from NCBI (GCF* GCA* FASTA files), target assembly made with SPAdes.

What I need to do with input files (FASTA/alignment) to use them with Ragout?

@ipetrushin
Copy link

How to understand following error (debug don't give a hint)?
ERROR: No sequences read for genome c2. Check recipe for correctness.
File exists, recipe looks like:

.references = c2
.target = s68
.maf = sz_395/alignment.maf

s68.fasta = algs68.fasta
c2.fasta = GCA_001021125.1_ASM102112v1_genomic.fna

@fenderglass
Copy link
Owner

Ok, it seems that you have found the .maf option :)

The sequence headers in MAF should be formatted as genome_name.contig_name, and I believe SibeliaZ just outputs contig_name. So you'd need to add genome names to MAF (according to the naming in the recipe).

Sorry for the inconvenience! SibeliaZ support is in the plans, but I didn't get a chance to work on that yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants