Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFA file of only gap records segfaults #30

Open
sjackman opened this issue May 4, 2018 · 5 comments
Open

GFA file of only gap records segfaults #30

sjackman opened this issue May 4, 2018 · 5 comments
Milestone

Comments

@sjackman
Copy link
Contributor

sjackman commented May 4, 2018

A GFA file ought to include both segments and gap records. It'd be preferable if gfakluge didn't segfault when encountering such a file.

H	VN:Z:2.0
G	*	6+	50+	121	58	FC:i:1
G	*	6+	225+	-57	58	FC:i:1
G	*	6+	298-	-83	8	FC:i:55
G	*	6-	62-	-80	9	FC:i:47
G	*	6-	171-	-67	41	FC:i:2
❯❯❯ gfak stats -A gaps.gfa
[1]    98433 segmentation fault
@edawson
Copy link
Owner

edawson commented May 8, 2018

Yikes. I assume we'd prefer an error (e.g. "Segment not found for gap <gap_id>")?

Just to verify I understand correctly: this is not valid GFA, and we should never get GFA that has the records spread across multiple files like this, right?

@edawson edawson added this to the v0.2.4 milestone May 12, 2018
@sjackman
Copy link
Contributor Author

Short answer, yes. It's not valid GFA.

Long answer. ABySS produces a GFA file of the segment records and edge records. For large genomes this file can be quite large. In a second step, ABySS then uses the paired-end and mate-pair reads to estimate the distances between segments and outputs the gap records. Rather than make a copy of the potentially large S+E records, it outputs only the gap records. ABySS can handle reading a GFA file spread across multiple files for this reason. It'd be useful to me if Gfakluge could also read these split files. Your call of course whether you want to support that or not. It's easy enough to use either awk or abyss-todot (a misnomer now since it handles more than just GraphViz files) to combine these two GFA files into a single file for Gfakluge.

@edawson
Copy link
Owner

edawson commented May 17, 2018

Interesting. How big are these two files?

I have been thinking about restructuring the command line tools to not build the GFAKluge object when the graph isn't being modified. When I get around to this I'll add support for breaking the graph into multiple files (with a stern warning, of course).

cating the gaps file to the seqs/edges file sounds like it might work as-is, unless I missed something.

@edawson
Copy link
Owner

edawson commented May 17, 2018

I guess I should mention: tools that don't modify the graph are:

  • stats
  • extract
  • diff

These tools would support abyss' split file format, with a warning. The rest of the tools should support the complete ( (S + E) + (G) ) file, even if it is very large, and should be able to handle it regardless of order. I didn't intend to enforce an order to GFA files in GFAkluge but it seems I've done it by accident for gap records (and probably edges as well).

@sjackman
Copy link
Contributor Author

sjackman commented May 18, 2018

Interesting. How big are these two files?

For a human genome:
FASTA: 2.9 GB
S+E with * for sequences: 137 MB
G: 10 MB

Thanks again, Eric!

@edawson edawson modified the milestones: v0.2.4, v0.3 Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants