-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document all implementations of GFA #3
Comments
I implemented GFA in a fork of daligner: https://github.com/jts/daligner |
That's amazing! I had just said to @pmelsted that I thought GFA could/should be the format for long read overlap alignments. @JustinChu |
@jts I'm curious though, why fork and patch DALIGNER rather than create a DALIGNER format to GFA format conversion script? Or was there insufficient information stored in the DALIGNER output? |
The latter. The DALIGNER output files don't store full alignments, only anchors that can rebuild the alignment on demand. |
Ah. Was the modification a new |
Cool. Would you add them to |
@jts I got my wires crossed above doing too many things at once. Did you add a GFA output format to |
@pmelsted good to see that you have a (GFA-producing) tool too. So I need to implement that 1-character change to bring myself into concordance with the other implementations. |
|
@sjackman it is a convertor from .las to .gfa: https://github.com/jts/DALIGNER/blob/master/LA2gfa.c |
@ekg What's the change? Just curious. |
@jts Any idea what's going on here?
|
My edge orientations aren't written the same as other tools. I show L 1 - 2
|
This makes me want to change this "+/-" thing. What about |
I'd prefer not to use 3 and 5 for the ends of the nodes. Better to avoid But then I assumed that the link was from the plus end to minus end. I'm concerned there will be problems representing things like links from
|
Yes + and - refer to whether you should use the nucl. sequence of the segment as presented in the (stealing this from @jts LA2gfa.c)
I'm also open to using ~ for reverse complement (this was in fastg) but I like that both orientations + and - have to be specified. |
Wait, is there any way to have a link from the 5' end of a node to the 5' If not then GFA will be incompatible with what the GA4GH group is
|
That is There are at least two contradicting ways to interpret this +/-. @ekg, Richard Durbin and a few others are using the other interpretation. It is more straightforward to say the beginning of segment 1 joins the end of segment 2. |
Thought about that, but this is awkward to |
I prefer to view the overlaps as alignments. Some portion of sequence A aligns to some portion of sequence B. That portion of A and B is either reverse complemented |
@sjackman LA2gfa is very out of date so I suspect something in the underlying files changed |
Yes, I believe the database did change at one point.
|
per discussion @ GFA-spec/GFA-spec#3
@ekg Regarding 5' to 5' links. This is not possible right now and I think we should talk about it in another issue. One way to represent it would be to have a separated record for the inverted segment. |
This would be easy to represent if the links were defined as between ends rather than as relative to the natural orientation of the nodes. Would we lose anything relative to the current representation by changing these semantics? |
5' to 5' links are supported in the current spec. The following snippet connects the leftmost end of
And a path:
|
No, @sjackman what @ekg is referring to is if you have an inversion, like
But what if the TAG was inverted how would you encode the links and path for |
When you say inverted, do you mean reversed without being complemented? Does that happen? How would that happen? |
I am a little bit confused by the GFA implemention of the daligner. From the assembly pipeline point of view, daligner's output is the overlapping information, not the assembly graph. Maybe @jts can help me understand why we use GFA to store overlapping. Thanks a lot. The FALCON assembler that I implemented dose convert the overlap ouput as some explicitly overlap information (as a small variation of the The graph representation used in FALCON is read-end based. It is more like the 1st version Heng Li proposed, but I don't explicity store the sequence. The sequences can be inferred from the read end labelling and the original sequences. There is pros and cons to store seqeunces with the linkage information. Currently, it is easier for me to code to store them separately. It makes the I/O for graph analysis little bit eaiser without reading the seqeunce data. |
@pb-jchin one of the original goals of GFA (and SQG, iirc) was that it could act as an intermediate format for the entire assembly pipeline from input reads to an output graph. We could then write a pipeline of standalone software with GFA input/output for each stage. |
@jts thanks for the clarification. Currently, for PacBio related assembly work, we basically seperate the sequences itself and the metadata into different files. For example, the daligner db file and las file. I use the same for downstream too. (I think CA/WGS have a couple different storages too.) Just wondering what's people's thought about such sperations. I am not sure what is the working draft for the curren GFA format. The linkage information and the sequence information are in different lines. It may be a good idea if they are in different files. (or bad idea, perhaps) |
@jts maybe we are acutally talking about common format for storing overlaping data rather than the final assembly graph? |
I'm in favour of storing the sequences separately from the graph information. I'm not a fan of duplicating information. You may put |
From my perspective an ideal non-redundant representation would describe a This would also allow merging the cigars of the links into the graph |
I believe the current GFA graph format can be used to describe such a string graph. The CIGAR strings can all be |
If anyone finds additional implementations of GFA, please open a pull request to add it/them to the README.md. |
The text was updated successfully, but these errors were encountered: