-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Path line syntax ambiguous ? #37
Comments
The only way I can see to separate segment/edge ids in a list is with spaces. All other printable characters are legitimate in ids. I would rather not have orientation symbols in the list. As far as I can see there is no need for them if you allow edge ids, and you ban edges that Richard |
Richard,
you seem to
|
I propose requiring each node traversal or mapping in the path to be on its own line. Something like: This is how vg produces GFA. It makes heavy use of paths and so this was really important to get right. Here is an example in a mini graph.
In addition to solving this issue, there are other benefits to this organization: This format makes many command line operations on the GFA file easier, as it allows us to sort the graph by node names (in the second column) and visually inspect approximate subgraphs including the paths through them. For example, I generated this by doing:
If the path records come on one line then there is no available technique to subset the graph (with paths) on the command line. Putting one path element per line also resolves the need for a custom path format. We just use the same tab delimited parsing machinery to read in the path. This also allows each path element to have a CIGAR. This is not strictly necessary but it adds flexibility and makes the paths that we can store in GFA graphs nearly equivalent to those which we use for alignments. It also makes it easier for us to add other annotations to the path elements, as each element is serialized with its own id. This doesn't matter so much for GFA but rather in RDF based serializations of assembly graphs, where the objective is to allow linking to specific components of the sequence graph and walks through it. In other words, if the path elements weren't first class entities, we would be unable to use them in semantic linkages. NB: In vg, paths are composed of a series of mappings. Mappings are ranks (in the lath) positions and edits. Edits are like CIGARs. Positions describe a node strand and an offset. It is not possible to support this data model with comma-delimited paths in GFA. (The vg schema vg.proto defines this. It might be good to write it out in BNF.) We are basically stuck on this issue because others prefer the one line per path format. As a result we are using incompatible versions of P. I would shift to another namespace but think it is a waste of a common format to do so. What GFA implementations are using the current one line per path format in P? What are the advantages of the comma delimited path format? As far as I understand it removes the need for the path rank elements (saving some space) and makes it so we can parse each path in one line. |
I forgot if I have said anything on the issue where you and others attempted to define Paths. As I see it now, I don't have a strong opionin either way. One advantage of the one-line format is that, at least to me, it is more natural to see a path as one object. When you put a path on multiple lines, the parser needs to read all lines into memory and then compose a Path object. That said, I can see your multi-line version has benefits in other cases.
I am not sure why you'd like to have Edits/CIGARs. Do you allow mismatches/gaps between a path and a segment mapped to the path? If not, could we just specify overlap lengths in case of graphs with multiple edges? In my view, the primary goal of a Path is to spell a sequence. Using CIGARs complicates this primary goal.
We have to disallow a pattern |
The point of my raising this issue was not to come up with a new syntax or extension of GFA. I simply want to know how to write a GFA parser given that the grammar is ambiguous. Any "fix" should be minimal, but presumably y'all have written parsers, so what did you assume? |
@thegenemyers Good catch, Gene. To resolve this we'd need to either remove comma |
@richarddurbin wrote…
The orientations are necessary as it's common to have a segment that begins and ends with the same inverted repeat with unique sequence in the middle between the two repeats. It yields the graph below. The paths Graphviz
GFA
|
@ekg The argument for having a path in single/multiple lines was debated in #22. The argument for a single line as I recall was that a path represents a single conceptual entity. It makes parsing easier if that entity exists on a single line. If it's spread over multiple lines, that parser would likely have to create a new path object and modify it as each new element for that path arrives in the input stream. They were good arguments for both formats, but the popular format was a single line per path. That being said, you have a use case for multi lines per path, and it's quite similar in format to the containment |
We can always repurpose a namespace but then we can't read each other's How many implementations are using the one line per path format? I understand it is sometimes easier to read the path in one step, but One line per path element solves the issue described here without any |
Can we escape commas inside of the segment name using a backslash?
I think this is parsable and clear to the human eye too. |
It is clear to the human eye until the path reaches more than a handful of elements. If the path has a million elements in it then it will be very uncomfortable to try to read in a terminal when it is written on one line. If the path elements each have their own line then we can sort the file by ID in the second column and maintain a pseudo-local view of the graph and walks though it in text. This is helpful when there are many overlapping paths. I beg the group not to add more custom sub formats to GFA. We can do everything we need with paths using only tab as a delimiter. I have no appetite for supporting a complex ad-hoc serialization format. The whole point of GFA is that it is easy to parse and write from C and easy for humans to read. Otherwise we could define the format in BNF and write JSON/XML/RDF/protobuf/capnproto/etc. from our codes. Barring any interest in making the elements of paths first class objects, I guess we are going to have to fork our representation of paths. Any suggestions for a letter? Is W taken? |
My recommendation is for GFA1, we test regex pattern |
I would agree with @lh3 proposal. |
I think that's a pragmatic solution for GFA1, Heng. If we keep the one-line-per-path format for GFA 2, we may consider changing the separator to space as @richarddurbin suggested. |
Commas are used to separate segment IDs in a path. Closes GFA-spec#37.
I've submitted PR #40 to resolve this issue for GFA1 as per Heng's suggestion. Your vote is appreciated. |
I am replying to @sjackman's comment about distinguishing A+,R+,B+ from A+,R-,B+. |
Yes, that's right. -- Gene |
Edge identifiers are needed when specifying paths through a multigraph, and GFA1 is not a multigraph. See my comment at #49 (comment) |
I do not see any reason for requiring GFA1 not to represent a multigraph. Until now, it is possible to represent a multigraph, and I think it is useful. There are different cases of multiple links connecting two segments. One are complement links, such as Also, connecting segments in different orientations (e.g. However, multiple dovetail between two sequences , e.g. |
I agree with @ggonnella that GFA1 can be a multigraph. This leads to complications in implementations, but it is necessary. Individual tools may choose to squash or remove multiple edges, but the spec should not forbid them. |
Do you mean
100% agree.
Yes, using such edges does make the graph a multigraph. @richarddurbin made this point as well. "This assumes that you don’t allow multiple different link lines between the same pair of nodes with the same +/- but different overlap numbers".
I'm okay with supporting such multigraphs in GFA1. It would be helpful to the implementation though to say in the header line whether the graph is a strict graph or a multigraph. |
Yes, sorry, typo, I fixed that. |
I have nothing against this, if it is useful for some applications. |
Commas are used to separate segment IDs in a path. Closes #37.
The path line specifies a comma separated list of segment id's followed immediately by an orientation symbol (+ or -). Since both comma's and + and - can be in a segment id, e.g. "S P1+,P2 acgt" defines P1+,P2 as a segment id, I don't see how it is possible to unambiguously parse the path list. Please advise.
The text was updated successfully, but these errors were encountered: