Path line syntax ambiguous ? #37

thegenemyers · 2016-09-17T11:09:43Z

The path line specifies a comma separated list of segment id's followed immediately by an orientation symbol (+ or -). Since both comma's and + and - can be in a segment id, e.g. "S P1+,P2 acgt" defines P1+,P2 as a segment id, I don't see how it is possible to unambiguously parse the path list. Please advise.

richarddurbin · 2016-09-17T11:35:42Z

The only way I can see to separate segment/edge ids in a list is with spaces. All other printable characters are legitimate in ids.
So I advocate that the lists in paths or subgraphs are space separated. This is clearly distinguishable by a parser from a tab, but not so clear to
the human eye. So this list should be the last of the fixed fields, only followed by optional SAM tag:type:data fields, which are clearly distinctive.

I would rather not have orientation symbols in the list. As far as I can see there is no need for them if you allow edge ids, and you ban edges that
connect a segment to itself (these can always be split by an artificial segment if you really want them).

Richard

thegenemyers · 2016-09-17T12:39:34Z

Richard,

 My issue was for GFA 1.0 where there seems to be a problem, whereas

you seem to
be speaking to what to do for DAS/GFA 2.0. There I would proceed as you
suggest.

 -- Gene

ekg · 2016-09-17T13:36:25Z

I propose requiring each node traversal or mapping in the path to be on its own line.

Something like: P [node name] [path name] [rank in path] [orientation] [mapping description]

This is how vg produces GFA. It makes heavy use of paths and so this was really important to get right.

Here is an example in a mini graph.

H       VN:Z:1.0
S       1       CAAA
P       1       x       1       +       4M
L       1       +       2       +       0M
S       2       TAAG
P       2       x       2       +       4M
L       2       +       4       +       0M
L       2       +       3       +       0M
S       3       G
P       3       x       3       +       1M
L       3       +       5       +       0M
L       3       +       6       +       0M
S       4       A
L       4       +       5       +       0M
L       4       +       6       +       0M
S       5       C
P       5       x       4       +       1M
L       5       +       7       +       0M
S       6       T
L       6       +       7       +       0M
S       7       TTG
P       7       x       5       +       3M

In addition to solving this issue, there are other benefits to this organization:

This format makes many command line operations on the GFA file easier, as it allows us to sort the graph by node names (in the second column) and visually inspect approximate subgraphs including the paths through them.

For example, I generated this by doing:

vg construct -r tiny/tiny.fa -v tiny/tiny.vcf.gz -m 5  | vg view - | head -22

If the path records come on one line then there is no available technique to subset the graph (with paths) on the command line.

Putting one path element per line also resolves the need for a custom path format. We just use the same tab delimited parsing machinery to read in the path.

This also allows each path element to have a CIGAR. This is not strictly necessary but it adds flexibility and makes the paths that we can store in GFA graphs nearly equivalent to those which we use for alignments.

It also makes it easier for us to add other annotations to the path elements, as each element is serialized with its own id. This doesn't matter so much for GFA but rather in RDF based serializations of assembly graphs, where the objective is to allow linking to specific components of the sequence graph and walks through it. In other words, if the path elements weren't first class entities, we would be unable to use them in semantic linkages.

NB: In vg, paths are composed of a series of mappings. Mappings are ranks (in the lath) positions and edits. Edits are like CIGARs. Positions describe a node strand and an offset. It is not possible to support this data model with comma-delimited paths in GFA. (The vg schema vg.proto defines this. It might be good to write it out in BNF.)

We are basically stuck on this issue because others prefer the one line per path format. As a result we are using incompatible versions of P. I would shift to another namespace but think it is a waste of a common format to do so.

What GFA implementations are using the current one line per path format in P? What are the advantages of the comma delimited path format? As far as I understand it removes the need for the path rank elements (saving some space) and makes it so we can parse each path in one line.

lh3 · 2016-09-19T01:31:37Z

We are basically stuck on this issue because others prefer the one line per path format.

I forgot if I have said anything on the issue where you and others attempted to define Paths. As I see it now, I don't have a strong opionin either way. One advantage of the one-line format is that, at least to me, it is more natural to see a path as one object. When you put a path on multiple lines, the parser needs to read all lines into memory and then compose a Path object. That said, I can see your multi-line version has benefits in other cases.

This also allows each path element to have a CIGAR.

I am not sure why you'd like to have Edits/CIGARs. Do you allow mismatches/gaps between a path and a segment mapped to the path? If not, could we just specify overlap lengths in case of graphs with multiple edges? In my view, the primary goal of a Path is to spell a sequence. Using CIGARs complicates this primary goal.

The path line specifies a comma separated list of segment id's followed immediately by an orientation symbol (+ or -). Since both comma's and + and - can be in a segment id

We have to disallow a pattern /[+-],/ in segment names. As long as a segment name does not match this pattern, parsing a GFA1 Path should be unambiguous.

thegenemyers · 2016-09-19T06:29:10Z

The point of my raising this issue was not to come up with a new syntax or extension of GFA. I simply want to know how to write a GFA parser given that the grammar is ambiguous. Any "fix" should be minimal, but presumably y'all have written parsers, so what did you assume?

sjackman · 2016-09-20T16:43:16Z

@thegenemyers Good catch, Gene. To resolve this we'd need to either remove comma , from valid segment ID characters. I'm hesitant to do that because comma , is a valid FASTA ID character, and I'd like FASTA and GFA to be compatible. I like your suggestion, Richard @richarddurbin, of making the separator space rather than comma,. I wonder whether there any implementations of the current pathP line specification.

sjackman · 2016-09-20T16:56:00Z

@richarddurbin wrote…

I would rather not have orientation symbols in the list. As far as I can see there is no need for them if you allow edge ids, and you ban edges that
connect a segment to itself (these can always be split by an artificial segment if you really want them).

The orientations are necessary as it's common to have a segment that begins and ends with the same inverted repeat with unique sequence in the middle between the two repeats. It yields the graph below. The paths A+ R+ B+ and A+ R- B+ yield two different sequences. Note that the segment R begins and ends with the same inverted repeat, but is not a palindrome. A palindrome yields the same graph, but in that case, both paths yield the same sequence.

Graphviz

digraph {
graph [rankdir = "LR"]
"A+" -> "R+" -> "B+"
"A+" -> "R-" -> "B+"
}

GFA

S   A   *
S   B   *
S   R   *
L   A   +   R   +
L   A   +   R   -
L   R   +   B   +
L   R   -   B   +
P   P1  A+,R+,B+    *
P   P2  A+,R-,B+    *

sjackman · 2016-09-20T17:00:44Z

@ekg The argument for having a path in single/multiple lines was debated in #22. The argument for a single line as I recall was that a path represents a single conceptual entity. It makes parsing easier if that entity exists on a single line. If it's spread over multiple lines, that parser would likely have to create a new path object and modify it as each new element for that path arrives in the input stream. They were good arguments for both formats, but the popular format was a single line per path.

That being said, you have a use case for multi lines per path, and it's quite similar in format to the containment C and proposed fragmentF record. Could a set of containment/fragment records be used to compose a path, Erik, if they were grouped together with a common path identifier?

sjackman · 2016-09-20T17:28:46Z

@ekg Let's use this issue to discuss the ambiguity of the current path P record and move the discussion of a new path record format to #36 "Paths in GFA v2". @pb-jchin and @rrwick have thoughts on this topic as well.

ekg · 2016-09-20T18:29:17Z

We can always repurpose a namespace but then we can't read each other's
paths...

How many implementations are using the one line per path format?

I understand it is sometimes easier to read the path in one step, but
accumulating it (like we do for everything else in the graph) seems equally
simple.

One line per path element solves the issue described here without any
additional format complexity.

ggonnella · 2016-09-21T08:41:23Z

Can we escape commas inside of the segment name using a backslash?
Then we would have:

S P1+,P2 acgt
S S3 cgtt
L P1+,P2 S3 *
P X P1+\,P2+,S3+ *

I think this is parsable and clear to the human eye too.

ekg · 2016-09-21T09:42:03Z

It is clear to the human eye until the path reaches more than a handful of elements. If the path has a million elements in it then it will be very uncomfortable to try to read in a terminal when it is written on one line.

If the path elements each have their own line then we can sort the file by ID in the second column and maintain a pseudo-local view of the graph and walks though it in text. This is helpful when there are many overlapping paths.

I beg the group not to add more custom sub formats to GFA. We can do everything we need with paths using only tab as a delimiter. I have no appetite for supporting a complex ad-hoc serialization format. The whole point of GFA is that it is easy to parse and write from C and easy for humans to read. Otherwise we could define the format in BNF and write JSON/XML/RDF/protobuf/capnproto/etc. from our codes.

Barring any interest in making the elements of paths first class objects, I guess we are going to have to fork our representation of paths. Any suggestions for a letter? Is W taken?

lh3 · 2016-09-21T16:35:00Z

My recommendation is for GFA1, we test regex pattern /[+-],/ at an S-line. If we find a match, we throw an error, saying the current spec does not allow it. I think this treatment is working most of time in practice. I have seen commas in sequence names, but I am yet to see a /[+-],/ pattern – it is very rare.

ggonnella · 2016-09-21T16:39:56Z

I would agree with @lh3 proposal.

sjackman · 2016-09-21T16:51:30Z

I think that's a pragmatic solution for GFA1, Heng. If we keep the one-line-per-path format for GFA 2, we may consider changing the separator to space as @richarddurbin suggested.

Commas are used to separate segment IDs in a path. Closes GFA-spec#37.

sjackman · 2016-09-21T17:11:14Z

I've submitted PR #40 to resolve this issue for GFA1 as per Heng's suggestion. Your vote is appreciated.

richarddurbin · 2016-09-25T18:35:19Z

I am replying to @sjackman's comment about distinguishing A+,R+,B+ from A+,R-,B+.
There are different edges that link A+ to R+ and A+ to R-. In the GFA2 pull request you distinguish between the two paths by giving different identifiers to the two edges (links for GFA1) and explicitly giving the relevant identifier in the path specification. If there is only one edge from A to R then you can omit the edge identifier. This way a path is just a list of identifiers.

thegenemyers · 2016-09-25T20:03:53Z

Yes, that's right. -- Gene

sjackman · 2016-09-26T04:44:53Z

Edge identifiers are needed when specifying paths through a multigraph, and GFA1 is not a multigraph. See my comment at #49 (comment)

ggonnella · 2016-09-26T13:43:20Z

I do not see any reason for requiring GFA1 not to represent a multigraph. Until now, it is possible to represent a multigraph, and I think it is useful.

There are different cases of multiple links connecting two segments.

One are complement links, such as L A + B + 100M and L B - A - 100M, they are simply different representations of the same information (in RGFA if I encounter both during parsing, I ignore them). Having both in the graph does not make it a multigraph.

Also, connecting segments in different orientations (e.g. L A + B+ and L A + B -) does not make it a multigraph, as the vertex definition is (segment, orientation).

However, multiple dovetail between two sequences , e.g. L A + B + 100M and L A + B + 50M, are not a rarity in assembly and make it a multigraph. In my own assembler, Readjoiner, I am working on an output of each stage of the assembly as GFA, and of course the overlap graph of the reads can contain such multiple overlaps. Also, the multigraph still satisfy the requirements of the SSG definition, as both of the L records define two edges in the graph and the simmetry is still satisfied.

lh3 · 2016-09-26T14:35:41Z

I agree with @ggonnella that GFA1 can be a multigraph. This leads to complications in implementations, but it is necessary. Individual tools may choose to squash or remove multiple edges, but the spec should not forbid them.

sjackman · 2016-09-26T17:28:34Z

One are complement links, such as L A + B + 100M and L A - B – 100M, they are simply different representations of the same information (in RGFA if I encounter both during parsing, I ignore them). Having both in the graph does not make it a multigraph.

Do you mean L A + B + 100M and L B - A – 100M? Yes, those are the two representations of the same pair of complementary edges, (u, v) and (σ(u), σ(v)), where u = A+ and v = B+.

Also, connecting segments in different orientations (e.g. L A + B + and L A + B -) does not make it a multigraph, as the vertex definition is (segment, orientation).

100% agree.

However, multiple dovetail between two sequences , e.g. L A + B + 100M and L A + B + 50M, are not a rarity in assembly and make it a multigraph.

Yes, using such edges does make the graph a multigraph. @richarddurbin made this point as well. "This assumes that you don’t allow multiple different link lines between the same pair of nodes with the same +/- but different overlap numbers".

I agree with @ggonnella that GFA1 can be a multigraph. This leads to complications in implementations, but it is necessary. Individual tools may choose to squash or remove multiple edges, but the spec should not forbid them.

I'm okay with supporting such multigraphs in GFA1. It would be helpful to the implementation though to say in the header line whether the graph is a strict graph or a multigraph.

ggonnella · 2016-09-26T17:35:12Z

Do you mean L A + B + 100M and L B - A – 100M?

Yes, sorry, typo, I fixed that.

ggonnella · 2016-09-26T17:36:26Z

I'm okay with supporting such multigraphs in GFA1. It would be helpful to the implementation though to say in the header line whether the graph is a strict graph or a multigraph.

I have nothing against this, if it is useful for some applications.

Commas are used to separate segment IDs in a path. Closes #37.

sjackman added the question label Sep 20, 2016

sjackman self-assigned this Sep 20, 2016

sjackman added a commit to sjackman/GFA-spec that referenced this issue Sep 21, 2016

Segment name must not contain +, nor -,

398df9f

Commas are used to separate segment IDs in a path. Closes GFA-spec#37.

sjackman mentioned this issue Sep 21, 2016

Segment name must not contain +, nor -, #40

Merged

sjackman closed this as completed in #40 Sep 27, 2016

sjackman added a commit that referenced this issue Sep 27, 2016

Segment name must not contain +, nor -,

e7576cd

Commas are used to separate segment IDs in a path. Closes #37.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path line syntax ambiguous ? #37

Path line syntax ambiguous ? #37

thegenemyers commented Sep 17, 2016

richarddurbin commented Sep 17, 2016 •

edited by sjackman

Loading

thegenemyers commented Sep 17, 2016 •

edited by sjackman

Loading

ekg commented Sep 17, 2016

lh3 commented Sep 19, 2016

thegenemyers commented Sep 19, 2016

sjackman commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

ekg commented Sep 20, 2016 •

edited by sjackman

Loading

ggonnella commented Sep 21, 2016

ekg commented Sep 21, 2016

lh3 commented Sep 21, 2016

ggonnella commented Sep 21, 2016

sjackman commented Sep 21, 2016

sjackman commented Sep 21, 2016 •

edited

Loading

richarddurbin commented Sep 25, 2016

thegenemyers commented Sep 25, 2016 •

edited by sjackman

Loading

sjackman commented Sep 26, 2016 •

edited

Loading

ggonnella commented Sep 26, 2016 •

edited

Loading

lh3 commented Sep 26, 2016

sjackman commented Sep 26, 2016

ggonnella commented Sep 26, 2016

ggonnella commented Sep 26, 2016

Path line syntax ambiguous ? #37

Path line syntax ambiguous ? #37

Comments

thegenemyers commented Sep 17, 2016

richarddurbin commented Sep 17, 2016 • edited by sjackman Loading

thegenemyers commented Sep 17, 2016 • edited by sjackman Loading

ekg commented Sep 17, 2016

lh3 commented Sep 19, 2016

thegenemyers commented Sep 19, 2016

sjackman commented Sep 20, 2016 • edited Loading

sjackman commented Sep 20, 2016 • edited Loading

Graphviz

GFA

sjackman commented Sep 20, 2016 • edited Loading

sjackman commented Sep 20, 2016 • edited Loading

ekg commented Sep 20, 2016 • edited by sjackman Loading

ggonnella commented Sep 21, 2016

ekg commented Sep 21, 2016

lh3 commented Sep 21, 2016

ggonnella commented Sep 21, 2016

sjackman commented Sep 21, 2016

sjackman commented Sep 21, 2016 • edited Loading

richarddurbin commented Sep 25, 2016

thegenemyers commented Sep 25, 2016 • edited by sjackman Loading

sjackman commented Sep 26, 2016 • edited Loading

ggonnella commented Sep 26, 2016 • edited Loading

lh3 commented Sep 26, 2016

sjackman commented Sep 26, 2016

ggonnella commented Sep 26, 2016

ggonnella commented Sep 26, 2016

richarddurbin commented Sep 17, 2016 •

edited by sjackman

Loading

thegenemyers commented Sep 17, 2016 •

edited by sjackman

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

sjackman commented Sep 20, 2016 •

edited

Loading

ekg commented Sep 20, 2016 •

edited by sjackman

Loading

sjackman commented Sep 21, 2016 •

edited

Loading

thegenemyers commented Sep 25, 2016 •

edited by sjackman

Loading

sjackman commented Sep 26, 2016 •

edited

Loading

ggonnella commented Sep 26, 2016 •

edited

Loading