Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path line syntax ambiguous ? #37

Closed
thegenemyers opened this issue Sep 17, 2016 · 24 comments
Closed

Path line syntax ambiguous ? #37

thegenemyers opened this issue Sep 17, 2016 · 24 comments
Assignees
Labels

Comments

@thegenemyers
Copy link
Contributor

The path line specifies a comma separated list of segment id's followed immediately by an orientation symbol (+ or -). Since both comma's and + and - can be in a segment id, e.g. "S P1+,P2 acgt" defines P1+,P2 as a segment id, I don't see how it is possible to unambiguously parse the path list. Please advise.

@richarddurbin
Copy link

richarddurbin commented Sep 17, 2016

The only way I can see to separate segment/edge ids in a list is with spaces. All other printable characters are legitimate in ids.
So I advocate that the lists in paths or subgraphs are space separated. This is clearly distinguishable by a parser from a tab, but not so clear to
the human eye. So this list should be the last of the fixed fields, only followed by optional SAM tag:type:data fields, which are clearly distinctive.

I would rather not have orientation symbols in the list. As far as I can see there is no need for them if you allow edge ids, and you ban edges that
connect a segment to itself (these can always be split by an artificial segment if you really want them).

Richard

@thegenemyers
Copy link
Contributor Author

thegenemyers commented Sep 17, 2016

Richard,

 My issue was for GFA 1.0 where there seems to be a problem, whereas 

you seem to
be speaking to what to do for DAS/GFA 2.0. There I would proceed as you
suggest.

 -- Gene

@ekg
Copy link
Collaborator

ekg commented Sep 17, 2016

I propose requiring each node traversal or mapping in the path to be on its own line.

Something like: P [node name] [path name] [rank in path] [orientation] [mapping description]

This is how vg produces GFA. It makes heavy use of paths and so this was really important to get right.

Here is an example in a mini graph.

image

H       VN:Z:1.0
S       1       CAAA
P       1       x       1       +       4M
L       1       +       2       +       0M
S       2       TAAG
P       2       x       2       +       4M
L       2       +       4       +       0M
L       2       +       3       +       0M
S       3       G
P       3       x       3       +       1M
L       3       +       5       +       0M
L       3       +       6       +       0M
S       4       A
L       4       +       5       +       0M
L       4       +       6       +       0M
S       5       C
P       5       x       4       +       1M
L       5       +       7       +       0M
S       6       T
L       6       +       7       +       0M
S       7       TTG
P       7       x       5       +       3M

In addition to solving this issue, there are other benefits to this organization:

This format makes many command line operations on the GFA file easier, as it allows us to sort the graph by node names (in the second column) and visually inspect approximate subgraphs including the paths through them.

For example, I generated this by doing:

vg construct -r tiny/tiny.fa -v tiny/tiny.vcf.gz -m 5  | vg view - | head -22

If the path records come on one line then there is no available technique to subset the graph (with paths) on the command line.

Putting one path element per line also resolves the need for a custom path format. We just use the same tab delimited parsing machinery to read in the path.

This also allows each path element to have a CIGAR. This is not strictly necessary but it adds flexibility and makes the paths that we can store in GFA graphs nearly equivalent to those which we use for alignments.

It also makes it easier for us to add other annotations to the path elements, as each element is serialized with its own id. This doesn't matter so much for GFA but rather in RDF based serializations of assembly graphs, where the objective is to allow linking to specific components of the sequence graph and walks through it. In other words, if the path elements weren't first class entities, we would be unable to use them in semantic linkages.

NB: In vg, paths are composed of a series of mappings. Mappings are ranks (in the lath) positions and edits. Edits are like CIGARs. Positions describe a node strand and an offset. It is not possible to support this data model with comma-delimited paths in GFA. (The vg schema vg.proto defines this. It might be good to write it out in BNF.)

We are basically stuck on this issue because others prefer the one line per path format. As a result we are using incompatible versions of P. I would shift to another namespace but think it is a waste of a common format to do so.

What GFA implementations are using the current one line per path format in P? What are the advantages of the comma delimited path format? As far as I understand it removes the need for the path rank elements (saving some space) and makes it so we can parse each path in one line.

@lh3
Copy link
Collaborator

lh3 commented Sep 19, 2016

We are basically stuck on this issue because others prefer the one line per path format.

I forgot if I have said anything on the issue where you and others attempted to define Paths. As I see it now, I don't have a strong opionin either way. One advantage of the one-line format is that, at least to me, it is more natural to see a path as one object. When you put a path on multiple lines, the parser needs to read all lines into memory and then compose a Path object. That said, I can see your multi-line version has benefits in other cases.

This also allows each path element to have a CIGAR.

I am not sure why you'd like to have Edits/CIGARs. Do you allow mismatches/gaps between a path and a segment mapped to the path? If not, could we just specify overlap lengths in case of graphs with multiple edges? In my view, the primary goal of a Path is to spell a sequence. Using CIGARs complicates this primary goal.

The path line specifies a comma separated list of segment id's followed immediately by an orientation symbol (+ or -). Since both comma's and + and - can be in a segment id

We have to disallow a pattern /[+-],/ in segment names. As long as a segment name does not match this pattern, parsing a GFA1 Path should be unambiguous.

@thegenemyers
Copy link
Contributor Author

The point of my raising this issue was not to come up with a new syntax or extension of GFA. I simply want to know how to write a GFA parser given that the grammar is ambiguous. Any "fix" should be minimal, but presumably y'all have written parsers, so what did you assume?

@sjackman sjackman self-assigned this Sep 20, 2016
@sjackman
Copy link
Collaborator

sjackman commented Sep 20, 2016

@thegenemyers Good catch, Gene. To resolve this we'd need to either remove comma , from valid segment ID characters. I'm hesitant to do that because comma , is a valid FASTA ID character, and I'd like FASTA and GFA to be compatible. I like your suggestion, Richard @richarddurbin, of making the separator space rather than comma,. I wonder whether there any implementations of the current pathP line specification.

@sjackman
Copy link
Collaborator

sjackman commented Sep 20, 2016

@richarddurbin wrote…

I would rather not have orientation symbols in the list. As far as I can see there is no need for them if you allow edge ids, and you ban edges that
connect a segment to itself (these can always be split by an artificial segment if you really want them).

The orientations are necessary as it's common to have a segment that begins and ends with the same inverted repeat with unique sequence in the middle between the two repeats. It yields the graph below. The paths A+ R+ B+ and A+ R- B+ yield two different sequences. Note that the segment R begins and ends with the same inverted repeat, but is not a palindrome. A palindrome yields the same graph, but in that case, both paths yield the same sequence.
ir

Graphviz

digraph {
graph [rankdir = "LR"]
"A+" -> "R+" -> "B+"
"A+" -> "R-" -> "B+"
}

GFA

S   A   *
S   B   *
S   R   *
L   A   +   R   +
L   A   +   R   -
L   R   +   B   +
L   R   -   B   +
P   P1  A+,R+,B+    *
P   P2  A+,R-,B+    *

@sjackman
Copy link
Collaborator

sjackman commented Sep 20, 2016

@ekg The argument for having a path in single/multiple lines was debated in #22. The argument for a single line as I recall was that a path represents a single conceptual entity. It makes parsing easier if that entity exists on a single line. If it's spread over multiple lines, that parser would likely have to create a new path object and modify it as each new element for that path arrives in the input stream. They were good arguments for both formats, but the popular format was a single line per path.

That being said, you have a use case for multi lines per path, and it's quite similar in format to the containment C and proposed fragmentF record. Could a set of containment/fragment records be used to compose a path, Erik, if they were grouped together with a common path identifier?

@sjackman
Copy link
Collaborator

sjackman commented Sep 20, 2016

@ekg Let's use this issue to discuss the ambiguity of the current path P record and move the discussion of a new path record format to #36 "Paths in GFA v2". @pb-jchin and @rrwick have thoughts on this topic as well.

@ekg
Copy link
Collaborator

ekg commented Sep 20, 2016

We can always repurpose a namespace but then we can't read each other's
paths...

How many implementations are using the one line per path format?

I understand it is sometimes easier to read the path in one step, but
accumulating it (like we do for everything else in the graph) seems equally
simple.

One line per path element solves the issue described here without any
additional format complexity.

@ggonnella
Copy link
Contributor

Can we escape commas inside of the segment name using a backslash?
Then we would have:

S P1+,P2 acgt
S S3 cgtt
L P1+,P2 S3 *
P X P1+\,P2+,S3+ *

I think this is parsable and clear to the human eye too.

@ekg
Copy link
Collaborator

ekg commented Sep 21, 2016

It is clear to the human eye until the path reaches more than a handful of elements. If the path has a million elements in it then it will be very uncomfortable to try to read in a terminal when it is written on one line.

If the path elements each have their own line then we can sort the file by ID in the second column and maintain a pseudo-local view of the graph and walks though it in text. This is helpful when there are many overlapping paths.

I beg the group not to add more custom sub formats to GFA. We can do everything we need with paths using only tab as a delimiter. I have no appetite for supporting a complex ad-hoc serialization format. The whole point of GFA is that it is easy to parse and write from C and easy for humans to read. Otherwise we could define the format in BNF and write JSON/XML/RDF/protobuf/capnproto/etc. from our codes.

Barring any interest in making the elements of paths first class objects, I guess we are going to have to fork our representation of paths. Any suggestions for a letter? Is W taken?

@lh3
Copy link
Collaborator

lh3 commented Sep 21, 2016

My recommendation is for GFA1, we test regex pattern /[+-],/ at an S-line. If we find a match, we throw an error, saying the current spec does not allow it. I think this treatment is working most of time in practice. I have seen commas in sequence names, but I am yet to see a /[+-],/ pattern – it is very rare.

@ggonnella
Copy link
Contributor

I would agree with @lh3 proposal.

@sjackman
Copy link
Collaborator

I think that's a pragmatic solution for GFA1, Heng. If we keep the one-line-per-path format for GFA 2, we may consider changing the separator to space as @richarddurbin suggested.

sjackman added a commit to sjackman/GFA-spec that referenced this issue Sep 21, 2016
Commas are used to separate segment IDs in a path.

Closes GFA-spec#37.
@sjackman
Copy link
Collaborator

sjackman commented Sep 21, 2016

I've submitted PR #40 to resolve this issue for GFA1 as per Heng's suggestion. Your vote is appreciated.

@richarddurbin
Copy link

I am replying to @sjackman's comment about distinguishing A+,R+,B+ from A+,R-,B+.
There are different edges that link A+ to R+ and A+ to R-. In the GFA2 pull request you distinguish between the two paths by giving different identifiers to the two edges (links for GFA1) and explicitly giving the relevant identifier in the path specification. If there is only one edge from A to R then you can omit the edge identifier. This way a path is just a list of identifiers.

@thegenemyers
Copy link
Contributor Author

thegenemyers commented Sep 25, 2016

Yes, that's right. -- Gene

@sjackman
Copy link
Collaborator

sjackman commented Sep 26, 2016

Edge identifiers are needed when specifying paths through a multigraph, and GFA1 is not a multigraph. See my comment at #49 (comment)

@ggonnella
Copy link
Contributor

ggonnella commented Sep 26, 2016

I do not see any reason for requiring GFA1 not to represent a multigraph. Until now, it is possible to represent a multigraph, and I think it is useful.

There are different cases of multiple links connecting two segments.

One are complement links, such as L A + B + 100M and L B - A - 100M, they are simply different representations of the same information (in RGFA if I encounter both during parsing, I ignore them). Having both in the graph does not make it a multigraph.

Also, connecting segments in different orientations (e.g. L A + B+ and L A + B -) does not make it a multigraph, as the vertex definition is (segment, orientation).

However, multiple dovetail between two sequences , e.g. L A + B + 100M and L A + B + 50M, are not a rarity in assembly and make it a multigraph. In my own assembler, Readjoiner, I am working on an output of each stage of the assembly as GFA, and of course the overlap graph of the reads can contain such multiple overlaps. Also, the multigraph still satisfy the requirements of the SSG definition, as both of the L records define two edges in the graph and the simmetry is still satisfied.

@lh3
Copy link
Collaborator

lh3 commented Sep 26, 2016

I agree with @ggonnella that GFA1 can be a multigraph. This leads to complications in implementations, but it is necessary. Individual tools may choose to squash or remove multiple edges, but the spec should not forbid them.

@sjackman
Copy link
Collaborator

One are complement links, such as L A + B + 100M and L A - B – 100M, they are simply different representations of the same information (in RGFA if I encounter both during parsing, I ignore them). Having both in the graph does not make it a multigraph.

Do you mean L A + B + 100M and L B - A – 100M? Yes, those are the two representations of the same pair of complementary edges, (u, v) and (σ(u), σ(v)), where u = A+ and v = B+.

Also, connecting segments in different orientations (e.g. L A + B + and L A + B -) does not make it a multigraph, as the vertex definition is (segment, orientation).

100% agree.

However, multiple dovetail between two sequences , e.g. L A + B + 100M and L A + B + 50M, are not a rarity in assembly and make it a multigraph.

Yes, using such edges does make the graph a multigraph. @richarddurbin made this point as well. "This assumes that you don’t allow multiple different link lines between the same pair of nodes with the same +/- but different overlap numbers".

I agree with @ggonnella that GFA1 can be a multigraph. This leads to complications in implementations, but it is necessary. Individual tools may choose to squash or remove multiple edges, but the spec should not forbid them.

I'm okay with supporting such multigraphs in GFA1. It would be helpful to the implementation though to say in the header line whether the graph is a strict graph or a multigraph.

@ggonnella
Copy link
Contributor

Do you mean L A + B + 100M and L B - A – 100M?

Yes, sorry, typo, I fixed that.

@ggonnella
Copy link
Contributor

I'm okay with supporting such multigraphs in GFA1. It would be helpful to the implementation though to say in the header line whether the graph is a strict graph or a multigraph.

I have nothing against this, if it is useful for some applications.

sjackman added a commit that referenced this issue Sep 27, 2016
Commas are used to separate segment IDs in a path.

Closes #37.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants