Meaning should not depend on the order of records #54

sjackman · 2017-01-11T19:16:11Z

I propose to strike the text

U/O-lines with the same name are considered to be concatenated together in the order in which they appear, and

See its context here under the heading Group.

The meaning of a GFA file should not depend on the order of its records. It should be possible to sort or shuffle a GFA file without affecting the meaning of the GFA file. Imagine if the meaning of a SAM file depended on the order of the alignments in the SAM file. Sorting by read name or target position would not be possible.

Consider the following example of O P1 2+ 1+ with split across two lines:

O P1 2+
O P1 1+

after a UNIX sort it becomes

O P1 1+
O P1 2+

which has changed the meaning of this path P1 from 2+ 1+ to 1+ 2+.

The text was updated successfully, but these errors were encountered:

sjackman · 2017-01-11T19:19:15Z

@ekg @ggonnella @jts @lh3 @pb-jchin @rchikhi @richarddurbin @rrwick @sjackman @thegenemyers
Your vote on this proposal is appreciated.

sjackman · 2017-01-11T19:23:21Z

This text was added to address Erik's use case of walks. His use case specifically included sorting the path records by their components, and so I believe this text does not address his use case, and as such should be removed.

Erik's use case should I feel be addressed with a new walk record type. Heng Li proposed such a record at #47 (comment). I suggest we define a walk record type in a future backward-compatible revision of GFA (e.g. GFA 2.1), with input from Erik and that specifically addresses his use case.

rrwick · 2017-01-12T00:19:43Z

I vote in favour of this proposal!

ggonnella · 2017-01-12T11:03:43Z

I am not against your proposal of deleting the quoted text (i.e. I am rather neutral on this).

As GFA2 allows for user-specific line types (ignoring by a core parser anything which is not starting with the standard record type specifiers), we should decide if the W line shall be an application-specific line or if it shall really be included in a future specification.

sjackman · 2017-01-12T17:45:33Z

I'm inclined to defer that question to a future date. More immediately I think it's quite important that we resolve that the meaning of a GFA file should not depend on the order of its lines. The walk record could start as application specific, and later be proposed using a pull request for adoption by the standard. The benefit of an extensible standard is that we don't have to nail down every possible application right away.

thegenemyers · 2017-01-13T09:22:24Z

Shaun, I don't really understand the use-case that is being discussed. I read the 'W' line description and it does not convey to me the precise meaning. A path can be in pieces, i.e. on different lines, and still be reconstructed with the understanding that edges between path pieces are not inferred, which is the case in your rather extreme example. For example, O blue V3 V4 V5 O blue V1 V2 V3 O red e34 35 O red e12 e23 work just fine. I'm not sure why order independence is important, nor what it buys you. Basically, the way the standard is currently you have to read the entire file to now if its even valid, i.e. there are no undefined vertices, etc. If you are going to read the whole file (sequentially) and build an internal representation in response to semantically check it, etc. then ???

…

-- Gene

On 1/12/17, 6:45 PM, Shaun Jackman wrote: I'm inclined to defer that question to a future date. More immediately I think it's quite important that we resolve that the meaning of a GFA file should not depend on the order of its lines. The walk record could start as application specific, and be proposed using a pull request for adoption by the standard. The benefit of an extensible standard is that we don't have to nail down every possible application right away. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#54 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGkkNrX_DBMGhJu_GIso-uw3GUy1395aks5rRma-gaJpZM4Lg9Yh>.

rrwick · 2017-01-13T13:18:15Z

Gene,

Using your 'blue' example, if this is the full graph structure...

...then this path is V3 -> V4 -> V5 -> V1 -> V2 -> V3:

O	blue	V3 V4 V5
O	blue	V1 V2 V3

...and this path is V1 -> V2 -> V3 -> V3 -> V4 -> V5:

O	blue	V1 V2 V3
O	blue	V3 V4 V5

Both paths are valid, so it seems like the definition does depend on the line order. I guess I'm unsure of what you mean by 'work just fine'. Am I missing something?

thegenemyers · 2017-01-13T14:37:19Z

No, I wrote that if we define it so that one *does not* infer edges between line components then it is order dependent. So the path is V1->V2->V3->V4->V5. The reference to V3 twice connects the two segment in an unambiguous way. If eij is the edge Vi->Vj, then the red path also defines the same path. -- Gene

…

On 1/13/17, 2:18 PM, Ryan Wick wrote: Gene, Using your 'blue' example, if this is the full graph structure... screen shot 2017-01-14 at 12 12 42 am <https://cloud.githubusercontent.com/assets/7053555/21931192/334cf606-d9ee-11e6-9077-02ead4f93180.png> ...then this path is V3 -> V4 -> V5 -> V1 -> V2 -> V3: |O blue V3 V4 V5 O blue V1 V2 V3 | ...and this path is V1 -> V2 -> V3 -> V3 -> V4 -> V5: |O blue V1 V2 V3 O blue V3 V4 V5 | Both paths are valid, so it seems like the definition does depend on the line order. I guess I'm unsure of what you mean by 'work just fine'. Am I missing something? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#54 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGkkNhWXt_nwci8Sr7jO3wiDAV73YnLXks5rR3mYgaJpZM4Lg9Yh>.

sjackman · 2017-01-13T16:32:56Z

In graphs that have been disambiguated using paired-end (or other long-range information), paths may contain cycles. A typical example is an untangled repeat. The paths of vertices A X B Y C and A Y B X C are both valid and quite different. If the path record is recorded in five lines, then disturbing the order of those lines (by for example sorting) changes the path.

Just as it's possible to concatenate two FASTA files, I think it should be possible to concatenate two GFA files to get the union of those two graphs. If those two files unexpectedly share a path ID and the user doesn't realize, I think it should give an error of the duplicated path ID, not silently concatenate the two paths into a single path. Imagine if FASTA implementations silently concatenated any two sequences with the same ID.

thegenemyers · 2017-01-13T17:05:07Z

If you put one vertex per line. If you wrote O blue A X B O blue B Y C O red A Y B O red B X C it would be fine and you still have the convenience of not having to have a line with 20,000 vertices on it (for example, some paths in real graphs are really that long with PacBio tech).

…

-- Gene

On 1/13/17, 5:32 PM, Shaun Jackman wrote: In graphs that have been disambiguated using paired-end (or other long-range information), paths may contain cycles. A typical example is an untangled repeat. The paths of vertices |A X B Y C| and |A Y B X C| are both valid and quite different. If the path record is recorded in five lines, then disturbing the order of those lines (by for example sorting) changes the path. Just as it's possible to concatenate two FASTA files, I think it should be possible to concatenate two GFA files to get the union of those two graphs. If those two files unexpectedly share a path ID and the user doesn't realize, I think it should give an error of the duplicated path ID, not silently concatenate the two paths into a single path. Imagine if FASTA implementations silently concatenated any two sequences with the same ID. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#54 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGkkNhb8D6dHgFl01wxvYjjvGTinHLJ_ks5rR6c5gaJpZM4Lg9Yh>.

rchikhi · 2017-01-13T20:46:13Z

The discussion boils down to whether the following lines

O blue A B
O blue B C

would encode the path A ->B ->B -> C (1) or the path A ->B -> C (2).
The spec seems to leave some room for interpretation here..

rchikhi · 2017-01-13T21:21:42Z

Anyhow I vote in favor of the proposal. It wouldn't be very elegant if the order of the lines inside a GFA2 file mattered just because of those O/U-lines.

rrwick · 2017-01-13T22:06:13Z

I understand better now, but even if separate lines of a path must share a segment (as Gene and Rayan said) the line order can still matter.

Here's the simplest case I can think of:

This path is V1 -> V2 -> V1 -> V3 -> V1:

O	blue	V1 V2 V1
O	blue	V1 V3 V1

And this path is V1 -> V3 -> V1 -> V2 -> V1:

O	blue	V1 V3 V1
O	blue	V1 V2 V1

Complexities like this feel awkward, so I'm still in favour of the proposal to require groups to be defined on one line.

sjackman · 2017-01-15T15:27:10Z

My count of the votes is
Strike the text: @rchikhi @rrwick @sjackman
Keep the text: @pb-jchin @thegenemyers
Abstain: @ggonnella
@lh3 @richarddurbin Do you wish to vote on this proposal?
Unless anyone else speaks up, the proposal to strike the text is accepted. We can of course address this use case again in a future proposal.

richarddurbin · 2017-01-15T18:12:39Z

I am frustrated because as far as I can see everyone has voted to accept the standard, which should mean that they take it as is. But somehow at the same time a subset have argued to change it, which is incompatible with accepting it as is. That said, I hear the concern about order dependence. So here is a proposed way forwards. Rather than remove all mention of the multiline idea, we remove it from the current formal syntax specification, but in the descriptive text for the group lines in the official spec document we say that there was discussion about allowing multiple group lines with the same identifier which would define consecutive segments of an ordered group, and that this was not included in the current version, but that if people want to explore use of this structure they could do so using another single letter record code. This would permit compatible exploration, since I understand that records with letters outside the current set get ignored by a formal parser. If someone does explore multiline groups this way and makes a case for them, with current syntax or a variation of that, then we can debate how to incorporate that into the official spec. I like this approach because I don’t want all mention of this idea to be lost from the specification document. Richard

sjackman · 2017-01-16T00:53:41Z

I'm fine with moving the multiline path discussion to a non-formal descriptive text. A multiline path could be useful, but I think its format needs more discussion.

sjackman · 2017-01-18T01:19:10Z

I've added the following text:

Note: It was discussed whether U/O-lines with the same name could be considered to be concatenated together in the order in which they appear (see #54 and #47). This multi-line path format was not included in the current version of this specification, but if people want to explore use of this structure, they can do so using a different single letter record code.

thegenemyers · 2017-01-18T14:29:55Z

I'm OK with this and with the text that Shaun put in as a place-holder for the discussion. -- Gene

ggonnella · 2017-02-03T13:21:47Z

If we allow a multiline path at some point, one could code the order with an optional tag, only included in multilne paths. I do not see any disadvantages in that case, other than some more validation checks (is the tag included in all lines with the same ID? are the tags values all different? is the encoded path resulting from the concatenation a valid one?).

sjackman added the needs quorum label Jan 11, 2017

sjackman self-assigned this Jan 11, 2017

rrwick mentioned this issue Jan 12, 2017

GFA2 Pull Request for Review #48

Merged

sjackman added quorum and removed needs quorum labels Jan 15, 2017

sjackman closed this as completed Jan 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meaning should not depend on the order of records #54

Meaning should not depend on the order of records #54

sjackman commented Jan 11, 2017 •

edited

Loading

sjackman commented Jan 11, 2017

sjackman commented Jan 11, 2017 •

edited

Loading

rrwick commented Jan 12, 2017

ggonnella commented Jan 12, 2017

sjackman commented Jan 12, 2017 •

edited

Loading

thegenemyers commented Jan 13, 2017 via email

rrwick commented Jan 13, 2017

thegenemyers commented Jan 13, 2017 via email

sjackman commented Jan 13, 2017

thegenemyers commented Jan 13, 2017 via email

rchikhi commented Jan 13, 2017 •

edited

Loading

rchikhi commented Jan 13, 2017

rrwick commented Jan 13, 2017

sjackman commented Jan 15, 2017 •

edited

Loading

richarddurbin commented Jan 15, 2017 via email •

edited by sjackman

Loading

sjackman commented Jan 16, 2017

sjackman commented Jan 18, 2017

thegenemyers commented Jan 18, 2017 via email •

edited by sjackman

Loading

ggonnella commented Feb 3, 2017

Meaning should not depend on the order of records #54

Meaning should not depend on the order of records #54

Comments

sjackman commented Jan 11, 2017 • edited Loading

sjackman commented Jan 11, 2017

sjackman commented Jan 11, 2017 • edited Loading

rrwick commented Jan 12, 2017

ggonnella commented Jan 12, 2017

sjackman commented Jan 12, 2017 • edited Loading

thegenemyers commented Jan 13, 2017 via email

rrwick commented Jan 13, 2017

thegenemyers commented Jan 13, 2017 via email

sjackman commented Jan 13, 2017

thegenemyers commented Jan 13, 2017 via email

rchikhi commented Jan 13, 2017 • edited Loading

rchikhi commented Jan 13, 2017

rrwick commented Jan 13, 2017

sjackman commented Jan 15, 2017 • edited Loading

richarddurbin commented Jan 15, 2017 via email • edited by sjackman Loading

sjackman commented Jan 16, 2017

sjackman commented Jan 18, 2017

thegenemyers commented Jan 18, 2017 via email • edited by sjackman Loading

ggonnella commented Feb 3, 2017

sjackman commented Jan 11, 2017 •

edited

Loading

sjackman commented Jan 11, 2017 •

edited

Loading

sjackman commented Jan 12, 2017 •

edited

Loading

rchikhi commented Jan 13, 2017 •

edited

Loading

sjackman commented Jan 15, 2017 •

edited

Loading

richarddurbin commented Jan 15, 2017 via email •

edited by sjackman

Loading

thegenemyers commented Jan 18, 2017 via email •

edited by sjackman

Loading