Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning should not depend on the order of records #54

Closed
sjackman opened this issue Jan 11, 2017 · 19 comments
Closed

Meaning should not depend on the order of records #54

sjackman opened this issue Jan 11, 2017 · 19 comments
Assignees
Labels

Comments

@sjackman
Copy link
Collaborator

sjackman commented Jan 11, 2017

I propose to strike the text

U/O-lines with the same name are considered to be concatenated together in the order in which they appear, and

See its context here under the heading Group.

The meaning of a GFA file should not depend on the order of its records. It should be possible to sort or shuffle a GFA file without affecting the meaning of the GFA file. Imagine if the meaning of a SAM file depended on the order of the alignments in the SAM file. Sorting by read name or target position would not be possible.

Consider the following example of O P1 2+ 1+ with split across two lines:

O P1 2+
O P1 1+

after a UNIX sort it becomes

O P1 1+
O P1 2+

which has changed the meaning of this path P1 from 2+ 1+ to 1+ 2+.

@sjackman sjackman self-assigned this Jan 11, 2017
@sjackman
Copy link
Collaborator Author

@ekg @ggonnella @jts @lh3 @pb-jchin @rchikhi @richarddurbin @rrwick @sjackman @thegenemyers
Your vote on this proposal is appreciated.

@sjackman
Copy link
Collaborator Author

sjackman commented Jan 11, 2017

This text was added to address Erik's use case of walks. His use case specifically included sorting the path records by their components, and so I believe this text does not address his use case, and as such should be removed.

Erik's use case should I feel be addressed with a new walk record type. Heng Li proposed such a record at #47 (comment). I suggest we define a walk record type in a future backward-compatible revision of GFA (e.g. GFA 2.1), with input from Erik and that specifically addresses his use case.

@rrwick
Copy link

rrwick commented Jan 12, 2017

I vote in favour of this proposal!

@ggonnella
Copy link
Contributor

I am not against your proposal of deleting the quoted text (i.e. I am rather neutral on this).

As GFA2 allows for user-specific line types (ignoring by a core parser anything which is not starting with the standard record type specifiers), we should decide if the W line shall be an application-specific line or if it shall really be included in a future specification.

@sjackman
Copy link
Collaborator Author

sjackman commented Jan 12, 2017

I'm inclined to defer that question to a future date. More immediately I think it's quite important that we resolve that the meaning of a GFA file should not depend on the order of its lines. The walk record could start as application specific, and later be proposed using a pull request for adoption by the standard. The benefit of an extensible standard is that we don't have to nail down every possible application right away.

@thegenemyers
Copy link
Contributor

thegenemyers commented Jan 13, 2017 via email

@rrwick
Copy link

rrwick commented Jan 13, 2017

Gene,

Using your 'blue' example, if this is the full graph structure...

screen shot 2017-01-14 at 12 12 42 am

...then this path is V3 -> V4 -> V5 -> V1 -> V2 -> V3:

O	blue	V3 V4 V5
O	blue	V1 V2 V3

...and this path is V1 -> V2 -> V3 -> V3 -> V4 -> V5:

O	blue	V1 V2 V3
O	blue	V3 V4 V5

Both paths are valid, so it seems like the definition does depend on the line order. I guess I'm unsure of what you mean by 'work just fine'. Am I missing something?

@thegenemyers
Copy link
Contributor

thegenemyers commented Jan 13, 2017 via email

@sjackman
Copy link
Collaborator Author

In graphs that have been disambiguated using paired-end (or other long-range information), paths may contain cycles. A typical example is an untangled repeat. The paths of vertices A X B Y C and A Y B X C are both valid and quite different. If the path record is recorded in five lines, then disturbing the order of those lines (by for example sorting) changes the path.

Just as it's possible to concatenate two FASTA files, I think it should be possible to concatenate two GFA files to get the union of those two graphs. If those two files unexpectedly share a path ID and the user doesn't realize, I think it should give an error of the duplicated path ID, not silently concatenate the two paths into a single path. Imagine if FASTA implementations silently concatenated any two sequences with the same ID.

@thegenemyers
Copy link
Contributor

thegenemyers commented Jan 13, 2017 via email

@rchikhi
Copy link

rchikhi commented Jan 13, 2017

The discussion boils down to whether the following lines

O blue A B
O blue B C

would encode the path A ->B ->B -> C (1) or the path A ->B -> C (2).
The spec seems to leave some room for interpretation here..

@rchikhi
Copy link

rchikhi commented Jan 13, 2017

Anyhow I vote in favor of the proposal. It wouldn't be very elegant if the order of the lines inside a GFA2 file mattered just because of those O/U-lines.

@rrwick
Copy link

rrwick commented Jan 13, 2017

I understand better now, but even if separate lines of a path must share a segment (as Gene and Rayan said) the line order can still matter.

Here's the simplest case I can think of:
screen shot 2017-01-14 at 8 46 15 am

This path is V1 -> V2 -> V1 -> V3 -> V1:

O	blue	V1 V2 V1
O	blue	V1 V3 V1

And this path is V1 -> V3 -> V1 -> V2 -> V1:

O	blue	V1 V3 V1
O	blue	V1 V2 V1

Complexities like this feel awkward, so I'm still in favour of the proposal to require groups to be defined on one line.

@sjackman
Copy link
Collaborator Author

sjackman commented Jan 15, 2017

My count of the votes is
Strike the text: @rchikhi @rrwick @sjackman
Keep the text: @pb-jchin @thegenemyers
Abstain: @ggonnella
@lh3 @richarddurbin Do you wish to vote on this proposal?
Unless anyone else speaks up, the proposal to strike the text is accepted. We can of course address this use case again in a future proposal.

@richarddurbin
Copy link

richarddurbin commented Jan 15, 2017 via email

@sjackman
Copy link
Collaborator Author

I'm fine with moving the multiline path discussion to a non-formal descriptive text. A multiline path could be useful, but I think its format needs more discussion.

@sjackman
Copy link
Collaborator Author

I've added the following text:

Note: It was discussed whether U/O-lines with the same name could be considered to be concatenated together in the order in which they appear (see #54 and #47). This multi-line path format was not included in the current version of this specification, but if people want to explore use of this structure, they can do so using a different single letter record code.

@thegenemyers
Copy link
Contributor

thegenemyers commented Jan 18, 2017 via email

@ggonnella
Copy link
Contributor

If we allow a multiline path at some point, one could code the order with an optional tag, only included in multilne paths. I do not see any disadvantages in that case, other than some more validation checks (is the tag included in all lines with the same ID? are the tags values all different? is the encoded path resulting from the concatenation a valid one?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants