-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meaning should not depend on the order of records #54
Comments
@ekg @ggonnella @jts @lh3 @pb-jchin @rchikhi @richarddurbin @rrwick @sjackman @thegenemyers |
This text was added to address Erik's use case of walks. His use case specifically included sorting the path records by their components, and so I believe this text does not address his use case, and as such should be removed. Erik's use case should I feel be addressed with a new walk record type. Heng Li proposed such a record at #47 (comment). I suggest we define a walk record type in a future backward-compatible revision of GFA (e.g. GFA 2.1), with input from Erik and that specifically addresses his use case. |
I vote in favour of this proposal! |
I am not against your proposal of deleting the quoted text (i.e. I am rather neutral on this). As GFA2 allows for user-specific line types (ignoring by a core parser anything which is not starting with the standard record type specifiers), we should decide if the W line shall be an application-specific line or if it shall really be included in a future specification. |
I'm inclined to defer that question to a future date. More immediately I think it's quite important that we resolve that the meaning of a GFA file should not depend on the order of its lines. The walk record could start as application specific, and later be proposed using a pull request for adoption by the standard. The benefit of an extensible standard is that we don't have to nail down every possible application right away. |
Shaun,
I don't really understand the use-case that is being discussed. I
read the 'W' line
description and it does not convey to me the precise meaning.
A path can be in pieces, i.e. on different lines, and still be
reconstructed with
the understanding that edges between path pieces are not inferred, which
is the
case in your rather extreme example. For example,
O blue V3 V4 V5
O blue V1 V2 V3
O red e34 35
O red e12 e23
work just fine.
I'm not sure why order independence is important, nor what it buys
you. Basically, the
way the standard is currently you have to read the entire file to now if
its even valid, i.e.
there are no undefined vertices, etc. If you are going to read the
whole file (sequentially)
and build an internal representation in response to semantically check
it, etc. then ???
…-- Gene
On 1/12/17, 6:45 PM, Shaun Jackman wrote:
I'm inclined to defer that question to a future date. More immediately
I think it's quite important that we resolve that the meaning of a GFA
file should not depend on the order of its lines. The walk record
could start as application specific, and be proposed using a pull
request for adoption by the standard. The benefit of an extensible
standard is that we don't have to nail down every possible application
right away.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#54 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGkkNrX_DBMGhJu_GIso-uw3GUy1395aks5rRma-gaJpZM4Lg9Yh>.
|
Gene, Using your 'blue' example, if this is the full graph structure... ...then this path is V3 -> V4 -> V5 -> V1 -> V2 -> V3:
...and this path is V1 -> V2 -> V3 -> V3 -> V4 -> V5:
Both paths are valid, so it seems like the definition does depend on the line order. I guess I'm unsure of what you mean by 'work just fine'. Am I missing something? |
No, I wrote that if we define it so that one *does not* infer edges
between line components
then it is order dependent. So the path is V1->V2->V3->V4->V5. The
reference to V3
twice connects the two segment in an unambiguous way. If eij is the
edge Vi->Vj, then the
red path also defines the same path.
-- Gene
…On 1/13/17, 2:18 PM, Ryan Wick wrote:
Gene,
Using your 'blue' example, if this is the full graph structure...
screen shot 2017-01-14 at 12 12 42 am
<https://cloud.githubusercontent.com/assets/7053555/21931192/334cf606-d9ee-11e6-9077-02ead4f93180.png>
...then this path is V3 -> V4 -> V5 -> V1 -> V2 -> V3:
|O blue V3 V4 V5
O blue V1 V2 V3
|
...and this path is V1 -> V2 -> V3 -> V3 -> V4 -> V5:
|O blue V1 V2 V3
O blue V3 V4 V5
|
Both paths are valid, so it seems like the definition does depend on
the line order. I guess I'm unsure of what you mean by 'work just
fine'. Am I missing something?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#54 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGkkNhWXt_nwci8Sr7jO3wiDAV73YnLXks5rR3mYgaJpZM4Lg9Yh>.
|
In graphs that have been disambiguated using paired-end (or other long-range information), paths may contain cycles. A typical example is an untangled repeat. The paths of vertices Just as it's possible to concatenate two FASTA files, I think it should be possible to concatenate two GFA files to get the union of those two graphs. If those two files unexpectedly share a path ID and the user doesn't realize, I think it should give an error of the duplicated path ID, not silently concatenate the two paths into a single path. Imagine if FASTA implementations silently concatenated any two sequences with the same ID. |
If you put one vertex per line. If you wrote
O blue A X B
O blue B Y C
O red A Y B
O red B X C
it would be fine and you still have the convenience of not having to have
a line with 20,000 vertices on it (for example, some paths in real graphs
are really that long with PacBio tech).
…-- Gene
On 1/13/17, 5:32 PM, Shaun Jackman wrote:
In graphs that have been disambiguated using paired-end (or other
long-range information), paths may contain cycles. A typical example
is an untangled repeat. The paths of vertices |A X B Y C| and |A Y B X
C| are both valid and quite different. If the path record is recorded
in five lines, then disturbing the order of those lines (by for
example sorting) changes the path.
Just as it's possible to concatenate two FASTA files, I think it
should be possible to concatenate two GFA files to get the union of
those two graphs. If those two files unexpectedly share a path ID and
the user doesn't realize, I think it should give an error of the
duplicated path ID, not silently concatenate the two paths into a
single path. Imagine if FASTA implementations silently concatenated
any two sequences with the same ID.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#54 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGkkNhb8D6dHgFl01wxvYjjvGTinHLJ_ks5rR6c5gaJpZM4Lg9Yh>.
|
The discussion boils down to whether the following lines
would encode the path A ->B ->B -> C (1) or the path A ->B -> C (2). |
Anyhow I vote in favor of the proposal. It wouldn't be very elegant if the order of the lines inside a GFA2 file mattered just because of those O/U-lines. |
My count of the votes is |
I am frustrated because as far as I can see everyone has voted to accept the standard, which should mean that they take it as is. But somehow at the same time a subset have argued to change it, which is incompatible with accepting it as is.
That said, I hear the concern about order dependence.
So here is a proposed way forwards. Rather than remove all mention of the multiline idea, we remove it from the current formal syntax specification, but in the descriptive text for the group lines in the official spec document we say that there was discussion about allowing multiple group lines with the same identifier which would define consecutive segments of an ordered group, and that this was not included in the current version, but that if people want to explore use of this structure they could do so using another single letter record code. This would permit compatible exploration, since I understand that records with letters outside the current set get ignored by a formal parser. If someone does explore multiline groups this way and makes a case for them, with current syntax or a variation of that, then we can debate how to incorporate that into the official spec.
I like this approach because I don’t want all mention of this idea to be lost from the specification document.
Richard
|
I'm fine with moving the multiline path discussion to a non-formal descriptive text. A multiline path could be useful, but I think its format needs more discussion. |
I've added the following text: Note: It was discussed whether U/O-lines with the same name could be considered to be concatenated together in the order in which they appear (see #54 and #47). This multi-line path format was not included in the current version of this specification, but if people want to explore use of this structure, they can do so using a different single letter record code. |
I'm OK with this and with the text that Shaun put in as a place-holder
for the
discussion.
-- Gene
|
If we allow a multiline path at some point, one could code the order with an optional tag, only included in multilne paths. I do not see any disadvantages in that case, other than some more validation checks (is the tag included in all lines with the same ID? are the tags values all different? is the encoded path resulting from the concatenation a valid one?). |
I propose to strike the text
U/O-lines with the same name are considered to be concatenated together in the order in which they appear, and
See its context here under the heading Group.
The meaning of a GFA file should not depend on the order of its records. It should be possible to sort or shuffle a GFA file without affecting the meaning of the GFA file. Imagine if the meaning of a SAM file depended on the order of the alignments in the SAM file. Sorting by read name or target position would not be possible.
Consider the following example of
O P1 2+ 1+
with split across two lines:after a UNIX
sort
it becomeswhich has changed the meaning of this path
P1
from2+ 1+
to1+ 2+
.The text was updated successfully, but these errors were encountered: