Conversation
…calculating the top-level digest
…hat create inconsistencies or are plainly incorrect now
…ell as consistent with RefSeq Sequences
… to me this should depend on the use case/implementation
|
Decisions from today:
|
nsheff
left a comment
There was a problem hiding this comment.
Just a few clarifications suggested on your edits, around the same_order section of the comparison results.
| Because the sequence array would contain the same sequences, the comparison function will count them all as matching. | ||
| Similarly, the names arrays contain the same names and so all will be counted as a match. | ||
| However, the same_order will *not* be true, so it will yield false. | ||
| However, the same_order will *not* be true; it will yield false for all attributes. |
There was a problem hiding this comment.
This is again incorrect. Here's an example:
it does not yield false for all attributes. The lengths and sequences are in the same order.
There was a problem hiding this comment.
so it will yield false only for the names attribute.
There was a problem hiding this comment.
I agree
EDIT: I agree, based on your original example. My idea was to change the example slightly (which also unfortunately makes it more complex) so that the line "This is the same output as a comparison of two sequence collections in different orders" becomes true (see comment below)
| ## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays | ||
|
|
||
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names? | ||
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? |
There was a problem hiding this comment.
This correctly is not correct. This example was not about different orders, just about a name swap.
There was a problem hiding this comment.
My suggestion is to change the example slightly, see: https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123
| ## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays | ||
|
|
||
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names? | ||
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? |
There was a problem hiding this comment.
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? | |
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but where two of the sequences have swapped names? |
There was a problem hiding this comment.
Another consequence of changing the example (see https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123)
| Because the sequence array would contain the same sequences, the comparison function will count them all as matching. | ||
| Similarly, the names arrays contain the same names and so all will be counted as a match. | ||
| However, the same_order will *not* be true, so it will yield false. | ||
| However, the same_order will *not* be true; it will yield false for all attributes. |
There was a problem hiding this comment.
| However, the same_order will *not* be true; it will yield false for all attributes. | |
| However, the same_order will *not* necessarily be true; it will yield false for at least one attribute (the `names`). |
There was a problem hiding this comment.
This is another consequence of changing the example slightly
|
@sveinugu I made a few suggestions on top of yours. Can you accept/confirm my corrections to your edits now? Then we can merge your changes. |
| However, the same_order will *not* be true, so it will yield false. | ||
| However, the same_order will *not* be true; it will yield false for all attributes. | ||
|
|
||
| This is the same output as a comparison of two sequence collections in different orders, without the name swap. This is a fundamental limitation of the array-based method of comparing. |
There was a problem hiding this comment.
This is the same output as a comparison of two sequence collections in different orders, without the name swap.
My main problem is with this line. In the example where everything is the same except for a name swap, then the output will NOT be the same as a comparison of two sequence collections in different order. As you mention above, in the main example, only the same_order value for the names attribute will be false. In the comparison of two sequence collections in different orders, all same_order values will be false!
There was a problem hiding this comment.
yes you are right. I think this may be an outdated explanation from before we split the same_order into a per-array field.
That removes this particular limitation with the compare function.
sveinugu
left a comment
There was a problem hiding this comment.
Sorry for spamming you with single comments, I should have started the batch review straight away.
So I think we are on the same page except for the last example in the "Compare two sequence collections" (under "limitations"). I suggest to change the example to make the sentence "This is the same output as a comparison of two sequence collections in different orders" correct, which I do not think it is now, and all the small changes are just a consequence of that major change. However, there might be a simpler way to get the message across, as my suggestion makes the whole text slightly more complidated. Either way, perhaps adding an explicit JSON example will simplify things?
| ## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays | ||
|
|
||
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names? | ||
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? |
There was a problem hiding this comment.
My suggestion is to change the example slightly, see: https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123
| ## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays | ||
|
|
||
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names? | ||
| One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? |
There was a problem hiding this comment.
Another consequence of changing the example (see https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123)
Changes from @nsheff I agree with Co-authored-by: Nathan Sheffield <nsheff@users.noreply.github.com>
@nsheff @andrewyatz @tcezard
I should of course have done this earlier, but I have now read the spec detailedly and found a number of issues, errors, and inconsistencies.
The two most important of which is that:
the recommended bas schema includes the
accessionattribute. While I am not against this attribute per se, the issue on this is still open (New schema term: accessions #79) and there is nothing in the decision record about this. On the top of my head, I have at least two issues with this attribute (other than the formalities): 1. Should it allow for more than one accession per sequence? 2. Should the accessions somehow be namespaced, e.g. be in CURIE form?The
lengthattribute is now digested, even though it is notinherent. Related to this is the presentation of the digestion algorithm, where I think the filtering step should be moved to later in order to allow for the creation of level 1 digests for inherent attributes.Please look at the different commits for other issues.
Hope there is still time to fix these issues!