Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise decision record: sorted_sequences #77

Open
ahwagner opened this issue May 17, 2024 · 2 comments
Open

Revise decision record: sorted_sequences #77

ahwagner opened this issue May 17, 2024 · 2 comments

Comments

@ahwagner
Copy link
Member

https://github.com/ga4gh/seqcol-spec/blob/4722c713a9a5692b255f8bdf9ab27c18ca297042/docs/decision_record.md?plain=1#L19-L21

The above text conveys some opinions about the relative unimportance and utility of this feature.

From my perspective, using this feature is important, as it allows us to design federated applications with a reduced overhead on implementations: if a collection of sequences from which a dataset was generated can be universally identified by a GA4GH digest, based purely on sequence content (and not names, lengths, or order of sequences), then a one-time digest is sufficient for answering the question "do these resources represent variants in the same coordinate space?". I think it is much more likely that two implementations across a network will have the same underlying sequence content, but describe sequences in different orders or with different names (and therefore may not be able to compute /comparison/:digest1/:digest2 if digest2 is an unrecognized permutation of digest1 by that instance). It therefore seems to me to that implementing this feature represents the most obvious case for describing Sequence Collections across a federated service network, instead of a case with limited / fringe utility.

I recognize that as someone looking at this spec from the outside, I may not be grasping the larger importance of the comparison function or how Sequence Collections are intended to be used. And in deference to the work and internal discussions of the SeqCol team I am not suggesting that this feature become required or an exemplar case for Sequence Collections. However, I think it would be more inclusive of potential adopters with a similar view as my own to simply describe this feature and its use in the decision record, and tone down any language that may appear dismissive of this use case.

@ahwagner ahwagner changed the title Revise decision record: Revise decision record: sorted_sequences May 17, 2024
@nsheff
Copy link
Member

nsheff commented May 17, 2024

Ok, I can definitely clarify that rationale better, or tone down the dismissive language. It may actually be useful to try to dissuade people from using this, because, in my opinion, using this is a bad move that is less useful than using the other, better features of sequence collections. So, I think the dismissive language here may be warranted and useful and instructive, but I haven't effectively explained why/the alternative.

To understand your use case, one thing I'm confused about: You're looking to compare coordinate spaces:

"do these resources represent variants in the same coordinate space?".

But then wouldn't you use sorted_name_length_pairs? It's exactly for looking for shared coordinate systems. I guess you're saying you don't have sequence names -- but if there are no sequence names, then you don't have a coordinate space at all, and I don't see how you have variant definitions, either. How can you describe variants without sequence name? The variants have to annotate a certain sequence in the collection. So I'm missing your point here. You'd need sequence names, and sorted_sequences won't help you anyway, right?

A few other thoughts:

  • You're neglecting the complexity of subsets -- what if one collection has an additional sequence added? Then the digests won't match anyway. I think you're overestimating the probability that people have different collections with the same sequence content, but in different order and with different names and no extra or missing sequences in the collection, which are the requirements you'd need to match on sorted_sequences. On the other hand, all this information is immediately provided by the comparison function.
  • The comparison function can be calculated even if the service only recognizes one of the digests, as long as you have the canonical representation of the other. If you don't have either representation, then that's where you have the problem. For that situation, we would need 'meta-servers' that broker digest requests to servers. But I think even if you're using sorted_sequences, you're still going to be in this situation a lot of the time, due to the above point.

nsheff added a commit that referenced this issue May 17, 2024
@nsheff
Copy link
Member

nsheff commented May 17, 2024

I toned down the dismissive language in the ADR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants