Update digest serialization rules in docs #410

korikuzma · 2022-11-01T18:03:48Z

The digest serialization docs do not explicitly say how to handle arrays of objects that can't be serialized (i.e., Genotype.members) but still have the property ordered=False. Below is a proposed example from @ahwagner where we serialize each GenotypeMember and sort Genotype.members based on the serialized strings.

The text was updated successfully, but these errors were encountered:

larrybabb · 2022-11-02T17:57:51Z

@ahwagner sorry to need reminding (again), but why is the ordering important to the genotypeMembers array?

ahwagner · 2022-11-02T20:00:45Z

@larrybabb the members array has property ordered: false (here), indicating that the order of elements in the Genotype.members array is not meaningful. The consequence of this is that when creating a computed digest for the Genotype, we need to have a consistent approach to ordering these elements so we get the same digest regardless of the element order.

To date, we have only had identifiable objects in arrays, and for those we would compute the digests of the array objects and then sort the array lexicographically. In this new case, Genotype.members elements are GenotypeMember JSON objects that are not identifiable and so we need a strategy for ordering them. My proposal, captured by @korikuzma above, is that we extend our existing serialization strategy to first serialize non-identifiable objects in arrays at the same time we compute digests for identifiable objects in array. Then all array objects are sorted lexicographically. I think the PR associated with this issue should also include some documentation updates for the digest serialization strategy to make this clear.

larrybabb · 2022-11-02T20:59:00Z

@ahwagner got it. thank you for re-explaining that.
Question: what if we dropped the ordered property altogether and presumed that all arrays are unordered (I presume that will be the rule vs the exception). If an attribute that is an array comes along that needs to be ordered then we can add that special attribute to it to describe how it should be sorted or ordered (assuming there might be more than one way?). If there is only going to be one way to "order" arrays that need to be ordered then we can possibly use the idea of naming the attributed with the prefix orderedXXX: thus avoiding the need for this additional boolean property.

It's just that an ordered: t/f attribute in a json message doesn't tell me much in terms of being able to validate the information on the receiving end. If we have to presume the order is correct and must be preserved then qualifying the attribute with ordered should/would provide just as much information as a boolean flag. We've avoided flags like this to date. It seems like we may want to continue avoiding them.

if the genotype.members was an ordered array then we would call it genotype.orderedMembers and presume that whatever order the members are placed in the array must be preserved. Another possibility is to define a class called OrderedArray that can be used for any attributes that need to have their order preserved.

Thoughts?

reece · 2022-11-03T17:09:45Z

Do I understand correctly that the proposal is to add a non-standard "ordered" attribute to the members property of the message?

larrybabb · 2022-11-03T17:23:15Z

I think so. My questions above are for @ahwagner and offer some other options possibly. Let's see how he responds.

ahwagner · 2022-11-03T21:28:13Z

@reece and @larrybabb there are two concerns here. Tagging @andreasprlic because this is an important technical implementation discussion he should also weigh in on.

Concern 1. we need to sort some JSON arrays and not sort others for digest serialization

JSON Schema does not differentiate between arrays and sets, but VRS does. We represent all sets in VRS (e.g. VariationSet.members, Genotype.members) as arrays in JSON Schema, and this represents the typical case for an array in VRS. However, when we introduced ComposedSequenceExpressions (CSEs), we needed a way of differentiating between ordered arrays for CSEs and unordered arrays for everything else. At the time, our digest serialization rules were sufficient to resolve this, as we only reordered arrays if they contained identifiers (as pointed out by @reece, here). Since CSEs were not identifiable, they did not get changed into identifiers, and no sorting took place.

Later, however, we added Genotype, which contained an unordered members array with non-identifiable GenotypeMember objects. Because these are non-identifiable our current digest serialization rules treat this array the same way it treats a CSE array: as ordered (in this case, incorrectly). This issue (#410) and the associated PR (#409) were created to prompt discussion about updating the serialization rules to handle this situation. In addition to the need to sort this array, we need to define a mechanism that allows us to sort JSON Objects (which are evaluated as dicts in Python and have no default sorting comparison behavior). This proposal above defines a sorting behavior.

As an aside, one potential decision that would sidestep this issue (for now) is to make GenotypeMember objects identifiable, but I think this is not the correct design choice as GenotypeMembers are intended to be used only as a nested class within Genotype. Making them globally identifiable is contrary to that intended use. If there is disagreement on that underlying decision the conversation should start there.

Concern 2. define a mechanism that allows us to uniformly indicate sort behavior across classes

During the GKS-Pilot work we opted to use the ordered attribute in the JSON Schema (as opposed to within-message solutions) to handle the challenge of Concern 1. As the JSON Schema is parsed by VRS-Python and the JSON Schema specification allows for custom attributes, this seemed like a good solution to explicitly define ordered/unordered array behavior on a per-array level without increasing message size or changing digests of VRS objects. We discussed the advantages of a schema-level attribute on this thread. @larrybabb I think we saw eye-to-eye on this at the time, but @reece and @andreasprlic did not weigh in there, so it seems appropriate to me to revisit this decision and get their comments and come to a consensus decision.

There are many approaches we may take to address this concern, including some previously suggested ones:

Schema-based approaches (not in message)

ordered boolean property for arrays in JSON Schema (currently implemented solution)

Message-based approaches (defined in schema and explicit in message)

index property added to ordered messages (proposed by @larrybabb here)
ordered prefix for properties with arrays where order is meaningful (proposed in Update digest serialization rules in docs #410 here)
orderedArray class that contains an array that is ordered (proposed on 12/13/21 call here)

Documentation approaches (implementation concern only; not in message or schema)

Revise serialization rules and/or class-specific implementation guidance to describe the expected sort behavior of arrays

I ask that we keep the discussion on Concern 1: array sorting proposal in this thread and discuss Concern 2: indicating sort behavior (which is dependent on resolving the first concern) in a separate issue (#411).

larrybabb · 2022-11-04T16:16:37Z

@ahwagner I get what you are saying. Thank you (again) for taking the time and effort to lay out the details above.

It seems like the 2 issues are

When an array contains items with no specified ordering and are also un-identifiable (no computed digest) we need a mechanism to make sure we can digest the parent class of the array element consistently. This is the Genotype.member[] problem we are currently trying to address.
When an array has a need to be in a specific order whether the elements are identifiable or not, we need a mechanism to assure the ordering is preserved. This one is less clear to me here because in the CSEs example you cite above there is no value in the json that would allow one to understand what the intended order is. I'm guessing we simply presume it is in the correct order with no way of verifying it? This seems a bit risky, and I would need some more clarity on why this is acceptable to us. If we are saying that we are not responsible for the ordering and presume however we create the data is the order it is in and the data does not need to contain any values (i.e. indices) to clarify the accuracy of the order, then we really shouldn't order anything ever.

The CSE case should probably be set up as a linked list or nesting construct to make sure that the chain/hierarchy is semantically included in the data. Maybe all ordered lists should use a designed construct to preserve those semantics?

For any data that has no ordering we would only need to solve the issue of digesting un-identifiable components in lists. For this we should assume that any items in a Value object array would meet the requirement of being a value object too and we should simply digest these elements and sort them by their value object digest, even though we don't ever persist or preserve these non-identifiable element's ids.

Again, I apologize for re-surfacing these issues. I think the idea of having an ordered flag attribute in any/all array parent classes is a bit difficult to accept as a final solution to this issue.

ahwagner · 2022-11-06T13:05:31Z

Separating out these distinct concerns between this thread and #411.

Addressing here:

When an array contains items with no specified ordering and are also un-identifiable (no computed digest) we need a mechanism to make sure we can digest the parent class of the array element consistently. This is the Genotype.member[] problem we are currently trying to address.
...
For any data that has no ordering we would only need to solve the issue of digesting un-identifiable components in lists. For this we should assume that any items in a Value object array would meet the requirement of being a value object too and we should simply digest these elements and sort them by their value object digest, even though we don't ever persist or preserve these non-identifiable element's ids.

I think this approach is very similar to the proposal laid out above. The array content needs to be digest serialized, then digested. I had proposed we sort on the digest serialized outputs to save on the extra compute of creating digests (since these are not persisted anyways). Is there a reason we want to take the extra step to create digests for these objects, e.g. consistency with the approach for identifiable objects? I'm okay with that, but just want to acknowledge the extra compute expense associated with this decision.

ahwagner · 2022-11-07T17:11:19Z

On 11/7 leads call, @larrybabb @andreasprlic and @ahwagner agreed that sorting on digests (even for non-identifiable objects) is the more consistent approach.

Proposed Resolution: digest ALL JSON Objects in arrays for sorting during digest serialization, UNLESS the array order is meaningful (indicated as described in resolution of #411).

korikuzma · 2022-11-08T17:39:03Z

Here is an updated visualization of the proposed solution. Since ordered=False for Genotype.members, the final serialization will always be the same regardless of Genotype.members order

ahwagner · 2022-11-09T12:04:54Z

Implemented in #409

korikuzma · 2022-11-09T12:05:56Z

@ahwagner I did not update the docs with this change. Did you want me to do this in a separate PR?

ahwagner · 2022-11-09T12:07:58Z

Yes, good catch. Reopening this issue until the documentation is updated.

github-actions · 2024-01-09T02:08:26Z

This issue was marked stale due to inactivity.

korikuzma assigned ahwagner Nov 1, 2022

This was referenced Nov 2, 2022

feat: digests take ordered property in arrays into consideration (#116) ga4gh/vrs-python#118

Merged

fix: models.yaml to take ordered property in account #409

Merged

ahwagner mentioned this issue Nov 3, 2022

indicating sort behavior across VRS classes #411

Closed

ahwagner closed this as completed Nov 9, 2022

ahwagner reopened this Nov 9, 2022

github-actions bot added the Stale See .github/workflows/stale.yml label Jan 9, 2024

ahwagner removed the Stale See .github/workflows/stale.yml label Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update digest serialization rules in docs #410

Update digest serialization rules in docs #410

korikuzma commented Nov 1, 2022

larrybabb commented Nov 2, 2022

ahwagner commented Nov 2, 2022

larrybabb commented Nov 2, 2022

reece commented Nov 3, 2022

larrybabb commented Nov 3, 2022

ahwagner commented Nov 3, 2022 •

edited

larrybabb commented Nov 4, 2022

ahwagner commented Nov 6, 2022

ahwagner commented Nov 7, 2022

korikuzma commented Nov 8, 2022

ahwagner commented Nov 9, 2022

korikuzma commented Nov 9, 2022

ahwagner commented Nov 9, 2022

github-actions bot commented Jan 9, 2024

Update digest serialization rules in docs #410

Update digest serialization rules in docs #410

Comments

korikuzma commented Nov 1, 2022

larrybabb commented Nov 2, 2022

ahwagner commented Nov 2, 2022

larrybabb commented Nov 2, 2022

reece commented Nov 3, 2022

larrybabb commented Nov 3, 2022

ahwagner commented Nov 3, 2022 • edited

Concern 1. we need to sort some JSON arrays and not sort others for digest serialization

Concern 2. define a mechanism that allows us to uniformly indicate sort behavior across classes

Schema-based approaches (not in message)

Message-based approaches (defined in schema and explicit in message)

Documentation approaches (implementation concern only; not in message or schema)

larrybabb commented Nov 4, 2022

ahwagner commented Nov 6, 2022

ahwagner commented Nov 7, 2022

korikuzma commented Nov 8, 2022

ahwagner commented Nov 9, 2022

korikuzma commented Nov 9, 2022

ahwagner commented Nov 9, 2022

github-actions bot commented Jan 9, 2024

ahwagner commented Nov 3, 2022 •

edited