Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

[schema] aggregation-share/sum-part has incorrect cardinality of sum and batch_uuid #3

Closed
yuriks opened this issue Sep 18, 2020 · 3 comments
Assignees
Labels
p1 Must be fixed for the corresponding milestone to be reached. schema-protocol issues with encoding formats or protocol between servers requiring coordination outside ISRG
Milestone

Comments

@yuriks
Copy link
Contributor

yuriks commented Sep 18, 2020

In the original protobuf schema defined in the IDL document, PrioSumPart contained a single bytes value_sum field, and a repeated batch_uuid list. The batch_uuid list represents the uuids of all of the batches (tracing back to the ingestor) that participated in this aggregation.

At some point in the conversion to Avro, this got flipped and the current schema now has only a single batch_uuid, but an array of sums.

Not only does this not let us properly represent the batch_uuids when we do the final multi-batch reduction sum, the sums array will also always contain only one value (because all aggregations we'll do always produce a single output sum per file, both the per-batch sum, and the overall sum for a given time range), so this seems like an oversight and should be fixed to match the intent of the original schema.

tgeoghegan added a commit that referenced this issue Sep 18, 2020
 - The sum part contains UUIDs of all the data share batches that were
   accumulated and so "batch_uuid" becomes an array.
 - Update the sums field to make it more clear what it represents.

Addresses #3
tgeoghegan added a commit that referenced this issue Sep 18, 2020
 - The sum part contains UUIDs of all the data share batches that were
   accumulated and so "batch_uuid" becomes an array.
 - Update the sums field to make it more clear what it represents.

Addresses #3
@tlepoint
Copy link

Actually, the sums array needs to be an array, because the sum is done over a vector containing n bins. We can choose the terminology that you prefer (one sum of a vector, or many individual sums, the current avro file choosing the latter formulation). In the original proto schema, sum was represented as bytes (assuming proper serialization): I modified this to be int64 to be consistent with the f_r, g_r, h_r, and r_pit terms.

Thanks for catching the error for the batch_uuid, this indeed should be an array. I may have been thrown off by the comment above the batch_uuid field.

@tgeoghegan
Copy link
Collaborator

You're quite right @tlepoint, and Yuri and I realized as much last Friday, but failed to say as much in this issue. I think what I currently have in #2 is consistent with what you're saying.

@tgeoghegan tgeoghegan added the schema-protocol issues with encoding formats or protocol between servers requiring coordination outside ISRG label Sep 21, 2020
@tgeoghegan tgeoghegan added this to the pha-integration milestone Sep 21, 2020
@tgeoghegan tgeoghegan added the p1 Must be fixed for the corresponding milestone to be reached. label Sep 21, 2020
@tgeoghegan tgeoghegan modified the milestones: pha-integration, prototype Sep 21, 2020
@tgeoghegan tgeoghegan self-assigned this Sep 21, 2020
tgeoghegan added a commit that referenced this issue Sep 21, 2020
 - The sum part contains UUIDs of all the data share batches that were
   accumulated and so "batch_uuid" becomes an array.
 - Update the sums field to make it more clear what it represents.

Addresses #3
@tgeoghegan
Copy link
Collaborator

Addressed in #2. Please file a new issue for further schema amendments, referencing this one if necessary.

aaomidi pushed a commit that referenced this issue Feb 2, 2021
aaomidi pushed a commit that referenced this issue Feb 3, 2021
* Explore upgrading to ureq 2

* Feedback #1

* Feedback #2

Co-authored-by: Tim Geoghegan <timg@letsencrypt.org>

* Feedback #3

Co-authored-by: Tim Geoghegan <timg@letsencrypt.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
p1 Must be fixed for the corresponding milestone to be reached. schema-protocol issues with encoding formats or protocol between servers requiring coordination outside ISRG
Projects
None yet
Development

No branches or pull requests

3 participants