-
Notifications
You must be signed in to change notification settings - Fork 2
Refactor and clarify the count and count_distinct code a bit #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc474f2 to
503d57b
Compare
| { | ||
| int64 max_flattening; | ||
| int64 max_flattened_count; | ||
| int64 max_flattened_count_with_max_flattening; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some verbosity is good, but such a long field name is very weird to work with. Makes all the lines it appears in extra long.
| if (flattening >= accumulator->max_flattening) | ||
| { | ||
| accumulator->max_flattening = flattening; | ||
| /* Get the largest flattened count from the ones with the maximum flattening. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting this comment next to the field definition would have been better IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was on the fence, opted for less comments. As this is contentious, I'll revert at nearest opportunity and move/copy the comment 👍. Comment only here is definitely not enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach here would be to group related fields into nested structs. Not sure if this works, but it would shorter and clearer:
typedef struct CountResultAccumulator
{
struct
{
int64 amount;
int64 count;
} max_flattened;
};There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it seems like the pg_diffix implementation is different from reference. Once we are able to confirm this (slack), we'll have to redo the naming accordingly. Let's postpone the naming thread till then.
| static const int COUNT_DISTINCT_AIDS_OFFSET = 2; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can omit the COUNT_DISTINCT_ prefix since this constant is private to this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And why OFFSET? We use index everywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the convention from count.c https://github.com/diffix/pg_diffix/blob/master/src/aggregation/count.c#L39. So INDEX is for a single thing, OFFSET is to indicate where the series of variadic things begins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prefix there makes sense because we have 2 aggs in the same file. Here it's just one so it can be removed safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So INDEX is for a single thing, OFFSET is to indicate where the series of variadic things begins.
OK, makes sense.
| /* Maps values per-AID given the list of low-count tracker entries and an AID values set index. */ | ||
| static List *transpose_lc_values_per_aid(List *lc_entries, int aidvs_index, uint32 *lc_values_true_count) | ||
| static List *transpose_lc_values_per_aid(List *lc_entries, int aid_index, uint32 *lc_values_true_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed these changes. The index is for the set (comment above), not for some particular AID value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should call the aid parameter index something else. aid_pos/aid_no/aid_offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, I think we covered this in #139 (comment). aidvs is more naturally expanded to AID valueS, so could be confused with the "vertical" dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That as well. I can't tell if it's pluralized or the set.
| * The number of low count values has to be anonymized. | ||
| */ | ||
| static CountDistinctResult count_distinct_calculate_final(DistinctTracker_hash *tracker, int aidvs_count) | ||
| static CountDistinctResult count_distinct_calculate_final(DistinctTracker_hash *tracker, int aids_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. this is the number of aid instances, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answered in the other thread
Squashed version of #139, see there for discussion. Diff should be the same as #139, retaining #139 intact to keep the discussions up.