Skip to content

Conversation

@pdobacz
Copy link
Contributor

@pdobacz pdobacz commented Jan 4, 2022

This is a loose collection of things I've picked up along the way, while exploring and figuring out how things work in here.

It's all split by commits, so we can pick only those which make sense. I tried to use my common sense understanding of things and (a little bit) reference code.

review commit by commit; they should be rather atomic steps. Also, if you don't like a commit codewise, let me know if I was on the correct track.

If anything worthy of master comes out of this exercise, I'll squash.

@pdobacz pdobacz requested a review from edongashi January 4, 2022 17:21
{
int64 max_flattening;
int64 max_flattened_count;
int64 max_flattened_count_with_max_flattening;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is somewhat along the lines of noise_with_max_sigma. This field doesn't hold the max_flattened_count, but a max from the subset having max_flattening. I thought I could trade removing a comment for a longer name.

{
Datum value; /* Unique value */
List *aidvs; /* AID value sets for the unique value */
List *aidvs; /* List of (hashes of) AID value lists, one for each AID instance */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the items are hash sets, not lists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong. I have no clue how count distinct works here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, they're lists, it confused me on my reading, so I prefer an explicit version

bool insufficient_data = false;
CountResultAccumulator result_accumulator = {0};

for (int aidvs_index = 0; aidvs_index < aids_count; aidvs_index++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: aidvs means AID value set. aid_index also works to know that we're referring to some particular AID type instead of an aid value in a list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, this clarifies this a bit. I thought this is AID values. I think aid_index dispels the ambiguity

if (list_length(aidv) == max_size) // set is full, value is not low-count
return aidv;
return list_append_unique_ptr(aidv, (void *)aid);
return list_append_unique_ptr(aidv, (void *)aid_hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong here (not your changes). list_append_unique_ptr is being called for a value type. This assumes we're running on a 64 bit system, otherwise we lose half of the AID when storing to the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need some more ramp up to figure out a fix for this - could you open an issue pls?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do after investigating how count distinct works.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to assume 64-bit system. We should have a compile time assert for that, if one doesn't exist already.


/* ----------------------------------------------------------------
* anon_count(any, aids...)
* anon_count_any(value, aids...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep this any to mirror postgres.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But ok since it matches with the oid comments below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a back and forth here. I think value is consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but value is not a valid type. Neither is aids but we can't help it because we need the any polymorphism.

*/

CREATE FUNCTION diffix.anon_count_any_transfn(internal, "any", variadic "any")
CREATE FUNCTION diffix.anon_count_any_transfn(internal, "any", variadic aids "any")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the value name here instead: anon_count_any_transfn(internal, value "any", variadic aids "any")? Does this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did so, and it worked, then I took it back to not overthink. If you like it, I'll reintroduce 👍

edongashi
edongashi previously approved these changes Jan 5, 2022
*/

CREATE FUNCTION diffix.anon_count_distinct_transfn(internal, "any", variadic aids "any")
CREATE FUNCTION diffix.anon_count_distinct_transfn(internal, value "any", variadic aids "any")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have value here now I'd revert to any in the comments (both sql and C). I think it's better because we have no typing in C-land and that's the closest we can have to a signature. This will become noticeable when we get more UDFs and possibly overloads. Imagine we have a max function of datetime and integer, would you rather write them as

max(value)
max(value)

or

max(integer)
max(datetime)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if you put any, aids... there it doesn't make much sense, neither does any, any.... Documenting "what" the argument represents is more useful than the any type. Since those are caption comments, maybe we can drop the arguments - everything is properly documented in the CREATE ... statements anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aid has special meaning for us because we produce the aid adapters in code and we know right away what to expect. any, aids... is perfectly fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let it be, you mean like this, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit looks nice. We have experimented with making it a proper postgres type but without success and ended up with our current design with the AidSpec stuff.

@pdobacz
Copy link
Contributor Author

pdobacz commented Jan 5, 2022

superseded by #142

@pdobacz pdobacz closed this Jan 5, 2022
@pdobacz pdobacz deleted the piotr/clarify-aid-aidv-naming branch January 5, 2022 13:05
@pdobacz pdobacz restored the piotr/clarify-aid-aidv-naming branch January 5, 2022 13:05
@pdobacz pdobacz deleted the piotr/clarify-aid-aidv-naming branch May 19, 2022 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants