Skip to content

Made update_with_coupon public#113

Merged
notfilippo merged 1 commit intoapache:mainfrom
fulmicoton-dd:main
Apr 10, 2026
Merged

Made update_with_coupon public#113
notfilippo merged 1 commit intoapache:mainfrom
fulmicoton-dd:main

Conversation

@fulmicoton-dd
Copy link
Copy Markdown
Contributor

We need update_with_coupon to be public for the following reason.

We are running the equivalent of a

SELECT DISTINCT(some_field)
GROUP BY bucket_field;

some_field is dictionary encoded. Fetching the actual string value from value id associated to it is not a cheap operation.
For this reason, we first perform the group by, collecting, for each bucket the value ids encounterred.

We then compute once and for all the mapping from term_id to term and store it in a cache.
We can then reuse this mapping when building the hll sketch for each independent bucket.

In order to avoid dealing with strings, we would like to store, not the strings, but their coupons.

Copy link
Copy Markdown
Member

@notfilippo notfilippo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case makes sense.

I was thinking we could also potentially accept in the sketch any value that implements a trait with a coupon() method. Then we could implement this trait by default for all Hash, and leave it implementable by users to support this use case with a bit more intentionality.

Anyhow that's for a follow-up. This change does not break the current API so it looks good on my side.

@fulmicoton-dd
Copy link
Copy Markdown
Contributor Author

Thank you!

@notfilippo notfilippo merged commit bd0e8b5 into apache:main Apr 10, 2026
9 checks passed
@jmalkin
Copy link
Copy Markdown

jmalkin commented Apr 10, 2026

We've historically treated this as an anti-pattern. It makes it really, really easy to destroy the contents of a sketch by, say, mixing hash functions. For this use case, I'd have said you're better off just submitting hashes and letting the sketch hash again.

@notfilippo
Copy link
Copy Markdown
Member

@jmalkin I definitely agree in principle. Would a better comment on the method be enough to warn about the side effects of this?

In my opinion preventing advanced use cases like this hinders adoption and I would prefer an approach where we supply all the tools, while warning about the risks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants