Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve distinct/uniq behaviour #1478

Merged
merged 3 commits into from
May 17, 2022
Merged

Improve distinct/uniq behaviour #1478

merged 3 commits into from
May 17, 2022

Conversation

buixor
Copy link
Contributor

@buixor buixor commented Apr 25, 2022

In the existing implementation, distinct effectiveness is directly related to the cache_size parameter. The distinct parses in "live" the queue of objects to determine the uniqueness. For a scenario such as crawl-http-non-statics, this can have a "huge" impact in terms of potential false-positives. This PR adds its own cache to the distinct filters, so that it's not impacted by the cache_size.

Copy link
Contributor

@sabban sabban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sabban sabban added this to the 1.4.0 milestone Apr 27, 2022
@buixor buixor merged commit fbcb2ed into master May 17, 2022
@buixor buixor deleted the improve_distinct branch May 17, 2022 10:45
mmetc pushed a commit that referenced this pull request May 19, 2022
* make uniq/distinct use a cache that is independant of the bucket's cache_size

* add testing specifically for cache_size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants