Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roaring bitmaps by default #9548

Merged
merged 4 commits into from
Mar 24, 2020
Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 20, 2020

Description

I think it is finally time to switch to using Roaring bitmaps instead of using CONCISE by default. Using Druid with Roaring is rather well tested by now, and in most cases I think it is going to provide a better out of the box experience, where the speed is generally worth the potential for larger segment sizes that come with high cardinalities. There will still exist cases of datasets with ultra high cardinality columns where CONCISE might produce smaller segments due to the overhead of the Roaring format, but it makes sense to me for the operator to opt into the decision of wanting the smallest possible segments at the potential cost of speed, rather than that being the default.

Related: http://db.ucsd.edu/wp-content/uploads/2017/03/sidm338-wangA.pdf (though this paper is using a from scratch custom version of roaring apparently)


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • DefaultBitmapSerdeFactory

@himanshug
Copy link
Contributor

sounds good, but can you document a version of PR description in the doc file where configuration is (or maybe create a separate bitmaps.md file and link to that), so that users can make an informed decision. I know the paper is linked but it would be useful if summary of pros/cons is listed in Druid docs itself.

@clintropolis
Copy link
Member Author

sounds good, but can you document a version of PR description in the doc file where configuration is (or maybe create a separate bitmaps.md file and link to that), so that users can make an informed decision. I know the paper is linked but it would be useful if summary of pros/cons is listed in Druid docs itself.

I added a 'compression' section to the segment documentation page, that attempts to dissuade people from changing the defaults unless they verify that the settings are in fact better for their use case (which is why I left this off at first, because people probably shouldn't be changing these unless they know what they are doing imo).

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, including the new docs.

@jihoonson jihoonson merged commit bf85ea1 into apache:master Mar 24, 2020
@himanshug
Copy link
Contributor

I added a 'compression' section to the segment .....

LGTM, thanks

@clintropolis clintropolis deleted the roaring-by-default branch March 24, 2020 02:44
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants