-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Doubles aggregators use 64bits by default #5478
Conversation
Change-Id: Ia4f442037052add178f6ac68138c9d52f96c6e09
@b-slim can you please include examples of when 64 vs 32 made a difference, and also record the impact in size and performance for switching from float to double? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I am also interested in the numbers @drcrallen asked for, but I am ok with the PR regardless since the change has been planned for some time.
docs/content/configuration/index.md
Outdated
doubleSum, doubleMin, and doubleMax aggregators at indexing time. To instead use 64-bit floats | ||
for these columns, please set the system-wide property `druid.indexing.doubleStorage=double`. | ||
This will become the default behavior in a future version of Druid. | ||
Prior to version `0.13.0` Druid's storage layer uses a 32-bit float representation to store columns created by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar / formatting:
- "used a 32-bit", not uses
- I'd say don't backtick-format 0.13.0
- borderline on whether doubleSum/doubleMin/doubleMax should be backtick-formatted; I probably wouldn't
- "keep the old format" not "keep old format"
- I think it'd be clearer to replace the last sentence with "Support for 64-bit floating point columns was released in Druid 0.11.0, so if you use this feature then older versions of Druid will not be able to read your data segments."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Also labelled |
Change-Id: I5a588f7364f236bf22f2b138e9d743bfb27c67fe
Thanks, i have fixed the doc and added a sentence about pros/cons of using Double columns. |
@b-slim What I was thinking was that as people decide if they want to use floats or doubles, they will weigh the tradeoff of accuracy against storage size / performance. Presumably doubles use more space. They may be slower too (?). People can do their own tests but it might still be useful for us to publish some numbers. |
@gianm fair enough for the speed part I thought the effect can be very negligible since we are using 64-bit primitive doubles to aggregate, but you are right maybe we can have a hit due to deserialization and loading into memory. Will add that to my list. |
@b-slim @drcrallen @gianm @leventov I've been running benchmarks and collecting speeds and sizes for integer columns following up on #4080 (comment), so I got curious late one night last week and went ahead and ran the same(ish) benchmarks on floats and doubles and results are pretty much what I expected. To summarize, I see between 10-50% decrease in select speed depending on number of rows selected and type of value distribution, and up to double the encoded size. Note: the dip at the end of the plots is an artifact of the parameters for number of filtered rows the benchmarks were run with didn't include
In my experimental branch, I've modified the zipf value generator to use Math.random positive distribution: Zipf distribution: Zipf distribution scaled with Math.PI: Zipf distribution scaled with Math.PI (half zeros): Zipf distribution scaled with Math.PI (mostly zeros): To put that into percentage slower: Don't let this dissuade anyone else from running benchmarks to validate what I see here. |
@clintropolis Thanks for doing this! now QQ, when you say floats does that means Double32 bits or the new Float 32 aggregators? |
@b-slim these were just run directly against |
@clintropolis ColumnarFloats is using 32bits to aggregate and store ie true floats32 bits. I thought the ask here is to compare the 2 variant of Double columns, variant one is old school store as 32bits and aggregate as 64bits, Variant 2 is stored as 64bits and aggregated as 64bits. I hope that makes sense. |
Issue description is here #5462
This change is