Make Doubles aggregators use 64bits by default #5478

b-slim · 2018-03-11T21:39:16Z

Issue description is here #5462

This change is

Change-Id: Ia4f442037052add178f6ac68138c9d52f96c6e09

drcrallen · 2018-03-12T21:04:27Z

@b-slim can you please include examples of when 64 vs 32 made a difference, and also record the impact in size and performance for switching from float to double?

gianm

LGTM. I am also interested in the numbers @drcrallen asked for, but I am ok with the PR regardless since the change has been planned for some time.

gianm · 2018-03-13T04:23:49Z

docs/content/configuration/index.md

-doubleSum, doubleMin, and doubleMax aggregators at indexing time. To instead use 64-bit floats
-for these columns, please set the system-wide property `druid.indexing.doubleStorage=double`.
-This will become the default behavior in a future version of Druid.
+Prior to version `0.13.0` Druid's storage layer uses a 32-bit float representation to store columns created by the 


Grammar / formatting:

"used a 32-bit", not uses

I'd say don't backtick-format 0.13.0

borderline on whether doubleSum/doubleMin/doubleMax should be backtick-formatted; I probably wouldn't

"keep the old format" not "keep old format"

I think it'd be clearer to replace the last sentence with "Support for 64-bit floating point columns was released in Druid 0.11.0, so if you use this feature then older versions of Druid will not be able to read your data segments."

nishantmonu51

LGTM.

leventov · 2018-03-13T17:22:50Z

Also labelled Incompatible because I think semantically it is. To retain the former behaviour some changes in configs should be done.

Change-Id: I5a588f7364f236bf22f2b138e9d743bfb27c67fe

b-slim · 2018-03-14T18:29:12Z

Thanks, i have fixed the doc and added a sentence about pros/cons of using Double columns.
Am not sure what kind of number you guys expect please let me know and will address it in a follow-up.

gianm · 2018-03-14T20:19:03Z

@b-slim What I was thinking was that as people decide if they want to use floats or doubles, they will weigh the tradeoff of accuracy against storage size / performance. Presumably doubles use more space. They may be slower too (?). People can do their own tests but it might still be useful for us to publish some numbers.

b-slim · 2018-03-14T21:14:21Z

@gianm fair enough for the speed part I thought the effect can be very negligible since we are using 64-bit primitive doubles to aggregate, but you are right maybe we can have a hit due to deserialization and loading into memory. Will add that to my list.

clintropolis · 2018-03-20T01:36:40Z

@b-slim @drcrallen @gianm @leventov I've been running benchmarks and collecting speeds and sizes for integer columns following up on #4080 (comment), so I got curious late one night last week and went ahead and ran the same(ish) benchmarks on floats and doubles and results are pretty much what I expected. To summarize, I see between 10-50% decrease in select speed depending on number of rows selected and type of value distribution, and up to double the encoded size. Note: the dip at the end of the plots is an artifact of the parameters for number of filtered rows the benchmarks were run with didn't include rows - 1, and is actually a cliff from the filter being null in the benchmark:

if (filter == null) {
  for (int i = 0; i < rows; i++) {
    blackhole.consume(data.get(i));
  }
} else {
  for (int i = filter.nextSetBit(0); i >= 0; i = filter.nextSetBit(i + 1)) {
    blackhole.consume(data.get(i));
  }
}

In my experimental branch, I've modified the zipf value generator to use sample of the distribution directly instead of enumerating a distribution so that large cardinalities can be tested, hence the 'lazy' in the name. (Modified benchmarking code is not pushed anywhere yet).

Normal distribution:

Uniform distribution:

Math.random positive distribution:

Zipf distribution:
(integer values stored in float/double column)

Zipf distribution scaled with Math.PI:

Zipf distribution scaled with Math.PI (half zeros):

Zipf distribution scaled with Math.PI (mostly zeros):

To put that into percentage slower:
This is for normal distribution, but other graphs are similar.

Don't let this dissuade anyone else from running benchmarks to validate what I see here.

b-slim · 2018-03-20T02:07:10Z

@clintropolis Thanks for doing this! now QQ, when you say floats does that means Double32 bits or the new Float 32 aggregators?

clintropolis · 2018-03-20T04:24:33Z

@b-slim these were just run directly against ColumnarDoubles and ColumnarFloats implementations and are similar in nature to what FloatCompressionBenchmark and CompressedColumnarIntsBenchmark are doing rather than full query benchmarks. If I have a chance later I'll try to collect some more data and see if I can paint a more complete picture.

b-slim · 2018-03-20T16:20:36Z

@clintropolis ColumnarFloats is using 32bits to aggregate and store ie true floats32 bits. I thought the ask here is to compare the 2 variant of Double columns, variant one is old school store as 32bits and aggregate as 64bits, Variant 2 is stored as 64bits and aggregated as 64bits. I hope that makes sense.

use 64-bit float representation for double based aggregator

2d9e1f2

Change-Id: Ia4f442037052add178f6ac68138c9d52f96c6e09

b-slim added this to the 0.13.0 milestone Mar 11, 2018

leventov added Release Notes Design Review labels Mar 11, 2018

gianm approved these changes Mar 13, 2018

View reviewed changes

nishantmonu51 approved these changes Mar 13, 2018

View reviewed changes

leventov added the Incompatible label Mar 13, 2018

review comments

cb3e3b4

Change-Id: I5a588f7364f236bf22f2b138e9d743bfb27c67fe

b-slim merged commit 17c71a2 into apache:master Mar 20, 2018

b-slim deleted the 5462_make_double_64 branch April 26, 2018 01:27

jon-wei mentioned this pull request Sep 14, 2018

Make Doubles aggregators use 64bits by default. #5462

Closed

dclim mentioned this pull request Oct 11, 2018

Druid 0.13.0-incubating release notes #6442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Doubles aggregators use 64bits by default #5478

Make Doubles aggregators use 64bits by default #5478

b-slim commented Mar 11, 2018 •

edited

Loading

drcrallen commented Mar 12, 2018

gianm left a comment

gianm Mar 13, 2018

nishantmonu51 left a comment

leventov commented Mar 13, 2018

b-slim commented Mar 14, 2018

gianm commented Mar 14, 2018 •

edited

Loading

b-slim commented Mar 14, 2018

clintropolis commented Mar 20, 2018

b-slim commented Mar 20, 2018

clintropolis commented Mar 20, 2018

b-slim commented Mar 20, 2018

Make Doubles aggregators use 64bits by default #5478

Make Doubles aggregators use 64bits by default #5478

Conversation

b-slim commented Mar 11, 2018 • edited Loading

drcrallen commented Mar 12, 2018

gianm left a comment

Choose a reason for hiding this comment

gianm Mar 13, 2018

Choose a reason for hiding this comment

nishantmonu51 left a comment

Choose a reason for hiding this comment

leventov commented Mar 13, 2018

b-slim commented Mar 14, 2018

gianm commented Mar 14, 2018 • edited Loading

b-slim commented Mar 14, 2018

clintropolis commented Mar 20, 2018

b-slim commented Mar 20, 2018

clintropolis commented Mar 20, 2018

b-slim commented Mar 20, 2018

b-slim commented Mar 11, 2018 •

edited

Loading

gianm commented Mar 14, 2018 •

edited

Loading