Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Doubles aggregators use 64bits by default #5478

Merged
merged 2 commits into from
Mar 20, 2018

Conversation

b-slim
Copy link
Contributor

@b-slim b-slim commented Mar 11, 2018

Issue description is here #5462


This change is Reviewable

Change-Id: Ia4f442037052add178f6ac68138c9d52f96c6e09
@drcrallen
Copy link
Contributor

@b-slim can you please include examples of when 64 vs 32 made a difference, and also record the impact in size and performance for switching from float to double?

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am also interested in the numbers @drcrallen asked for, but I am ok with the PR regardless since the change has been planned for some time.

doubleSum, doubleMin, and doubleMax aggregators at indexing time. To instead use 64-bit floats
for these columns, please set the system-wide property `druid.indexing.doubleStorage=double`.
This will become the default behavior in a future version of Druid.
Prior to version `0.13.0` Druid's storage layer uses a 32-bit float representation to store columns created by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar / formatting:

  • "used a 32-bit", not uses
  • I'd say don't backtick-format 0.13.0
  • borderline on whether doubleSum/doubleMin/doubleMax should be backtick-formatted; I probably wouldn't
  • "keep the old format" not "keep old format"
  • I think it'd be clearer to replace the last sentence with "Support for 64-bit floating point columns was released in Druid 0.11.0, so if you use this feature then older versions of Druid will not be able to read your data segments."

Copy link
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@leventov
Copy link
Member

Also labelled Incompatible because I think semantically it is. To retain the former behaviour some changes in configs should be done.

Change-Id: I5a588f7364f236bf22f2b138e9d743bfb27c67fe
@b-slim
Copy link
Contributor Author

b-slim commented Mar 14, 2018

Thanks, i have fixed the doc and added a sentence about pros/cons of using Double columns.
Am not sure what kind of number you guys expect please let me know and will address it in a follow-up.

@gianm
Copy link
Contributor

gianm commented Mar 14, 2018

@b-slim What I was thinking was that as people decide if they want to use floats or doubles, they will weigh the tradeoff of accuracy against storage size / performance. Presumably doubles use more space. They may be slower too (?). People can do their own tests but it might still be useful for us to publish some numbers.

@b-slim
Copy link
Contributor Author

b-slim commented Mar 14, 2018

@gianm fair enough for the speed part I thought the effect can be very negligible since we are using 64-bit primitive doubles to aggregate, but you are right maybe we can have a hit due to deserialization and loading into memory. Will add that to my list.

@clintropolis
Copy link
Member

@b-slim @drcrallen @gianm @leventov I've been running benchmarks and collecting speeds and sizes for integer columns following up on #4080 (comment), so I got curious late one night last week and went ahead and ran the same(ish) benchmarks on floats and doubles and results are pretty much what I expected. To summarize, I see between 10-50% decrease in select speed depending on number of rows selected and type of value distribution, and up to double the encoded size. Note: the dip at the end of the plots is an artifact of the parameters for number of filtered rows the benchmarks were run with didn't include rows - 1, and is actually a cliff from the filter being null in the benchmark:

if (filter == null) {
  for (int i = 0; i < rows; i++) {
    blackhole.consume(data.get(i));
  }
} else {
  for (int i = filter.nextSetBit(0); i >= 0; i = filter.nextSetBit(i + 1)) {
    blackhole.consume(data.get(i));
  }
}

In my experimental branch, I've modified the zipf value generator to use sample of the distribution directly instead of enumerating a distribution so that large cardinalities can be tested, hence the 'lazy' in the name. (Modified benchmarking code is not pushed anywhere yet).

Normal distribution:
normal

Uniform distribution:
uniform

Math.random positive distribution:
random positive

Zipf distribution:
(integer values stored in float/double column)
zipf low ints in floats

Zipf distribution scaled with Math.PI:
zipf low scaled with pi

Zipf distribution scaled with Math.PI (half zeros):
pi zipf low half zeros

Zipf distribution scaled with Math.PI (mostly zeros):
pi zipf low mostly zeros

To put that into percentage slower:
This is for normal distribution, but other graphs are similar.
normal-distribution percentage change

Don't let this dissuade anyone else from running benchmarks to validate what I see here.

@b-slim
Copy link
Contributor Author

b-slim commented Mar 20, 2018

@clintropolis Thanks for doing this! now QQ, when you say floats does that means Double32 bits or the new Float 32 aggregators?

@b-slim b-slim merged commit 17c71a2 into apache:master Mar 20, 2018
@clintropolis
Copy link
Member

@b-slim these were just run directly against ColumnarDoubles and ColumnarFloats implementations and are similar in nature to what FloatCompressionBenchmark and CompressedColumnarIntsBenchmark are doing rather than full query benchmarks. If I have a chance later I'll try to collect some more data and see if I can paint a more complete picture.

@b-slim
Copy link
Contributor Author

b-slim commented Mar 20, 2018

@clintropolis ColumnarFloats is using 32bits to aggregate and store ie true floats32 bits. I thought the ask here is to compare the 2 variant of Double columns, variant one is old school store as 32bits and aggregate as 64bits, Variant 2 is stored as 64bits and aggregated as 64bits. I hope that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants