Adding feature thetaSketchConstant to do some set operation in PostAgg #5551

lssenthilkumar · 2018-03-29T10:35:10Z

Problem Statement: Need to get the topn page hit count from the set of filtered user accessed the page.

Solution: To solve this use case need to intersect two datasets, one with filtered data set and other topn page count.

Hence I will execute two queries, one with finalize = false and get the theta and pass the same as a constant to second query.

…gregator

gianm · 2018-03-31T19:37:19Z

...rc/main/java/io/druid/query/aggregation/datasketches/theta/SketchConstantPostAggregator.java

+  {
+    this.name = name;
+    Preconditions.checkNotNull(sketchValue);
+    this.sketchValue = SketchHolder.deserialize(sketchValue);


It'd be better to write this as,

this.sketchValue = SketchHolder.deserialize(Preconditions.checkNotNull(sketchValue, "value"));

For a couple reasons:

Users that don't provide "value" will get a better error message.

Preconditions.checkNotNull is designed to be used inline (it returns the value).

It looks like you copied this construct from ConstantPostAggregator, which has the same flaws. If you want to improve that as part of this PR, go for it too.

Thanks Gain, Updated the code for ConstantPostAggregator and SketchConstantPostAggregator.

gianm · 2018-03-31T19:39:00Z

...rc/main/java/io/druid/query/aggregation/datasketches/theta/SketchConstantPostAggregator.java

+  @Override
+  public Set<String> getDependentFields()
+  {
+    return Sets.newHashSet();


Collections.emptySet() returns a singleton so is preferred here.

Updated the code for SketchConstantPostAggregator.

gianm · 2018-03-31T19:41:15Z

...rc/main/java/io/druid/query/aggregation/datasketches/theta/SketchConstantPostAggregator.java

+  }
+
+  @JsonProperty("value")
+  public String getSketchValue()


You can just return sketchValue here. Jackson knows how to serialize SketchHolder objects. That way, all the serialization code is located in one place (SketchHolderJsonSerializer)

Incorporated the same and working fine.

gianm · 2018-03-31T19:46:17Z

...rc/main/java/io/druid/query/aggregation/datasketches/theta/SketchConstantPostAggregator.java

+  public int hashCode()
+  {
+    int result = name != null ? name.hashCode() : 0;
+    result = 37 * result + getSketchValue().hashCode();


It looks like SketchHolder overrides equals but not hashCode. This is a bug, although I'm not sure if it has a visible effect in production before this patch (I can't think of anything offhand that would depend on SketchHolder's hash code being consistent with equals). But it does affect the correctness of this new class's equals/hashCode methods. So please fix SketchHolder in this patch - I think something that just delegates to the underlying Sketch's hashCode method would be enough.

Couple of things,

I updated equals method for SketchHolder since it internally calls Sketch's equals method which is not implemented.

And added hashcode method for SketchHolder.
Please review and let me know your comments.

gianm · 2018-03-31T20:52:34Z

Thanks for the contribution @lssenthilkumar! Could you please fill out our project CLA at: http://druid.io/community/cla.html

gianm

Had a couple more comments @lssenthilkumar. Thanks for bearing with me. It is looking almost ready to go.

gianm · 2018-04-03T20:49:44Z

...rc/main/java/io/druid/query/aggregation/datasketches/theta/SketchConstantPostAggregator.java

+  public SketchConstantPostAggregator(@JsonProperty("name") String name, @JsonProperty("value") String sketchValue)
+  {
+    this.name = name;
+    this.sketchValue = SketchHolder.deserialize(Preconditions.checkNotNull(sketchValue));


Minor comment, but ideally this should be Preconditions.checkNotNull(sketchValue, "value") so the user gets a nicer error message (including the word "value").

I am using Preconditions.checkArgument to check null or empty, rather just null check and provided nice error message.

gianm · 2018-04-03T21:03:29Z

...rc/main/java/io/druid/query/aggregation/datasketches/theta/SketchConstantPostAggregator.java

+  @Override
+  public byte[] getCacheKey()
+  {
+    return new CacheKeyBuilder(PostAggregatorIds.THETA_SKETCH_CONSTANT).appendInt(hashCode()).build();


The hashCode() isn't a good cache key here, since it can collide easily, and collisions are bad in cache keys since they cause wires to get crossed. It would be better to include a sha1sum of the entire base64-formatted sketch constant. The odds of collision in that case are vanishingly small.

I am using DigestUtils.sha1Hex method instead hashCode.

gianm · 2018-04-03T21:09:15Z

...e/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SketchHolder.java

-    return this.getSketch().equals(((SketchHolder) o).getSketch());
+    // Can't use Sketch's equals method because it is not implemented.
+    // return this.getSketch().equals(((SketchHolder) o).getSketch());
+    return Arrays.equals(this.getSketch().toByteArray(), ((SketchHolder) o).getSketch().toByteArray());


I think it would be fine to keep these equals and hashCode methods just based on Sketch. So do equals as:

return this.getSketch().equals(((SketchHolder) o).getSketch());

And add hashCode as:

return this.getSketch().hashCode();

The reason I think this is ok is that at least they're consistent. They're reference-based, not value-based, but I think that's ok for now as long as they're consistent. I think we don't have a need for them to be value-based at this time.

And my feeling is that if we ever do need value-based equals/hashCode methods for SketchHolder, it should be done as part of Sketch.

I have added some unit test for post aggregation serde in SketchAggregationTest class. *testSketchEstimatePostAggregatorSerde
*testSketchSetPostAggregatorSerde

Since Sketch class doesn't have equals implementation, the above UT will fail.
Hence I updated SketchHolder's equals and hashcode method and not to use Sketch's equals or hashcode.

If I implement the above changes to use Sketch's equals and hashcode then I need to remove serde unit test for SketchConstantPostAggregator.

Please check the unit test class and suggest me how to proceed.

Ah I see the problem you are facing. Okay, in that case I would say go ahead with how you did it - base the equals and hashCode off the toByteArray form. Just please include a comment that it's done this way because Sketch impls don't tend to have value based hashCodes and equals, yet we want the SketchHolder to have one.

Updated the comments please check once.

Thanks for your time and review comments Gian 👍. Hope i didn't trouble you much. :)

gianm · 2018-04-03T21:13:59Z

...e/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SketchHolder.java

@@ -295,6 +296,14 @@ public boolean equals(Object o)
    if (o == null || getClass() != o.getClass()) {
      return false;
    }
-    return this.getSketch().equals(((SketchHolder) o).getSketch());
+    // Can't use Sketch's equals method because it is not implemented.
+    // return this.getSketch().equals(((SketchHolder) o).getSketch());


Please don't include commented-out code - it should be deleted.

Removed the commented-out codes.

…tant

lssenthilkumar · 2018-04-05T04:33:23Z

I am not sure why this ci build is failing, re-triggering the build by closing and re-opening the PR.

gianm

LGTM, thanks @lssenthilkumar!

apache#5551) * Adding feature thetaSketchConstant to do some set operation in PostAggregator * Updated review comments for PR apache#5551 - Adding thetaSketchConstant * Fixed CI build issue * Updated review comments 2 for PR apache#5551 - Adding thetaSketchConstant

* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Fix check style and remove a comment * Add overlord unsecured paths to coordinator when using combined service (#5579) * Add overlord unsecured paths to coordinator when using combined service * PR comment * More error reporting and stats for ingestion tasks (#5418) * Add more indexing task status and error reporting * PR comments, add support in AppenderatorDriverRealtimeIndexTask * Use TaskReport instead of metrics/context * Fix tests * Use TaskReport uploads * Refactor fire department metrics retrieval * Refactor input row serde in hadoop task * Refactor hadoop task loader names * Truncate error message in TaskStatus, add errorMsg to task report * PR comments * Allow getDomain to return disjointed intervals (#5570) * Allow getDomain to return disjointed intervals * Indentation issues * Adding feature thetaSketchConstant to do some set operation in PostAgg (#5551) * Adding feature thetaSketchConstant to do some set operation in PostAggregator * Updated review comments for PR #5551 - Adding thetaSketchConstant * Fixed CI build issue * Updated review comments 2 for PR #5551 - Adding thetaSketchConstant * Fix taskDuration docs for KafkaIndexingService (#5572) * With incremental handoff the changed line is no longer true. * Add doc for automatic pendingSegments (#5565) * Add missing doc for automatic pendingSegments * address comments * Fix indexTask to respect forceExtendableShardSpecs (#5509) * Fix indexTask to respect forceExtendableShardSpecs * add comments * Deprecate spark2 profile in pom.xml (#5581) Deprecated due to #5382 * CompressionUtils: Add support for decompressing xz, bz2, zip. (#5586) Also switch various firehoses to the new method. Fixes #5585. * This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Address code review comments * Fix the coding style according to druid conventions * Add more javadocs * Rename some variables/methods * Other minor issues * Address more code review comments * Some refactoring to put defaults in IndexTaskUtils * Added check for maxBytesInMemory in AppenderatorImpl * Decrement bytes in abandonSegment * Test unit test for multiple sinks in single appenderator * Fix some merge conflicts after rebase * Fix some style checks * Merge conflicts * Fix failing tests Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex * Address PR comments * Put defaults for maxRows and maxBytes in TuningConfig * Change/add javadocs * Refactoring and renaming some variables/methods * Fix TeamCity inspection warnings * Added maxBytesInMemory config to HadoopTuningConfig * Updated the docs and examples * Added maxBytesInMemory config in docs * Removed references to maxRowsInMemory under tuningConfig in examples * Set maxBytesInMemory to 0 until used Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing and set to part of max jvm memory when ingestion task starts * Update toString in KafkaSupervisorTuningConfig * Use correct maxBytesInMemory value in AppenderatorImpl * Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory Experimenting with various defaults, 1/3 jvm memory causes OOM * Update docs to correct maxBytesInMemory default value * Minor to rename and add comment * Add more details in docs * Address new PR comments * Address PR comments * Fix spelling typo

…e#5583) * This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Fix check style and remove a comment * Add overlord unsecured paths to coordinator when using combined service (apache#5579) * Add overlord unsecured paths to coordinator when using combined service * PR comment * More error reporting and stats for ingestion tasks (apache#5418) * Add more indexing task status and error reporting * PR comments, add support in AppenderatorDriverRealtimeIndexTask * Use TaskReport instead of metrics/context * Fix tests * Use TaskReport uploads * Refactor fire department metrics retrieval * Refactor input row serde in hadoop task * Refactor hadoop task loader names * Truncate error message in TaskStatus, add errorMsg to task report * PR comments * Allow getDomain to return disjointed intervals (apache#5570) * Allow getDomain to return disjointed intervals * Indentation issues * Adding feature thetaSketchConstant to do some set operation in PostAgg (apache#5551) * Adding feature thetaSketchConstant to do some set operation in PostAggregator * Updated review comments for PR apache#5551 - Adding thetaSketchConstant * Fixed CI build issue * Updated review comments 2 for PR apache#5551 - Adding thetaSketchConstant * Fix taskDuration docs for KafkaIndexingService (apache#5572) * With incremental handoff the changed line is no longer true. * Add doc for automatic pendingSegments (apache#5565) * Add missing doc for automatic pendingSegments * address comments * Fix indexTask to respect forceExtendableShardSpecs (apache#5509) * Fix indexTask to respect forceExtendableShardSpecs * add comments * Deprecate spark2 profile in pom.xml (apache#5581) Deprecated due to apache#5382 * CompressionUtils: Add support for decompressing xz, bz2, zip. (apache#5586) Also switch various firehoses to the new method. Fixes apache#5585. * This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Address code review comments * Fix the coding style according to druid conventions * Add more javadocs * Rename some variables/methods * Other minor issues * Address more code review comments * Some refactoring to put defaults in IndexTaskUtils * Added check for maxBytesInMemory in AppenderatorImpl * Decrement bytes in abandonSegment * Test unit test for multiple sinks in single appenderator * Fix some merge conflicts after rebase * Fix some style checks * Merge conflicts * Fix failing tests Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex * Address PR comments * Put defaults for maxRows and maxBytes in TuningConfig * Change/add javadocs * Refactoring and renaming some variables/methods * Fix TeamCity inspection warnings * Added maxBytesInMemory config to HadoopTuningConfig * Updated the docs and examples * Added maxBytesInMemory config in docs * Removed references to maxRowsInMemory under tuningConfig in examples * Set maxBytesInMemory to 0 until used Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing and set to part of max jvm memory when ingestion task starts * Update toString in KafkaSupervisorTuningConfig * Use correct maxBytesInMemory value in AppenderatorImpl * Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory Experimenting with various defaults, 1/3 jvm memory causes OOM * Update docs to correct maxBytesInMemory default value * Minor to rename and add comment * Add more details in docs * Address new PR comments * Address PR comments * Fix spelling typo

lssenthilkumar · 2018-09-14T19:42:40Z

@gianm, I don't see this change with the latest release. Can you please tell me when this change will be part of GA.

Adding feature thetaSketchConstant to do some set operation in PostAg…

6ddd089

…gregator

gianm added the Feature label Mar 31, 2018

gianm reviewed Mar 31, 2018

View reviewed changes

Laguduwa Sankaran, Senthil Kumar added 4 commits April 3, 2018 22:21

Merge branch 'master' of https://github.com/druid-io/druid

9c8a591

Updated review comments for PR apache#5551 - Adding thetaSketchConstant

0fa9eab

Merge branch 'master' of https://github.com/druid-io/druid

d645a35

Fixed CI build issue

ee3b2ab

gianm reviewed Apr 3, 2018

View reviewed changes

Laguduwa Sankaran, Senthil Kumar added 2 commits April 4, 2018 16:02

Merge branch 'master' of https://github.com/druid-io/druid

89278d0

Updated review comments 2 for PR apache#5551 - Adding thetaSketchCons…

dd34a5d

…tant

lssenthilkumar closed this Apr 4, 2018

lssenthilkumar reopened this Apr 4, 2018

lssenthilkumar closed this Apr 5, 2018

lssenthilkumar reopened this Apr 5, 2018

gianm approved these changes Apr 6, 2018

View reviewed changes

gianm merged commit 371c672 into apache:master Apr 6, 2018

dclim added this to the 0.13.0 milestone Oct 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding feature thetaSketchConstant to do some set operation in PostAgg #5551

Adding feature thetaSketchConstant to do some set operation in PostAgg #5551

lssenthilkumar commented Mar 29, 2018

gianm Mar 31, 2018

gianm Mar 31, 2018

lssenthilkumar Apr 3, 2018

gianm Mar 31, 2018

lssenthilkumar Apr 3, 2018

gianm Mar 31, 2018

lssenthilkumar Apr 3, 2018

gianm Mar 31, 2018

lssenthilkumar Apr 3, 2018

gianm commented Mar 31, 2018

gianm left a comment

gianm Apr 3, 2018

lssenthilkumar Apr 4, 2018

gianm Apr 3, 2018

lssenthilkumar Apr 4, 2018

gianm Apr 3, 2018

lssenthilkumar Apr 4, 2018

gianm Apr 4, 2018

lssenthilkumar Apr 4, 2018

gianm Apr 3, 2018

lssenthilkumar Apr 4, 2018

lssenthilkumar commented Apr 5, 2018

gianm left a comment

lssenthilkumar commented Sep 14, 2018

Adding feature thetaSketchConstant to do some set operation in PostAgg #5551

Adding feature thetaSketchConstant to do some set operation in PostAgg #5551

Conversation

lssenthilkumar commented Mar 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Mar 31, 2018

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lssenthilkumar commented Apr 5, 2018

gianm left a comment

Choose a reason for hiding this comment

lssenthilkumar commented Sep 14, 2018