Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding feature thetaSketchConstant to do some set operation in PostAgg #5551

Merged
merged 7 commits into from
Apr 6, 2018

Conversation

lssenthilkumar
Copy link
Contributor

Problem Statement: Need to get the topn page hit count from the set of filtered user accessed the page.

Solution: To solve this use case need to intersect two datasets, one with filtered data set and other topn page count.

Hence I will execute two queries, one with finalize = false and get the theta and pass the same as a constant to second query.

@gianm gianm added the Feature label Mar 31, 2018
{
this.name = name;
Preconditions.checkNotNull(sketchValue);
this.sketchValue = SketchHolder.deserialize(sketchValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be better to write this as,

this.sketchValue = SketchHolder.deserialize(Preconditions.checkNotNull(sketchValue, "value"));

For a couple reasons:

  1. Users that don't provide "value" will get a better error message.
  2. Preconditions.checkNotNull is designed to be used inline (it returns the value).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you copied this construct from ConstantPostAggregator, which has the same flaws. If you want to improve that as part of this PR, go for it too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gain, Updated the code for ConstantPostAggregator and SketchConstantPostAggregator.

@Override
public Set<String> getDependentFields()
{
return Sets.newHashSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collections.emptySet() returns a singleton so is preferred here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code for SketchConstantPostAggregator.

}

@JsonProperty("value")
public String getSketchValue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just return sketchValue here. Jackson knows how to serialize SketchHolder objects. That way, all the serialization code is located in one place (SketchHolderJsonSerializer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated the same and working fine.

public int hashCode()
{
int result = name != null ? name.hashCode() : 0;
result = 37 * result + getSketchValue().hashCode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like SketchHolder overrides equals but not hashCode. This is a bug, although I'm not sure if it has a visible effect in production before this patch (I can't think of anything offhand that would depend on SketchHolder's hash code being consistent with equals). But it does affect the correctness of this new class's equals/hashCode methods. So please fix SketchHolder in this patch - I think something that just delegates to the underlying Sketch's hashCode method would be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things,

  1. I updated equals method for SketchHolder since it internally calls Sketch's equals method which is not implemented.
  2. And added hashcode method for SketchHolder.
    Please review and let me know your comments.

@gianm
Copy link
Contributor

gianm commented Mar 31, 2018

Thanks for the contribution @lssenthilkumar! Could you please fill out our project CLA at: http://druid.io/community/cla.html

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a couple more comments @lssenthilkumar. Thanks for bearing with me. It is looking almost ready to go.

public SketchConstantPostAggregator(@JsonProperty("name") String name, @JsonProperty("value") String sketchValue)
{
this.name = name;
this.sketchValue = SketchHolder.deserialize(Preconditions.checkNotNull(sketchValue));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, but ideally this should be Preconditions.checkNotNull(sketchValue, "value") so the user gets a nicer error message (including the word "value").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using Preconditions.checkArgument to check null or empty, rather just null check and provided nice error message.

@Override
public byte[] getCacheKey()
{
return new CacheKeyBuilder(PostAggregatorIds.THETA_SKETCH_CONSTANT).appendInt(hashCode()).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashCode() isn't a good cache key here, since it can collide easily, and collisions are bad in cache keys since they cause wires to get crossed. It would be better to include a sha1sum of the entire base64-formatted sketch constant. The odds of collision in that case are vanishingly small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using DigestUtils.sha1Hex method instead hashCode.

return this.getSketch().equals(((SketchHolder) o).getSketch());
// Can't use Sketch's equals method because it is not implemented.
// return this.getSketch().equals(((SketchHolder) o).getSketch());
return Arrays.equals(this.getSketch().toByteArray(), ((SketchHolder) o).getSketch().toByteArray());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to keep these equals and hashCode methods just based on Sketch. So do equals as:

return this.getSketch().equals(((SketchHolder) o).getSketch());

And add hashCode as:

return this.getSketch().hashCode();

The reason I think this is ok is that at least they're consistent. They're reference-based, not value-based, but I think that's ok for now as long as they're consistent. I think we don't have a need for them to be value-based at this time.

And my feeling is that if we ever do need value-based equals/hashCode methods for SketchHolder, it should be done as part of Sketch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some unit test for post aggregation serde in SketchAggregationTest class. *testSketchEstimatePostAggregatorSerde
*testSketchSetPostAggregatorSerde

Since Sketch class doesn't have equals implementation, the above UT will fail.
Hence I updated SketchHolder's equals and hashcode method and not to use Sketch's equals or hashcode.

If I implement the above changes to use Sketch's equals and hashcode then I need to remove serde unit test for SketchConstantPostAggregator.

Please check the unit test class and suggest me how to proceed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the problem you are facing. Okay, in that case I would say go ahead with how you did it - base the equals and hashCode off the toByteArray form. Just please include a comment that it's done this way because Sketch impls don't tend to have value based hashCodes and equals, yet we want the SketchHolder to have one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comments please check once.

Thanks for your time and review comments Gian 👍. Hope i didn't trouble you much. :)

@@ -295,6 +296,14 @@ public boolean equals(Object o)
if (o == null || getClass() != o.getClass()) {
return false;
}
return this.getSketch().equals(((SketchHolder) o).getSketch());
// Can't use Sketch's equals method because it is not implemented.
// return this.getSketch().equals(((SketchHolder) o).getSketch());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't include commented-out code - it should be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the commented-out codes.

@lssenthilkumar
Copy link
Contributor Author

I am not sure why this ci build is failing, re-triggering the build by closing and re-opening the PR.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @lssenthilkumar!

@gianm gianm merged commit 371c672 into apache:master Apr 6, 2018
surekhasaharan pushed a commit to surekhasaharan/druid that referenced this pull request Apr 6, 2018
apache#5551)

* Adding feature thetaSketchConstant to do some set operation in PostAggregator

* Updated review comments for PR apache#5551 - Adding thetaSketchConstant

* Fixed CI build issue

* Updated review comments 2 for PR apache#5551 - Adding thetaSketchConstant
gianm pushed a commit that referenced this pull request May 3, 2018
* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks

Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.

 * The default value is 1/3(Runtime.maxMemory())
 * To maintain the current behaviour set 'maxBytesInMemory' to -1
 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
   will be respected i.e. the first one to go above threshold will trigger persist

* Fix check style and remove a comment

* Add overlord unsecured paths to coordinator when using combined service (#5579)

* Add overlord unsecured paths to coordinator when using combined service

* PR comment

* More error reporting and stats for ingestion tasks (#5418)

* Add more indexing task status and error reporting

* PR comments, add support in AppenderatorDriverRealtimeIndexTask

* Use TaskReport instead of metrics/context

* Fix tests

* Use TaskReport uploads

* Refactor fire department metrics retrieval

* Refactor input row serde in hadoop task

* Refactor hadoop task loader names

* Truncate error message in TaskStatus, add errorMsg to task report

* PR comments

* Allow getDomain to return disjointed intervals (#5570)

* Allow getDomain to return disjointed intervals

* Indentation issues

* Adding feature thetaSketchConstant to do some set operation in PostAgg (#5551)

* Adding feature thetaSketchConstant to do some set operation in PostAggregator

* Updated review comments for PR #5551 - Adding thetaSketchConstant

* Fixed CI build issue

* Updated review comments 2 for PR #5551 - Adding thetaSketchConstant

* Fix taskDuration docs for KafkaIndexingService (#5572)

* With incremental handoff the changed line is no longer true.

* Add doc for automatic pendingSegments (#5565)

* Add missing doc for automatic pendingSegments

* address comments

* Fix indexTask to respect forceExtendableShardSpecs (#5509)

* Fix indexTask to respect forceExtendableShardSpecs

* add comments

* Deprecate spark2 profile in pom.xml (#5581)

Deprecated due to #5382

* CompressionUtils: Add support for decompressing xz, bz2, zip. (#5586)

Also switch various firehoses to the new method.

Fixes #5585.

* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks

Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.

 * The default value is 1/3(Runtime.maxMemory())
 * To maintain the current behaviour set 'maxBytesInMemory' to -1
 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
   will be respected i.e. the first one to go above threshold will trigger persist

* Address code review comments

* Fix the coding style according to druid conventions
* Add more javadocs
* Rename some variables/methods
* Other minor issues

* Address more code review comments

* Some refactoring to put defaults in IndexTaskUtils
* Added check for maxBytesInMemory in AppenderatorImpl
* Decrement bytes in abandonSegment
* Test unit test for multiple sinks in single appenderator
* Fix some merge conflicts after rebase

* Fix some style checks

* Merge conflicts

* Fix failing tests

Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex

* Address PR comments

* Put defaults for maxRows and maxBytes in TuningConfig
* Change/add javadocs
* Refactoring and renaming some variables/methods

* Fix TeamCity inspection warnings

* Added maxBytesInMemory config to HadoopTuningConfig

* Updated the docs and examples

* Added maxBytesInMemory config in docs
* Removed references to maxRowsInMemory under tuningConfig in examples

* Set maxBytesInMemory to 0 until used

Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing
and set to part of max jvm memory when ingestion task starts

* Update toString in KafkaSupervisorTuningConfig

* Use correct maxBytesInMemory value in AppenderatorImpl

* Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory

Experimenting with various defaults, 1/3 jvm memory causes OOM

* Update docs to correct maxBytesInMemory default value

* Minor to rename and add comment

* Add more details in docs

* Address new PR comments

* Address PR comments

* Fix spelling typo
sathishsri88 pushed a commit to sathishs/druid that referenced this pull request May 8, 2018
…e#5583)

* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks

Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.

 * The default value is 1/3(Runtime.maxMemory())
 * To maintain the current behaviour set 'maxBytesInMemory' to -1
 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
   will be respected i.e. the first one to go above threshold will trigger persist

* Fix check style and remove a comment

* Add overlord unsecured paths to coordinator when using combined service (apache#5579)

* Add overlord unsecured paths to coordinator when using combined service

* PR comment

* More error reporting and stats for ingestion tasks (apache#5418)

* Add more indexing task status and error reporting

* PR comments, add support in AppenderatorDriverRealtimeIndexTask

* Use TaskReport instead of metrics/context

* Fix tests

* Use TaskReport uploads

* Refactor fire department metrics retrieval

* Refactor input row serde in hadoop task

* Refactor hadoop task loader names

* Truncate error message in TaskStatus, add errorMsg to task report

* PR comments

* Allow getDomain to return disjointed intervals (apache#5570)

* Allow getDomain to return disjointed intervals

* Indentation issues

* Adding feature thetaSketchConstant to do some set operation in PostAgg (apache#5551)

* Adding feature thetaSketchConstant to do some set operation in PostAggregator

* Updated review comments for PR apache#5551 - Adding thetaSketchConstant

* Fixed CI build issue

* Updated review comments 2 for PR apache#5551 - Adding thetaSketchConstant

* Fix taskDuration docs for KafkaIndexingService (apache#5572)

* With incremental handoff the changed line is no longer true.

* Add doc for automatic pendingSegments (apache#5565)

* Add missing doc for automatic pendingSegments

* address comments

* Fix indexTask to respect forceExtendableShardSpecs (apache#5509)

* Fix indexTask to respect forceExtendableShardSpecs

* add comments

* Deprecate spark2 profile in pom.xml (apache#5581)

Deprecated due to apache#5382

* CompressionUtils: Add support for decompressing xz, bz2, zip. (apache#5586)

Also switch various firehoses to the new method.

Fixes apache#5585.

* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks

Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.

 * The default value is 1/3(Runtime.maxMemory())
 * To maintain the current behaviour set 'maxBytesInMemory' to -1
 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
   will be respected i.e. the first one to go above threshold will trigger persist

* Address code review comments

* Fix the coding style according to druid conventions
* Add more javadocs
* Rename some variables/methods
* Other minor issues

* Address more code review comments

* Some refactoring to put defaults in IndexTaskUtils
* Added check for maxBytesInMemory in AppenderatorImpl
* Decrement bytes in abandonSegment
* Test unit test for multiple sinks in single appenderator
* Fix some merge conflicts after rebase

* Fix some style checks

* Merge conflicts

* Fix failing tests

Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex

* Address PR comments

* Put defaults for maxRows and maxBytes in TuningConfig
* Change/add javadocs
* Refactoring and renaming some variables/methods

* Fix TeamCity inspection warnings

* Added maxBytesInMemory config to HadoopTuningConfig

* Updated the docs and examples

* Added maxBytesInMemory config in docs
* Removed references to maxRowsInMemory under tuningConfig in examples

* Set maxBytesInMemory to 0 until used

Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing
and set to part of max jvm memory when ingestion task starts

* Update toString in KafkaSupervisorTuningConfig

* Use correct maxBytesInMemory value in AppenderatorImpl

* Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory

Experimenting with various defaults, 1/3 jvm memory causes OOM

* Update docs to correct maxBytesInMemory default value

* Minor to rename and add comment

* Add more details in docs

* Address new PR comments

* Address PR comments

* Fix spelling typo
@lssenthilkumar
Copy link
Contributor Author

@gianm, I don't see this change with the latest release. Can you please tell me when this change will be part of GA.

@dclim dclim added this to the 0.13.0 milestone Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants