-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thetaSketch(with sketches-core-0.13.1) in groupBy always return value no more than 16384 #7607
Comments
Could you be more specific please? What exactly do you do? What is the Druid version? How did you update sketches-core? |
@pzhdfy |
Just to double check I have run characterization tests against the Theta Union sketch (0.13.1) using both datum updates as well as sketch updates, both on-heap and off-heap. These tests exercise a 4K sketch up to a million uniques and all 4 test suites produce the same identical accuracy pitch-fork plot as follows: The plot looks a little noisy because I am only running 1024 trials at each point. This is an absolutely normal accuracy characterization plot and well within accuracy specifications. Unless you can provide more clarity and information as to how you are using the sketch there is nothing else we can do. |
1.use the newest Druid version( compiled from master branch, with sketches-core-0.13.1)2.use this python generate test data
3.use this Ingestion spec
4.use groupby and topN
|
In groupBy vs topN, as far as aggregators are concerned, one major difference is that groupBy uses |
|
I was able to reproduce this as well. Downgrading to sketches-core-0.13.0 fixed the problem. I also noticed that adding a limit to the groupBy fixed it as well. I'm not sure why - it does change the code paths, however. In Druid SQL, this query exhibits the issue: SELECT 'beep', APPROX_COUNT_DISTINCT_DS_THETA("user_id") FROM test_theta GROUP BY 1 And this one doesn't: SELECT 'beep', APPROX_COUNT_DISTINCT_DS_THETA("user_id") FROM test_theta GROUP BY 1 LIMIT 1 (The 'beep' is to force the SQL planner to use a groupBy rather than timeseries query type.) |
Thank you, @pzhdfy, for the detailed instructions on how to reproduce this problem. |
Very puzzling. Se need to simplify the problem environment to where I can reproduce the problem outside Druid. I suspect that somehow theta is being reset to 1.0, which would cause this. |
@leerho, please let me know if the following is helpful, or if I could do anything else to help. What the Druid query is doing is something like this:
I scattered a bunch of sketch toStrings around the code and found that in step (2) they look like this: The object built up from the segment scan,
The initial state of the sketch in the merge buffer (should be empty),
The final state of the sketch in the merge buffer (should match the original sketch from the segment scan),
It's changed a bit, but doesn't match up. The code that printed this was the @Override
public void aggregate(ByteBuffer buf, int position)
{
Object update = selector.getObject();
if (update == null) {
return;
}
Union union = getOrCreateUnion(buf, position);
final String initialUnionResult = update instanceof SketchHolder ? union.getResult().toString() : null;
SketchAggregator.updateUnion(union, update);
if (update instanceof SketchHolder) {
log.info(
"Aggregate called with buffer[%s], position[%s], update = %s, union starts as = %s, union ends as = %s",
System.identityHashCode(buf),
position,
((SketchHolder) update).getSketch(),
initialUnionResult,
union.getResult()
);
}
} |
Thank you!! We have been able to reproduce the problem. Now I can dig in to see what went wrong. |
This is a regression in Theta sketch code. So I would think you don't want to approve the 0.14.1 release candidate as it is now. We will fix the sketches-core shortly. |
pull request: |
0.14.1 is too far gone, the artifacts are already propagated to maven and the apache mirrors, so I'm going to go ahead and do the release anyway. I've modified the release notes to warn about upgrading if relying on theta sketches. This issue does seem severe enough to go ahead and do a 0.14.2 since we can probably drive that through a lot quicker than we can wrap up and validate 0.15.0, so I will create an rc and start a vote as soon as possible. |
Fixed by #7619 |
* fix issue apache#7607 * exclude com.google.code.findbugs:annotations
* fix issue apache#7607 * exclude com.google.code.findbugs:annotations
* fix issue apache#7607 * exclude com.google.code.findbugs:annotations
* fix issue apache#7607 * exclude com.google.code.findbugs:annotations
Affected Version
The Druid version with sketches-core-0.13.1
Description
we updated to sketches-core-0.13.1 , because it Bug fix for Quantiles Sketches in direct mode.
then we found using thetaSketch in groupBy always return value no more than 16384(the size).
if set size to another value, such as 32768, the thetaSketch return <= 32768.
But thetaSketch in topN and timeseries return expected data
Then we roll back to sketches-core-0.12.0, thetaSketch works well.
But the Quantiles Sketches will has bug
@AlexanderSaydakov
The text was updated successfully, but these errors were encountered: