Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thetaSketch(with sketches-core-0.13.1) in groupBy always return value no more than 16384 #7607

Closed
pzhdfy opened this issue May 7, 2019 · 16 comments · Fixed by #7619
Closed
Labels
Milestone

Comments

@pzhdfy
Copy link
Contributor

pzhdfy commented May 7, 2019

Affected Version

The Druid version with sketches-core-0.13.1

Description

we updated to sketches-core-0.13.1 , because it Bug fix for Quantiles Sketches in direct mode.
then we found using thetaSketch in groupBy always return value no more than 16384(the size).
if set size to another value, such as 32768, the thetaSketch return <= 32768.
But thetaSketch in topN and timeseries return expected data

Then we roll back to sketches-core-0.12.0, thetaSketch works well.
But the Quantiles Sketches will has bug

@AlexanderSaydakov

@AlexanderSaydakov
Copy link
Contributor

Could you be more specific please? What exactly do you do? What is the Druid version? How did you update sketches-core?

@leerho
Copy link
Contributor

leerho commented May 7, 2019

@pzhdfy
This is very strange. Can you reproduce this in a smaller environment? We already have unit tests where N >> K and that show no problem. We need much more detail in order to help you troubleshoot this.

@leerho
Copy link
Contributor

leerho commented May 7, 2019

@pzhdfy

Just to double check I have run characterization tests against the Theta Union sketch (0.13.1) using both datum updates as well as sketch updates, both on-heap and off-heap.

These tests exercise a 4K sketch up to a million uniques and all 4 test suites produce the same identical accuracy pitch-fork plot as follows:

ThetaUnionAccuracy1

The plot looks a little noisy because I am only running 1024 trials at each point. This is an absolutely normal accuracy characterization plot and well within accuracy specifications.

Unless you can provide more clarity and information as to how you are using the sketch there is nothing else we can do.

@pzhdfy
Copy link
Contributor Author

pzhdfy commented May 8, 2019

@AlexanderSaydakov

1.use the newest Druid version( compiled from master branch, with sketches-core-0.13.1)

2.use this python generate test data

import json
f=open('test_theta_data','w')
d={}
for i in xrange(100000):
  #date_id,user_id
  d['date_id']='20190407'
  d['user_id']=i
  f.write('%s\n' % json.dumps(d))

3.use this Ingestion spec

{
  "type": "index_hadoop",
  "spec": {
    "dataSchema": {
      "dataSource": "test_theta",
      "parser": {
        "type": "hadoopyString",
        "parseSpec": {
          "format": "json",
          "timestampSpec": {
            "column": "date_id",
            "format": "yyyyMMdd",
            "missingValue": 0
          },
          "dimensionsSpec": {
            "dimensionExclusions": [],
            "dimensions": [
              "date_id",
              "user_id"
            ]
          }
        }
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "DAY",
        "rollup": true,
        "intervals": [
          "2019-04-07/2019-04-08"
        ]
      }
    },
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "children": [
          {
            "paths": "/home/druid/druid_test_histogram/test_theta_data",
            "type": "static"
          }
        ],
        "type": "multi"
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "jobProperties": {
        "mapreduce.am.push.alter.metrics": "druid_indexer_config#druid.dataSource.name#druid.hive.table#mapreduce.input.multipleinputs.dir.formats:druidIndexerConfig#druidDataSource#druidHiveTable#multipleinputsDir",
        "mapreduce.job.classloader": "true",
        "mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.",
        "mapreduce.job.queuename": "root.offline.hdp.hdp.L_0",
        "mapreduce.map.java.opts": "-Xmx6G -Djava.net.preferIPv4Stack=true",
        "mapreduce.map.memory.mb": "10240",
        "mapreduce.reduce.java.opts": "-Xmx6G -Djava.net.preferIPv4Stack=true",
        "mapreduce.reduce.memory.mb": "10240",
        "mapreduce.reduce.speculative": "false"
      }
    }
  },
  "context": {
    "druid.indexer.fork.property.druid.processing.buffer.sizeBytes": 1024,
    "druid.indexer.fork.property.druid.processing.numMergeBuffers": 1,
    "druid.indexer.fork.property.druid.processing.numThreads": 1
  }
}

4.use groupby and topN

{
  "dimensions": [
    "date_id"
  ],
  "aggregations": [
    {
      "type": "thetaSketch",
      "fieldName": "user_id",
      "name": "count_distinct__user_id"
    }
  ],
  "intervals": "2019-04-07/2019-04-08",
  "limitSpec": {
    "limit": 5000,
    "type": "default",
    "columns": [
      {
        "direction": "descending",
        "dimension": "count_distinct__user_id"
      }
    ]
  },
  "granularity": "all",
  "queryType": "groupBy",
  "dataSource": "test_theta"
}
{
  "dimension": "date_id", 
  "aggregations": [
    {
      "type": "thetaSketch", 
      "fieldName": "user_id", 
      "name": "count_distinct__user_id"
    }
  ], 
  "intervals": "2019-04-07/2019-04-08", 
 "metric": "count_distinct__user_id", 
 "threshold": 5000, 
  "granularity": "all",  
  "queryType": "topN", 
  "dataSource": "test_theta"
}

the result for groupby and topN
image
image

@gianm
Copy link
Contributor

gianm commented May 8, 2019

In groupBy vs topN, as far as aggregators are concerned, one major difference is that groupBy uses relocate and topN does not. However, since you have just one date_id, I don't think it's likely that relocate would be called. So that's probably not it.

@leerho
Copy link
Contributor

leerho commented May 8, 2019

@pzhdfy

  1. Please try sketches release 0.13.0. This will narrow down the possible changes that might be causing this.
  2. At the points where the sketch is reporting the good result and the bad result print out the sketch summary using toString(). This might provide some clues.

@gianm
Copy link
Contributor

gianm commented May 8, 2019

I was able to reproduce this as well. Downgrading to sketches-core-0.13.0 fixed the problem. I also noticed that adding a limit to the groupBy fixed it as well. I'm not sure why - it does change the code paths, however. In Druid SQL, this query exhibits the issue:

SELECT 'beep', APPROX_COUNT_DISTINCT_DS_THETA("user_id") FROM test_theta GROUP BY 1

And this one doesn't:

SELECT 'beep', APPROX_COUNT_DISTINCT_DS_THETA("user_id") FROM test_theta GROUP BY 1 LIMIT 1

(The 'beep' is to force the SQL planner to use a groupBy rather than timeseries query type.)

@gianm
Copy link
Contributor

gianm commented May 8, 2019

Thank you, @pzhdfy, for the detailed instructions on how to reproduce this problem.

@leerho
Copy link
Contributor

leerho commented May 8, 2019

Very puzzling. Se need to simplify the problem environment to where I can reproduce the problem outside Druid. I suspect that somehow theta is being reset to 1.0, which would cause this.

@gianm
Copy link
Contributor

gianm commented May 8, 2019

@leerho, please let me know if the following is helpful, or if I could do anything else to help.

What the Druid query is doing is something like this:

  1. Iterating over all rows in a Druid segment, and building up a theta sketch object. This object looks fine.
  2. Taking that object and merging it into the 'merge buffer', which starts off initialized to an empty sketch. This is where it goes off the rails.

I scattered a bunch of sketch toStrings around the code and found that in step (2) they look like this:

The object built up from the segment scan,

### HeapCompactOrderedSketch SUMMARY:
   Estimate                : 100086.81001356241
   Upper Bound, 95% conf   : 101530.78009013624
   Lower Bound, 95% conf   : 98663.31883662633
   Theta (double)          : 0.16369789383615946
   Theta (long)            : 1509846576500454824
   Theta (long) hex        : 14f40d7639a635a8
   EstMode?                : true
   Empty?                  : false
   Array Size Entries      : 16384
   Retained Entries        : 16384
   Seed Hash               : 93cc | 37836
### END SKETCH SUMMARY

The initial state of the sketch in the merge buffer (should be empty),

### HeapCompactOrderedSketch SUMMARY:
   Estimate                : 0.0
   Upper Bound, 95% conf   : 0.0
   Lower Bound, 95% conf   : 0.0
   Theta (double)          : 1.0
   Theta (long)            : 9223372036854775807
   Theta (long) hex        : 7fffffffffffffff
   EstMode?                : false
   Empty?                  : true
   Array Size Entries      : 0
   Retained Entries        : 0
   Seed Hash               : 93cc | 37836
### END SKETCH SUMMARY

The final state of the sketch in the merge buffer (should match the original sketch from the segment scan),

### HeapCompactOrderedSketch SUMMARY:
   Estimate                : 16384.0
   Upper Bound, 95% conf   : 16384.0
   Lower Bound, 95% conf   : 16384.0
   Theta (double)          : 1.0
   Theta (long)            : 9223372036854775807
   Theta (long) hex        : 7fffffffffffffff
   EstMode?                : false
   Empty?                  : false
   Array Size Entries      : 16384
   Retained Entries        : 16384
   Seed Hash               : 93cc | 37836
### END SKETCH SUMMARY

It's changed a bit, but doesn't match up.

The code that printed this was the aggregate method in SketchBufferAggregator, which looks like this after the debugging code I added:

  @Override
  public void aggregate(ByteBuffer buf, int position)
  {
    Object update = selector.getObject();
    if (update == null) {
      return;
    }

    Union union = getOrCreateUnion(buf, position);
    final String initialUnionResult = update instanceof SketchHolder ? union.getResult().toString() : null;

    SketchAggregator.updateUnion(union, update);

    if (update instanceof SketchHolder) {
      log.info(
          "Aggregate called with buffer[%s], position[%s], update = %s, union starts as = %s, union ends as = %s",
          System.identityHashCode(buf),
          position,
          ((SketchHolder) update).getSketch(),
          initialUnionResult,
          union.getResult()
      );
    }
  }

@leerho
Copy link
Contributor

leerho commented May 8, 2019

Thank you!! We have been able to reproduce the problem. Now I can dig in to see what went wrong.

@AlexanderSaydakov
Copy link
Contributor

This is a regression in Theta sketch code. So I would think you don't want to approve the 0.14.1 release candidate as it is now. We will fix the sketches-core shortly.

@leerho
Copy link
Contributor

leerho commented May 8, 2019

@pzhdfy @gianm
DataSketches sketches-core 0.13.3 is now released to Maven Central with the fix.
Thank you @pzhdfy and @gianm for your help in finding this!!

AlexanderSaydakov added a commit to AlexanderSaydakov/druid that referenced this issue May 9, 2019
@AlexanderSaydakov
Copy link
Contributor

AlexanderSaydakov commented May 9, 2019

pull request:
#7619

@clintropolis
Copy link
Member

This is a regression in Theta sketch code. So I would think you don't want to approve the 0.14.1 release candidate as it is now. We will fix the sketches-core shortly.

0.14.1 is too far gone, the artifacts are already propagated to maven and the apache mirrors, so I'm going to go ahead and do the release anyway. I've modified the release notes to warn about upgrading if relying on theta sketches.

This issue does seem severe enough to go ahead and do a 0.14.2 since we can probably drive that through a lot quicker than we can wrap up and validate 0.15.0, so I will create an rc and start a vote as soon as possible.

clintropolis pushed a commit that referenced this issue May 10, 2019
* fix issue #7607

* exclude com.google.code.findbugs:annotations
@clintropolis
Copy link
Member

Fixed by #7619

clintropolis pushed a commit that referenced this issue May 10, 2019
* fix issue #7607

* exclude com.google.code.findbugs:annotations
gianm pushed a commit to implydata/druid-public that referenced this issue May 10, 2019
* fix issue apache#7607

* exclude com.google.code.findbugs:annotations
jihoonson pushed a commit to jihoonson/druid that referenced this issue May 11, 2019
* fix issue apache#7607

* exclude com.google.code.findbugs:annotations
clintropolis pushed a commit that referenced this issue May 11, 2019
* fix issue #7607

* exclude com.google.code.findbugs:annotations
@clintropolis clintropolis added this to the 0.14.2 milestone May 24, 2019
jihoonson added a commit to implydata/druid-public that referenced this issue Jun 4, 2019
* fix issue apache#7607

* exclude com.google.code.findbugs:annotations
jihoonson pushed a commit to implydata/druid-public that referenced this issue Jun 26, 2019
* fix issue apache#7607

* exclude com.google.code.findbugs:annotations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants