thetaSketch(with sketches-core-0.13.1) in groupBy always return value no more than 16384 #7607

pzhdfy · 2019-05-07T06:50:33Z

Affected Version

The Druid version with sketches-core-0.13.1

Description

we updated to sketches-core-0.13.1 , because it Bug fix for Quantiles Sketches in direct mode.
then we found using thetaSketch in groupBy always return value no more than 16384(the size).
if set size to another value, such as 32768, the thetaSketch return <= 32768.
But thetaSketch in topN and timeseries return expected data

Then we roll back to sketches-core-0.12.0, thetaSketch works well.
But the Quantiles Sketches will has bug

@AlexanderSaydakov

AlexanderSaydakov · 2019-05-07T16:46:17Z

Could you be more specific please? What exactly do you do? What is the Druid version? How did you update sketches-core?

leerho · 2019-05-07T17:03:05Z

@pzhdfy
This is very strange. Can you reproduce this in a smaller environment? We already have unit tests where N >> K and that show no problem. We need much more detail in order to help you troubleshoot this.

leerho · 2019-05-07T23:57:04Z

@pzhdfy

Just to double check I have run characterization tests against the Theta Union sketch (0.13.1) using both datum updates as well as sketch updates, both on-heap and off-heap.

These tests exercise a 4K sketch up to a million uniques and all 4 test suites produce the same identical accuracy pitch-fork plot as follows:

The plot looks a little noisy because I am only running 1024 trials at each point. This is an absolutely normal accuracy characterization plot and well within accuracy specifications.

Unless you can provide more clarity and information as to how you are using the sketch there is nothing else we can do.

pzhdfy · 2019-05-08T03:54:37Z

@AlexanderSaydakov

1.use the newest Druid version( compiled from master branch, with sketches-core-0.13.1)

2.use this python generate test data

import json
f=open('test_theta_data','w')
d={}
for i in xrange(100000):
  #date_id,user_id
  d['date_id']='20190407'
  d['user_id']=i
  f.write('%s\n' % json.dumps(d))

3.use this Ingestion spec

{
  "type": "index_hadoop",
  "spec": {
    "dataSchema": {
      "dataSource": "test_theta",
      "parser": {
        "type": "hadoopyString",
        "parseSpec": {
          "format": "json",
          "timestampSpec": {
            "column": "date_id",
            "format": "yyyyMMdd",
            "missingValue": 0
          },
          "dimensionsSpec": {
            "dimensionExclusions": [],
            "dimensions": [
              "date_id",
              "user_id"
            ]
          }
        }
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "DAY",
        "rollup": true,
        "intervals": [
          "2019-04-07/2019-04-08"
        ]
      }
    },
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "children": [
          {
            "paths": "/home/druid/druid_test_histogram/test_theta_data",
            "type": "static"
          }
        ],
        "type": "multi"
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "jobProperties": {
        "mapreduce.am.push.alter.metrics": "druid_indexer_config#druid.dataSource.name#druid.hive.table#mapreduce.input.multipleinputs.dir.formats:druidIndexerConfig#druidDataSource#druidHiveTable#multipleinputsDir",
        "mapreduce.job.classloader": "true",
        "mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.",
        "mapreduce.job.queuename": "root.offline.hdp.hdp.L_0",
        "mapreduce.map.java.opts": "-Xmx6G -Djava.net.preferIPv4Stack=true",
        "mapreduce.map.memory.mb": "10240",
        "mapreduce.reduce.java.opts": "-Xmx6G -Djava.net.preferIPv4Stack=true",
        "mapreduce.reduce.memory.mb": "10240",
        "mapreduce.reduce.speculative": "false"
      }
    }
  },
  "context": {
    "druid.indexer.fork.property.druid.processing.buffer.sizeBytes": 1024,
    "druid.indexer.fork.property.druid.processing.numMergeBuffers": 1,
    "druid.indexer.fork.property.druid.processing.numThreads": 1
  }
}

4.use groupby and topN

{
  "dimensions": [
    "date_id"
  ],
  "aggregations": [
    {
      "type": "thetaSketch",
      "fieldName": "user_id",
      "name": "count_distinct__user_id"
    }
  ],
  "intervals": "2019-04-07/2019-04-08",
  "limitSpec": {
    "limit": 5000,
    "type": "default",
    "columns": [
      {
        "direction": "descending",
        "dimension": "count_distinct__user_id"
      }
    ]
  },
  "granularity": "all",
  "queryType": "groupBy",
  "dataSource": "test_theta"
}

{
  "dimension": "date_id", 
  "aggregations": [
    {
      "type": "thetaSketch", 
      "fieldName": "user_id", 
      "name": "count_distinct__user_id"
    }
  ], 
  "intervals": "2019-04-07/2019-04-08", 
 "metric": "count_distinct__user_id", 
 "threshold": 5000, 
  "granularity": "all",  
  "queryType": "topN", 
  "dataSource": "test_theta"
}

the result for groupby and topN

gianm · 2019-05-08T15:47:32Z

In groupBy vs topN, as far as aggregators are concerned, one major difference is that groupBy uses relocate and topN does not. However, since you have just one date_id, I don't think it's likely that relocate would be called. So that's probably not it.

leerho · 2019-05-08T15:48:47Z

@pzhdfy

Please try sketches release 0.13.0. This will narrow down the possible changes that might be causing this.
At the points where the sketch is reporting the good result and the bad result print out the sketch summary using toString(). This might provide some clues.

gianm · 2019-05-08T16:17:37Z

I was able to reproduce this as well. Downgrading to sketches-core-0.13.0 fixed the problem. I also noticed that adding a limit to the groupBy fixed it as well. I'm not sure why - it does change the code paths, however. In Druid SQL, this query exhibits the issue:

SELECT 'beep', APPROX_COUNT_DISTINCT_DS_THETA("user_id") FROM test_theta GROUP BY 1

And this one doesn't:

SELECT 'beep', APPROX_COUNT_DISTINCT_DS_THETA("user_id") FROM test_theta GROUP BY 1 LIMIT 1

(The 'beep' is to force the SQL planner to use a groupBy rather than timeseries query type.)

gianm · 2019-05-08T16:18:07Z

Thank you, @pzhdfy, for the detailed instructions on how to reproduce this problem.

leerho · 2019-05-08T17:08:39Z

Very puzzling. Se need to simplify the problem environment to where I can reproduce the problem outside Druid. I suspect that somehow theta is being reset to 1.0, which would cause this.

gianm · 2019-05-08T17:19:11Z

@leerho, please let me know if the following is helpful, or if I could do anything else to help.

What the Druid query is doing is something like this:

Iterating over all rows in a Druid segment, and building up a theta sketch object. This object looks fine.
Taking that object and merging it into the 'merge buffer', which starts off initialized to an empty sketch. This is where it goes off the rails.

I scattered a bunch of sketch toStrings around the code and found that in step (2) they look like this:

The object built up from the segment scan,

### HeapCompactOrderedSketch SUMMARY:
   Estimate                : 100086.81001356241
   Upper Bound, 95% conf   : 101530.78009013624
   Lower Bound, 95% conf   : 98663.31883662633
   Theta (double)          : 0.16369789383615946
   Theta (long)            : 1509846576500454824
   Theta (long) hex        : 14f40d7639a635a8
   EstMode?                : true
   Empty?                  : false
   Array Size Entries      : 16384
   Retained Entries        : 16384
   Seed Hash               : 93cc | 37836
### END SKETCH SUMMARY

The initial state of the sketch in the merge buffer (should be empty),

### HeapCompactOrderedSketch SUMMARY:
   Estimate                : 0.0
   Upper Bound, 95% conf   : 0.0
   Lower Bound, 95% conf   : 0.0
   Theta (double)          : 1.0
   Theta (long)            : 9223372036854775807
   Theta (long) hex        : 7fffffffffffffff
   EstMode?                : false
   Empty?                  : true
   Array Size Entries      : 0
   Retained Entries        : 0
   Seed Hash               : 93cc | 37836
### END SKETCH SUMMARY

The final state of the sketch in the merge buffer (should match the original sketch from the segment scan),

### HeapCompactOrderedSketch SUMMARY:
   Estimate                : 16384.0
   Upper Bound, 95% conf   : 16384.0
   Lower Bound, 95% conf   : 16384.0
   Theta (double)          : 1.0
   Theta (long)            : 9223372036854775807
   Theta (long) hex        : 7fffffffffffffff
   EstMode?                : false
   Empty?                  : false
   Array Size Entries      : 16384
   Retained Entries        : 16384
   Seed Hash               : 93cc | 37836
### END SKETCH SUMMARY

It's changed a bit, but doesn't match up.

The code that printed this was the aggregate method in SketchBufferAggregator, which looks like this after the debugging code I added:

  @Override
  public void aggregate(ByteBuffer buf, int position)
  {
    Object update = selector.getObject();
    if (update == null) {
      return;
    }

    Union union = getOrCreateUnion(buf, position);
    final String initialUnionResult = update instanceof SketchHolder ? union.getResult().toString() : null;

    SketchAggregator.updateUnion(union, update);

    if (update instanceof SketchHolder) {
      log.info(
          "Aggregate called with buffer[%s], position[%s], update = %s, union starts as = %s, union ends as = %s",
          System.identityHashCode(buf),
          position,
          ((SketchHolder) update).getSketch(),
          initialUnionResult,
          union.getResult()
      );
    }
  }

leerho · 2019-05-08T18:58:10Z

Thank you!! We have been able to reproduce the problem. Now I can dig in to see what went wrong.

AlexanderSaydakov · 2019-05-08T22:56:06Z

This is a regression in Theta sketch code. So I would think you don't want to approve the 0.14.1 release candidate as it is now. We will fix the sketches-core shortly.

leerho · 2019-05-08T23:55:44Z

@pzhdfy @gianm
DataSketches sketches-core 0.13.3 is now released to Maven Central with the fix.
Thank you @pzhdfy and @gianm for your help in finding this!!

AlexanderSaydakov · 2019-05-09T00:16:44Z

pull request:
#7619

clintropolis · 2019-05-09T02:40:44Z

This is a regression in Theta sketch code. So I would think you don't want to approve the 0.14.1 release candidate as it is now. We will fix the sketches-core shortly.

0.14.1 is too far gone, the artifacts are already propagated to maven and the apache mirrors, so I'm going to go ahead and do the release anyway. I've modified the release notes to warn about upgrading if relying on theta sketches.

This issue does seem severe enough to go ahead and do a 0.14.2 since we can probably drive that through a lot quicker than we can wrap up and validate 0.15.0, so I will create an rc and start a vote as soon as possible.

* fix issue #7607 * exclude com.google.code.findbugs:annotations

clintropolis · 2019-05-10T00:48:39Z

Fixed by #7619

* fix issue #7607 * exclude com.google.code.findbugs:annotations

* fix issue apache#7607 * exclude com.google.code.findbugs:annotations

* fix issue #7607 * exclude com.google.code.findbugs:annotations

* fix issue apache#7607 * exclude com.google.code.findbugs:annotations

AlexanderSaydakov added a commit to AlexanderSaydakov/druid that referenced this issue May 9, 2019

fix issue apache#7607

a976bae

clintropolis mentioned this issue May 9, 2019

0.14.1-incubating release notes #7553

Closed

clintropolis pushed a commit that referenced this issue May 10, 2019

fix issue #7607 (#7619)

59f9ff3

* fix issue #7607 * exclude com.google.code.findbugs:annotations

AlexanderSaydakov mentioned this issue May 10, 2019

fix issue with theta sketch in druid-datasketches extension #7619

Merged

clintropolis closed this as completed May 10, 2019

clintropolis pushed a commit that referenced this issue May 10, 2019

fix issue #7607 (#7619)

cf9db11

* fix issue #7607 * exclude com.google.code.findbugs:annotations

gianm pushed a commit to implydata/druid-public that referenced this issue May 10, 2019

fix issue apache#7607 (apache#7619)

e55e9cf

* fix issue apache#7607 * exclude com.google.code.findbugs:annotations

jihoonson pushed a commit to jihoonson/druid that referenced this issue May 11, 2019

fix issue apache#7607 (apache#7619)

483f05b

* fix issue apache#7607 * exclude com.google.code.findbugs:annotations

clintropolis pushed a commit that referenced this issue May 11, 2019

fix issue #7607 (#7619) (#7636)

4116917

* fix issue #7607 * exclude com.google.code.findbugs:annotations

clintropolis mentioned this issue May 24, 2019

0.14.2-incubating release notes #7628

Closed

clintropolis added the Bug label May 24, 2019

clintropolis added this to the 0.14.2 milestone May 24, 2019

jihoonson added a commit to implydata/druid-public that referenced this issue Jun 4, 2019

fix issue apache#7607 (apache#7619) (apache#7636)

efaf109

* fix issue apache#7607 * exclude com.google.code.findbugs:annotations

jihoonson pushed a commit to implydata/druid-public that referenced this issue Jun 26, 2019

fix issue apache#7607 (apache#7619)

a78f269

* fix issue apache#7607 * exclude com.google.code.findbugs:annotations

burningice2012 mentioned this issue Jul 18, 2019

Incorrect quantiles result produced by quantilesDoublesSketch #8095

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thetaSketch(with sketches-core-0.13.1) in groupBy always return value no more than 16384 #7607

thetaSketch(with sketches-core-0.13.1) in groupBy always return value no more than 16384 #7607

pzhdfy commented May 7, 2019 •

edited

Loading

AlexanderSaydakov commented May 7, 2019

leerho commented May 7, 2019

leerho commented May 7, 2019 •

edited

Loading

pzhdfy commented May 8, 2019 •

edited

Loading

gianm commented May 8, 2019

leerho commented May 8, 2019 •

edited

Loading

gianm commented May 8, 2019

gianm commented May 8, 2019

leerho commented May 8, 2019

gianm commented May 8, 2019

leerho commented May 8, 2019

AlexanderSaydakov commented May 8, 2019

leerho commented May 8, 2019 •

edited

Loading

AlexanderSaydakov commented May 9, 2019 •

edited

Loading

clintropolis commented May 9, 2019

clintropolis commented May 10, 2019

thetaSketch(with sketches-core-0.13.1) in groupBy always return value no more than 16384 #7607

thetaSketch(with sketches-core-0.13.1) in groupBy always return value no more than 16384 #7607

Comments

pzhdfy commented May 7, 2019 • edited Loading

Affected Version

Description

AlexanderSaydakov commented May 7, 2019

leerho commented May 7, 2019

leerho commented May 7, 2019 • edited Loading

pzhdfy commented May 8, 2019 • edited Loading

1.use the newest Druid version( compiled from master branch, with sketches-core-0.13.1)

2.use this python generate test data

3.use this Ingestion spec

gianm commented May 8, 2019

leerho commented May 8, 2019 • edited Loading

gianm commented May 8, 2019

gianm commented May 8, 2019

leerho commented May 8, 2019

gianm commented May 8, 2019

leerho commented May 8, 2019

AlexanderSaydakov commented May 8, 2019

leerho commented May 8, 2019 • edited Loading

AlexanderSaydakov commented May 9, 2019 • edited Loading

clintropolis commented May 9, 2019

clintropolis commented May 10, 2019

pzhdfy commented May 7, 2019 •

edited

Loading

leerho commented May 7, 2019 •

edited

Loading

pzhdfy commented May 8, 2019 •

edited

Loading

leerho commented May 8, 2019 •

edited

Loading

leerho commented May 8, 2019 •

edited

Loading

AlexanderSaydakov commented May 9, 2019 •

edited

Loading