Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix histograms for sketches where min and max are equal #15381

Merged
merged 1 commit into from Nov 16, 2023

Conversation

legoscia
Copy link
Contributor

There is a problem with Quantiles sketches and KLL Quantiles sketches. Queries using the histogram post-aggregator fail if:

  • the sketch contains at least one value, and
  • the values in the sketch are all equal, and
  • the splitPoints argument is not passed to the post-aggregator, and
  • the numBins argument is greater than 2 (or not specified, which leads to the default of 10 being used)

In that case, the query fails and returns this error:

{
  "error": "Unknown exception",
  "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
  "host": null,
  "errorCode": "legacyQueryException",
  "persona": "OPERATOR",
  "category": "RUNTIME_FAILURE",
  "errorMessage": "Values must be unique, monotonically increasing and not NaN.",
  "context": {
    "host": null,
    "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
    "legacyErrorCode": "Unknown exception"
  }
}

This behaviour is undesirable, since the caller doesn't necessarily know in advance whether the sketch has values that are diverse enough. With this change, the post-aggregators return [N, 0, 0...] instead of crashing, where N is the number of values in the sketch, and the length of the list is equal to numBins. That is what they already returned for numBins = 2.

Here is an example of a query that would fail:

{"queryType":"timeseries",
 "dataSource": {
   "type": "inline",
   "columnNames": ["foo", "bar"],
   "rows": [
      ["abc", 42.0],
      ["def", 42.0]
   ]
 },
 "intervals":["0000/3000"],
 "granularity":"all",
 "aggregations":[
   {"name":"the_sketch", "fieldName":"bar", "type":"quantilesDoublesSketch"}],
 "postAggregations":[
   {"name":"the_histogram",
    "type":"quantilesDoublesSketchToHistogram",
    "field":{"type":"fieldAccess","fieldName":"the_sketch"},
    "numBins": 3}]}

I believe this also fixes issue #10585.

Description

I noticed this error when trying to get histograms from quantiles sketches. At first it seemed intermittent and random, as it would go away when I changed the query a bit, but eventually I realised that it depends on the underlying data. topN queries are particularly susceptible, as it's enough for one of the dimension values to have a sketch with a single value for the entire query to fail.

I'm checking for the case where splitPoints isn't explicitly specified, but the minimum and maximum values of the sketch are equal. In that case, I don't bother calling the getPMF method of the sketch, since the result is given. Instead, I just return an array where the first element is the number of values in the sketch.

I considered changing the list of split points to something that getPMF would accept, e.g. setting delta to 1.0, or setting max to Double.MAX_VALUE and calculating delta from that. In the end, I thought that there is no obvious choice, and any way of coming up with artificial split points could cause problems depending on which values are in the sketch. (For example, if the minimum value is greater than 2^53, adding 1.0 becomes a no-op.)

Release note

Fixed: Histogram post-aggregators for Quantiles and KLL sketches no longer fail if all values in the sketch are equal.


Key changed/added classes in this PR
  • DoublesSketchToHistogramPostAggregator
  • KllDoublesSketchToHistogramPostAggregator
  • KllFloatsSketchToHistogramPostAggregator

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

There is a problem with Quantiles sketches and KLL Quantiles sketches.
Queries using the histogram post-aggregator fail if:

- the sketch contains at least one value, and
- the values in the sketch are all equal, and
- the splitPoints argument is not passed to the post-aggregator, and
- the numBins argument is greater than 2 (or not specified, which
  leads to the default of 10 being used)

In that case, the query fails and returns this error:

    {
      "error": "Unknown exception",
      "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
      "host": null,
      "errorCode": "legacyQueryException",
      "persona": "OPERATOR",
      "category": "RUNTIME_FAILURE",
      "errorMessage": "Values must be unique, monotonically increasing and not NaN.",
      "context": {
        "host": null,
        "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
        "legacyErrorCode": "Unknown exception"
      }
    }

This behaviour is undesirable, since the caller doesn't necessarily
know in advance whether the sketch has values that are diverse
enough. With this change, the post-aggregators return [N, 0, 0...]
instead of crashing, where N is the number of values in the sketch,
and the length of the list is equal to numBins. That is what they
already returned for numBins = 2.

Here is an example of a query that would fail:

    {"queryType":"timeseries",
     "dataSource": {
       "type": "inline",
       "columnNames": ["foo", "bar"],
       "rows": [
          ["abc", 42.0],
          ["def", 42.0]
       ]
     },
     "intervals":["0000/3000"],
     "granularity":"all",
     "aggregations":[
       {"name":"the_sketch", "fieldName":"bar", "type":"quantilesDoublesSketch"}],
     "postAggregations":[
       {"name":"the_histogram",
        "type":"quantilesDoublesSketchToHistogram",
        "field":{"type":"fieldAccess","fieldName":"the_sketch"},
        "numBins": 3}]}

I believe this also fixes issue apache#10585.
@AmatyaAvadhanula
Copy link
Contributor

Thank you @legoscia!

@AmatyaAvadhanula AmatyaAvadhanula merged commit 67f45fa into apache:master Nov 16, 2023
53 checks passed
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
There is a problem with Quantiles sketches and KLL Quantiles sketches.
Queries using the histogram post-aggregator fail if:

- the sketch contains at least one value, and
- the values in the sketch are all equal, and
- the splitPoints argument is not passed to the post-aggregator, and
- the numBins argument is greater than 2 (or not specified, which
  leads to the default of 10 being used)

In that case, the query fails and returns this error:

    {
      "error": "Unknown exception",
      "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
      "host": null,
      "errorCode": "legacyQueryException",
      "persona": "OPERATOR",
      "category": "RUNTIME_FAILURE",
      "errorMessage": "Values must be unique, monotonically increasing and not NaN.",
      "context": {
        "host": null,
        "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
        "legacyErrorCode": "Unknown exception"
      }
    }

This behaviour is undesirable, since the caller doesn't necessarily
know in advance whether the sketch has values that are diverse
enough. With this change, the post-aggregators return [N, 0, 0...]
instead of crashing, where N is the number of values in the sketch,
and the length of the list is equal to numBins. That is what they
already returned for numBins = 2.

Here is an example of a query that would fail:

    {"queryType":"timeseries",
     "dataSource": {
       "type": "inline",
       "columnNames": ["foo", "bar"],
       "rows": [
          ["abc", 42.0],
          ["def", 42.0]
       ]
     },
     "intervals":["0000/3000"],
     "granularity":"all",
     "aggregations":[
       {"name":"the_sketch", "fieldName":"bar", "type":"quantilesDoublesSketch"}],
     "postAggregations":[
       {"name":"the_histogram",
        "type":"quantilesDoublesSketchToHistogram",
        "field":{"type":"fieldAccess","fieldName":"the_sketch"},
        "numBins": 3}]}

I believe this also fixes issue apache#10585.
writer-jill pushed a commit to writer-jill/druid that referenced this pull request Nov 20, 2023
There is a problem with Quantiles sketches and KLL Quantiles sketches.
Queries using the histogram post-aggregator fail if:

- the sketch contains at least one value, and
- the values in the sketch are all equal, and
- the splitPoints argument is not passed to the post-aggregator, and
- the numBins argument is greater than 2 (or not specified, which
  leads to the default of 10 being used)

In that case, the query fails and returns this error:

    {
      "error": "Unknown exception",
      "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
      "host": null,
      "errorCode": "legacyQueryException",
      "persona": "OPERATOR",
      "category": "RUNTIME_FAILURE",
      "errorMessage": "Values must be unique, monotonically increasing and not NaN.",
      "context": {
        "host": null,
        "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
        "legacyErrorCode": "Unknown exception"
      }
    }

This behaviour is undesirable, since the caller doesn't necessarily
know in advance whether the sketch has values that are diverse
enough. With this change, the post-aggregators return [N, 0, 0...]
instead of crashing, where N is the number of values in the sketch,
and the length of the list is equal to numBins. That is what they
already returned for numBins = 2.

Here is an example of a query that would fail:

    {"queryType":"timeseries",
     "dataSource": {
       "type": "inline",
       "columnNames": ["foo", "bar"],
       "rows": [
          ["abc", 42.0],
          ["def", 42.0]
       ]
     },
     "intervals":["0000/3000"],
     "granularity":"all",
     "aggregations":[
       {"name":"the_sketch", "fieldName":"bar", "type":"quantilesDoublesSketch"}],
     "postAggregations":[
       {"name":"the_histogram",
        "type":"quantilesDoublesSketchToHistogram",
        "field":{"type":"fieldAccess","fieldName":"the_sketch"},
        "numBins": 3}]}

I believe this also fixes issue apache#10585.
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
There is a problem with Quantiles sketches and KLL Quantiles sketches.
Queries using the histogram post-aggregator fail if:

- the sketch contains at least one value, and
- the values in the sketch are all equal, and
- the splitPoints argument is not passed to the post-aggregator, and
- the numBins argument is greater than 2 (or not specified, which
  leads to the default of 10 being used)

In that case, the query fails and returns this error:

    {
      "error": "Unknown exception",
      "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
      "host": null,
      "errorCode": "legacyQueryException",
      "persona": "OPERATOR",
      "category": "RUNTIME_FAILURE",
      "errorMessage": "Values must be unique, monotonically increasing and not NaN.",
      "context": {
        "host": null,
        "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
        "legacyErrorCode": "Unknown exception"
      }
    }

This behaviour is undesirable, since the caller doesn't necessarily
know in advance whether the sketch has values that are diverse
enough. With this change, the post-aggregators return [N, 0, 0...]
instead of crashing, where N is the number of values in the sketch,
and the length of the list is equal to numBins. That is what they
already returned for numBins = 2.

Here is an example of a query that would fail:

    {"queryType":"timeseries",
     "dataSource": {
       "type": "inline",
       "columnNames": ["foo", "bar"],
       "rows": [
          ["abc", 42.0],
          ["def", 42.0]
       ]
     },
     "intervals":["0000/3000"],
     "granularity":"all",
     "aggregations":[
       {"name":"the_sketch", "fieldName":"bar", "type":"quantilesDoublesSketch"}],
     "postAggregations":[
       {"name":"the_histogram",
        "type":"quantilesDoublesSketchToHistogram",
        "field":{"type":"fieldAccess","fieldName":"the_sketch"},
        "numBins": 3}]}

I believe this also fixes issue apache#10585.
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants