Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird query results when using the DataSketches Quantiles Sketch #8659

Open
QiuMM opened this issue Oct 10, 2019 · 4 comments
Open

Weird query results when using the DataSketches Quantiles Sketch #8659

QiuMM opened this issue Oct 10, 2019 · 4 comments

Comments

@QiuMM
Copy link
Member

QiuMM commented Oct 10, 2019

We used the DataSketches to compute quantiles and got very weird query results.

Affected Version

0.12.2

Description

Metrics spec at ingestion time:

"metricsSpec": [
      {
        "type": "count",
        "name": "count"
      },
      {
        "type": "doubleSum",
        "name": "cm_value",
        "fieldName": "cm_value",
        "expression": null
      },
      {
        "type": "quantilesDoublesSketch",
        "name": "cm_value_sketch",
        "fieldName": "cm_value",
        "k": 128
      }
]

My query:

 "aggregations": [
    {
      "type": "quantilesDoublesSketch",
      "name": "custom_value_sketch",
      "fieldName": "cm_value"
    },
    {
      "type": "doubleSum",
      "name": "count",
      "fieldName": "count"
    },
    {
      "type": "doubleSum",
      "name": "cm_value_sum",
      "fieldName": "cm_value"
    }
  ],
  "postAggregations": [
    {
      "type": "quantilesDoublesSketchToQuantiles",
      "name": "quantiles",
      "fractions": [
        0.1,
        0.2,
        0.3,
        0.4,
        0.5,
        0.6,
        0.7,
        0.8,
        0.9,
        1
      ],
      "field": {
        "type": "fieldAccess",
        "fieldName": "custom_value_sketch"
      }
    }
  ]

The query result:

"result" : {
    "count" : 4223.0,
    "cm_value_sum" : 667109.0,
    "quantiles" : [ 52.0, 179.0, 515.0, 929.0, 1185.0, 1426.0, 1680.0, 2047.0, 2601.0, 6000.0 ],
    "custom_value_sketch" : 529
  }

As we can see, the value of 0.5-quantile is 1185.0, so there must be nearly half of the cm_value greater than or equal to 1185.0. However, if we multiply 1185 and 2111 (half of the count) , we found the result is 2501535 which is much greater than the sum of cm_value 667109. Impossible! this should not be happen. We have loaded the same data into hive, and queried hive we got the result:

"result" : {
    "count" : 4223.0,
    "cm_value_sum" : 667109.0,
    "quantiles" : [ 70.0, 82.0, 96.0, 112.0, 136.0, 160.0, 189.0, 229.0, 274.8000000000002, 3368.0 ]
  }

@AlexanderSaydakov is there any bug of DataSketches Quantiles Sketch or I used it in a wrong way?

@AlexanderSaydakov
Copy link
Contributor

AlexanderSaydakov commented Oct 10, 2019 via email

@QiuMM
Copy link
Member Author

QiuMM commented Oct 10, 2019

@AlexanderSaydakov thanks, and is it enough if I upgrade the data sketch extension only rather than the whole Druid.

@AlexanderSaydakov
Copy link
Contributor

I am not sure which version of the extension would be compatible with which version of Druid. It always is built as a part of the whole Druid package.

@QiuMM
Copy link
Member Author

QiuMM commented Oct 10, 2019

Okay, I'll have a try, thanks @AlexanderSaydakov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants