[Lens] Can't create a Lens visualization that produces accurate count(distinct) on field values #179934

bradquarry · 2024-04-03T13:39:09Z

I’m trying to figure out how I can run a count(distinct) query in Lens and generate precise results. Our Unique Count aggregation in Lens is producing wrong results vs an external deterministic unique count using python. I imagine it’s because we are using the imprecise cardinality aggregation in Lens, which chooses between two estimation algorithms for counts and does not use a deterministic approach. If true, this isn't good as financial services customers need to use Lens for precise high cardinality counts (millions of accounts) for daily reporting needs and you can't have inaccurate counts.

Anyway, to get an accurate distinct count I have to run a terms aggregation followed by a sum_bucket agg like so. The partitioning will be required as the counts are in the millions and I don't want to hit circuit breakers due to memory constraints.

How can I execute the type of query below in Lens to guarantee accurate unique count output every time from very high cardinality unique count aggregations? What if I had a billion unique values? Auto partitioning would be great here.

GET test/_search
{
  “aggs”: {
    “counts”: {
      “terms”: { “field”: “field.keyword”,
      “include”: {
               “partition”: 0,
               “num_partitions”: 2
            }
      }
    },“sum_buckets”: {
      “sum_bucket”: {
        “buckets_path”: “counts>_count”
      }
    }
  }
}

GET test/_search
{
  “aggs”: {
    “counts”: {
      “terms”: { “field”: “field.keyword”,
      “include”: {
               “partition”: 1,
               “num_partitions”: 2
            }
      }
    },“sum_buckets”: {
      “sum_bucket”: {
        “buckets_path”: “counts>_count”
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-04-03T13:39:20Z

Pinging @elastic/kibana-visualizations (Team:Visualizations)

markov00 · 2024-04-03T16:11:11Z

Thanks @bradquarry, I've found the original issue, I will link this there so we can have also your suggestion/pow #179934

bradquarry · 2024-04-03T16:26:18Z

@markov00 I don't think my issue is directly related to the issue you linked to. No matter how you use the cardinality aggregation in Elastic it does does not guarantee deterministic results, it simply chooses between a "more accurate or less accurate guesstimation". Even if you enable the parameter suggested you are still using guestimation algorithms. At least, this is my understanding of the cardinality aggregation algorithms.

This issue focuses on providing customers with the ability to reliable and deterministic results for any unique count in Lens using a different aggregation strategy. Also, no matter what the data cardinality.

Some customers don't trust our product in financial services due to the use of the cardinality agg in Lens. Their unique counts that must be exact aren't exact and we need to change this perception to grow in this market.

dej611 · 2024-04-05T09:45:28Z

Lens is using available aggregations from Elasticsearch, as that cannot be computed client-side (unless loading the whole values dataset in client memory...).
I've found a related issue on Elasticsearch: elastic/elasticsearch#15876

If the Elasticsearch decides to address it then we could consider to support it in Kibana Lens as well.

bradquarry · 2024-04-05T12:00:34Z

I provided an example of how to do this above using a term's aggregation and a sum bucket. Use the above plus msearch and some automatic or manual partitioning then sum the results of msearch on the client side. It works as a workaround for my customer, but they can't use Kibana. I'm experimenting with Vega to help, but it's a learning experience.

dej611 · 2024-04-05T12:19:22Z

As far as I know terms aggregation is approximate as well: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#terms-agg-doc-count-error

bradquarry added Team:Visualizations Visualization editors, elastic-charts and infrastructure Feature:Lens labels Apr 3, 2024

kibanamachine added this to Long-term goals in Lens Apr 3, 2024

bradquarry changed the title ~~Can't create a Lens visualization that produces accurate count(distinct) on field values~~ [Lens] Can't create a Lens visualization that produces accurate count(distinct) on field values Apr 3, 2024

markov00 mentioned this issue Apr 3, 2024

[Lens] Unique count aggregation should have control for precision threshold and warning about estimates #69832

Open

markov00 closed this as completed Apr 3, 2024

bradquarry reopened this Apr 3, 2024

dej611 added the enhancement New value added to drive a business result label Apr 4, 2024

wchaparro mentioned this issue Apr 8, 2024

[ES|QL] High accuracy cardinality aggregation elastic/elasticsearch#107231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lens] Can't create a Lens visualization that produces accurate count(distinct) on field values #179934

[Lens] Can't create a Lens visualization that produces accurate count(distinct) on field values #179934

bradquarry commented Apr 3, 2024 •

edited

Loading

elasticmachine commented Apr 3, 2024

markov00 commented Apr 3, 2024

bradquarry commented Apr 3, 2024 •

edited

Loading

dej611 commented Apr 5, 2024

bradquarry commented Apr 5, 2024 •

edited

Loading

dej611 commented Apr 5, 2024 •

edited

Loading

[Lens] Can't create a Lens visualization that produces accurate count(distinct) on field values #179934

[Lens] Can't create a Lens visualization that produces accurate count(distinct) on field values #179934

Comments

bradquarry commented Apr 3, 2024 • edited Loading

elasticmachine commented Apr 3, 2024

markov00 commented Apr 3, 2024

bradquarry commented Apr 3, 2024 • edited Loading

dej611 commented Apr 5, 2024

bradquarry commented Apr 5, 2024 • edited Loading

dej611 commented Apr 5, 2024 • edited Loading

bradquarry commented Apr 3, 2024 •

edited

Loading

bradquarry commented Apr 3, 2024 •

edited

Loading

bradquarry commented Apr 5, 2024 •

edited

Loading

dej611 commented Apr 5, 2024 •

edited

Loading