Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lens] Can't create a Lens visualization that produces accurate count(distinct) on field values #179934

Open
bradquarry opened this issue Apr 3, 2024 · 6 comments
Labels
enhancement New value added to drive a business result Feature:Lens Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects

Comments

@bradquarry
Copy link

bradquarry commented Apr 3, 2024

I’m trying to figure out how I can run a count(distinct) query in Lens and generate precise results. Our Unique Count aggregation in Lens is producing wrong results vs an external deterministic unique count using python. I imagine it’s because we are using the imprecise cardinality aggregation in Lens, which chooses between two estimation algorithms for counts and does not use a deterministic approach. If true, this isn't good as financial services customers need to use Lens for precise high cardinality counts (millions of accounts) for daily reporting needs and you can't have inaccurate counts.

Anyway, to get an accurate distinct count I have to run a terms aggregation followed by a sum_bucket agg like so. The partitioning will be required as the counts are in the millions and I don't want to hit circuit breakers due to memory constraints.

How can I execute the type of query below in Lens to guarantee accurate unique count output every time from very high cardinality unique count aggregations? What if I had a billion unique values? Auto partitioning would be great here.

GET test/_search
{
  “aggs”: {
    “counts”: {
      “terms”: { “field”: “field.keyword”,
      “include”: {
               “partition”: 0,
               “num_partitions”: 2
            }
      }
    },“sum_buckets”: {
      “sum_bucket”: {
        “buckets_path”: “counts>_count”
      }
    }
  }
}

GET test/_search
{
  “aggs”: {
    “counts”: {
      “terms”: { “field”: “field.keyword”,
      “include”: {
               “partition”: 1,
               “num_partitions”: 2
            }
      }
    },“sum_buckets”: {
      “sum_bucket”: {
        “buckets_path”: “counts>_count”
      }
    }
  }
}
@bradquarry bradquarry added Team:Visualizations Visualization editors, elastic-charts and infrastructure Feature:Lens labels Apr 3, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-visualizations (Team:Visualizations)

@kibanamachine kibanamachine added this to Long-term goals in Lens Apr 3, 2024
@bradquarry bradquarry changed the title Can't create a Lens visualization that produces accurate count(distinct) on field values [Lens] Can't create a Lens visualization that produces accurate count(distinct) on field values Apr 3, 2024
@markov00
Copy link
Member

markov00 commented Apr 3, 2024

Thanks @bradquarry, I've found the original issue, I will link this there so we can have also your suggestion/pow #179934

@markov00 markov00 closed this as completed Apr 3, 2024
@bradquarry
Copy link
Author

bradquarry commented Apr 3, 2024

@markov00 I don't think my issue is directly related to the issue you linked to. No matter how you use the cardinality aggregation in Elastic it does does not guarantee deterministic results, it simply chooses between a "more accurate or less accurate guesstimation". Even if you enable the parameter suggested you are still using guestimation algorithms. At least, this is my understanding of the cardinality aggregation algorithms.

This issue focuses on providing customers with the ability to reliable and deterministic results for any unique count in Lens using a different aggregation strategy. Also, no matter what the data cardinality.

Some customers don't trust our product in financial services due to the use of the cardinality agg in Lens. Their unique counts that must be exact aren't exact and we need to change this perception to grow in this market.

@bradquarry bradquarry reopened this Apr 3, 2024
@dej611 dej611 added the enhancement New value added to drive a business result label Apr 4, 2024
@dej611
Copy link
Contributor

dej611 commented Apr 5, 2024

Lens is using available aggregations from Elasticsearch, as that cannot be computed client-side (unless loading the whole values dataset in client memory...).
I've found a related issue on Elasticsearch: elastic/elasticsearch#15876

If the Elasticsearch decides to address it then we could consider to support it in Kibana Lens as well.

@bradquarry
Copy link
Author

bradquarry commented Apr 5, 2024

I provided an example of how to do this above using a term's aggregation and a sum bucket. Use the above plus msearch and some automatic or manual partitioning then sum the results of msearch on the client side. It works as a workaround for my customer, but they can't use Kibana. I'm experimenting with Vega to help, but it's a learning experience.

@dej611
Copy link
Contributor

dej611 commented Apr 5, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Lens Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
No open projects
Lens
  
Long-term goals
Development

No branches or pull requests

4 participants