[Lens] Allow time shift with terms #103063

flash1293 · 2021-06-23T10:10:47Z

Right now terms can’t be used along with time shifts because we can’t guarantee data will be returned from all shards. The problem is the following:

When a terms agg with a sort metric is used, it’s not actually sorting all terms globally. Instead, each shard sorts the local terms locally and only sends the top shard_size terms to the coordinating node handling the request from Kibana. Here, the results from all shards are merged into a single consistent list of length size. By default the shard size parameter is a little higher than the size parameter to account for differences in local distribution. However, it can’t guarantee that the final response will be correct because it’s always possible a shard didn’t send some terms to the coordinating node that would have made it into the final list. This is the case today for regular terms aggregations on high cardinality fields with multiple shards and local differences in distribution of these values.

However for time shift, this problem becomes much more obvious in some cases:
Let’s say the user configures the following chart:
Top 3 values of “myField” ordered by count of records
Count of records
Count of records shifted by 1 week

Data for a week ago is in a different index and thus a different shard than data from today.
Both shards (a week ago and today) will locally build a list of 3 myField values and send them to the coordinating node. The coordinating node will order them by the count of records of todays data and send them to kibana.

This is how data in todays shard looks like - it will send X, Y and Z:
A 1
B 2
C 3
X 99
Y 98
Z 97

This is how data in the shard from a week ago looks like - it will send A, B and C:
A 99
B 98
C 97
X 3
Y 2
Z 1

The coordinating node tries to put together a result and sends the following:

myField	Count of records	Count of records - 1 week
X	99	-
Y	98	-
Z	97	-

The final table does not include any data for X, Y and Z - it was there, but the shard thought it wouldn’t get relevant because it had small local values.
If a user knows for sure there should be data for X, Y and Z from a week ago, they will be very confused in this case (besides misleading information that can impact business decisions).
The error will not always look this straightforward - it’s possible some of the terms make it from the previous shards while some don’t, or some terms make it on some shards, but not on all of them so the cell is filled, but the number isn’t actually correct. For sum and count of records the number can only be too small, but for average/median/min/max the number can be completely wrong as well.

Mitigating the case

By default the shard_size parameter is set to 1.5 * size + 10. We can bump up this number in case a time shift is used - maybe something like min(1000, 1.5 * size + 10). This will decrease the chance of this case happening without fully removing it - at the expense of lower performance and higher memory usage in the Elasticsearch cluster (elastic/elasticsearch#72684) We can work with the Elasticsearch team to do this in a safe manner. It depends a lot on the data the user is working with how much this will help.

Warning about the case

Elasticsearch tells us when it’s possible this is happening right now. The doc_count_error_upper_bound in the response will be 0 if it’s not possible there is missing data. For any other value there’s a chance some data is missing, but we can’t say for sure. This is a runtime error - we can’t say for sure whether it will happen and it can depend on the time range / filter / incoming data. When the case is happening, we can tell the user to either ignore it (because they know it’s not relevant to them), or do the following things:
Narrow down search scope - if there are less documents, the chance for a complete terms list is higher
Increase “size” of top values - more data transferred, but higher chance of getting all of it
I can imagine this as a warning label of some kind overlaying the chart.

As this is an issue in general (prominent for time shifts but can happen in any configuration), we should warn about it in general. See #94918

The warning could include the existing “fix action” to make terms static by turning them into filters - I still think that’s a helpful feature in general

Example bad case

A use case specific example where this problem could show up: A security analyst is looking at suspicious user activity. The build a chart: top 20 values of ips ordered by number of times the “login” endpoint has been accessed (filtered count of records metric). They wonder whether these ips have been active yesterday as well. They duplicate the count of records metric and shift it by one day. The second column doesn’t show any activity for these ips yesterday. The analyst concludes that the attack began today. However, they were actually probing the login endpoint yesterday as well, but only a small number of times. Because of this, the shard holding yesterdays data won’t return these IPs when queried for the top ips ordered by number of login attempts.

Implementation steps

Make sure terms agg will sort by the time range of the selected metric first, then by the unshifted time range (two sorting metrics) to get the best possible result (in case the distribution is similar in both shards) - otherwise the sorting of the shard not containing data for the shifted time range will fall back to alphabetical

kibana/src/plugins/data/common/search/aggs/buckets/terms.ts

Line 189 in 8cb3dbc

orderAgg = aggs.createAggConfig(
In this scenario, also set the shard_size parameter as explained above
Implement [Lens] Expose Elasticsearch accuracy warnings to the user #94918 in a general case (including a "fix it" action in the warning to turn top values into filters
Remove the error in Lens and allow top values together with multiple time shifts
Add a validation in Lens to make sure top values is never nested inside a date histogram (this would still be an error case) with a "fix it" action to swap date histogram and top values

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-23T10:10:49Z

Pinging @elastic/kibana-app (Team:KibanaApp)

markov00 · 2024-06-11T09:27:57Z

@stratoula can this be fixed by using ESQL instead?

stratoula · 2024-06-11T11:07:02Z

We are returning all the terms when grouped by so I think yes but it will be better to test it or ask ES as I am not 100% sure.

flash1293 added enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure Feature:Lens labels Jun 23, 2021

flash1293 added this to Long-term goals in Lens via automation Jun 23, 2021

flash1293 mentioned this issue Jun 23, 2021

[Meta][Lens] Data Modelling #57708

Closed

drewdaemon added the impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. label Feb 9, 2023

kpatticha mentioned this issue Jun 8, 2023

[APM] Limitations impeding our migration to Lens #159295

Open

4 tasks

stratoula added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. and removed impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. labels Jun 12, 2023

markov00 added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jun 4, 2024

markov00 added the triage_needed label Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lens] Allow time shift with terms #103063

[Lens] Allow time shift with terms #103063

flash1293 commented Jun 23, 2021

elasticmachine commented Jun 23, 2021

markov00 commented Jun 11, 2024

stratoula commented Jun 11, 2024

[Lens] Allow time shift with terms #103063

[Lens] Allow time shift with terms #103063

Comments

flash1293 commented Jun 23, 2021

Mitigating the case

Warning about the case

Example bad case

Implementation steps

elasticmachine commented Jun 23, 2021

markov00 commented Jun 11, 2024

stratoula commented Jun 11, 2024