Fix bug in filtering Rules' log history in stack management #171409

maryam-saeidi · 2023-11-16T14:33:46Z

🐞 Summary

When checking Rules' logs in stack management, the filter does not work as expected (slack, QA environment)

Screen.Recording.2023-11-14.at.09.32.27.mov

Besides not filtering correctly, in the first search, the table had rows, but in the second one, it didn’t.

elasticmachine · 2023-11-16T14:33:48Z

Pinging @elastic/response-ops (Team:ResponseOps)

Zacqary · 2023-11-29T19:28:16Z

Only able to reproduce this when there are more than 1000 logs to filter through, and even then only some filters fail and some succeed. It's somehow related to having a lot of data. Investigating further.

Zacqary · 2023-11-29T21:58:53Z

Welp, here's our problem:

kibana/x-pack/plugins/actions/server/lib/get_execution_log_aggregation.ts

Line 16 in 942a8b1

    
           const DEFAULT_MAX_BUCKETS_LIMIT = 1000; // do not retrieve more than this number of executions

const DEFAULT_MAX_BUCKETS_LIMIT = 1000; // do not retrieve more than this number of executions
const DEFAULT_MAX_KPI_BUCKETS_LIMIT = 10000;

I'm not sure why the limits for log buckets and KPI buckets are different. That seems odd, and a quick change of the limit to 10000 doesn't seem to break anything when testing locally.

The first goal is to figure out a better failure mode for when we attempt to aggregate over more than 1000 logs, and then we should figure out if syncing this bucket limit with the KPI is a major issue or not.

Zacqary · 2023-11-29T23:18:11Z

Ok this is the problem. We have two different aggregations determining the total number of logs, and the actual logs themselves:

executionUuidCardinality: {
  filter: {
    size: DEFAULT_MAX_BUCKETS_LIMIT,
    bool: {
      ...(dslFilterQuery ? { filter: dslFilterQuery } : {}),
      must: [getProviderAndActionFilter('alerting', 'execute')],
    },
  },
  aggs: {
    executionUuidCardinality: {
      cardinality: {
        field: EXECUTION_UUID_FIELD,
      },
    },
  },
},
executionUuid: {
  // Bucket by execution UUID
  terms: {
    field: EXECUTION_UUID_FIELD,
    size: DEFAULT_MAX_BUCKETS_LIMIT,
    order: formatSortForTermSort(sort),
  },

executionUuidCardinality is used to determine the total number of buckets, but it isn't subject to the DEFAULT_MAX_BUCKETS_LIMIT. Therefore it's not an accurate count of the total number of logs actually available to executionUuid. There doesn't seem to be a way to apply a size prop to a cardinality metric, so I don't know how to get this to be an accurate total.

Anyway, the logs API sends back the cardinality as the data table's total, and then the logs as the actual data to insert in it. When the total exceeds the length of the data, the table puts in all these empty rows.

I recommend syncing up the DEFAULT_MAX_BUCKETS_LIMIT and DEFAULT_MAX_KPI_BUCKETS_LIMIT. The UI limits data display to 1000 all by itself, but we're already doing the same amount of work to fetch the KPIs, so it's not too much of a wasted effort.

We can then pull the data table total from the KPI aggregation instead of the cardinality.

## Summary Fixes #171409 This fixes weird behavior when attempting to filter more than 1000 rule execution logs. The execution log aggregator had two problems: - It had a max bucket limit of 1000, while the KPI agg had a max bucket limit of 10000, leading to inconsistent results - For pagination, it reported the total number of logs by using a cardinality aggregator, which wasn't subject to any bucket limit at all The endpoint responds with a `total` (the cardinality) and an array of `data` (a paginated slice of the aggregated logs). The way the data table UI works is that it takes the `total` value, caps it at 1000, and then generates that many blank rows to fill with whatever's in the `data` array. So as seen in the issue, we could easily run into a situation where: 1. User enters a filter query that last appeared more than 1000 logs ago 2. The cardinality agg accurately reports the number of logs that should be fetched, and generates enough blank rows for these logs 3. The actual execution log agg hits the 1000 bucket limit, and returns an empty `data` array This PR fixes this by using a bucket sum aggregation to determine the data table's `total` instead of a cardinality aggregation. It also ups the bucket limit from 1000 to 10000, to match the bucket limit from the summary KPI endpoint. This prevents a situation where: 1. User enters a filter query that last appeared in <1000 logs, but more than 1000 logs ago 2. Summary KPI widget, with a bucket limit of 10000, accurately displays a count of the <1000 logs this query matches 3. The data table is empty because the execution log bucket limit tops out at 1000 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>

## Summary Fixes elastic#171409 This fixes weird behavior when attempting to filter more than 1000 rule execution logs. The execution log aggregator had two problems: - It had a max bucket limit of 1000, while the KPI agg had a max bucket limit of 10000, leading to inconsistent results - For pagination, it reported the total number of logs by using a cardinality aggregator, which wasn't subject to any bucket limit at all The endpoint responds with a `total` (the cardinality) and an array of `data` (a paginated slice of the aggregated logs). The way the data table UI works is that it takes the `total` value, caps it at 1000, and then generates that many blank rows to fill with whatever's in the `data` array. So as seen in the issue, we could easily run into a situation where: 1. User enters a filter query that last appeared more than 1000 logs ago 2. The cardinality agg accurately reports the number of logs that should be fetched, and generates enough blank rows for these logs 3. The actual execution log agg hits the 1000 bucket limit, and returns an empty `data` array This PR fixes this by using a bucket sum aggregation to determine the data table's `total` instead of a cardinality aggregation. It also ups the bucket limit from 1000 to 10000, to match the bucket limit from the summary KPI endpoint. This prevents a situation where: 1. User enters a filter query that last appeared in <1000 logs, but more than 1000 logs ago 2. Summary KPI widget, with a bucket limit of 10000, accurately displays a count of the <1000 logs this query matches 3. The data table is empty because the execution log bucket limit tops out at 1000 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co> (cherry picked from commit 9d32f88)

…72228) (#174557) # Backport This will backport the following commits from `main` to `8.12`: - [[RAM] Fix rule log aggregations when filtering >1k logs (#172228)](#172228)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Zacqary Adam Xeper <Zacqary@users.noreply.github.com>

## Summary Fixes elastic#171409 This fixes weird behavior when attempting to filter more than 1000 rule execution logs. The execution log aggregator had two problems: - It had a max bucket limit of 1000, while the KPI agg had a max bucket limit of 10000, leading to inconsistent results - For pagination, it reported the total number of logs by using a cardinality aggregator, which wasn't subject to any bucket limit at all The endpoint responds with a `total` (the cardinality) and an array of `data` (a paginated slice of the aggregated logs). The way the data table UI works is that it takes the `total` value, caps it at 1000, and then generates that many blank rows to fill with whatever's in the `data` array. So as seen in the issue, we could easily run into a situation where: 1. User enters a filter query that last appeared more than 1000 logs ago 2. The cardinality agg accurately reports the number of logs that should be fetched, and generates enough blank rows for these logs 3. The actual execution log agg hits the 1000 bucket limit, and returns an empty `data` array This PR fixes this by using a bucket sum aggregation to determine the data table's `total` instead of a cardinality aggregation. It also ups the bucket limit from 1000 to 10000, to match the bucket limit from the summary KPI endpoint. This prevents a situation where: 1. User enters a filter query that last appeared in <1000 logs, but more than 1000 logs ago 2. Summary KPI widget, with a bucket limit of 10000, accurately displays a count of the <1000 logs this query matches 3. The data table is empty because the execution log bucket limit tops out at 1000 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>

maryam-saeidi added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Nov 16, 2023

Zacqary self-assigned this Nov 20, 2023

Zacqary mentioned this issue Nov 30, 2023

[RAM] Fix rule log aggregations when filtering >1k logs #172228

Merged

1 task

XavierM closed this as completed in #172228 Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in filtering Rules' log history in stack management #171409

Fix bug in filtering Rules' log history in stack management #171409

maryam-saeidi commented Nov 16, 2023 •

edited

Loading

elasticmachine commented Nov 16, 2023

Zacqary commented Nov 29, 2023

Zacqary commented Nov 29, 2023

Zacqary commented Nov 29, 2023

Fix bug in filtering Rules' log history in stack management #171409

Fix bug in filtering Rules' log history in stack management #171409

Comments

maryam-saeidi commented Nov 16, 2023 • edited Loading

🐞 Summary

elasticmachine commented Nov 16, 2023

Zacqary commented Nov 29, 2023

Zacqary commented Nov 29, 2023

Zacqary commented Nov 29, 2023

maryam-saeidi commented Nov 16, 2023 •

edited

Loading