-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in filtering Rules' log history in stack management #171409
Comments
Pinging @elastic/response-ops (Team:ResponseOps) |
Only able to reproduce this when there are more than 1000 logs to filter through, and even then only some filters fail and some succeed. It's somehow related to having a lot of data. Investigating further. |
Welp, here's our problem:
const DEFAULT_MAX_BUCKETS_LIMIT = 1000; // do not retrieve more than this number of executions
const DEFAULT_MAX_KPI_BUCKETS_LIMIT = 10000; I'm not sure why the limits for log buckets and KPI buckets are different. That seems odd, and a quick change of the limit to 10000 doesn't seem to break anything when testing locally. The first goal is to figure out a better failure mode for when we attempt to aggregate over more than 1000 logs, and then we should figure out if syncing this bucket limit with the KPI is a major issue or not. |
Ok this is the problem. We have two different aggregations determining the total number of logs, and the actual logs themselves: executionUuidCardinality: {
filter: {
size: DEFAULT_MAX_BUCKETS_LIMIT,
bool: {
...(dslFilterQuery ? { filter: dslFilterQuery } : {}),
must: [getProviderAndActionFilter('alerting', 'execute')],
},
},
aggs: {
executionUuidCardinality: {
cardinality: {
field: EXECUTION_UUID_FIELD,
},
},
},
},
executionUuid: {
// Bucket by execution UUID
terms: {
field: EXECUTION_UUID_FIELD,
size: DEFAULT_MAX_BUCKETS_LIMIT,
order: formatSortForTermSort(sort),
},
Anyway, the logs API sends back the cardinality as the data table's I recommend syncing up the We can then pull the data table total from the KPI aggregation instead of the cardinality. |
## Summary Fixes #171409 This fixes weird behavior when attempting to filter more than 1000 rule execution logs. The execution log aggregator had two problems: - It had a max bucket limit of 1000, while the KPI agg had a max bucket limit of 10000, leading to inconsistent results - For pagination, it reported the total number of logs by using a cardinality aggregator, which wasn't subject to any bucket limit at all The endpoint responds with a `total` (the cardinality) and an array of `data` (a paginated slice of the aggregated logs). The way the data table UI works is that it takes the `total` value, caps it at 1000, and then generates that many blank rows to fill with whatever's in the `data` array. So as seen in the issue, we could easily run into a situation where: 1. User enters a filter query that last appeared more than 1000 logs ago 2. The cardinality agg accurately reports the number of logs that should be fetched, and generates enough blank rows for these logs 3. The actual execution log agg hits the 1000 bucket limit, and returns an empty `data` array This PR fixes this by using a bucket sum aggregation to determine the data table's `total` instead of a cardinality aggregation. It also ups the bucket limit from 1000 to 10000, to match the bucket limit from the summary KPI endpoint. This prevents a situation where: 1. User enters a filter query that last appeared in <1000 logs, but more than 1000 logs ago 2. Summary KPI widget, with a bucket limit of 10000, accurately displays a count of the <1000 logs this query matches 3. The data table is empty because the execution log bucket limit tops out at 1000 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>
## Summary Fixes elastic#171409 This fixes weird behavior when attempting to filter more than 1000 rule execution logs. The execution log aggregator had two problems: - It had a max bucket limit of 1000, while the KPI agg had a max bucket limit of 10000, leading to inconsistent results - For pagination, it reported the total number of logs by using a cardinality aggregator, which wasn't subject to any bucket limit at all The endpoint responds with a `total` (the cardinality) and an array of `data` (a paginated slice of the aggregated logs). The way the data table UI works is that it takes the `total` value, caps it at 1000, and then generates that many blank rows to fill with whatever's in the `data` array. So as seen in the issue, we could easily run into a situation where: 1. User enters a filter query that last appeared more than 1000 logs ago 2. The cardinality agg accurately reports the number of logs that should be fetched, and generates enough blank rows for these logs 3. The actual execution log agg hits the 1000 bucket limit, and returns an empty `data` array This PR fixes this by using a bucket sum aggregation to determine the data table's `total` instead of a cardinality aggregation. It also ups the bucket limit from 1000 to 10000, to match the bucket limit from the summary KPI endpoint. This prevents a situation where: 1. User enters a filter query that last appeared in <1000 logs, but more than 1000 logs ago 2. Summary KPI widget, with a bucket limit of 10000, accurately displays a count of the <1000 logs this query matches 3. The data table is empty because the execution log bucket limit tops out at 1000 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co> (cherry picked from commit 9d32f88)
…72228) (#174557) # Backport This will backport the following commits from `main` to `8.12`: - [[RAM] Fix rule log aggregations when filtering >1k logs (#172228)](#172228) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Zacqary Adam Xeper","email":"Zacqary@users.noreply.github.com"},"sourceCommit":{"committedDate":"2024-01-09T23:17:13Z","message":"[RAM] Fix rule log aggregations when filtering >1k logs (#172228)\n\n## Summary\r\n\r\nFixes #171409 \r\n\r\nThis fixes weird behavior when attempting to filter more than 1000 rule\r\nexecution logs.\r\n\r\nThe execution log aggregator had two problems:\r\n\r\n- It had a max bucket limit of 1000, while the KPI agg had a max bucket\r\nlimit of 10000, leading to inconsistent results\r\n- For pagination, it reported the total number of logs by using a\r\ncardinality aggregator, which wasn't subject to any bucket limit at all\r\n\r\nThe endpoint responds with a `total` (the cardinality) and an array of\r\n`data` (a paginated slice of the aggregated logs). The way the data\r\ntable UI works is that it takes the `total` value, caps it at 1000, and\r\nthen generates that many blank rows to fill with whatever's in the\r\n`data` array.\r\n\r\nSo as seen in the issue, we could easily run into a situation where:\r\n\r\n1. User enters a filter query that last appeared more than 1000 logs ago\r\n2. The cardinality agg accurately reports the number of logs that should\r\nbe fetched, and generates enough blank rows for these logs\r\n3. The actual execution log agg hits the 1000 bucket limit, and returns\r\nan empty `data` array\r\n\r\nThis PR fixes this by using a bucket sum aggregation to determine the\r\ndata table's `total` instead of a cardinality aggregation.\r\n\r\nIt also ups the bucket limit from 1000 to 10000, to match the bucket\r\nlimit from the summary KPI endpoint. This prevents a situation where:\r\n\r\n1. User enters a filter query that last appeared in <1000 logs, but more\r\nthan 1000 logs ago\r\n2. Summary KPI widget, with a bucket limit of 10000, accurately displays\r\na count of the <1000 logs this query matches\r\n3. The data table is empty because the execution log bucket limit tops\r\nout at 1000\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>","sha":"9d32f889683cb6bc099d561d94e02af2feaf0d10","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","release_note:fix","Team:ResponseOps","Feature:Alerting/RulesManagement","v8.12.0","v8.13.0"],"title":"[RAM] Fix rule log aggregations when filtering >1k logs","number":172228,"url":"#172228 Fix rule log aggregations when filtering >1k logs (#172228)\n\n## Summary\r\n\r\nFixes #171409 \r\n\r\nThis fixes weird behavior when attempting to filter more than 1000 rule\r\nexecution logs.\r\n\r\nThe execution log aggregator had two problems:\r\n\r\n- It had a max bucket limit of 1000, while the KPI agg had a max bucket\r\nlimit of 10000, leading to inconsistent results\r\n- For pagination, it reported the total number of logs by using a\r\ncardinality aggregator, which wasn't subject to any bucket limit at all\r\n\r\nThe endpoint responds with a `total` (the cardinality) and an array of\r\n`data` (a paginated slice of the aggregated logs). The way the data\r\ntable UI works is that it takes the `total` value, caps it at 1000, and\r\nthen generates that many blank rows to fill with whatever's in the\r\n`data` array.\r\n\r\nSo as seen in the issue, we could easily run into a situation where:\r\n\r\n1. User enters a filter query that last appeared more than 1000 logs ago\r\n2. The cardinality agg accurately reports the number of logs that should\r\nbe fetched, and generates enough blank rows for these logs\r\n3. The actual execution log agg hits the 1000 bucket limit, and returns\r\nan empty `data` array\r\n\r\nThis PR fixes this by using a bucket sum aggregation to determine the\r\ndata table's `total` instead of a cardinality aggregation.\r\n\r\nIt also ups the bucket limit from 1000 to 10000, to match the bucket\r\nlimit from the summary KPI endpoint. This prevents a situation where:\r\n\r\n1. User enters a filter query that last appeared in <1000 logs, but more\r\nthan 1000 logs ago\r\n2. Summary KPI widget, with a bucket limit of 10000, accurately displays\r\na count of the <1000 logs this query matches\r\n3. The data table is empty because the execution log bucket limit tops\r\nout at 1000\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>","sha":"9d32f889683cb6bc099d561d94e02af2feaf0d10"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","branchLabelMappingKey":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"#172228 Fix rule log aggregations when filtering >1k logs (#172228)\n\n## Summary\r\n\r\nFixes #171409 \r\n\r\nThis fixes weird behavior when attempting to filter more than 1000 rule\r\nexecution logs.\r\n\r\nThe execution log aggregator had two problems:\r\n\r\n- It had a max bucket limit of 1000, while the KPI agg had a max bucket\r\nlimit of 10000, leading to inconsistent results\r\n- For pagination, it reported the total number of logs by using a\r\ncardinality aggregator, which wasn't subject to any bucket limit at all\r\n\r\nThe endpoint responds with a `total` (the cardinality) and an array of\r\n`data` (a paginated slice of the aggregated logs). The way the data\r\ntable UI works is that it takes the `total` value, caps it at 1000, and\r\nthen generates that many blank rows to fill with whatever's in the\r\n`data` array.\r\n\r\nSo as seen in the issue, we could easily run into a situation where:\r\n\r\n1. User enters a filter query that last appeared more than 1000 logs ago\r\n2. The cardinality agg accurately reports the number of logs that should\r\nbe fetched, and generates enough blank rows for these logs\r\n3. The actual execution log agg hits the 1000 bucket limit, and returns\r\nan empty `data` array\r\n\r\nThis PR fixes this by using a bucket sum aggregation to determine the\r\ndata table's `total` instead of a cardinality aggregation.\r\n\r\nIt also ups the bucket limit from 1000 to 10000, to match the bucket\r\nlimit from the summary KPI endpoint. This prevents a situation where:\r\n\r\n1. User enters a filter query that last appeared in <1000 logs, but more\r\nthan 1000 logs ago\r\n2. Summary KPI widget, with a bucket limit of 10000, accurately displays\r\na count of the <1000 logs this query matches\r\n3. The data table is empty because the execution log bucket limit tops\r\nout at 1000\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>","sha":"9d32f889683cb6bc099d561d94e02af2feaf0d10"}}]}] BACKPORT--> Co-authored-by: Zacqary Adam Xeper <Zacqary@users.noreply.github.com>
## Summary Fixes elastic#171409 This fixes weird behavior when attempting to filter more than 1000 rule execution logs. The execution log aggregator had two problems: - It had a max bucket limit of 1000, while the KPI agg had a max bucket limit of 10000, leading to inconsistent results - For pagination, it reported the total number of logs by using a cardinality aggregator, which wasn't subject to any bucket limit at all The endpoint responds with a `total` (the cardinality) and an array of `data` (a paginated slice of the aggregated logs). The way the data table UI works is that it takes the `total` value, caps it at 1000, and then generates that many blank rows to fill with whatever's in the `data` array. So as seen in the issue, we could easily run into a situation where: 1. User enters a filter query that last appeared more than 1000 logs ago 2. The cardinality agg accurately reports the number of logs that should be fetched, and generates enough blank rows for these logs 3. The actual execution log agg hits the 1000 bucket limit, and returns an empty `data` array This PR fixes this by using a bucket sum aggregation to determine the data table's `total` instead of a cardinality aggregation. It also ups the bucket limit from 1000 to 10000, to match the bucket limit from the summary KPI endpoint. This prevents a situation where: 1. User enters a filter query that last appeared in <1000 logs, but more than 1000 logs ago 2. Summary KPI widget, with a bucket limit of 10000, accurately displays a count of the <1000 logs this query matches 3. The data table is empty because the execution log bucket limit tops out at 1000 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>
## Summary Fixes elastic#171409 This fixes weird behavior when attempting to filter more than 1000 rule execution logs. The execution log aggregator had two problems: - It had a max bucket limit of 1000, while the KPI agg had a max bucket limit of 10000, leading to inconsistent results - For pagination, it reported the total number of logs by using a cardinality aggregator, which wasn't subject to any bucket limit at all The endpoint responds with a `total` (the cardinality) and an array of `data` (a paginated slice of the aggregated logs). The way the data table UI works is that it takes the `total` value, caps it at 1000, and then generates that many blank rows to fill with whatever's in the `data` array. So as seen in the issue, we could easily run into a situation where: 1. User enters a filter query that last appeared more than 1000 logs ago 2. The cardinality agg accurately reports the number of logs that should be fetched, and generates enough blank rows for these logs 3. The actual execution log agg hits the 1000 bucket limit, and returns an empty `data` array This PR fixes this by using a bucket sum aggregation to determine the data table's `total` instead of a cardinality aggregation. It also ups the bucket limit from 1000 to 10000, to match the bucket limit from the summary KPI endpoint. This prevents a situation where: 1. User enters a filter query that last appeared in <1000 logs, but more than 1000 logs ago 2. Summary KPI widget, with a bucket limit of 10000, accurately displays a count of the <1000 logs this query matches 3. The data table is empty because the execution log bucket limit tops out at 1000 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>
## Summary Fixes elastic#171409 This fixes weird behavior when attempting to filter more than 1000 rule execution logs. The execution log aggregator had two problems: - It had a max bucket limit of 1000, while the KPI agg had a max bucket limit of 10000, leading to inconsistent results - For pagination, it reported the total number of logs by using a cardinality aggregator, which wasn't subject to any bucket limit at all The endpoint responds with a `total` (the cardinality) and an array of `data` (a paginated slice of the aggregated logs). The way the data table UI works is that it takes the `total` value, caps it at 1000, and then generates that many blank rows to fill with whatever's in the `data` array. So as seen in the issue, we could easily run into a situation where: 1. User enters a filter query that last appeared more than 1000 logs ago 2. The cardinality agg accurately reports the number of logs that should be fetched, and generates enough blank rows for these logs 3. The actual execution log agg hits the 1000 bucket limit, and returns an empty `data` array This PR fixes this by using a bucket sum aggregation to determine the data table's `total` instead of a cardinality aggregation. It also ups the bucket limit from 1000 to 10000, to match the bucket limit from the summary KPI endpoint. This prevents a situation where: 1. User enters a filter query that last appeared in <1000 logs, but more than 1000 logs ago 2. Summary KPI widget, with a bucket limit of 10000, accurately displays a count of the <1000 logs this query matches 3. The data table is empty because the execution log bucket limit tops out at 1000 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>
🐞 Summary
When checking Rules' logs in stack management, the filter does not work as expected (slack, QA environment)
Screen.Recording.2023-11-14.at.09.32.27.mov
Besides not filtering correctly, in the first search, the table had rows, but in the second one, it didn’t.
The text was updated successfully, but these errors were encountered: