Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in filtering Rules' log history in stack management #171409

Closed
maryam-saeidi opened this issue Nov 16, 2023 · 4 comments · Fixed by #172228
Closed

Fix bug in filtering Rules' log history in stack management #171409

maryam-saeidi opened this issue Nov 16, 2023 · 4 comments · Fixed by #172228
Assignees
Labels
Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Nov 16, 2023

🐞 Summary

When checking Rules' logs in stack management, the filter does not work as expected (slack, QA environment)

Screen.Recording.2023-11-14.at.09.32.27.mov

Besides not filtering correctly, in the first search, the table had rows, but in the second one, it didn’t.

@maryam-saeidi maryam-saeidi added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Nov 16, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@Zacqary Zacqary self-assigned this Nov 20, 2023
@Zacqary
Copy link
Contributor

Zacqary commented Nov 29, 2023

Only able to reproduce this when there are more than 1000 logs to filter through, and even then only some filters fail and some succeed. It's somehow related to having a lot of data. Investigating further.

@Zacqary
Copy link
Contributor

Zacqary commented Nov 29, 2023

Welp, here's our problem:

const DEFAULT_MAX_BUCKETS_LIMIT = 1000; // do not retrieve more than this number of executions

const DEFAULT_MAX_BUCKETS_LIMIT = 1000; // do not retrieve more than this number of executions
const DEFAULT_MAX_KPI_BUCKETS_LIMIT = 10000;

I'm not sure why the limits for log buckets and KPI buckets are different. That seems odd, and a quick change of the limit to 10000 doesn't seem to break anything when testing locally.

The first goal is to figure out a better failure mode for when we attempt to aggregate over more than 1000 logs, and then we should figure out if syncing this bucket limit with the KPI is a major issue or not.

@Zacqary
Copy link
Contributor

Zacqary commented Nov 29, 2023

Ok this is the problem. We have two different aggregations determining the total number of logs, and the actual logs themselves:

executionUuidCardinality: {
  filter: {
    size: DEFAULT_MAX_BUCKETS_LIMIT,
    bool: {
      ...(dslFilterQuery ? { filter: dslFilterQuery } : {}),
      must: [getProviderAndActionFilter('alerting', 'execute')],
    },
  },
  aggs: {
    executionUuidCardinality: {
      cardinality: {
        field: EXECUTION_UUID_FIELD,
      },
    },
  },
},
executionUuid: {
  // Bucket by execution UUID
  terms: {
    field: EXECUTION_UUID_FIELD,
    size: DEFAULT_MAX_BUCKETS_LIMIT,
    order: formatSortForTermSort(sort),
  },

executionUuidCardinality is used to determine the total number of buckets, but it isn't subject to the DEFAULT_MAX_BUCKETS_LIMIT. Therefore it's not an accurate count of the total number of logs actually available to executionUuid. There doesn't seem to be a way to apply a size prop to a cardinality metric, so I don't know how to get this to be an accurate total.

Anyway, the logs API sends back the cardinality as the data table's total, and then the logs as the actual data to insert in it. When the total exceeds the length of the data, the table puts in all these empty rows.

I recommend syncing up the DEFAULT_MAX_BUCKETS_LIMIT and DEFAULT_MAX_KPI_BUCKETS_LIMIT. The UI limits data display to 1000 all by itself, but we're already doing the same amount of work to fetch the KPIs, so it's not too much of a wasted effort.

We can then pull the data table total from the KPI aggregation instead of the cardinality.

XavierM added a commit that referenced this issue Jan 9, 2024
## Summary

Fixes #171409 

This fixes weird behavior when attempting to filter more than 1000 rule
execution logs.

The execution log aggregator had two problems:

- It had a max bucket limit of 1000, while the KPI agg had a max bucket
limit of 10000, leading to inconsistent results
- For pagination, it reported the total number of logs by using a
cardinality aggregator, which wasn't subject to any bucket limit at all

The endpoint responds with a `total` (the cardinality) and an array of
`data` (a paginated slice of the aggregated logs). The way the data
table UI works is that it takes the `total` value, caps it at 1000, and
then generates that many blank rows to fill with whatever's in the
`data` array.

So as seen in the issue, we could easily run into a situation where:

1. User enters a filter query that last appeared more than 1000 logs ago
2. The cardinality agg accurately reports the number of logs that should
be fetched, and generates enough blank rows for these logs
3. The actual execution log agg hits the 1000 bucket limit, and returns
an empty `data` array

This PR fixes this by using a bucket sum aggregation to determine the
data table's `total` instead of a cardinality aggregation.

It also ups the bucket limit from 1000 to 10000, to match the bucket
limit from the summary KPI endpoint. This prevents a situation where:

1. User enters a filter query that last appeared in <1000 logs, but more
than 1000 logs ago
2. Summary KPI widget, with a bucket limit of 10000, accurately displays
a count of the <1000 logs this query matches
3. The data table is empty because the execution log bucket limit tops
out at 1000

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Jan 9, 2024
## Summary

Fixes elastic#171409

This fixes weird behavior when attempting to filter more than 1000 rule
execution logs.

The execution log aggregator had two problems:

- It had a max bucket limit of 1000, while the KPI agg had a max bucket
limit of 10000, leading to inconsistent results
- For pagination, it reported the total number of logs by using a
cardinality aggregator, which wasn't subject to any bucket limit at all

The endpoint responds with a `total` (the cardinality) and an array of
`data` (a paginated slice of the aggregated logs). The way the data
table UI works is that it takes the `total` value, caps it at 1000, and
then generates that many blank rows to fill with whatever's in the
`data` array.

So as seen in the issue, we could easily run into a situation where:

1. User enters a filter query that last appeared more than 1000 logs ago
2. The cardinality agg accurately reports the number of logs that should
be fetched, and generates enough blank rows for these logs
3. The actual execution log agg hits the 1000 bucket limit, and returns
an empty `data` array

This PR fixes this by using a bucket sum aggregation to determine the
data table's `total` instead of a cardinality aggregation.

It also ups the bucket limit from 1000 to 10000, to match the bucket
limit from the summary KPI endpoint. This prevents a situation where:

1. User enters a filter query that last appeared in <1000 logs, but more
than 1000 logs ago
2. Summary KPI widget, with a bucket limit of 10000, accurately displays
a count of the <1000 logs this query matches
3. The data table is empty because the execution log bucket limit tops
out at 1000

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>
(cherry picked from commit 9d32f88)
kibanamachine added a commit that referenced this issue Jan 10, 2024
…72228) (#174557)

# Backport

This will backport the following commits from `main` to `8.12`:
- [[RAM] Fix rule log aggregations when filtering &gt;1k logs
(#172228)](#172228)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Zacqary Adam
Xeper","email":"Zacqary@users.noreply.github.com"},"sourceCommit":{"committedDate":"2024-01-09T23:17:13Z","message":"[RAM]
Fix rule log aggregations when filtering >1k logs (#172228)\n\n##
Summary\r\n\r\nFixes #171409 \r\n\r\nThis fixes weird behavior when
attempting to filter more than 1000 rule\r\nexecution logs.\r\n\r\nThe
execution log aggregator had two problems:\r\n\r\n- It had a max bucket
limit of 1000, while the KPI agg had a max bucket\r\nlimit of 10000,
leading to inconsistent results\r\n- For pagination, it reported the
total number of logs by using a\r\ncardinality aggregator, which wasn't
subject to any bucket limit at all\r\n\r\nThe endpoint responds with a
`total` (the cardinality) and an array of\r\n`data` (a paginated slice
of the aggregated logs). The way the data\r\ntable UI works is that it
takes the `total` value, caps it at 1000, and\r\nthen generates that
many blank rows to fill with whatever's in the\r\n`data`
array.\r\n\r\nSo as seen in the issue, we could easily run into a
situation where:\r\n\r\n1. User enters a filter query that last appeared
more than 1000 logs ago\r\n2. The cardinality agg accurately reports the
number of logs that should\r\nbe fetched, and generates enough blank
rows for these logs\r\n3. The actual execution log agg hits the 1000
bucket limit, and returns\r\nan empty `data` array\r\n\r\nThis PR fixes
this by using a bucket sum aggregation to determine the\r\ndata table's
`total` instead of a cardinality aggregation.\r\n\r\nIt also ups the
bucket limit from 1000 to 10000, to match the bucket\r\nlimit from the
summary KPI endpoint. This prevents a situation where:\r\n\r\n1. User
enters a filter query that last appeared in <1000 logs, but more\r\nthan
1000 logs ago\r\n2. Summary KPI widget, with a bucket limit of 10000,
accurately displays\r\na count of the <1000 logs this query
matches\r\n3. The data table is empty because the execution log bucket
limit tops\r\nout at 1000\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by:
Xavier Mouligneau
<xavier.mouligneau@elastic.co>","sha":"9d32f889683cb6bc099d561d94e02af2feaf0d10","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","release_note:fix","Team:ResponseOps","Feature:Alerting/RulesManagement","v8.12.0","v8.13.0"],"title":"[RAM]
Fix rule log aggregations when filtering >1k
logs","number":172228,"url":"#172228
Fix rule log aggregations when filtering >1k logs (#172228)\n\n##
Summary\r\n\r\nFixes #171409 \r\n\r\nThis fixes weird behavior when
attempting to filter more than 1000 rule\r\nexecution logs.\r\n\r\nThe
execution log aggregator had two problems:\r\n\r\n- It had a max bucket
limit of 1000, while the KPI agg had a max bucket\r\nlimit of 10000,
leading to inconsistent results\r\n- For pagination, it reported the
total number of logs by using a\r\ncardinality aggregator, which wasn't
subject to any bucket limit at all\r\n\r\nThe endpoint responds with a
`total` (the cardinality) and an array of\r\n`data` (a paginated slice
of the aggregated logs). The way the data\r\ntable UI works is that it
takes the `total` value, caps it at 1000, and\r\nthen generates that
many blank rows to fill with whatever's in the\r\n`data`
array.\r\n\r\nSo as seen in the issue, we could easily run into a
situation where:\r\n\r\n1. User enters a filter query that last appeared
more than 1000 logs ago\r\n2. The cardinality agg accurately reports the
number of logs that should\r\nbe fetched, and generates enough blank
rows for these logs\r\n3. The actual execution log agg hits the 1000
bucket limit, and returns\r\nan empty `data` array\r\n\r\nThis PR fixes
this by using a bucket sum aggregation to determine the\r\ndata table's
`total` instead of a cardinality aggregation.\r\n\r\nIt also ups the
bucket limit from 1000 to 10000, to match the bucket\r\nlimit from the
summary KPI endpoint. This prevents a situation where:\r\n\r\n1. User
enters a filter query that last appeared in <1000 logs, but more\r\nthan
1000 logs ago\r\n2. Summary KPI widget, with a bucket limit of 10000,
accurately displays\r\na count of the <1000 logs this query
matches\r\n3. The data table is empty because the execution log bucket
limit tops\r\nout at 1000\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by:
Xavier Mouligneau
<xavier.mouligneau@elastic.co>","sha":"9d32f889683cb6bc099d561d94e02af2feaf0d10"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","branchLabelMappingKey":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"#172228
Fix rule log aggregations when filtering >1k logs (#172228)\n\n##
Summary\r\n\r\nFixes #171409 \r\n\r\nThis fixes weird behavior when
attempting to filter more than 1000 rule\r\nexecution logs.\r\n\r\nThe
execution log aggregator had two problems:\r\n\r\n- It had a max bucket
limit of 1000, while the KPI agg had a max bucket\r\nlimit of 10000,
leading to inconsistent results\r\n- For pagination, it reported the
total number of logs by using a\r\ncardinality aggregator, which wasn't
subject to any bucket limit at all\r\n\r\nThe endpoint responds with a
`total` (the cardinality) and an array of\r\n`data` (a paginated slice
of the aggregated logs). The way the data\r\ntable UI works is that it
takes the `total` value, caps it at 1000, and\r\nthen generates that
many blank rows to fill with whatever's in the\r\n`data`
array.\r\n\r\nSo as seen in the issue, we could easily run into a
situation where:\r\n\r\n1. User enters a filter query that last appeared
more than 1000 logs ago\r\n2. The cardinality agg accurately reports the
number of logs that should\r\nbe fetched, and generates enough blank
rows for these logs\r\n3. The actual execution log agg hits the 1000
bucket limit, and returns\r\nan empty `data` array\r\n\r\nThis PR fixes
this by using a bucket sum aggregation to determine the\r\ndata table's
`total` instead of a cardinality aggregation.\r\n\r\nIt also ups the
bucket limit from 1000 to 10000, to match the bucket\r\nlimit from the
summary KPI endpoint. This prevents a situation where:\r\n\r\n1. User
enters a filter query that last appeared in <1000 logs, but more\r\nthan
1000 logs ago\r\n2. Summary KPI widget, with a bucket limit of 10000,
accurately displays\r\na count of the <1000 logs this query
matches\r\n3. The data table is empty because the execution log bucket
limit tops\r\nout at 1000\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>\r\nCo-authored-by:
Xavier Mouligneau
<xavier.mouligneau@elastic.co>","sha":"9d32f889683cb6bc099d561d94e02af2feaf0d10"}}]}]
BACKPORT-->

Co-authored-by: Zacqary Adam Xeper <Zacqary@users.noreply.github.com>
nreese pushed a commit to nreese/kibana that referenced this issue Jan 10, 2024
## Summary

Fixes elastic#171409 

This fixes weird behavior when attempting to filter more than 1000 rule
execution logs.

The execution log aggregator had two problems:

- It had a max bucket limit of 1000, while the KPI agg had a max bucket
limit of 10000, leading to inconsistent results
- For pagination, it reported the total number of logs by using a
cardinality aggregator, which wasn't subject to any bucket limit at all

The endpoint responds with a `total` (the cardinality) and an array of
`data` (a paginated slice of the aggregated logs). The way the data
table UI works is that it takes the `total` value, caps it at 1000, and
then generates that many blank rows to fill with whatever's in the
`data` array.

So as seen in the issue, we could easily run into a situation where:

1. User enters a filter query that last appeared more than 1000 logs ago
2. The cardinality agg accurately reports the number of logs that should
be fetched, and generates enough blank rows for these logs
3. The actual execution log agg hits the 1000 bucket limit, and returns
an empty `data` array

This PR fixes this by using a bucket sum aggregation to determine the
data table's `total` instead of a cardinality aggregation.

It also ups the bucket limit from 1000 to 10000, to match the bucket
limit from the summary KPI endpoint. This prevents a situation where:

1. User enters a filter query that last appeared in <1000 logs, but more
than 1000 logs ago
2. Summary KPI widget, with a bucket limit of 10000, accurately displays
a count of the <1000 logs this query matches
3. The data table is empty because the execution log bucket limit tops
out at 1000

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>
delanni pushed a commit to delanni/kibana that referenced this issue Jan 11, 2024
## Summary

Fixes elastic#171409 

This fixes weird behavior when attempting to filter more than 1000 rule
execution logs.

The execution log aggregator had two problems:

- It had a max bucket limit of 1000, while the KPI agg had a max bucket
limit of 10000, leading to inconsistent results
- For pagination, it reported the total number of logs by using a
cardinality aggregator, which wasn't subject to any bucket limit at all

The endpoint responds with a `total` (the cardinality) and an array of
`data` (a paginated slice of the aggregated logs). The way the data
table UI works is that it takes the `total` value, caps it at 1000, and
then generates that many blank rows to fill with whatever's in the
`data` array.

So as seen in the issue, we could easily run into a situation where:

1. User enters a filter query that last appeared more than 1000 logs ago
2. The cardinality agg accurately reports the number of logs that should
be fetched, and generates enough blank rows for these logs
3. The actual execution log agg hits the 1000 bucket limit, and returns
an empty `data` array

This PR fixes this by using a bucket sum aggregation to determine the
data table's `total` instead of a cardinality aggregation.

It also ups the bucket limit from 1000 to 10000, to match the bucket
limit from the summary KPI endpoint. This prevents a situation where:

1. User enters a filter query that last appeared in <1000 logs, but more
than 1000 logs ago
2. Summary KPI widget, with a bucket limit of 10000, accurately displays
a count of the <1000 logs this query matches
3. The data table is empty because the execution log bucket limit tops
out at 1000

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this issue Feb 15, 2024
## Summary

Fixes elastic#171409 

This fixes weird behavior when attempting to filter more than 1000 rule
execution logs.

The execution log aggregator had two problems:

- It had a max bucket limit of 1000, while the KPI agg had a max bucket
limit of 10000, leading to inconsistent results
- For pagination, it reported the total number of logs by using a
cardinality aggregator, which wasn't subject to any bucket limit at all

The endpoint responds with a `total` (the cardinality) and an array of
`data` (a paginated slice of the aggregated logs). The way the data
table UI works is that it takes the `total` value, caps it at 1000, and
then generates that many blank rows to fill with whatever's in the
`data` array.

So as seen in the issue, we could easily run into a situation where:

1. User enters a filter query that last appeared more than 1000 logs ago
2. The cardinality agg accurately reports the number of logs that should
be fetched, and generates enough blank rows for these logs
3. The actual execution log agg hits the 1000 bucket limit, and returns
an empty `data` array

This PR fixes this by using a bucket sum aggregation to determine the
data table's `total` instead of a cardinality aggregation.

It also ups the bucket limit from 1000 to 10000, to match the bucket
limit from the summary KPI endpoint. This prevents a situation where:

1. User enters a filter query that last appeared in <1000 logs, but more
than 1000 logs ago
2. Summary KPI widget, with a bucket limit of 10000, accurately displays
a count of the <1000 logs this query matches
3. The data table is empty because the execution log bucket limit tops
out at 1000

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Xavier Mouligneau <xavier.mouligneau@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants