[7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage #106308

LeeDr · 2021-07-20T18:18:53Z

Kibana version: 7.14.0

Elasticsearch version: 7.14.0

Server OS version: all, but mostly impacts Cloud

Browser version: n/a

Browser OS version: n/a

Original install method (e.g. download page, yum, from source, etc.): n/a

Describe the bug: A bug in 7.11 caused failed tasks to not clean up unneeded docs in both the .kibana and .kibana_task_manager index. These could build up over time to hundreds of thousands of docs.

A task was added in 7.13 that would clean these up over time (the throttled task could take a month to clean up everything).

But 7.14 introduces saved object sharing which requires every saved object to be modified. In prior releases, only saved objects that had migrations defined for this release were modified. So a 7.11 -> 7.13 migration works but a 7.11 -> 7.14 migration an fail due to timeouts on Kibana becoming available.

If a user tries to upgrade from 7.11 or 7.12 to 7.14 (skipping the 7.13 release with the task for cleaning up those objects), the saved object migration will take longer than Cloud expects for Kibana to be available and restart the Kibana instance several times and report it as failed. It appears from one case, that the migration will actually complete if left to run.

For on-prem installations, I believe the migration will complete but may take 15 - 20 minutes or longer depending on how many saved objects there are.
I don't know the impact on ECE, ECK, or other installation methods.

Steps to reproduce:

Install the stack at version 7.11
create tasks (TBD exact steps to create the types of tasks that cause the problem)
upgrade to 7.14.0

Expected behavior: Kibana should complete saved object migration in a "reasonable time" to not cause timeouts on Elastic Cloud or other installation methods.

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

elasticmachine · 2021-07-20T18:18:54Z

Pinging @elastic/kibana-core (Team:Core)

elasticmachine · 2021-07-20T18:18:55Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2021-07-20T19:33:18Z

Some queries to help clean this up

1. Get the `task.runAt` of the oldest alert action in the queue (action which hasn't run yet). If none returned, use a time prior to running this query (ex: Store Date.now() before running the query and use that if no results).

GET .kibana_task_manager/_search
{
  "size": 1,
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "terms": {
          "task.taskType": ["actions:.email","actions:.index","actions:.pagerduty","actions:.swimlane","actions:.server-log","actions:.slack","actions:.webhook","actions:.servicenow","actions:.servicenow-sir","actions:.jira","actions:.resilient","actions:.teams"]
        }
            },
            {
              "term": {
                "type": "task"
              }
            },
            {
              "term": {
                "task.status": "idle"
              }
            }
          ]
        }
      }
    }
  },
  "sort": [
    { "task.runAt": { "order": "asc" } }
  ]
}

2. Delete `action_task_params` where `updated_at` is older than the oldest alert action in the queue (use date from first query)

POST /.kibana/_delete_by_query
{
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "type": "action_task_params"
              }
            },
            {
              "range": {
                "updated_at": {
                  "lte": "{{ put date from first query here }}"
                }
              }
            }
          ]
        }
      }
    }
  }
}

3. Delete `tasks` where `task.runAt` is older than the oldest alert action in the queue and status is failed

POST /.kibana_task_manager/_delete_by_query
{
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "terms": {
          "task.taskType": ["actions:.email","actions:.index","actions:.pagerduty","actions:.swimlane","actions:.server-log","actions:.slack","actions:.webhook","actions:.servicenow","actions:.servicenow-sir","actions:.jira","actions:.resilient","actions:.teams"]
        }
            },
            {
              "term": {
                "task.status": "failed"
              }
            },
            {
              "term": {
                "type": "task"
              }
            },
            {
              "range": {
                "task.runAt": {
                  "lte": "{{ put date from first query here }}"
                }
              }
            }
          ]
        }
      }
    }
  }
}

LeeDr · 2021-07-20T21:24:27Z

@mikecote what's the purpose of the terms filter in step 1? Is this all the taskTypes? Or its excluding some?

     "terms": {
          "task.taskType": ["actions:.email","actions:.index","actions:.pagerduty","actions:.swimlane","actions:.server-log","actions:.slack","actions:.webhook","actions:.servicenow","actions:.servicenow-sir","actions:.jira","actions:.resilient","actions:.teams"]
        }

mikecote · 2021-07-20T21:33:16Z

@LeeDr It's a limited set of taskTypes. Basically all the types that relate to running alert actions. There are many more around running rules (alerts), telemetry, login sessions, etc that I didn't want part of the query (so I opted-in the ones I wanted).

LeeDr · 2021-07-20T21:45:32Z

@crisdarocha if possible, could you try to run a query against the .kibana_task_manager index (in your pre-upgrade 7.11.1 cluster) to get the count of docs that match the terms "task.taskType" like in step 1 above? I just want to confirm it covers a majority of the docs.

joshdover · 2021-07-21T14:54:51Z

EDIT: thought of a better way, see #106308 (comment)

After thinking through this a bit, I think it's best if we implement this outside the migration algo as cleanup procedure that occurs beforehand the migration starts. The reason being that the algo's current architecture assumes that it's only interacting with one source index and would require many changes to accommodate being able to coordinate the migrations between these two indices to support these queries that need to happen across two different indices.

The caveat here is that we'll be touching the source index of the prior Kibana version, even in cases where the migration ultimately fails. This makes the upgrade process a bit less 'hermetic' but I think the risks here are minimal considering that we already have logic in place that is doing similar cleanup work on 7.13.

We'll also have to run these operations before we can set the write blocks on the indices, which means that the older Kibana nodes could potentially still be writing documents while these delete operations are in process. Though this behavior isn't officially supported, users will run into this situation. @mikecote are there any risks to running these _delete_by_query operations against indices that may still be in active use by the older Kibana nodes?

mikecote · 2021-07-21T16:52:52Z

@mikecote are there any risks to running these _delete_by_query operations against indices that may still be in active use by the older Kibana nodes?

I don't foresee any risks as any new document should have a timestamp of ~now meaning the query will leave those new docs untouched.

joshdover · 2021-07-21T17:04:24Z

As I've started to work through this a bit, I've found a bit better solution which should allow us to continue to support 'hermetic' upgrades:

Just before the migration starts, make the query to find the oldest alert action in the queue. Fallback to Date.now() if none are returned.
Add a new excludeFilter option to the migration system. This filter will be included in a bool.must_not clause currently used to filter out objects from the initial "client-side" migration step which copies documents from the source index to the temporary index.
Provide the appropriate filters based on the oldest alert action from step (1) to each migration:
- For the migration of the .kibana index, use the query filter from query (2) above
- For the migration fo the .kibana_task_manager index, use the query filter from query (3) above

I believe this is much safer as it won't change any of our basic assumptions about how the SO algo works and will greatly simplify the implementation. My goals is to have a working draft of this up sometime tomorrow.

mikecote · 2021-07-21T17:13:27Z

@joshdover brilliant! ❤️

joshdover · 2021-07-21T17:25:57Z

@mikecote if you or someone on your team has time, it'd be helpful to prepare a Elasticsearch data archive that has something like 1k documents in this state. That way I can add an integration test pretty easily that verifies that these old docs are cleaned up.

By "data archive" I essentially mean a .zip of the data directory from an Elasticsearch instance that has this data stored. You'll want to shutdown Elasticsearch before creating the archive. We have some examples here: src/core/server/saved_objects/migrationsv2/integration_tests/archives. This should be done with a 7.14 Elasticsearch snapshot so that it will work with both 7.x and 8.x for backporting. You can test if the data archive works by running yarn es snapshot --data-archive=path/to/archive.zip.

If creating the archive isn't going well (sorry we don't have docs on this), even just a script to create this data (using Kibana APIs or ES APIs) would be helpful.

mikecote · 2021-07-21T17:27:32Z

@joshdover thanks, I'll look into it and provide an archive.

joshdover · 2021-07-21T17:28:18Z

Thanks a ton 😄

mikecote · 2021-07-21T19:14:33Z

@joshdover Here's an archive containing ~1500 tasks, of which ~1000 have failed (stale) and ~500 are still in queue. There's also ~1500 action_task_params. The test/queries should reduce the tasks to ~500 and the action_task_params to ~500.

es.zip

crisdarocha · 2021-07-21T20:17:19Z

@crisdarocha if possible, could you try to run a query against the .kibana_task_manager index (in your pre-upgrade 7.11.1 cluster) to get the count of docs that match the terms "task.taskType" like in step 1 above? I just want to confirm it covers a majority of the docs.

Quickly replying to @LeeDr 's question. Most documents in my .kibana_task_manager have "task.taskType": "actions:.index" and not document has "task.status": "idle". Most of them have "task.status": "failed".

I tried three options:

Upgrade from 7.11 to 7.13 and wait for the clean up. This is unfortunately not fast enough for 850k documents in each of the Kibana indices. Less than 600 documents per hour per index.
Ran the delete by query above to both indices and the upgrade happens normally (small index, so no problem)
Started a larger than usual Kibana instance (8GB instead of the ESS default of 1GB RAM) and the upgrade happened without timeouts.

Solution sounds quite elegant, I hope it works! Thanks @LeeDr @joshdover and @mikecote for taking good care of that!

joshdover · 2021-07-22T18:45:04Z

I've got this working end-to-end with an integration test passing over in #106534 and I believe the risk here is fairly minimal. No fundamental changes to the migration algo were needed to implement this. 🎉

I'll have this cleaned up and the test coverage completed for review sometime on Monday, but any early eyes on the implementation now would be helpful.

LeeDr added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 20, 2021

joshdover self-assigned this Jul 21, 2021

joshdover changed the title ~~[7.14] SO migration takes too long on pre-7.13 upgrades~~ [7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage Jul 21, 2021

joshdover added blocker v7.14.0 labels Jul 21, 2021

joshdover mentioned this issue Jul 22, 2021

Add ability for types to define an excludeOnUpgrade hook #106534

Merged

6 tasks

thesmallestduck added the Feature:Saved Objects label Jul 27, 2021

joshdover closed this as completed in #106534 Jul 27, 2021

joshdover mentioned this issue Aug 5, 2021

Use a scripted reindex for re-writing document _ids for shared saved object types #107740

Closed

6 tasks

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage #106308

[7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage #106308

LeeDr commented Jul 20, 2021 •

edited

Loading

elasticmachine commented Jul 20, 2021

elasticmachine commented Jul 20, 2021

mikecote commented Jul 20, 2021 •

edited

Loading

LeeDr commented Jul 20, 2021

mikecote commented Jul 20, 2021 •

edited

Loading

LeeDr commented Jul 20, 2021

joshdover commented Jul 21, 2021 •

edited

Loading

mikecote commented Jul 21, 2021

joshdover commented Jul 21, 2021

mikecote commented Jul 21, 2021

joshdover commented Jul 21, 2021

mikecote commented Jul 21, 2021

joshdover commented Jul 21, 2021

mikecote commented Jul 21, 2021 •

edited

Loading

crisdarocha commented Jul 21, 2021

joshdover commented Jul 22, 2021

[7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage #106308

[7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage #106308

Comments

LeeDr commented Jul 20, 2021 • edited Loading

elasticmachine commented Jul 20, 2021

elasticmachine commented Jul 20, 2021

mikecote commented Jul 20, 2021 • edited Loading

1. Get the task.runAt of the oldest alert action in the queue (action which hasn't run yet). If none returned, use a time prior to running this query (ex: Store Date.now() before running the query and use that if no results).

2. Delete action_task_params where updated_at is older than the oldest alert action in the queue (use date from first query)

3. Delete tasks where task.runAt is older than the oldest alert action in the queue and status is failed

LeeDr commented Jul 20, 2021

mikecote commented Jul 20, 2021 • edited Loading

LeeDr commented Jul 20, 2021

joshdover commented Jul 21, 2021 • edited Loading

mikecote commented Jul 21, 2021

joshdover commented Jul 21, 2021

mikecote commented Jul 21, 2021

joshdover commented Jul 21, 2021

mikecote commented Jul 21, 2021

joshdover commented Jul 21, 2021

mikecote commented Jul 21, 2021 • edited Loading

crisdarocha commented Jul 21, 2021

joshdover commented Jul 22, 2021

LeeDr commented Jul 20, 2021 •

edited

Loading

mikecote commented Jul 20, 2021 •

edited

Loading

1. Get the `task.runAt` of the oldest alert action in the queue (action which hasn't run yet). If none returned, use a time prior to running this query (ex: Store Date.now() before running the query and use that if no results).

2. Delete `action_task_params` where `updated_at` is older than the oldest alert action in the queue (use date from first query)

3. Delete `tasks` where `task.runAt` is older than the oldest alert action in the queue and status is failed

mikecote commented Jul 20, 2021 •

edited

Loading

joshdover commented Jul 21, 2021 •

edited

Loading

mikecote commented Jul 21, 2021 •

edited

Loading