Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage #106308

Closed
LeeDr opened this issue Jul 20, 2021 · 16 comments · Fixed by #106534
Closed

[7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage #106308

LeeDr opened this issue Jul 20, 2021 · 16 comments · Fixed by #106534
Assignees
Labels
blocker bug Fixes for quality problems that affect the customer experience Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.14.0

Comments

@LeeDr
Copy link

LeeDr commented Jul 20, 2021

Kibana version: 7.14.0

Elasticsearch version: 7.14.0

Server OS version: all, but mostly impacts Cloud

Browser version: n/a

Browser OS version: n/a

Original install method (e.g. download page, yum, from source, etc.): n/a

Describe the bug: A bug in 7.11 caused failed tasks to not clean up unneeded docs in both the .kibana and .kibana_task_manager index. These could build up over time to hundreds of thousands of docs.

A task was added in 7.13 that would clean these up over time (the throttled task could take a month to clean up everything).

But 7.14 introduces saved object sharing which requires every saved object to be modified. In prior releases, only saved objects that had migrations defined for this release were modified. So a 7.11 -> 7.13 migration works but a 7.11 -> 7.14 migration an fail due to timeouts on Kibana becoming available.

If a user tries to upgrade from 7.11 or 7.12 to 7.14 (skipping the 7.13 release with the task for cleaning up those objects), the saved object migration will take longer than Cloud expects for Kibana to be available and restart the Kibana instance several times and report it as failed. It appears from one case, that the migration will actually complete if left to run.

For on-prem installations, I believe the migration will complete but may take 15 - 20 minutes or longer depending on how many saved objects there are.
I don't know the impact on ECE, ECK, or other installation methods.

Steps to reproduce:

  1. Install the stack at version 7.11
  2. create tasks (TBD exact steps to create the types of tasks that cause the problem)
  3. upgrade to 7.14.0

Expected behavior: Kibana should complete saved object migration in a "reasonable time" to not cause timeouts on Elastic Cloud or other installation methods.

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

@LeeDr LeeDr added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 20, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor

mikecote commented Jul 20, 2021

Some queries to help clean this up

1. Get the task.runAt of the oldest alert action in the queue (action which hasn't run yet). If none returned, use a time prior to running this query (ex: Store Date.now() before running the query and use that if no results).

GET .kibana_task_manager/_search
{
  "size": 1,
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "terms": {
          "task.taskType": ["actions:.email","actions:.index","actions:.pagerduty","actions:.swimlane","actions:.server-log","actions:.slack","actions:.webhook","actions:.servicenow","actions:.servicenow-sir","actions:.jira","actions:.resilient","actions:.teams"]
        }
            },
            {
              "term": {
                "type": "task"
              }
            },
            {
              "term": {
                "task.status": "idle"
              }
            }
          ]
        }
      }
    }
  },
  "sort": [
    { "task.runAt": { "order": "asc" } }
  ]
}

2. Delete action_task_params where updated_at is older than the oldest alert action in the queue (use date from first query)

POST /.kibana/_delete_by_query
{
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "type": "action_task_params"
              }
            },
            {
              "range": {
                "updated_at": {
                  "lte": "{{ put date from first query here }}"
                }
              }
            }
          ]
        }
      }
    }
  }
}

3. Delete tasks where task.runAt is older than the oldest alert action in the queue and status is failed

POST /.kibana_task_manager/_delete_by_query
{
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "terms": {
          "task.taskType": ["actions:.email","actions:.index","actions:.pagerduty","actions:.swimlane","actions:.server-log","actions:.slack","actions:.webhook","actions:.servicenow","actions:.servicenow-sir","actions:.jira","actions:.resilient","actions:.teams"]
        }
            },
            {
              "term": {
                "task.status": "failed"
              }
            },
            {
              "term": {
                "type": "task"
              }
            },
            {
              "range": {
                "task.runAt": {
                  "lte": "{{ put date from first query here }}"
                }
              }
            }
          ]
        }
      }
    }
  }
}

@LeeDr
Copy link
Author

LeeDr commented Jul 20, 2021

@mikecote what's the purpose of the terms filter in step 1? Is this all the taskTypes? Or its excluding some?

     "terms": {
          "task.taskType": ["actions:.email","actions:.index","actions:.pagerduty","actions:.swimlane","actions:.server-log","actions:.slack","actions:.webhook","actions:.servicenow","actions:.servicenow-sir","actions:.jira","actions:.resilient","actions:.teams"]
        }

@mikecote
Copy link
Contributor

mikecote commented Jul 20, 2021

@LeeDr It's a limited set of taskTypes. Basically all the types that relate to running alert actions. There are many more around running rules (alerts), telemetry, login sessions, etc that I didn't want part of the query (so I opted-in the ones I wanted).

@LeeDr
Copy link
Author

LeeDr commented Jul 20, 2021

@crisdarocha if possible, could you try to run a query against the .kibana_task_manager index (in your pre-upgrade 7.11.1 cluster) to get the count of docs that match the terms "task.taskType" like in step 1 above? I just want to confirm it covers a majority of the docs.

@joshdover joshdover self-assigned this Jul 21, 2021
@joshdover
Copy link
Contributor

joshdover commented Jul 21, 2021

EDIT: thought of a better way, see #106308 (comment)

After thinking through this a bit, I think it's best if we implement this outside the migration algo as cleanup procedure that occurs beforehand the migration starts. The reason being that the algo's current architecture assumes that it's only interacting with one source index and would require many changes to accommodate being able to coordinate the migrations between these two indices to support these queries that need to happen across two different indices.

The caveat here is that we'll be touching the source index of the prior Kibana version, even in cases where the migration ultimately fails. This makes the upgrade process a bit less 'hermetic' but I think the risks here are minimal considering that we already have logic in place that is doing similar cleanup work on 7.13.

We'll also have to run these operations before we can set the write blocks on the indices, which means that the older Kibana nodes could potentially still be writing documents while these delete operations are in process. Though this behavior isn't officially supported, users will run into this situation. @mikecote are there any risks to running these _delete_by_query operations against indices that may still be in active use by the older Kibana nodes?

@mikecote
Copy link
Contributor

@mikecote are there any risks to running these _delete_by_query operations against indices that may still be in active use by the older Kibana nodes?

I don't foresee any risks as any new document should have a timestamp of ~now meaning the query will leave those new docs untouched.

@joshdover
Copy link
Contributor

As I've started to work through this a bit, I've found a bit better solution which should allow us to continue to support 'hermetic' upgrades:

  1. Just before the migration starts, make the query to find the oldest alert action in the queue. Fallback to Date.now() if none are returned.
  2. Add a new excludeFilter option to the migration system. This filter will be included in a bool.must_not clause currently used to filter out objects from the initial "client-side" migration step which copies documents from the source index to the temporary index.
  3. Provide the appropriate filters based on the oldest alert action from step (1) to each migration:
    • For the migration of the .kibana index, use the query filter from query (2) above
    • For the migration fo the .kibana_task_manager index, use the query filter from query (3) above

I believe this is much safer as it won't change any of our basic assumptions about how the SO algo works and will greatly simplify the implementation. My goals is to have a working draft of this up sometime tomorrow.

@joshdover joshdover changed the title [7.14] SO migration takes too long on pre-7.13 upgrades [7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage Jul 21, 2021
@mikecote
Copy link
Contributor

@joshdover brilliant! ❤️

@joshdover
Copy link
Contributor

@mikecote if you or someone on your team has time, it'd be helpful to prepare a Elasticsearch data archive that has something like 1k documents in this state. That way I can add an integration test pretty easily that verifies that these old docs are cleaned up.

By "data archive" I essentially mean a .zip of the data directory from an Elasticsearch instance that has this data stored. You'll want to shutdown Elasticsearch before creating the archive. We have some examples here: src/core/server/saved_objects/migrationsv2/integration_tests/archives. This should be done with a 7.14 Elasticsearch snapshot so that it will work with both 7.x and 8.x for backporting. You can test if the data archive works by running yarn es snapshot --data-archive=path/to/archive.zip.

If creating the archive isn't going well (sorry we don't have docs on this), even just a script to create this data (using Kibana APIs or ES APIs) would be helpful.

@mikecote
Copy link
Contributor

@joshdover thanks, I'll look into it and provide an archive.

@joshdover
Copy link
Contributor

Thanks a ton 😄

@mikecote
Copy link
Contributor

mikecote commented Jul 21, 2021

@joshdover Here's an archive containing ~1500 tasks, of which ~1000 have failed (stale) and ~500 are still in queue. There's also ~1500 action_task_params. The test/queries should reduce the tasks to ~500 and the action_task_params to ~500.

es.zip

@crisdarocha
Copy link

@crisdarocha if possible, could you try to run a query against the .kibana_task_manager index (in your pre-upgrade 7.11.1 cluster) to get the count of docs that match the terms "task.taskType" like in step 1 above? I just want to confirm it covers a majority of the docs.

Quickly replying to @LeeDr 's question. Most documents in my .kibana_task_manager have "task.taskType": "actions:.index" and not document has "task.status": "idle". Most of them have "task.status": "failed".

I tried three options:

  1. Upgrade from 7.11 to 7.13 and wait for the clean up. This is unfortunately not fast enough for 850k documents in each of the Kibana indices. Less than 600 documents per hour per index.
  2. Ran the delete by query above to both indices and the upgrade happens normally (small index, so no problem)
  3. Started a larger than usual Kibana instance (8GB instead of the ESS default of 1GB RAM) and the upgrade happened without timeouts.

Solution sounds quite elegant, I hope it works! Thanks @LeeDr @joshdover and @mikecote for taking good care of that!

@joshdover
Copy link
Contributor

I've got this working end-to-end with an integration test passing over in #106534 and I believe the risk here is fairly minimal. No fundamental changes to the migration algo were needed to implement this. 🎉

I'll have this cleaned up and the test coverage completed for review sometime on Monday, but any early eyes on the implementation now would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker bug Fixes for quality problems that affect the customer experience Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.14.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants