Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias #184161

Merged
merged 10 commits into from
May 29, 2024

Conversation

doakalexi
Copy link
Contributor

@doakalexi doakalexi commented May 23, 2024

Resolves #179829

Summary

We've run into multiple SDHs where concrete indices exist for an alerts-as-data resource but none of them are set as the write index for an alias. This PR adds code to pick a concrete index and set it as the write index to avoid these types of failures.

Checklist

To verify

  1. Go to dev tools
  2. Create an ES Query rule
POST kbn:/api/alerting/rule
{
  "params": {
    "searchType": "esQuery",
    "timeWindowSize": 5,
    "timeWindowUnit": "m",
    "threshold": [
      -1
    ],
    "thresholdComparator": ">",
    "size": 100,
    "esQuery": "{\n    \"query\":{\n      \"match_all\" : {}\n    }\n  }",
    "aggType": "count",
    "groupBy": "all",
    "termSize": 5,
    "excludeHitsFromPreviousRun": false,
    "sourceFields": [],
    "index": [
      ".kibana"
    ],
    "timeField": "created_at"
  },
  "consumer": "stackAlerts",
  "schedule": {
    "interval": "1m"
  },
  "tags": [],
  "name": "test",
  "rule_type_id": ".es-query",
  "actions": []
}
  1. Run the following commands to set "is_write_index": false
POST /_aliases
{
  "actions": [
    {
      "remove": {
        "index": ".internal.alerts-stack.alerts-default-000001",
        "alias": ".alerts-stack.alerts-default"
      }
    }, {
      "add": {
        "index": ".internal.alerts-stack.alerts-default-000001",
        "alias": ".alerts-stack.alerts-default",
        "is_write_index": false
      }
    }
  ]
}

GET .internal.alerts-stack.alerts-default-000001/_alias/*
  1. Stop Kibana, but keep ES running
  2. Start Kibana and verify that the rule runs successfully
  3. Run the GET alias command to verify "is_write_index": true
GET .internal.alerts-stack.alerts-default-000001/_alias/*

@doakalexi
Copy link
Contributor Author

/ci

@doakalexi
Copy link
Contributor Author

/ci

@doakalexi
Copy link
Contributor Author

/ci

@doakalexi doakalexi changed the title Initial commit for auto-healing when no write index is set for alerts… [ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias May 23, 2024
@doakalexi doakalexi added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) release_note:skip Skip the PR/issue when compiling release notes v8.15.0 labels May 23, 2024
@doakalexi
Copy link
Contributor Author

/ci

@doakalexi doakalexi marked this pull request as ready for review May 24, 2024 13:59
@doakalexi doakalexi requested a review from a team as a code owner May 24, 2024 13:59
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

Copy link
Member

@umbopepato umbopepato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected, LGTM!

image image

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get the existing error message removed - or I guess probably changed to DEBUG:

[ERROR][plugins.alerting] Indices matching pattern .internal.alerts-stack.alerts-default-* exist but none are set as the write index for alias .alerts-stack.alerts-default

and replaced with an INFO message indicating that we successfully set the write index.

@@ -186,9 +187,11 @@ async function createAliasStream(opts: CreateConcreteWriteIndexOpts): Promise<vo
// If there are some concrete indices but none of them are the write index, we'll throw an error
// because one of the existing indices should have been the write target.
if (concreteIndicesExist && !concreteWriteIndicesExist) {
throw new Error(
logger.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should log this as an error, since presumably we "fix" this in the next step. I was a little scared when I ran the test scenario, and the error got logged but nothing indicating it got fixed got logged. Though it did get fixed :-)

I think we should put a try/catch around the setConcreteWriteIndex() call, log an error if the attempt to fix fails, as I guess an error, but log the success - with the index and alias name - as info.

Not sure if we should continue to throw an error in the case we log the error. Seems like we'd want to continue on with other alert indices/aliases, not sure if that's how it's structured or not. So seems like throwing is probably not required in the error case here ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in this commit 9096c2f

I'd like to get the existing error message removed - or I guess probably changed to DEBUG:

[ERROR][plugins.alerting] Indices matching pattern .internal.alerts-stack.alerts-default-* exist but none are set as the write index for alias .alerts-stack.alerts-default

and replaced with an INFO message indicating that we successfully set the write index.

I set the message to debug and added the INFO success message to log that we set the write index inside setConcreteWriteIndex().

I think we should put a try/catch around the setConcreteWriteIndex() call, log an error if the attempt to fix fails, as I guess an error, but log the success - with the index and alias name - as info.

Not sure if we should continue to throw an error in the case we log the error. Seems like we'd want to continue on with other alert indices/aliases, not sure if that's how it's structured or not. So seems like throwing is probably not required in the error case here ...

There is a try/catch inside setConcreteWriteIndex() and I am wondering if we should still throw the error bc it breaks the alerting rules? I am not sure, maybe I should remove like you mentioned.

@@ -220,9 +223,10 @@ async function createAliasStream(opts: CreateConcreteWriteIndexOpts): Promise<vo
{ logger }
);
if (!existingIndices[indexPatterns.name]?.aliases?.[indexPatterns.alias]?.is_write_index) {
throw Error(
logger.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a little confusing to me. I think we fall in this code when a concrete write index doesn't exist, so we create one. We check in the code if it already got created - which is highly likely at startup with multiple Kibanas - and if we find it, but it's not the write index, we fix it so it is.

I'd think we could depend on whichever other Kibana created the index to make it the write index, but not positive. So not clear to me we need this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in this commit 9096c2f

I removed the change here to set the write index. It should just throw the error like before.

@pmuellr
Copy link
Member

pmuellr commented May 28, 2024

I was curious how we were going to pick from the available indices behind the alias, to mark as the write index.

I extended the test in the top comment to add more indices to the alias:

PUT /.internal.alerts-stack.alerts-default-000002
PUT /.internal.alerts-stack.alerts-default-000003
PUT /.internal.alerts-stack.alerts-default-000004

POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": ".internal.alerts-stack.alerts-default-000002",
        "alias": ".alerts-stack.alerts-default",
        "is_write_index": false
      }
    },
    {
      "add": {
        "index": ".internal.alerts-stack.alerts-default-000003",
        "alias": ".alerts-stack.alerts-default",
        "is_write_index": false
      }
    },
    {
      "add": {
        "index": ".internal.alerts-stack.alerts-default-000004",
        "alias": ".alerts-stack.alerts-default",
        "is_write_index": false
      }
    }
  ]
}

Then remove the write index per original instructions, and restart Kibana. Then run the following to see what happened:

GET _alias/.alerts-stack.alerts-default

shows that the -000001 index becomes the write index.

I would have expected -000004 instead. I'm not sure that -000001 is specifically wrong - it probably doesn't matter which index things are written to, technically. But of course could be confusing. What's really scaring me about this is what happens when ILM rolls it over again, for real. Will it try to create a -000002 index (which already exists, so maybe an error?) or -000005.

The other scary part about picking the "right" index (assuming we need to), is that we have seen people do their own aliasing to add non-alerting indices to the alias, for some reason. So it's possible we could be marking those as the write index, depending on how these are chosen, which would be very wrong.

Feels like what we need to do is to look for indices in the alias matching .internal.alerts-${patternWithNamespace}-*, and use the last one sorted alphabetically. Or something :-)

@doakalexi
Copy link
Contributor Author

doakalexi commented May 29, 2024

The other scary part about picking the "right" index (assuming we need to), is that we have seen people do their own aliasing to add non-alerting indices to the alias, for some reason. So it's possible we could be marking those as the write index, depending on how these are chosen, which would be very wrong.

Feels like what we need to do is to look for indices in the alias matching .internal.alerts-${patternWithNamespace}-*, and use the last one sorted alphabetically. Or something :-)

Resolved in this commit 9096c2f

You are right, that is prob a good idea! I updated the setConcreteWriteIndex function to sort the concrete indices and select the last one as the write index.

The code already searches for the indices using the query below that I think should prevent us from picking bad indices from people adding non-alerting indices to the alias.

  esClient.indices.getAlias({
          index: indexPatterns.pattern,
          name: indexPatterns.basePattern,
        }),

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. LGTM!

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@doakalexi doakalexi merged commit 5a5f9bd into elastic:main May 29, 2024
36 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Response Ops][Alerting] Investigate auto-healing when no write index is set for alerts as data alias
6 participants