[ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias #184161

doakalexi · 2024-05-23T17:38:41Z

Summary

We've run into multiple SDHs where concrete indices exist for an alerts-as-data resource but none of them are set as the write index for an alias. This PR adds code to pick a concrete index and set it as the write index to avoid these types of failures.

Checklist

Unit or functional tests were updated or added to match the most common scenarios

To verify

Go to dev tools
Create an ES Query rule

POST kbn:/api/alerting/rule
{
  "params": {
    "searchType": "esQuery",
    "timeWindowSize": 5,
    "timeWindowUnit": "m",
    "threshold": [
      -1
    ],
    "thresholdComparator": ">",
    "size": 100,
    "esQuery": "{\n    \"query\":{\n      \"match_all\" : {}\n    }\n  }",
    "aggType": "count",
    "groupBy": "all",
    "termSize": 5,
    "excludeHitsFromPreviousRun": false,
    "sourceFields": [],
    "index": [
      ".kibana"
    ],
    "timeField": "created_at"
  },
  "consumer": "stackAlerts",
  "schedule": {
    "interval": "1m"
  },
  "tags": [],
  "name": "test",
  "rule_type_id": ".es-query",
  "actions": []
}

Run the following commands to set "is_write_index": false

POST /_aliases
{
  "actions": [
    {
      "remove": {
        "index": ".internal.alerts-stack.alerts-default-000001",
        "alias": ".alerts-stack.alerts-default"
      }
    }, {
      "add": {
        "index": ".internal.alerts-stack.alerts-default-000001",
        "alias": ".alerts-stack.alerts-default",
        "is_write_index": false
      }
    }
  ]
}

GET .internal.alerts-stack.alerts-default-000001/_alias/*

Stop Kibana, but keep ES running
Start Kibana and verify that the rule runs successfully
Run the GET alias command to verify "is_write_index": true

GET .internal.alerts-stack.alerts-default-000001/_alias/*

… as data alias

doakalexi · 2024-05-23T18:26:47Z

/ci

doakalexi · 2024-05-23T19:12:10Z

/ci

…ana into auto-heal-alerting-indices

doakalexi · 2024-05-23T20:39:37Z

/ci

doakalexi · 2024-05-23T22:09:09Z

/ci

elasticmachine · 2024-05-24T13:59:57Z

Pinging @elastic/response-ops (Team:ResponseOps)

umbopepato

Works as expected, LGTM!

pmuellr

I'd like to get the existing error message removed - or I guess probably changed to DEBUG:

[ERROR][plugins.alerting] Indices matching pattern .internal.alerts-stack.alerts-default-* exist but none are set as the write index for alias .alerts-stack.alerts-default

and replaced with an INFO message indicating that we successfully set the write index.

pmuellr · 2024-05-28T19:57:47Z

x-pack/plugins/alerting/server/alerts_service/lib/data_stream_adapter.ts

@@ -186,9 +187,11 @@ async function createAliasStream(opts: CreateConcreteWriteIndexOpts): Promise<vo
    // If there are some concrete indices but none of them are the write index, we'll throw an error
    // because one of the existing indices should have been the write target.
    if (concreteIndicesExist && !concreteWriteIndicesExist) {
-      throw new Error(
+      logger.error(


I don't think we should log this as an error, since presumably we "fix" this in the next step. I was a little scared when I ran the test scenario, and the error got logged but nothing indicating it got fixed got logged. Though it did get fixed :-)

I think we should put a try/catch around the setConcreteWriteIndex() call, log an error if the attempt to fix fails, as I guess an error, but log the success - with the index and alias name - as info.

Not sure if we should continue to throw an error in the case we log the error. Seems like we'd want to continue on with other alert indices/aliases, not sure if that's how it's structured or not. So seems like throwing is probably not required in the error case here ...

Resolved in this commit 9096c2f

I'd like to get the existing error message removed - or I guess probably changed to DEBUG:

[ERROR][plugins.alerting] Indices matching pattern .internal.alerts-stack.alerts-default-* exist but none are set as the write index for alias .alerts-stack.alerts-default

and replaced with an INFO message indicating that we successfully set the write index.

I set the message to debug and added the INFO success message to log that we set the write index inside setConcreteWriteIndex().

I think we should put a try/catch around the setConcreteWriteIndex() call, log an error if the attempt to fix fails, as I guess an error, but log the success - with the index and alias name - as info.

Not sure if we should continue to throw an error in the case we log the error. Seems like we'd want to continue on with other alert indices/aliases, not sure if that's how it's structured or not. So seems like throwing is probably not required in the error case here ...

There is a try/catch inside setConcreteWriteIndex() and I am wondering if we should still throw the error bc it breaks the alerting rules? I am not sure, maybe I should remove like you mentioned.

pmuellr · 2024-05-28T20:01:54Z

x-pack/plugins/alerting/server/alerts_service/lib/data_stream_adapter.ts

@@ -220,9 +223,10 @@ async function createAliasStream(opts: CreateConcreteWriteIndexOpts): Promise<vo
          { logger }
        );
        if (!existingIndices[indexPatterns.name]?.aliases?.[indexPatterns.alias]?.is_write_index) {
-          throw Error(
+          logger.error(


This one is a little confusing to me. I think we fall in this code when a concrete write index doesn't exist, so we create one. We check in the code if it already got created - which is highly likely at startup with multiple Kibanas - and if we find it, but it's not the write index, we fix it so it is.

I'd think we could depend on whichever other Kibana created the index to make it the write index, but not positive. So not clear to me we need this here.

Resolved in this commit 9096c2f

I removed the change here to set the write index. It should just throw the error like before.

pmuellr · 2024-05-28T20:18:22Z

I was curious how we were going to pick from the available indices behind the alias, to mark as the write index.

I extended the test in the top comment to add more indices to the alias:

PUT /.internal.alerts-stack.alerts-default-000002
PUT /.internal.alerts-stack.alerts-default-000003
PUT /.internal.alerts-stack.alerts-default-000004

POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": ".internal.alerts-stack.alerts-default-000002",
        "alias": ".alerts-stack.alerts-default",
        "is_write_index": false
      }
    },
    {
      "add": {
        "index": ".internal.alerts-stack.alerts-default-000003",
        "alias": ".alerts-stack.alerts-default",
        "is_write_index": false
      }
    },
    {
      "add": {
        "index": ".internal.alerts-stack.alerts-default-000004",
        "alias": ".alerts-stack.alerts-default",
        "is_write_index": false
      }
    }
  ]
}

Then remove the write index per original instructions, and restart Kibana. Then run the following to see what happened:

GET _alias/.alerts-stack.alerts-default

shows that the -000001 index becomes the write index.

I would have expected -000004 instead. I'm not sure that -000001 is specifically wrong - it probably doesn't matter which index things are written to, technically. But of course could be confusing. What's really scaring me about this is what happens when ILM rolls it over again, for real. Will it try to create a -000002 index (which already exists, so maybe an error?) or -000005.

The other scary part about picking the "right" index (assuming we need to), is that we have seen people do their own aliasing to add non-alerting indices to the alias, for some reason. So it's possible we could be marking those as the write index, depending on how these are chosen, which would be very wrong.

Feels like what we need to do is to look for indices in the alias matching .internal.alerts-${patternWithNamespace}-*, and use the last one sorted alphabetically. Or something :-)

…ana into auto-heal-alerting-indices

doakalexi · 2024-05-29T16:57:16Z

The other scary part about picking the "right" index (assuming we need to), is that we have seen people do their own aliasing to add non-alerting indices to the alias, for some reason. So it's possible we could be marking those as the write index, depending on how these are chosen, which would be very wrong.

Feels like what we need to do is to look for indices in the alias matching .internal.alerts-${patternWithNamespace}-*, and use the last one sorted alphabetically. Or something :-)

Resolved in this commit 9096c2f

You are right, that is prob a good idea! I updated the setConcreteWriteIndex function to sort the concrete indices and select the last one as the write index.

The code already searches for the indices using the query below that I think should prevent us from picking bad indices from people adding non-alerting indices to the alias.

  esClient.indices.getAlias({
          index: indexPatterns.pattern,
          name: indexPatterns.basePattern,
        }),

pmuellr

Thanks for the changes. LGTM!

kibana-ci · 2024-05-29T20:00:23Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: cc83a2a

Failed CI Steps

Metrics [docs]

✅ unchanged

History

💔 Build #212893 failed 5fbcdf0
💔 Build #212577 failed c1b31e1
💛 Build #211868 was flaky 71318fb
💚 Build #211857 succeeded cecad86
💔 Build #211845 failed 7c34f44

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

doakalexi added 2 commits May 23, 2024 10:37

Initial commit for auto-healing when no write index is set for alerts…

edeac70

… as data alias

Moving to other file

2410133

Merge branch 'main' into auto-heal-alerting-indices

7c34f44

doakalexi added 2 commits May 23, 2024 13:38

Fixing test failures

a26da7f

Merge branch 'auto-heal-alerting-indices' of github.com:doakalexi/kib…

cecad86

…ana into auto-heal-alerting-indices

doakalexi changed the title ~~Initial commit for auto-healing when no write index is set for alerts…~~ [ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias May 23, 2024

doakalexi added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) release_note:skip Skip the PR/issue when compiling release notes v8.15.0 labels May 23, 2024

Remove return

71318fb

doakalexi marked this pull request as ready for review May 24, 2024 13:59

doakalexi requested a review from a team as a code owner May 24, 2024 13:59

doakalexi requested review from pmuellr and umbopepato May 24, 2024 14:00

umbopepato approved these changes May 24, 2024

View reviewed changes

Merge branch 'main' into auto-heal-alerting-indices

c1b31e1

pmuellr requested changes May 28, 2024

View reviewed changes

doakalexi added 2 commits May 29, 2024 09:47

Addressing feedback on the pr

9096c2f

Merge branch 'auto-heal-alerting-indices' of github.com:doakalexi/kib…

5fbcdf0

…ana into auto-heal-alerting-indices

pmuellr approved these changes May 29, 2024

View reviewed changes

Merge branch 'main' into auto-heal-alerting-indices

cc83a2a

doakalexi merged commit 5a5f9bd into elastic:main May 29, 2024
36 checks passed

kibanamachine added the backport:skip This commit does not require backporting label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias #184161

[ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias #184161

doakalexi commented May 23, 2024 •

edited

Loading

doakalexi commented May 23, 2024

doakalexi commented May 23, 2024

doakalexi commented May 23, 2024

doakalexi commented May 23, 2024

elasticmachine commented May 24, 2024

umbopepato left a comment

pmuellr left a comment

pmuellr May 28, 2024

doakalexi May 29, 2024

pmuellr May 28, 2024

doakalexi May 29, 2024

pmuellr commented May 28, 2024

doakalexi commented May 29, 2024 •

edited

Loading

pmuellr left a comment

kibana-ci commented May 29, 2024

[ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias #184161

[ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias #184161

Conversation

doakalexi commented May 23, 2024 • edited Loading

Summary

Checklist

To verify

doakalexi commented May 23, 2024

doakalexi commented May 23, 2024

doakalexi commented May 23, 2024

doakalexi commented May 23, 2024

elasticmachine commented May 24, 2024

umbopepato left a comment

Choose a reason for hiding this comment

pmuellr left a comment

Choose a reason for hiding this comment

pmuellr May 28, 2024

Choose a reason for hiding this comment

doakalexi May 29, 2024

Choose a reason for hiding this comment

pmuellr May 28, 2024

Choose a reason for hiding this comment

doakalexi May 29, 2024

Choose a reason for hiding this comment

pmuellr commented May 28, 2024

doakalexi commented May 29, 2024 • edited Loading

pmuellr left a comment

Choose a reason for hiding this comment

kibana-ci commented May 29, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

History

doakalexi commented May 23, 2024 •

edited

Loading

doakalexi commented May 29, 2024 •

edited

Loading