-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias #184161
[ResponseOps] Investigate auto-healing when no write index is set for alerts as data alias #184161
Conversation
/ci |
/ci |
…ana into auto-heal-alerting-indices
/ci |
/ci |
Pinging @elastic/response-ops (Team:ResponseOps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to get the existing error message removed - or I guess probably changed to DEBUG:
[ERROR][plugins.alerting] Indices matching pattern .internal.alerts-stack.alerts-default-* exist but none are set as the write index for alias .alerts-stack.alerts-default
and replaced with an INFO message indicating that we successfully set the write index.
@@ -186,9 +187,11 @@ async function createAliasStream(opts: CreateConcreteWriteIndexOpts): Promise<vo | |||
// If there are some concrete indices but none of them are the write index, we'll throw an error | |||
// because one of the existing indices should have been the write target. | |||
if (concreteIndicesExist && !concreteWriteIndicesExist) { | |||
throw new Error( | |||
logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should log this as an error, since presumably we "fix" this in the next step. I was a little scared when I ran the test scenario, and the error got logged but nothing indicating it got fixed got logged. Though it did get fixed :-)
I think we should put a try/catch around the setConcreteWriteIndex()
call, log an error if the attempt to fix fails, as I guess an error, but log the success - with the index and alias name - as info.
Not sure if we should continue to throw an error in the case we log the error. Seems like we'd want to continue on with other alert indices/aliases, not sure if that's how it's structured or not. So seems like throwing is probably not required in the error case here ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in this commit 9096c2f
I'd like to get the existing error message removed - or I guess probably changed to DEBUG:
[ERROR][plugins.alerting] Indices matching pattern .internal.alerts-stack.alerts-default-* exist but none are set as the write index for alias .alerts-stack.alerts-default
and replaced with an INFO message indicating that we successfully set the write index.
I set the message to debug and added the INFO success message to log that we set the write index inside setConcreteWriteIndex()
.
I think we should put a try/catch around the setConcreteWriteIndex() call, log an error if the attempt to fix fails, as I guess an error, but log the success - with the index and alias name - as info.
Not sure if we should continue to throw an error in the case we log the error. Seems like we'd want to continue on with other alert indices/aliases, not sure if that's how it's structured or not. So seems like throwing is probably not required in the error case here ...
There is a try/catch inside setConcreteWriteIndex()
and I am wondering if we should still throw the error bc it breaks the alerting rules? I am not sure, maybe I should remove like you mentioned.
@@ -220,9 +223,10 @@ async function createAliasStream(opts: CreateConcreteWriteIndexOpts): Promise<vo | |||
{ logger } | |||
); | |||
if (!existingIndices[indexPatterns.name]?.aliases?.[indexPatterns.alias]?.is_write_index) { | |||
throw Error( | |||
logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a little confusing to me. I think we fall in this code when a concrete write index doesn't exist, so we create one. We check in the code if it already got created - which is highly likely at startup with multiple Kibanas - and if we find it, but it's not the write index, we fix it so it is.
I'd think we could depend on whichever other Kibana created the index to make it the write index, but not positive. So not clear to me we need this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in this commit 9096c2f
I removed the change here to set the write index. It should just throw the error like before.
I was curious how we were going to pick from the available indices behind the alias, to mark as the write index. I extended the test in the top comment to add more indices to the alias:
Then remove the write index per original instructions, and restart Kibana. Then run the following to see what happened:
shows that the I would have expected The other scary part about picking the "right" index (assuming we need to), is that we have seen people do their own aliasing to add non-alerting indices to the alias, for some reason. So it's possible we could be marking those as the write index, depending on how these are chosen, which would be very wrong. Feels like what we need to do is to look for indices in the alias matching |
…ana into auto-heal-alerting-indices
Resolved in this commit 9096c2f You are right, that is prob a good idea! I updated the The code already searches for the indices using the query below that I think should prevent us from picking bad indices from people adding non-alerting indices to the alias.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. LGTM!
💛 Build succeeded, but was flaky
Failed CI StepsMetrics [docs]
History
To update your PR or re-run it, just comment with: |
Resolves #179829
Summary
We've run into multiple SDHs where concrete indices exist for an alerts-as-data resource but none of them are set as the write index for an alias. This PR adds code to pick a concrete index and set it as the write index to avoid these types of failures.
Checklist
To verify
"is_write_index": false
"is_write_index": true