-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding potential impacts to remaining health indicators #86197
Adding potential impacts to remaining health indicators #86197
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @masseyke, I've created a changelog YAML for you. |
@elasticmachine run elasticsearch-ci/part-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these Keith.
new HealthIndicatorImpact( | ||
2, | ||
"Snapshots in corrupted repositories cannot be restored. Data loss is possible.", | ||
List.of(ImpactArea.SEARCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a BACKUP
impact area? Maybe the corrupted repository should be severity 1?
I'm also wondering if we should include the repository names in the impact (similarly to how we do it in the shards_availability
indicator)
Line 762 in 1cdb03d
"Cannot add data to %d %s [%s]. Searches might return incomplete results.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a working definition of impact severity levels? My thinking was that it wasn't severity 1 because the system is still responsive and usable. But it would certainly be terrifying to not have a functioning snapshot repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah not really for now. The severity will be very important when we'll have a global impacts section and we'll possibly only report severity: 1
(requirements TBD)
new HealthIndicatorImpact( | ||
3, | ||
"Indices are not being rolled over, which could lead to future instability.", | ||
List.of(ImpactArea.SEARCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a DEPLOYMENT MANAGEMENT
impact area?
I wonder about the impact message. Rollover is an implementation detail. Maybe we shouldn't mention it in this case.
How about?
Automatic index lifecycle and data retention management is disabled. The performance and stability of your system could be impacted.
new HealthIndicatorImpact( | ||
3, | ||
"Scheduled snapshots are not running, which could lead to future data loss.", | ||
List.of(ImpactArea.SEARCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BACKUP
impact area?
List<HealthIndicatorImpact> impacts = Collections.singletonList( | ||
new HealthIndicatorImpact( | ||
3, | ||
"Indices are not being rolled over, which could lead to future instability.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything more specific that could be said about "future instability"? As in, could you get more specific around potential for running out of resources or data loss or something that might be coming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking through what happens when shards get too big:
- Slow query response times
- OutOfMemoryErrors
- Disks filling up (leading to maybe slow query response times at first, server crashes eventually)
- Slow recovery time on node failure (which means data might be under-replicated for longer, so slower query response times).
I'm actually not sure if all of those are still the case in 8.0. I know I had some 1 TB shards in 1.x or 2.x, and the whole cluster was just an unstable mess at that point. Maybe slow query times is the thing to highlight here? @andreidan does that list above look right? Does slow query times seem like the thing to highlight here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this list based on feedback from @Leaf-Lin at #86197 (review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ Keith - I think for a user the "data retention" aspect would be very important (data staying for too long in the system could breach some user data laws and such) and financially, data not moving through the tiers will have a very unpleasant impact on the bills
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just made a few suggestions around impact wording...
@@ -27,6 +31,7 @@ public class InstanceHasMasterHealthIndicatorService implements HealthIndicatorS | |||
|
|||
private static final String INSTANCE_HAS_MASTER_GREEN_SUMMARY = "Health coordinating instance has a master node."; | |||
private static final String INSTANCE_HAS_MASTER_RED_SUMMARY = "Health coordinating instance does not have a master node."; | |||
private static final String NO_MASTER_IMPACT = "The cluster cannot create, delete, or rebalance indices, and is likely to be unstable"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impact area should be "ingest", "deployment management" and "backup", "connected service".
For "ingest"
- In addition to "create", I think the cluster won't be able to accept any new writes to existing indices (append/update) either?
For "deployment management"
- All the management scheduled tasks (watcher/ilm/slm) during this time will not work.
- All the
_cat
(and some other APIs) will not work.
For "backup"
- Snapshot or restore won't work when master is missing.
For "connected service"
- Feel free to call this some other name.
- When master is down, all connected services like EnterpriseSearch/Kibana/APM/Fleet/Integration etc won't be accessible.
For "search"
Maybe somehow counterintuitively, search will continue to work when the master is missing. Is it worth pointing this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to "create", I think the cluster won't be able to accept any new writes to existing indices (append/update) either?
Ohh, you're right. Writes fail.
All the _cat (and some other APIs) will not work.
I didn't realize this, but you're right about that, too.
For "connected service"
Do we need to call this out separately? They only don't work because the various APIs we're already talking about don't work, right?
Maybe somehow counterintuitively, search will continue to work when the master is missing. Is it worth pointing this out?
If you don't have a master, you probably ought to drop everything and fix that problem. There's no point in making it sound not quite so bad is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For "connected service"
Do we need to call this out separately? They only don't work because the various APIs we're already talking about don't work, right?
I personally would prefer to make this a bit more explicitly but I will defer this to @shubhaat .
The question is when master is not available, all connected services like "Kibana/EnterpriseSearch/Fleet" will stop working. Should these connected services be listed as part of the impact area
? and should the description of impact includes a statement on the "connected services"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me that we'd want to tell users that. But we don't usually reference Enterprise Search in Elasticsearch code. That seems like something we'd do at the Cloud level -- it sees that Elasticsearch is broken, so it knows to tell the user that all of the applications it is hosting that are built on Elasticsearch are not working. I'll go ahead and merge this for now, and can open a separate PR about this if the consensus is that we want to start referencing applications built on top of Elasticsearch more inside of Elasticsearch.
@@ -54,6 +59,10 @@ public HealthIndicatorResult calculate(boolean includeDetails) { | |||
|
|||
HealthStatus instanceHasMasterStatus = masterNode == null ? HealthStatus.RED : HealthStatus.GREEN; | |||
String instanceHasMasterSummary = masterNode == null ? INSTANCE_HAS_MASTER_RED_SUMMARY : INSTANCE_HAS_MASTER_GREEN_SUMMARY; | |||
List<HealthIndicatorImpact> impacts = new ArrayList<>(); | |||
if (masterNode == null) { | |||
impacts.add(new HealthIndicatorImpact(1, NO_MASTER_IMPACT, List.of(ImpactArea.INGEST))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impact area should be "ingest", "deployment management" and "backup", "connected service".
1, | ||
String.format( | ||
Locale.ROOT, | ||
"Snapshots in corrupted repositories %s cannot be restored. Data loss is possible.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a suggestion, the sentence talked about "Data loss is possible" feels really scary. When snapshot repo is corrupted, the data on ES nodes are not lost yet, maybe:
"Snapshots in corrupted repositories %s cannot be restored. Data loss is possible.", | |
"Data in corrupted snapshot repository %s may be lost and cannot be restored.", |
Also, should the repository here be singular or plural?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could make the plurality dependent on how many are actually corrupted. We can have more than one repository though, so it's possible that more than one is corrupted.
My intention was to make it sound really scary. I can tone that down a little like you suggest though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify if the check only looks at the single repository found-snapshot
that is defined by ESS? or does it check all possible repositories from GET _snapshot/_all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It checks all repositories in the cluster state, which is equivalent to GET _snapshot/_all
. I've changed it to use "repository" if there is only one corrupted, and "repositories" if there are more than one.
Collections.singletonList( | ||
new HealthIndicatorImpact( | ||
1, | ||
"Snapshots in corrupted repositories [corrupted-repo] cannot be restored. Data loss is possible.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, can we make this wording be less scary?
"Snapshots in corrupted repositories [corrupted-repo] cannot be restored. Data loss is possible.", | |
"Data in corrupted snapshot repository [corrupted-repo] may be lost and cannot be restored.", |
List<HealthIndicatorImpact> impacts = Collections.singletonList( | ||
new HealthIndicatorImpact( | ||
3, | ||
"Scheduled snapshots are not running, which could lead to future data loss.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Scheduled snapshots are not running, which could lead to future data loss.", | |
"Scheduled snapshots are not running. Enable schedule snapshot management to prevent loss of data in the future.", |
I don't think disabling SLM leads to future data loss 😛 , more of insurance to prevent data loss... Not sure how to best word this?
Collections.singletonList( | ||
new HealthIndicatorImpact( | ||
3, | ||
"Scheduled snapshots are not running, which could lead to future data loss.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment above
"Scheduled snapshots are not running, which could lead to future data loss.", | |
"Scheduled snapshots are not running. Enable schedule snapshot management to prevent loss of data in the future." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the impacts block rather than user actions, how about Scheduled snapshots are not running. There might not be backups of the data.
? Or Scheduled snapshots are not running. There might not be backups of the data that could be used to restore if data is lost in the future.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's a bit wordy, but this feels more correct than what we had previous 👍 .
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this Keith
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The ability to add impacts to indicators was added in #84899, but impacts for all indicators other than shards availability were left empty. This commit adds potential impacts for the other indicators.