Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding potential impacts to remaining health indicators #86197

Merged

Conversation

masseyke
Copy link
Member

The ability to add impacts to indicators was added in #84899, but impacts for all indicators other than shards availability were left empty. This commit adds potential impacts for the other indicators.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke
Copy link
Member Author

masseyke commented May 2, 2022

@elasticmachine run elasticsearch-ci/part-2

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these Keith.

new HealthIndicatorImpact(
2,
"Snapshots in corrupted repositories cannot be restored. Data loss is possible.",
List.of(ImpactArea.SEARCH)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a BACKUP impact area? Maybe the corrupted repository should be severity 1?

I'm also wondering if we should include the repository names in the impact (similarly to how we do it in the shards_availability indicator)

"Cannot add data to %d %s [%s]. Searches might return incomplete results.",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a working definition of impact severity levels? My thinking was that it wasn't severity 1 because the system is still responsive and usable. But it would certainly be terrifying to not have a functioning snapshot repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah not really for now. The severity will be very important when we'll have a global impacts section and we'll possibly only report severity: 1 (requirements TBD)

new HealthIndicatorImpact(
3,
"Indices are not being rolled over, which could lead to future instability.",
List.of(ImpactArea.SEARCH)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a DEPLOYMENT MANAGEMENT impact area?

I wonder about the impact message. Rollover is an implementation detail. Maybe we shouldn't mention it in this case.

How about?

Automatic index lifecycle and data retention management is disabled. The performance and stability of your system could be impacted. 

new HealthIndicatorImpact(
3,
"Scheduled snapshots are not running, which could lead to future data loss.",
List.of(ImpactArea.SEARCH)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BACKUP impact area?

@andreidan andreidan requested a review from Leaf-Lin May 3, 2022 10:50
List<HealthIndicatorImpact> impacts = Collections.singletonList(
new HealthIndicatorImpact(
3,
"Indices are not being rolled over, which could lead to future instability.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything more specific that could be said about "future instability"? As in, could you get more specific around potential for running out of resources or data loss or something that might be coming?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking through what happens when shards get too big:

  • Slow query response times
  • OutOfMemoryErrors
  • Disks filling up (leading to maybe slow query response times at first, server crashes eventually)
  • Slow recovery time on node failure (which means data might be under-replicated for longer, so slower query response times).

I'm actually not sure if all of those are still the case in 8.0. I know I had some 1 TB shards in 1.x or 2.x, and the whole cluster was just an unstable mess at that point. Maybe slow query times is the thing to highlight here? @andreidan does that list above look right? Does slow query times seem like the thing to highlight here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this list based on feedback from @Leaf-Lin at #86197 (review).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Keith - I think for a user the "data retention" aspect would be very important (data staying for too long in the system could breach some user data laws and such) and financially, data not moving through the tiers will have a very unpleasant impact on the bills

@masseyke masseyke requested a review from andreidan May 3, 2022 20:15
Copy link
Contributor

@Leaf-Lin Leaf-Lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just made a few suggestions around impact wording...

@@ -27,6 +31,7 @@ public class InstanceHasMasterHealthIndicatorService implements HealthIndicatorS

private static final String INSTANCE_HAS_MASTER_GREEN_SUMMARY = "Health coordinating instance has a master node.";
private static final String INSTANCE_HAS_MASTER_RED_SUMMARY = "Health coordinating instance does not have a master node.";
private static final String NO_MASTER_IMPACT = "The cluster cannot create, delete, or rebalance indices, and is likely to be unstable";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impact area should be "ingest", "deployment management" and "backup", "connected service".

For "ingest"

  • In addition to "create", I think the cluster won't be able to accept any new writes to existing indices (append/update) either?

For "deployment management"

  • All the management scheduled tasks (watcher/ilm/slm) during this time will not work.
  • All the _cat (and some other APIs) will not work.

For "backup"

  • Snapshot or restore won't work when master is missing.

For "connected service"

  • Feel free to call this some other name.
  • When master is down, all connected services like EnterpriseSearch/Kibana/APM/Fleet/Integration etc won't be accessible.

For "search"

Maybe somehow counterintuitively, search will continue to work when the master is missing. Is it worth pointing this out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to "create", I think the cluster won't be able to accept any new writes to existing indices (append/update) either?

Ohh, you're right. Writes fail.

All the _cat (and some other APIs) will not work.

I didn't realize this, but you're right about that, too.

For "connected service"

Do we need to call this out separately? They only don't work because the various APIs we're already talking about don't work, right?

Maybe somehow counterintuitively, search will continue to work when the master is missing. Is it worth pointing this out?

If you don't have a master, you probably ought to drop everything and fix that problem. There's no point in making it sound not quite so bad is there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "connected service"

Do we need to call this out separately? They only don't work because the various APIs we're already talking about don't work, right?

I personally would prefer to make this a bit more explicitly but I will defer this to @shubhaat .
The question is when master is not available, all connected services like "Kibana/EnterpriseSearch/Fleet" will stop working. Should these connected services be listed as part of the impact area? and should the description of impact includes a statement on the "connected services"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me that we'd want to tell users that. But we don't usually reference Enterprise Search in Elasticsearch code. That seems like something we'd do at the Cloud level -- it sees that Elasticsearch is broken, so it knows to tell the user that all of the applications it is hosting that are built on Elasticsearch are not working. I'll go ahead and merge this for now, and can open a separate PR about this if the consensus is that we want to start referencing applications built on top of Elasticsearch more inside of Elasticsearch.

@@ -54,6 +59,10 @@ public HealthIndicatorResult calculate(boolean includeDetails) {

HealthStatus instanceHasMasterStatus = masterNode == null ? HealthStatus.RED : HealthStatus.GREEN;
String instanceHasMasterSummary = masterNode == null ? INSTANCE_HAS_MASTER_RED_SUMMARY : INSTANCE_HAS_MASTER_GREEN_SUMMARY;
List<HealthIndicatorImpact> impacts = new ArrayList<>();
if (masterNode == null) {
impacts.add(new HealthIndicatorImpact(1, NO_MASTER_IMPACT, List.of(ImpactArea.INGEST)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impact area should be "ingest", "deployment management" and "backup", "connected service".

1,
String.format(
Locale.ROOT,
"Snapshots in corrupted repositories %s cannot be restored. Data loss is possible.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a suggestion, the sentence talked about "Data loss is possible" feels really scary. When snapshot repo is corrupted, the data on ES nodes are not lost yet, maybe:

Suggested change
"Snapshots in corrupted repositories %s cannot be restored. Data loss is possible.",
"Data in corrupted snapshot repository %s may be lost and cannot be restored.",

Also, should the repository here be singular or plural?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make the plurality dependent on how many are actually corrupted. We can have more than one repository though, so it's possible that more than one is corrupted.
My intention was to make it sound really scary. I can tone that down a little like you suggest though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify if the check only looks at the single repository found-snapshot that is defined by ESS? or does it check all possible repositories from GET _snapshot/_all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks all repositories in the cluster state, which is equivalent to GET _snapshot/_all. I've changed it to use "repository" if there is only one corrupted, and "repositories" if there are more than one.

Collections.singletonList(
new HealthIndicatorImpact(
1,
"Snapshots in corrupted repositories [corrupted-repo] cannot be restored. Data loss is possible.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous comment, can we make this wording be less scary?

Suggested change
"Snapshots in corrupted repositories [corrupted-repo] cannot be restored. Data loss is possible.",
"Data in corrupted snapshot repository [corrupted-repo] may be lost and cannot be restored.",

List<HealthIndicatorImpact> impacts = Collections.singletonList(
new HealthIndicatorImpact(
3,
"Scheduled snapshots are not running, which could lead to future data loss.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Scheduled snapshots are not running, which could lead to future data loss.",
"Scheduled snapshots are not running. Enable schedule snapshot management to prevent loss of data in the future.",

I don't think disabling SLM leads to future data loss 😛 , more of insurance to prevent data loss... Not sure how to best word this?

Collections.singletonList(
new HealthIndicatorImpact(
3,
"Scheduled snapshots are not running, which could lead to future data loss.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment above

Suggested change
"Scheduled snapshots are not running, which could lead to future data loss.",
"Scheduled snapshots are not running. Enable schedule snapshot management to prevent loss of data in the future."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the impacts block rather than user actions, how about Scheduled snapshots are not running. There might not be backups of the data.? Or Scheduled snapshots are not running. There might not be backups of the data that could be used to restore if data is lost in the future.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a bit wordy, but this feels more correct than what we had previous 👍 .

@masseyke
Copy link
Member Author

masseyke commented May 5, 2022

@elasticmachine update branch

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this Keith

Copy link
Contributor

@Leaf-Lin Leaf-Lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@masseyke masseyke added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label May 9, 2022
@elasticsearchmachine elasticsearchmachine merged commit 5af8c93 into elastic:master May 9, 2022
@masseyke masseyke deleted the feature/health-api-impacts branch May 9, 2022 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Data Management/Health >feature Team:Data Management Meta label for data/management team v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants