Adding potential impacts to remaining health indicators #86197

masseyke · 2022-04-26T19:19:34Z

The ability to add impacts to indicators was added in #84899, but impacts for all indicators other than shards availability were left empty. This commit adds potential impacts for the other indicators.

elasticmachine · 2022-04-26T19:19:39Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2022-04-26T19:20:00Z

Hi @masseyke, I've created a changelog YAML for you.

masseyke · 2022-05-02T18:25:21Z

@elasticmachine run elasticsearch-ci/part-2

andreidan

Thanks for adding these Keith.

andreidan · 2022-05-03T10:37:53Z

server/src/main/java/org/elasticsearch/snapshots/RepositoryIntegrityHealthIndicatorService.java

+            new HealthIndicatorImpact(
+                2,
+                "Snapshots in corrupted repositories cannot be restored. Data loss is possible.",
+                List.of(ImpactArea.SEARCH)


Should we add a BACKUP impact area? Maybe the corrupted repository should be severity 1?

I'm also wondering if we should include the repository names in the impact (similarly to how we do it in the shards_availability indicator)

elasticsearch/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

Line 762 in 1cdb03d

"Cannot add data to %d %s [%s]. Searches might return incomplete results.",

Do we have a working definition of impact severity levels? My thinking was that it wasn't severity 1 because the system is still responsive and usable. But it would certainly be terrifying to not have a functioning snapshot repo.

Ah not really for now. The severity will be very important when we'll have a global impacts section and we'll possibly only report severity: 1 (requirements TBD)

andreidan · 2022-05-03T10:45:52Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+                new HealthIndicatorImpact(
+                    3,
+                    "Indices are not being rolled over, which could lead to future instability.",
+                    List.of(ImpactArea.SEARCH)


Should we add a DEPLOYMENT MANAGEMENT impact area?

I wonder about the impact message. Rollover is an implementation detail. Maybe we shouldn't mention it in this case.

How about?

Automatic index lifecycle and data retention management is disabled. The performance and stability of your system could be impacted.

andreidan · 2022-05-03T10:46:08Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java

+                new HealthIndicatorImpact(
+                    3,
+                    "Scheduled snapshots are not running, which could lead to future data loss.",
+                    List.of(ImpactArea.SEARCH)


BACKUP impact area?

elastichelix · 2022-05-03T13:54:36Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IlmHealthIndicatorService.java

+            List<HealthIndicatorImpact> impacts = Collections.singletonList(
+                new HealthIndicatorImpact(
+                    3,
+                    "Indices are not being rolled over, which could lead to future instability.",


Is there anything more specific that could be said about "future instability"? As in, could you get more specific around potential for running out of resources or data loss or something that might be coming?

Just thinking through what happens when shards get too big:

Slow query response times

OutOfMemoryErrors

Disks filling up (leading to maybe slow query response times at first, server crashes eventually)

Slow recovery time on node failure (which means data might be under-replicated for longer, so slower query response times).

I'm actually not sure if all of those are still the case in 8.0. I know I had some 1 TB shards in 1.x or 2.x, and the whole cluster was just an unstable mess at that point. Maybe slow query times is the thing to highlight here? @andreidan does that list above look right? Does slow query times seem like the thing to highlight here?

I've updated this list based on feedback from @Leaf-Lin at #86197 (review).

++ Keith - I think for a user the "data retention" aspect would be very important (data staying for too long in the system could breach some user data laws and such) and financially, data not moving through the tiers will have a very unpleasant impact on the bills

Leaf-Lin

Just made a few suggestions around impact wording...

Leaf-Lin · 2022-05-05T00:22:22Z

...ain/java/org/elasticsearch/cluster/coordination/InstanceHasMasterHealthIndicatorService.java

@@ -27,6 +31,7 @@ public class InstanceHasMasterHealthIndicatorService implements HealthIndicatorS

    private static final String INSTANCE_HAS_MASTER_GREEN_SUMMARY = "Health coordinating instance has a master node.";
    private static final String INSTANCE_HAS_MASTER_RED_SUMMARY = "Health coordinating instance does not have a master node.";
+    private static final String NO_MASTER_IMPACT = "The cluster cannot create, delete, or rebalance indices, and is likely to be unstable";


Impact area should be "ingest", "deployment management" and "backup", "connected service".

For "ingest"

In addition to "create", I think the cluster won't be able to accept any new writes to existing indices (append/update) either?

For "deployment management"

All the management scheduled tasks (watcher/ilm/slm) during this time will not work.

All the _cat (and some other APIs) will not work.

For "backup"

Snapshot or restore won't work when master is missing.

For "connected service"

Feel free to call this some other name.

When master is down, all connected services like EnterpriseSearch/Kibana/APM/Fleet/Integration etc won't be accessible.

For "search"

Maybe somehow counterintuitively, search will continue to work when the master is missing. Is it worth pointing this out?

In addition to "create", I think the cluster won't be able to accept any new writes to existing indices (append/update) either?

Ohh, you're right. Writes fail.

All the _cat (and some other APIs) will not work.

I didn't realize this, but you're right about that, too.

For "connected service"

Do we need to call this out separately? They only don't work because the various APIs we're already talking about don't work, right?

Maybe somehow counterintuitively, search will continue to work when the master is missing. Is it worth pointing this out?

If you don't have a master, you probably ought to drop everything and fix that problem. There's no point in making it sound not quite so bad is there?

For "connected service"

Do we need to call this out separately? They only don't work because the various APIs we're already talking about don't work, right?

I personally would prefer to make this a bit more explicitly but I will defer this to @shubhaat .
The question is when master is not available, all connected services like "Kibana/EnterpriseSearch/Fleet" will stop working. Should these connected services be listed as part of the impact area? and should the description of impact includes a statement on the "connected services"?

It makes sense to me that we'd want to tell users that. But we don't usually reference Enterprise Search in Elasticsearch code. That seems like something we'd do at the Cloud level -- it sees that Elasticsearch is broken, so it knows to tell the user that all of the applications it is hosting that are built on Elasticsearch are not working. I'll go ahead and merge this for now, and can open a separate PR about this if the consensus is that we want to start referencing applications built on top of Elasticsearch more inside of Elasticsearch.

Leaf-Lin · 2022-05-05T00:34:24Z

...ain/java/org/elasticsearch/cluster/coordination/InstanceHasMasterHealthIndicatorService.java

@@ -54,6 +59,10 @@ public HealthIndicatorResult calculate(boolean includeDetails) {

        HealthStatus instanceHasMasterStatus = masterNode == null ? HealthStatus.RED : HealthStatus.GREEN;
        String instanceHasMasterSummary = masterNode == null ? INSTANCE_HAS_MASTER_RED_SUMMARY : INSTANCE_HAS_MASTER_GREEN_SUMMARY;
+        List<HealthIndicatorImpact> impacts = new ArrayList<>();
+        if (masterNode == null) {
+            impacts.add(new HealthIndicatorImpact(1, NO_MASTER_IMPACT, List.of(ImpactArea.INGEST)));


Impact area should be "ingest", "deployment management" and "backup", "connected service".

Leaf-Lin · 2022-05-05T00:39:17Z

server/src/main/java/org/elasticsearch/snapshots/RepositoryIntegrityHealthIndicatorService.java

+                1,
+                String.format(
+                    Locale.ROOT,
+                    "Snapshots in corrupted repositories %s cannot be restored. Data loss is possible.",


This is just a suggestion, the sentence talked about "Data loss is possible" feels really scary. When snapshot repo is corrupted, the data on ES nodes are not lost yet, maybe:

Suggested change

"Snapshots in corrupted repositories %s cannot be restored. Data loss is possible.",

"Data in corrupted snapshot repository %s may be lost and cannot be restored.",

Also, should the repository here be singular or plural?

I could make the plurality dependent on how many are actually corrupted. We can have more than one repository though, so it's possible that more than one is corrupted.
My intention was to make it sound really scary. I can tone that down a little like you suggest though.

Could you please clarify if the check only looks at the single repository found-snapshot that is defined by ESS? or does it check all possible repositories from GET _snapshot/_all?

It checks all repositories in the cluster state, which is equivalent to GET _snapshot/_all. I've changed it to use "repository" if there is only one corrupted, and "repositories" if there are more than one.

Leaf-Lin · 2022-05-05T00:41:10Z

...rc/test/java/org/elasticsearch/snapshots/RepositoryIntegrityHealthIndicatorServiceTests.java

+                    Collections.singletonList(
+                        new HealthIndicatorImpact(
+                            1,
+                            "Snapshots in corrupted repositories [corrupted-repo] cannot be restored. Data loss is possible.",


Similar to the previous comment, can we make this wording be less scary?

Suggested change

"Snapshots in corrupted repositories [corrupted-repo] cannot be restored. Data loss is possible.",

"Data in corrupted snapshot repository [corrupted-repo] may be lost and cannot be restored.",

Leaf-Lin · 2022-05-05T01:00:10Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java

+            List<HealthIndicatorImpact> impacts = Collections.singletonList(
+                new HealthIndicatorImpact(
+                    3,
+                    "Scheduled snapshots are not running, which could lead to future data loss.",


Suggested change

"Scheduled snapshots are not running, which could lead to future data loss.",

"Scheduled snapshots are not running. Enable schedule snapshot management to prevent loss of data in the future.",

I don't think disabling SLM leads to future data loss 😛 , more of insurance to prevent data loss... Not sure how to best word this?

Leaf-Lin · 2022-05-05T01:01:31Z

x-pack/plugin/ilm/src/test/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorServiceTests.java

+                    Collections.singletonList(
+                        new HealthIndicatorImpact(
+                            3,
+                            "Scheduled snapshots are not running, which could lead to future data loss.",


Similar to my comment above

Suggested change

"Scheduled snapshots are not running, which could lead to future data loss.",

"Scheduled snapshots are not running. Enable schedule snapshot management to prevent loss of data in the future."

Since this is the impacts block rather than user actions, how about Scheduled snapshots are not running. There might not be backups of the data.? Or Scheduled snapshots are not running. There might not be backups of the data that could be used to restore if data is lost in the future.?

I know it's a bit wordy, but this feels more correct than what we had previous 👍 .

masseyke · 2022-05-05T18:34:47Z

@elasticmachine update branch

andreidan

LGTM, thanks for working on this Keith

Leaf-Lin

LGTM

masseyke added 2 commits April 26, 2022 14:08

Adding potential impacts

89ca863

Fixed wording

5edfd57

masseyke added >feature :Data Management/Health v8.3.0 labels Apr 26, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Apr 26, 2022

masseyke added 2 commits April 26, 2022 14:20

Update docs/changelog/86197.yaml

7026538

merging master

bf14544

andreidan reviewed May 3, 2022

View reviewed changes

andreidan requested a review from Leaf-Lin May 3, 2022 10:50

elastichelix reviewed May 3, 2022

View reviewed changes

code review feedback

f174334

masseyke requested a review from andreidan May 3, 2022 20:15

masseyke added 2 commits May 3, 2022 15:19

checkstyle

f568403

spotlessApply

4e17cb5

Leaf-Lin reviewed May 5, 2022

View reviewed changes

code review feedback

085c524

Merge branch 'master' into feature/health-api-impacts

646ff39

masseyke requested review from Leaf-Lin and elastichelix May 5, 2022 18:37

fixing a unit test

1203cc8

andreidan approved these changes May 6, 2022

View reviewed changes

Leaf-Lin approved these changes May 9, 2022

View reviewed changes

masseyke added 2 commits May 9, 2022 10:31

merging master

ea4f550

spotlessApply

f273666

masseyke added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label May 9, 2022

elasticsearchmachine merged commit 5af8c93 into elastic:master May 9, 2022

masseyke deleted the feature/health-api-impacts branch May 9, 2022 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding potential impacts to remaining health indicators #86197

Adding potential impacts to remaining health indicators #86197

masseyke commented Apr 26, 2022

elasticmachine commented Apr 26, 2022

elasticsearchmachine commented Apr 26, 2022

masseyke commented May 2, 2022

andreidan left a comment

andreidan May 3, 2022

masseyke May 3, 2022

andreidan May 6, 2022

andreidan May 3, 2022

andreidan May 3, 2022

elastichelix May 3, 2022

masseyke May 3, 2022

masseyke May 5, 2022

andreidan May 6, 2022

Leaf-Lin left a comment

Leaf-Lin May 5, 2022

masseyke May 5, 2022

Leaf-Lin May 9, 2022

masseyke May 9, 2022

Leaf-Lin May 5, 2022

Leaf-Lin May 5, 2022

masseyke May 5, 2022

Leaf-Lin May 9, 2022

masseyke May 9, 2022

Leaf-Lin May 5, 2022

Leaf-Lin May 5, 2022

Leaf-Lin May 5, 2022

masseyke May 5, 2022

Leaf-Lin May 9, 2022

masseyke commented May 5, 2022

andreidan left a comment

Leaf-Lin left a comment

	"Snapshots in corrupted repositories %s cannot be restored. Data loss is possible.",
	"Data in corrupted snapshot repository %s may be lost and cannot be restored.",

	"Snapshots in corrupted repositories [corrupted-repo] cannot be restored. Data loss is possible.",
	"Data in corrupted snapshot repository [corrupted-repo] may be lost and cannot be restored.",

	"Scheduled snapshots are not running, which could lead to future data loss.",
	"Scheduled snapshots are not running. Enable schedule snapshot management to prevent loss of data in the future.",

Adding potential impacts to remaining health indicators #86197

Adding potential impacts to remaining health indicators #86197

Conversation

masseyke commented Apr 26, 2022

elasticmachine commented Apr 26, 2022

elasticsearchmachine commented Apr 26, 2022

masseyke commented May 2, 2022

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Leaf-Lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

For "ingest"

For "deployment management"

For "backup"

For "connected service"

For "search"

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masseyke commented May 5, 2022

andreidan left a comment

Choose a reason for hiding this comment

Leaf-Lin left a comment

Choose a reason for hiding this comment