Skip to content

Commit

Permalink
Remove help_url,rename summary to symptom, and user_actions to diagno…
Browse files Browse the repository at this point in the history
…sis (#88553)

Remove help_url,rename summary->symptom,user_actions->diagnosis
Separate the diagnosis `message` field in `cause` and `action`
Co-authored-by: Mary Gouseti <mgouseti@gmail.com>
  • Loading branch information
andreidan committed Jul 25, 2022
1 parent b3cc4ef commit da765ce
Show file tree
Hide file tree
Showing 25 changed files with 287 additions and 357 deletions.
6 changes: 6 additions & 0 deletions docs/changelog/88553.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pr: 88553
summary: "Remove help_url,rename summary to symptom, and `user_actions` to diagnosis"
area: Health
type: feature
issues:
- 88474
36 changes: 19 additions & 17 deletions docs/reference/health/health.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,11 @@ Each impact carries with it a severity level, an area of the system that is affe
description of the impact on the system.

Some health indicators can determine the root cause of a health problem and prescribe a set of
user actions that can be performed in order to improve the health of the system. User
actions contain a message detailing a root cause analysis, the list of affected resources (if
applicable) and steps to take that will improve the health of your cluster. User actions can
also optionally provide URLs to troubleshooting documentation.
steps that can be performed in order to improve the health of the system. The root cause and remediation
steps are encapsulated in a `diagnosis`.
A diagnosis contains a cause detailing a root cause analysis, an action containing a brief description
of the steps to take to fix the problem, the list of affected resources (if applicable), and a detailed
step by step troubleshooting guide to fix the diagnosed problem.

[[health-api-path-params]]
==== {api-path-parms-title}
Expand Down Expand Up @@ -199,12 +200,9 @@ for health status set `explain` to `false` to disable the more expensive analysi
`red`:::
The indicator is experiencing an outage or certain features are unavailable for use.

`summary`::
`symptom`::
(string) A message providing information about the current health status.

`help_url`::
(Optional, string) A link to additional troubleshooting guides for this indicator.

`details`::
(Optional, object) An object that contains additional information about the cluster that
has lead to the current health status result. This data is unstructured, and each
Expand Down Expand Up @@ -238,25 +236,29 @@ for health status set `explain` to `false` to disable the more expensive analysi
========

`user_actions`::
`diagnosis`::
(Optional, array) If a non-healthy status is returned, indicators may include a list of
user actions to take in order to remediate the health issue. User actions and root cause
analysis will not be calculated if the `explain` property is false.
diagnosis that encapsulate the cause of the health issue and an action to take in order to remediate the problem.
The diagnosis will not be calculated if the `explain` property is false.
+
.Properties of `user_actions`
.Properties of `diagnosis`
[%collapsible%open]
========
`message`::
(string) A description of a root cause of this health status and the steps that should
be taken to remediate the problem.
`cause`::
(string) A description of a root cause of this health problem.
`action`::
(string) A brief description the steps that should be taken to remediate the problem.
A more detailed step by step guide to remediate the problem is provided by the
`help_url` field.
`affected_resources`::
(Optional, array of strings) If the root cause pertains to multiple resources in the
cluster (like indices, shards, nodes, etc...) this will hold all resources that this
user action is applicable for.
diagnosis is applicable for.
`help_url`::
(string) A link to additional troubleshooting information for this user action.
(string) A link to the troubleshooting guide that'll fix the healh problem.
========
=======
======
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@
- match: { status: "green" }
- match: { components.cluster_coordination.status: "green" }
- match: { components.cluster_coordination.indicators.master_is_stable.status: "green" }
- match: { components.cluster_coordination.indicators.master_is_stable.summary: "The cluster has a stable master node" }
- match: { components.cluster_coordination.indicators.master_is_stable.symptom: "The cluster has a stable master node" }
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,6 @@
- is_true: cluster_name
- match: { components.cluster_coordination.status: "green" }
- match: { components.cluster_coordination.indicators.master_is_stable.status: "green" }
- match: { components.cluster_coordination.indicators.master_is_stable.summary: "The cluster has a stable master node" }
- match: { components.cluster_coordination.indicators.master_is_stable.symptom: "The cluster has a stable master node" }
- is_true: components.cluster_coordination.indicators.master_is_stable.details.current_master
- is_true: components.cluster_coordination.indicators.master_is_stable.details.recent_masters
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

- is_true: cluster_name
- match: { components.cluster_coordination.indicators.master_is_stable.status: "green" }
- match: { components.cluster_coordination.indicators.master_is_stable.summary: "The cluster has a stable master node" }
- match: { components.cluster_coordination.indicators.master_is_stable.symptom: "The cluster has a stable master node" }
- is_true: components.cluster_coordination.indicators.master_is_stable.details.current_master
- is_true: components.cluster_coordination.indicators.master_is_stable.details.recent_masters

Expand All @@ -23,6 +23,6 @@

- is_true: cluster_name
- match: { components.cluster_coordination.indicators.master_is_stable.status: "green" }
- match: { components.cluster_coordination.indicators.master_is_stable.summary: "The cluster has a stable master node" }
- match: { components.cluster_coordination.indicators.master_is_stable.symptom: "The cluster has a stable master node" }
- is_false: components.cluster_coordination.indicators.master_is_stable.details

Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
"Diagnosis":
- skip:
version: "- 8.3.99"
reason: "diagnosis was only added in 8.4.0"

- do:
indices.create:
index: red_index
body:
settings:
number_of_shards: 1
number_of_replicas: 0
index.routing.allocation.enable: none

- do:
_internal.health:
component: data
feature: shards_availability

- is_true: cluster_name
- match: { components.data.indicators.shards_availability.status: "red" }
- match: { components.data.indicators.shards_availability.symptom: "This cluster has 1 unavailable primary." }
- is_true: components.data.indicators.shards_availability.diagnosis
- length: { components.data.indicators.shards_availability.diagnosis: 1 }
- is_true: components.data.indicators.shards_availability.diagnosis.0.affected_resources
- length: { components.data.indicators.shards_availability.diagnosis.0.affected_resources: 1 }
- match: { components.data.indicators.shards_availability.diagnosis.0.affected_resources.0: "red_index" }

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ private void assertMasterStability(Client client, HealthStatus expectedStatus, S
assertThat(debugInformation, healthResponse.getStatus(), equalTo(expectedStatus));
assertThat(
debugInformation,
healthResponse.findComponent("cluster_coordination").findIndicator("master_is_stable").summary(),
healthResponse.findComponent("cluster_coordination").findIndicator("master_is_stable").symptom(),
containsString(expectedSummarySubstring)
);
});
Expand Down Expand Up @@ -317,11 +317,11 @@ public void onFailure(Exception e) {
/**
* This helper method creates a 3-node cluster where all nodes are master-eligible, and then simulates a long GC on the master node 5
* times (forcing another node to be elected master 5 times). It then asserts that the master stability health indicator status is
* YELLOW, and that expectedMasterStabilitySummarySubstring is contained in the summary.
* @param expectedMasterStabilitySummarySubstring A string to expect in the master stability health indictor summary
* YELLOW, and that expectedMasterStabilitySymptomSubstring is contained in the symptom.
* @param expectedMasterStabilitySymptomSubstring A string to expect in the master stability health indicator symptom
* @throws Exception
*/
public void testRepeatedMasterChanges(String expectedMasterStabilitySummarySubstring) throws Exception {
public void testRepeatedMasterChanges(String expectedMasterStabilitySymptomSubstring) throws Exception {
final List<String> nodes = internalCluster().startNodes(
3,
Settings.builder()
Expand Down Expand Up @@ -407,7 +407,7 @@ public void testRepeatedMasterChanges(String expectedMasterStabilitySummarySubst
* other node(s) were master, it only saw itself as master. So we want to check with another node.
*/
Client client = internalCluster().client(randomFrom(nodeNamesExceptFirstMaster));
assertMasterStability(client, HealthStatus.YELLOW, expectedMasterStabilitySummarySubstring);
assertMasterStability(client, HealthStatus.YELLOW, expectedMasterStabilitySymptomSubstring);
}

public void testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable() throws Exception {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,6 @@ public class GetHealthActionIT extends ESIntegTestCase {
private static final String INSTANCE_HAS_MASTER_INDICATOR_NAME = "instance_has_master";
private static final String NONEXISTENT_INDICATOR_NAME = "test_nonexistent_indicator";

private static final String ILM_INDICATOR_HELP_URL = "http-colon-slash-slash-ilm";
private static final String SLM_INDICATOR_HELP_URL = "http-colon-slash-slash-slm";
private static final String INSTANCE_HAS_MASTER_INDICATOR_HELP_URL = "http-colon-slash-slash-instance_has_master";

@Override
protected Collection<Class<? extends Plugin>> nodePlugins() {
return appendToCopy(super.nodePlugins(), TestHealthPlugin.class);
Expand Down Expand Up @@ -129,20 +125,17 @@ public static class TestHealthIndicatorService implements HealthIndicatorService
private final ClusterService clusterService;
private final String componentName;
private final String indicatorName;
private final String helpURL;
private final Setting<HealthStatus> statusSetting;

public TestHealthIndicatorService(
ClusterService clusterService,
String componentName,
String indicatorName,
String helpURL,
Setting<HealthStatus> statusSetting
) {
this.clusterService = clusterService;
this.componentName = componentName;
this.indicatorName = indicatorName;
this.helpURL = helpURL;
this.statusSetting = statusSetting;
}

Expand All @@ -156,11 +149,6 @@ public String component() {
return componentName;
}

@Override
public String helpURL() {
return helpURL;
}

@Override
public HealthIndicatorResult calculate(boolean explain) {
var status = clusterService.getClusterSettings().get(statusSetting);
Expand All @@ -176,13 +164,13 @@ public HealthIndicatorResult calculate(boolean explain) {

public static final class IlmHealthIndicatorService extends TestHealthIndicatorService {
public IlmHealthIndicatorService(ClusterService clusterService) {
super(clusterService, DATA_COMPONENT_NAME, ILM_INDICATOR_NAME, ILM_INDICATOR_HELP_URL, ILM_HEALTH_STATUS_SETTING);
super(clusterService, DATA_COMPONENT_NAME, ILM_INDICATOR_NAME, ILM_HEALTH_STATUS_SETTING);
}
}

public static final class SlmHealthIndicatorService extends TestHealthIndicatorService {
public SlmHealthIndicatorService(ClusterService clusterService) {
super(clusterService, DATA_COMPONENT_NAME, SLM_INDICATOR_NAME, SLM_INDICATOR_HELP_URL, SLM_HEALTH_STATUS_SETTING);
super(clusterService, DATA_COMPONENT_NAME, SLM_INDICATOR_NAME, SLM_HEALTH_STATUS_SETTING);
}
}

Expand All @@ -192,7 +180,6 @@ public ClusterCoordinationHealthIndicatorService(ClusterService clusterService)
clusterService,
CLUSTER_COORDINATION_COMPONENT_NAME,
INSTANCE_HAS_MASTER_INDICATOR_NAME,
INSTANCE_HAS_MASTER_INDICATOR_HELP_URL,
CLUSTER_COORDINATION_HEALTH_STATUS_SETTING
);
}
Expand Down Expand Up @@ -276,7 +263,6 @@ private void testRootLevel(
DATA_COMPONENT_NAME,
ilmIndicatorStatus,
"Health is set to [" + ilmIndicatorStatus + "] by test plugin",
ilmIndicatorStatus.indicatesHealthProblem() ? ILM_INDICATOR_HELP_URL : null,
new SimpleHealthIndicatorDetails(Map.of("explain", explain)),
Collections.emptyList(),
Collections.emptyList()
Expand All @@ -286,7 +272,6 @@ private void testRootLevel(
DATA_COMPONENT_NAME,
slmIndicatorStatus,
"Health is set to [" + slmIndicatorStatus + "] by test plugin",
slmIndicatorStatus.indicatesHealthProblem() ? SLM_INDICATOR_HELP_URL : null,
new SimpleHealthIndicatorDetails(Map.of("explain", explain)),
Collections.emptyList(),
Collections.emptyList()
Expand All @@ -307,7 +292,6 @@ private void testRootLevel(
CLUSTER_COORDINATION_COMPONENT_NAME,
clusterCoordinationIndicatorStatus,
"Health is set to [" + clusterCoordinationIndicatorStatus + "] by test plugin",
clusterCoordinationIndicatorStatus.indicatesHealthProblem() ? INSTANCE_HAS_MASTER_INDICATOR_HELP_URL : null,
new SimpleHealthIndicatorDetails(Map.of("explain", explain)),
Collections.emptyList(),
Collections.emptyList()
Expand Down Expand Up @@ -337,7 +321,6 @@ private void testComponentAndIndicator(Client client, HealthStatus ilmIndicatorS
DATA_COMPONENT_NAME,
ilmIndicatorStatus,
"Health is set to [" + ilmIndicatorStatus + "] by test plugin",
ilmIndicatorStatus.indicatesHealthProblem() ? ILM_INDICATOR_HELP_URL : null,
new SimpleHealthIndicatorDetails(Map.of("explain", explain)),
Collections.emptyList(),
Collections.emptyList()
Expand Down Expand Up @@ -366,7 +349,6 @@ private void testComponentNoIndicator(Client client, HealthStatus ilmIndicatorSt
DATA_COMPONENT_NAME,
ilmIndicatorStatus,
"Health is set to [" + ilmIndicatorStatus + "] by test plugin",
ilmIndicatorStatus.indicatesHealthProblem() ? ILM_INDICATOR_HELP_URL : null,
new SimpleHealthIndicatorDetails(Map.of("explain", explain)),
Collections.emptyList(),
Collections.emptyList()
Expand All @@ -376,7 +358,6 @@ private void testComponentNoIndicator(Client client, HealthStatus ilmIndicatorSt
DATA_COMPONENT_NAME,
slmIndicatorStatus,
"Health is set to [" + slmIndicatorStatus + "] by test plugin",
slmIndicatorStatus.indicatesHealthProblem() ? SLM_INDICATOR_HELP_URL : null,
new SimpleHealthIndicatorDetails(Map.of("explain", explain)),
Collections.emptyList(),
Collections.emptyList()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@
package org.elasticsearch.cluster.coordination;

import org.elasticsearch.cluster.node.DiscoveryNode;
import org.elasticsearch.health.Diagnosis;
import org.elasticsearch.health.HealthIndicatorDetails;
import org.elasticsearch.health.HealthIndicatorImpact;
import org.elasticsearch.health.HealthIndicatorResult;
import org.elasticsearch.health.HealthIndicatorService;
import org.elasticsearch.health.HealthStatus;
import org.elasticsearch.health.ImpactArea;
import org.elasticsearch.health.UserAction;

import java.util.Collection;
import java.util.List;
Expand All @@ -37,12 +37,12 @@
public class StableMasterHealthIndicatorService implements HealthIndicatorService {

public static final String NAME = "master_is_stable";
private static final String HELP_URL = "https://ela.st/fix-master";
public static final String GET_HELP_GUIDE = "https://ela.st/getting-help";
public static final UserAction CONTACT_SUPPORT_USER_ACTION = new UserAction(
new UserAction.Definition(
public static final Diagnosis CONTACT_SUPPORT_USER_ACTION = new Diagnosis(
new Diagnosis.Definition(
"contact_support",
"The Elasticsearch cluster does not have a stable master node. Get help at " + GET_HELP_GUIDE,
"The Elasticsearch cluster does not have a stable master node.",
"Get help at " + GET_HELP_GUIDE,
GET_HELP_GUIDE
),
null
Expand Down Expand Up @@ -87,11 +87,6 @@ public String component() {
return CLUSTER_COORDINATION;
}

@Override
public String helpURL() {
return HELP_URL;
}

@Override
public HealthIndicatorResult calculate(boolean explain) {
CoordinationDiagnosticsService.CoordinationDiagnosticsResult coordinationDiagnosticsResult = coordinationDiagnosticsService
Expand All @@ -113,8 +108,8 @@ HealthIndicatorResult getHealthIndicatorResult(
HealthStatus status = HealthStatus.fromCoordinationDiagnosticsStatus(coordinationDiagnosticsResult.status());
HealthIndicatorDetails details = getDetails(coordinationDiagnosticsResult.details(), explain);
Collection<HealthIndicatorImpact> impacts = status.indicatesHealthProblem() ? UNSTABLE_MASTER_IMPACTS : List.of();
List<UserAction> userActions = status.indicatesHealthProblem() ? getContactSupportUserActions(explain) : List.of();
return createIndicator(status, coordinationDiagnosticsResult.summary(), details, impacts, userActions);
List<Diagnosis> diagnosis = status.indicatesHealthProblem() ? getContactSupportUserActions(explain) : List.of();
return createIndicator(status, coordinationDiagnosticsResult.summary(), details, impacts, diagnosis);
}

/**
Expand Down Expand Up @@ -181,7 +176,7 @@ private HealthIndicatorDetails getDetails(
* @param explain If true, the returned list includes a UserAction to contact support, otherwise an empty list
* @return a single UserAction instructing users to contact support.
*/
private List<UserAction> getContactSupportUserActions(boolean explain) {
private List<Diagnosis> getContactSupportUserActions(boolean explain) {
if (explain) {
return List.of(CONTACT_SUPPORT_USER_ACTION);
} else {
Expand Down

0 comments on commit da765ce

Please sign in to comment.