-
Notifications
You must be signed in to change notification settings - Fork 24.3k
/
health.asciidoc
485 lines (374 loc) · 16.8 KB
/
health.asciidoc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
[[health-api]]
=== Health API
++++
<titleabbrev>Health</titleabbrev>
++++
An API that reports the health status of an {es} cluster.
[[health-api-request]]
==== {api-request-title}
`GET /_health_report` +
`GET /_health_report/<indicator>` +
[[health-api-prereqs]]
==== {api-prereq-title}
* If the {es} {security-features} are enabled, you must have the `monitor` or
`manage` <<privileges-list-cluster,cluster privilege>> to use this API.
[[health-api-desc]]
==== {api-description-title}
The health API returns a report with the health status of an Elasticsearch cluster. The report
contains a list of indicators that compose Elasticsearch functionality.
Each indicator has a health status of: `green`, `unknown`, `yellow` or `red`. The indicator will
provide an explanation and metadata describing the reason for its current health status.
The cluster's status is controlled by the worst indicator status.
In the event that an indicator's status is non-green, a list of impacts may be present in the
indicator result which detail the functionalities that are negatively affected by the health issue.
Each impact carries with it a severity level, an area of the system that is affected, and a simple
description of the impact on the system.
Some health indicators can determine the root cause of a health problem and prescribe a set of
steps that can be performed in order to improve the health of the system. The root cause and remediation
steps are encapsulated in a `diagnosis`.
A diagnosis contains a cause detailing a root cause analysis, an action containing a brief description
of the steps to take to fix the problem, the list of affected resources (if applicable), and a detailed
step-by-step troubleshooting guide to fix the diagnosed problem.
NOTE: The health indicators perform root cause analysis of non-green health statuses. This can
be computationally expensive when called frequently. When setting up automated polling of the API
for health status set `verbose` to `false` to disable the more expensive analysis logic.
[[health-api-path-params]]
==== {api-path-parms-title}
`<indicator>`::
(Optional, string) Limit the information returned to
a specific indicator. Supported indicators are:
+
--
`master_is_stable`::
Reports health issues regarding
the stability of the node that is seen as the master by the node handling
the health request. In case of enough observed master changes in a short period of time
this indicator will aim to diagnose and report back useful information
regarding the cluster formation issues it detects.
`shards_availability`::
Reports health issues regarding shard assignments.
`disk`::
Reports health issues caused by lack of disk space.
`ilm`::
Reports health issues related to
Indexing Lifecycle Management.
`repository_integrity`::
Tracks repository integrity and reports health issues
that arise if repositories become corrupted, unknown, or invalid.
`slm`::
Reports health issues related to
Snapshot Lifecycle Management.
`shards_capacity`::
Reports health issues related to the shards
capacity of the cluster.
--
[[health-api-query-params]]
==== {api-query-parms-title}
`verbose`::
(Optional, Boolean) If `true`, the response includes additional details that help explain the status of each non-green indicator.
These details include additional troubleshooting metrics and sometimes a root cause analysis of a health status.
Defaults to `true`.
`size`::
(Optional, integer) The maximum number of affected resources to return.
As a diagnosis can return multiple types of affected resources this parameter will limit the number of resources returned for each type to the configured value (e.g. a diagnosis could return
`1000` affected indices and `1000` affected nodes).
Defaults to `1000`.
[role="child_attributes"]
[[health-api-response-body]]
==== {api-response-body-title}
`cluster_name`::
(string) The name of the cluster.
`status`::
(Optional, string) Health status of the cluster, based on the aggregated status of all indicators
in the cluster. If the health of a specific indicator is being requested, this top
level status will be omitted. Statuses are:
`green`:::
The cluster is healthy.
`unknown`:::
The health of the cluster could not be determined.
`yellow`:::
The functionality of a cluster is in a degraded state and may need remediation
to avoid the health becoming `red`.
`red`:::
The cluster is experiencing an outage or certain features are unavailable for use.
`indicators`::
(object) Information about the health of the cluster indicators.
+
.Properties of `indicators`
[%collapsible%open]
====
`<indicator>`::
(object) Contains health results for an indicator.
+
.Properties of `<indicator>`
[%collapsible%open]
=======
`status`::
(string) Health status of the indicator. Statuses are:
`green`:::
The indicator is healthy.
`unknown`:::
The health of the indicator could not be determined.
`yellow`:::
The functionality of an indicator is in a degraded state and may need remediation
to avoid the health becoming `red`.
`red`:::
The indicator is experiencing an outage or certain features are unavailable for use.
`symptom`::
(string) A message providing information about the current health status.
`details`::
(Optional, object) An object that contains additional information about the cluster that
has lead to the current health status result. This data is unstructured, and each
indicator returns <<health-api-response-details, a unique set of details>>. Details will not be calculated if the
`verbose` property is set to false.
`impacts`::
(Optional, array) If a non-healthy status is returned, indicators may include a list of
impacts that this health status will have on the cluster.
+
.Properties of `impacts`
[%collapsible%open]
========
`severity`::
(integer) How important this impact is to the functionality of the cluster. A value of 1
is the highest severity, with larger values indicating lower severity.
`description`::
(string) A description of the impact on the cluster.
`impact_areas`::
(array of strings) The areas of cluster functionality that this impact affects.
Possible values are:
+
--
* `search`
* `ingest`
* `backup`
* `deployment_management`
--
========
`diagnosis`::
(Optional, array) If a non-healthy status is returned, indicators may include a list of
diagnosis that encapsulate the cause of the health issue and an action to take in order to remediate the problem.
The diagnosis will not be calculated if the `verbose` property is false.
+
.Properties of `diagnosis`
[%collapsible%open]
========
`cause`::
(string) A description of a root cause of this health problem.
`action`::
(string) A brief description the steps that should be taken to remediate the problem.
A more detailed step-by-step guide to remediate the problem is provided by the
`help_url` field.
`affected_resources`::
(Optional, array of strings) If the root cause pertains to multiple resources in the
cluster (like indices, shards, nodes, etc...) this will hold all resources that this
diagnosis is applicable for.
`help_url`::
(string) A link to the troubleshooting guide that'll fix the health problem.
========
=======
====
[role="child_attributes"]
[[health-api-response-details]]
==== Indicator Details
Each health indicator in the health API returns a set of details that further explains the state of the system. The
details have contents and a structure that is unique to each indicator.
[[health-api-response-details-master-is-stable]]
===== master_is_stable
`current_master`::
(object) Information about the currently elected master.
+
.Properties of `current_master`
[%collapsible%open]
====
`node_id`::
(string) The node id of the currently elected master, or null if no master is elected.
`name`::
(string) The node name of the currently elected master, or null if no master is elected.
====
`recent_masters`::
(Optional, array) A list of nodes that have been elected or replaced as master in a recent
time window. This field is present if the master
is changing rapidly enough to cause problems, and also present as additional information
when the indicator is `green`. This array includes only elected masters, and does _not_
include empty entries for periods when there was no elected master.
+
.Properties of `recent_masters`
[%collapsible%open]
====
`node_id`::
(string) The node id of a recently active master node.
`name`::
(string) The node name of a recently active master node.
====
`exception_fetching_history`::
(Optional, object) If the node being queried sees that the elected master has stepped down
repeatedly, the master history is requested from the most recently elected master node for
diagnosis purposes. If fetching this remote history fails, the exception information is
returned in this detail field.
+
.Properties of `exception_fetching_history`
[%collapsible%open]
====
`message`::
(string) The exception message for the failed history fetch operation.
`stack_trace`::
(string) The stack trace for the failed history fetch operation.
====
`cluster_formation`::
(Optional, array) If there has been no elected master node recently, the node being queried attempts to
gather information about why the cluster has been unable to form, or why the node being queried has been
unable to join the cluster if it has formed. This array could contain any entry for each master eligible
node's view of cluster formation.
+
.Properties of `cluster_formation`
[%collapsible%open]
====
`node_id`::
(string) The node id of a master-eligible node
`name`::
(Optional, string) The node name of a master-eligible node
`cluster_formation_message`::
(string) A detailed description explaining what went wrong with cluster formation, or why this node was
unable to join the cluster if it has formed.
====
[[health-api-response-details-shards-availability]]
===== shards_availability
`unassigned_primaries`::
(int) The number of primary shards that are unassigned for reasons other than initialization or relocation.
`initializing_primaries`::
(int) The number of primary shards that are initializing or recovering.
`creating_primaries`::
(int) The number of primary shards that are unassigned because they have been very recently created.
`restarting_primaries`::
(int) The number of primary shards that are relocating because of a node shutdown operation.
`started_primaries`::
(int) The number of primary shards that are active and available on the system.
`unassigned_replicas`::
(int) The number of replica shards that are unassigned for reasons other than initialization or relocation.
`initializing_replicas`::
(int) The number of replica shards that are initializing or recovering.
`restarting_replicas`::
(int) The number of replica shards that are relocating because of a node shutdown operation.
`started_replicas`::
(int) The number of replica shards that are active and available on the system.
[[health-api-response-details-disk]]
===== disk
`indices_with_readonly_block`::
(int) The number of indices the system enforced a read-only index block (`index.blocks.read_only_allow_delete`) on
because the cluster is running out of space.
`nodes_with_enough_disk_space`::
(int) The number of nodes that have enough available disk space to function.
`nodes_over_high_watermark`::
(int) The number of nodes that are running low on disk and it is likely that they will run out of space. Their disk usage
has tripped the <<cluster-routing-watermark-high, high watermark threshold>>.
`nodes_over_flood_stage_watermark`::
(int) The number of nodes that have run out of disk. Their disk usage has tripped the <<cluster-routing-flood-stage, flood stage
watermark threshold>>.
`unknown_nodes`::
(int) The number of nodes for which it was not possible to determine their disk health.
[[health-api-response-details-repository-integrity]]
===== repository_integrity
`total_repositories`::
(Optional, int) The number of currently configured repositories on the system. If there are no repositories
configured then this detail is omitted.
`corrupted_repositories`::
(Optional, int) The number of repositories on the system that have been determined to be corrupted. If there are
no corrupted repositories detected, this detail is omitted.
`corrupted`::
(Optional, array of strings) If corrupted repositories have been detected in the system, the names of up to ten of
them are displayed in this field. If no corrupted repositories are found, this detail is omitted.
`unknown_repositories`::
(Optional, int) The number of repositories that have been determined to be unknown by at least one node.
If there are no unknown repositories detected, this detail is omitted.
`invalid_repositories`::
(Optional, int) The number of repositories that have been determined to be invalid by at least one node.
If there are no invalid repositories detected, this detail is omitted.
[[health-api-response-details-ilm]]
===== ilm
`ilm_status`::
(string) The current status of the Indexing Lifecycle Management feature. Either `STOPPED`, `STOPPING`, or `RUNNING`.
`policies`::
(int) The number of index lifecycle policies that the system is managing.
`stagnating_indices`::
(int) the number of indices managed by {ilm} that has been stagnant longer than expected.
`stagnating_indices_per_action`::
(optional, map) Summary of the number of indices, grouped by action, that have been stagnant longer than
expected.
+
.Properties of `stagnating_indices_per_action`
[%collapsible%open]
=======
`downsample`::
(int) The number of stagnant indices in the `downsample` action.
`allocate`::
(int) The number of stagnant indices in the `allocate` action.
`shrink`::
(int) The number of stagnant indices in the `shrink` action.
`searchable_snapshot`::
(int) The number of stagnant indices in the `searchable_snapshot` action.
`rollover`::
(int) The number of stagnant indices in the `rollver` action.
`forcemerge`::
(int) The number of stagnant indices in the `forcemerge` action.
`delete`::
(int) The number of stagnant indices in the `delete` action.
`migrate`::
(int) The number of stagnant indices in the `migrate` action.
=======
[[health-api-response-details-slm]]
===== slm
`slm_status`::
(string) The current status of the Snapshot Lifecycle Management feature. Either `STOPPED`, `STOPPING`, or `RUNNING`.
`policies`::
(int) The number of snapshot policies that the system is managing.
`unhealthy_policies`::
(map) A detailed view on the policies that are considered unhealthy due to having
several consecutive unsuccessful invocations.
The `count` key represents the number of unhealthy policies (int).
The `invocations_since_last_success` key will report a map where the unhealthy policy
name is the key and it's corresponding number of failed invocations is the value.
[[health-api-response-details-shards-capacity]]
===== shards_capacity
`data`::
(map) A view with information about the current capacity of shards for data nodes that do not belong to the frozen tier.
+
.Properties of `data`
[%collapsible%open]
=====
`max_shards_in_cluster`::
(int) Indicates the maximum number of shards that the cluster can hold.
`current_used_shards`::
(optional, int) The total number of shards hold by the cluster. Only displayed in the case the indicator's status is `red` or `yellow`.
=====
`frozen`::
(map) A view with information about the current capacity of shards for data nodes that belong to the frozen tier.
+
.Properties of `frozen`
[%collapsible%open]
=====
`max_shards_in_cluster`::
(int) Indicates the maximum number of shards the cluster can hold for the partially mounted indices.
`current_used_shards`::
(optional, int) The total number of shards the partially mounted indices have in the cluster. Only displayed in the case the indicator's status is `red` or `yellow`.
=====
[[health-api-example]]
==== {api-examples-title}
[source,console]
--------------------------------------------------
GET _health_report
--------------------------------------------------
The API returns a response with all the indicators regardless
of current status.
[source,console]
--------------------------------------------------
GET _health_report/shards_availability
--------------------------------------------------
The API returns a response for just the shard availability indicator.
[source,console]
--------------------------------------------------
GET _health_report?verbose=false
--------------------------------------------------
The API returns a response with all health indicators but will
not calculate details or root cause analysis for the response. This is helpful
if you would like to monitor the health API and do not want the overhead of
calculating additional troubleshooting details each call.