-
Notifications
You must be signed in to change notification settings - Fork 224
Troubleshooting Elasticsearch Upgrades #6396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -37,9 +37,25 @@ For managed deployments: | |||||
|
|
||||||
| When a node wins the master election, it logs a message containing `elected-as-master` and all nodes log a message containing `master node changed` identifying the new elected master node. | ||||||
|
|
||||||
| If there is no elected master node and no node can win an election, all nodes repeatedly log messages about the problem using a logger called `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By default, this happens every 10 seconds. | ||||||
|
|
||||||
| Master elections only involve master-eligible nodes, so focus your attention on the master-eligible nodes in this situation. These nodes' logs indicate the requirements for a master election, such as the discovery of a certain set of nodes. The [Health]({{es-apis}}operation/operation-health-report) API on these nodes also provides useful information about the situation. | ||||||
| If there is no elected master node and no node can win an election, all nodes repeatedly log messages about the problem using a [logger](/deploy-manage/monitor/logging-configuration.md) called `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By default, this happens every 10 seconds. | ||||||
|
|
||||||
| During this time the {{es}} will induce `MasterNotDiscoveredException` errors and which its API will report like: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Grammar issue: Suggested rewrite:
Suggested change
|
||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "error" : { | ||||||
| "root_cause" : [ { | ||||||
| "type" : "master_not_discovered_exception", | ||||||
| "reason" : null | ||||||
| } ], | ||||||
| "type" : "master_not_discovered_exception", | ||||||
| "reason" : null | ||||||
| }, | ||||||
| "status" : 503 | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| Master elections only involve master-eligible nodes, so focus your attention on the [master-eligible nodes](/deploy-manage/distributed-architecture/clusters-nodes-shards/node-roles.md#master-node-role) in this situation. These nodes' logs indicate the requirements for a master election, such as the discovery of a certain set of nodes. The [Health]({{es-apis}}operation/operation-health-report) API on these nodes also provides useful information about the situation. | ||||||
|
|
||||||
| If the logs or the health report indicate that {{es}} can't discover enough nodes to form a quorum, you must address the reasons preventing {{es}} from discovering the missing nodes. The missing nodes are needed to reconstruct the cluster metadata. Without the cluster metadata, the data in your cluster is meaningless. The cluster metadata is stored on a subset of the master-eligible nodes in the cluster. If a quorum can't be discovered, the missing nodes were the ones holding the cluster metadata. | ||||||
|
|
||||||
|
|
@@ -66,7 +82,7 @@ If the logs suggest that the node cannot discover or join the cluster due to tim | |||||
|
|
||||||
| ## Node joins cluster and leaves again [discovery-node-leaves] | ||||||
|
|
||||||
| If a node joins the cluster but {{es}} determines it to be faulty, it is removed from the cluster again. Refer to [Troubleshooting an unstable cluster](/troubleshoot/elasticsearch/troubleshooting-unstable-cluster.md) for more information. | ||||||
| If a node joins the cluster but {{es}} determines it to be faulty, it is removed from the cluster again. This will log as `node-join` then afterwards as `node-left` by the elected-master node. Refer to [Troubleshooting an unstable cluster](/troubleshoot/elasticsearch/troubleshooting-unstable-cluster.md) for more information. | ||||||
|
|
||||||
|
|
||||||
| ## Investigate timeout and network issues [investigate-timeout-and-network-issues] | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,140 @@ | ||||||
| --- | ||||||
| navigation_title: "Troubleshoot upgrades" | ||||||
| description: "Common upgrade issues and resolutions." | ||||||
| type: troubleshooting | ||||||
| applies_to: | ||||||
| stack: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||
| products: | ||||||
| - id: elasticsearch | ||||||
| --- | ||||||
|
|
||||||
| # Troubleshoot upgrades [troubleshooting-upgrades] | ||||||
|
|
||||||
| Usually, [{{es}} upgrades](/deploy-manage/upgrade/deployment-or-cluster/elasticsearch.md) proceed smoothly due to [planning](/deploy-manage/upgrade/plan-upgrade.md) and [preparation](/deploy-manage/upgrade/prepare-to-upgrade.md) due dilligence. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Typo:
Suggested change
|
||||||
|
|
||||||
| This guide outlines {{es}} logs which indicate either upgrade blocking issues or fatal node start-up errors. | ||||||
|
|
||||||
|
|
||||||
| ## | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This H2 heading is empty ( |
||||||
|
|
||||||
| To monitor which nodes have been upgraded, use the [CAT nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes) API: | ||||||
|
|
||||||
| ```console | ||||||
| GET _cat/nodes?v=true&h=name,ip,version,uptime&s=uptime | ||||||
| ``` | ||||||
|
|
||||||
|
|
||||||
| ## Rolling upgrades considerations [upgrade-issues] | ||||||
|
|
||||||
| {{es}} supports running two versions during a rolling upgrade, from an earlier version to later version. It does not ever support running more than two versions. It does not support two versions beyond the duration of the rolling upgrade. | ||||||
|
|
||||||
| During a rolling upgrade, the cluster continues to operate normally. New functionality is either inactive or operates in a backward-compatible mode until the last node of earlier version leaves the cluster. New functionality becomes operational when all nodes in the cluster are running the later version. | ||||||
|
|
||||||
| Usually, the earlier version nodes only leave the cluster when you shut them down to upgrade them. In this case, the last earlier version node leaves the cluster when there are no more nodes to upgrade. However, it is possible that an earlier version node might temporarily or permanently (until intervened) leave the cluster before you purposely shut it down due to [cluster fault detection](/deploy-manage/distributed-architecture/discovery-cluster-formation/cluster-fault-detection.md). | ||||||
|
|
||||||
| If all the remaining earlier version nodes unexpectedly leave the cluster during an upgrade, the cluster will consider itself to be fully-upgraded, automatically activate new functionality, and leave its backward-compatible mode. | ||||||
|
|
||||||
| Once that has happened, there is no way to return the cluster to a state that is compatible with the earlier version nodes. Nodes running the earlier version will not be able to join this fully-upgraded cluster. Their {{es}} logs will report `failed to join` issues due to `Caused by` errors like | ||||||
|
|
||||||
| * `node version [x.x.x] may not join a cluster comprising only nodes of version [y.y.y] or greater` | ||||||
| * `node with version [x.x.x] may not join a cluster with minimum version [y.y.y]` | ||||||
| * `node with system index mappings versions [y.y.y] may not join a cluster with minimum system index mappings versions [x.x.x]` | ||||||
| * `handshake with [NODE_ID] failed: remote node version [x.x.x] is incompatible with local node version [y.y.y]` | ||||||
|
|
||||||
| {{es}} maintains the data in the data paths of the older nodes and will recover the cluster to health using this data after the nodes are fully upgraded. Therefore, to bring these nodes back into the cluster, upgrade them. | ||||||
|
|
||||||
| :::{note} :applies_to: { ece:, ess: } | ||||||
| Usually you can "Reapply" your latest [Deployment activity](/deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) {{es}} upgrade to finish upgrading. If the node out of cluster causes [Cluster health](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-health) status of `red`, then plans will be blocked for data safety. If this is the case, kindly [contact us](/troubleshoot/index.md#contact-us) with {{ech}} deployment ID or [{{ece}} diagnostic](/troubleshoot/deployments/cloud-enterprise/run-ece-diagnostics-tool.md) flagged `--deployments` for problematic deployment. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Avoid "kindly" — the Elastic style guide treats it the same as "please", which should be omitted unless asking users to wait or tolerate inconvenience.
Suggested change
|
||||||
| ::: | ||||||
|
|
||||||
| If you stop half or more of the master-eligible nodes all at once during the upgrade, the cluster will [become unavailable](/troubleshoot/elasticsearch/discovery-troubleshooting.md#discovery-no-master) due to insufficient [voting configurations](/deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-voting.md). | ||||||
|
|
||||||
| You must restart all the stopped master-eligible nodes to allow the cluster to re-form. If the re-formed cluster comprises only upgraded nodes, then the cluster will consider itself to be fully-upgraded, automatically activate new functionality, and leave its backward-compatible mode. In this case, upgrade all other nodes running the old version to enable them to join the re-formed cluster. Upgrade the master-eligible nodes last to make it less likely that this occurs. | ||||||
|
|
||||||
| In a testing or development environment with only one or two master-eligible nodes, you cannot avoid stopping half or more of the master-eligible nodes, so the cluster will always become unavailable at some point during the upgrade. When you restart the master-eligible nodes after this unavailability, the cluster will re-form with a single upgraded node, which is therefore fully-upgraded and will reject older nodes' attempts to re-join the cluster. Upgrade the master-eligible nodes last to avoid these rejections. | ||||||
|
|
||||||
|
|
||||||
| ## Symptoms | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The required Symptoms and Resolution sections (and the optional Diagnosis, Best practices, Resources sections) contain only template placeholder comments. The resolution section still has the literal stepper code block from the template. These need to be filled in before the page goes live — the page is currently non-functional for users who land on it from the link added in |
||||||
|
|
||||||
| <!-- REQUIRED | ||||||
|
|
||||||
| Describe what users observe when the problem occurs. Focus on the symptoms themselves, not their causes. Use bullet points. If applicable, include: | ||||||
|
|
||||||
| - Error messages | ||||||
| - Log output | ||||||
| - Missing or unexpected behavior | ||||||
| - Timeouts or performance issues | ||||||
| --> | ||||||
|
|
||||||
| ## Diagnosis | ||||||
|
|
||||||
| <!-- OPTIONAL | ||||||
|
|
||||||
| Use a Diagnosis section when you need to help users narrow down the problem from an initial symptom before providing the resolution. This is especially useful when: | ||||||
|
|
||||||
| - The initial symptom requires diagnostic steps to identify the specific cause | ||||||
| - Multiple resolutions depend on diagnostic findings | ||||||
| - The same symptom can have multiple root causes | ||||||
|
|
||||||
| Use numbered steps or bullet points to guide users through the diagnostic process. | ||||||
| --> | ||||||
|
|
||||||
| ## Resolution | ||||||
|
|
||||||
| <!-- REQUIRED | ||||||
|
|
||||||
| Provide clear, actionable steps to resolve the issue. | ||||||
|
|
||||||
| - Order steps from most common to least common to resolve the issue | ||||||
| - Numbered instructions that begin with imperative verb phrases | ||||||
| - Keep each step focused on a single action | ||||||
| - Use the stepper component | ||||||
| - Avoid diagnostic branching unless the problem cannot be resolved linearly. | ||||||
|
|
||||||
| For complex scenarios, consider these patterns: | ||||||
|
|
||||||
| - Multiple resolutions or "combo" resolutions: When multiple solutions can be applied together or independently, present them as a list of options (users can choose one or combine multiple approaches). | ||||||
|
|
||||||
| - Resolutions that differ by deployment type: When steps differ significantly by deployment type ({{ecloud}} versus self-managed versus ECK), organize by deployment type using clear headings. | ||||||
|
|
||||||
| - Separating diagnosis from causes and resolution: When multiple resolutions depend on diagnostic findings, use a separate Diagnosis section before Resolution to help users identify their specific situation first. | ||||||
|
|
||||||
| For more information about the stepper component, refer to [the syntax guide](https://elastic.github.io/docs-builder/syntax/stepper/). | ||||||
| --> | ||||||
|
|
||||||
| ```markdown | ||||||
| :::::{stepper} | ||||||
|
|
||||||
| ::::{step} [Step title] | ||||||
| [Step description or instruction - begin with an imperative verb] | ||||||
| :::: | ||||||
|
|
||||||
| ::::{step} [Step title] | ||||||
| [Step description or instruction - begin with an imperative verb] | ||||||
| :::: | ||||||
|
|
||||||
| ::::{step} [Step title] | ||||||
| [Step description or instruction - begin with an imperative verb] | ||||||
| :::: | ||||||
|
|
||||||
| ::::: | ||||||
| ``` | ||||||
|
|
||||||
| ## Best practices | ||||||
|
|
||||||
| <!-- OPTIONAL BUT RECOMMENDED | ||||||
|
|
||||||
| Explain how to avoid this issue in the future. Use bullet points. Do not restate general product best practices or guidance that applies broadly beyond this issue. | ||||||
| --> | ||||||
|
|
||||||
| ## Resources | ||||||
|
|
||||||
| <!-- OPTIONAL | ||||||
|
|
||||||
| Link to related documentation for deeper context. These links are supplementary — all information required to fix the issue should already be on this page. | ||||||
|
|
||||||
| Avoid linking to GitHub issues, pull requests, or internal discussions. Resources should be stable, user-facing documentation. | ||||||
| --> | ||||||
|
|
||||||
| - [Related documentation link] | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Placeholder links should be filled in with real targets or removed before publishing:
|
||||||
| - [Contrib/upstream reference] | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo:
earlier ersionshould beearlier version.