Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 5 additions & 13 deletions deploy-manage/upgrade/deployment-or-cluster/elasticsearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

* _(Recommended)_ **A rolling restart**

This option allows you to upgrade your cluster one node at a time without interrupting service. Running multiple versions of {{es}} in the same cluster beyond the duration of an upgrade is not supported, as shards cannot be replicated from upgraded nodes to nodes running the old-version. Running more than two versions of {{es}} in the same cluster is not supported.
This option allows you to upgrade your cluster one node at a time without interrupting service. Running multiple versions of {{es}} in the same cluster beyond the duration of an upgrade is not supported, as shards cannot be replicated from upgraded nodes to nodes running the earlier ersion. Running more than two versions of {{es}} in the same cluster is not supported.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: earlier ersion should be earlier version.

Suggested change
This option allows you to upgrade your cluster one node at a time without interrupting service. Running multiple versions of {{es}} in the same cluster beyond the duration of an upgrade is not supported, as shards cannot be replicated from upgraded nodes to nodes running the earlier ersion. Running more than two versions of {{es}} in the same cluster is not supported.
This option allows you to upgrade your cluster one node at a time without interrupting service. Running multiple versions of {{es}} in the same cluster beyond the duration of an upgrade is not supported, as shards cannot be replicated from upgraded nodes to nodes running the earlier version. Running more than two versions of {{es}} in the same cluster is not supported.


* **A full restart**

Expand Down Expand Up @@ -246,20 +246,12 @@
To monitor which nodes have been upgraded, use the [CAT nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes) API:

```console
GET _cat/nodes?v=true&h=name,ip,version,uptime
GET _cat/nodes?v=true&h=name,ip,version,uptime&s=uptime
```

## Rolling upgrades considerations [upgrade-issues]

During a rolling upgrade, the cluster continues to operate normally. New functionality is either inactive or operates in a backward-compatible mode until the last old-version node leaves the cluster. New functionality becomes operational when all nodes in the cluster are running the new version.

Usually, the old-version nodes only leave the cluster when you shut them down to upgrade them. In this case, the last old-version node leaves the cluster when there are no more nodes to upgrade. However, it is possible that an old-version node might temporarily or permanently (until intervened) leave the cluster before you purposely shut it down due to [cluster fault detection](/deploy-manage/distributed-architecture/discovery-cluster-formation/cluster-fault-detection.md).

If all the remaining old-version nodes unexpectedly leave the cluster during an upgrade, the cluster will consider itself to be fully-upgraded, automatically activate new functionality, and leave its backward-compatible mode. Once that has happened, there is no way to return the cluster to a state that is compatible with the old-version nodes. Nodes running the earlier version will not be able to join this fully-upgraded cluster. To bring these nodes back into the cluster, upgrade them. {{es}} maintains the data in the data paths of the older nodes and will recover the cluster to health using this data after the nodes are fully upgraded.

If you stop half or more of the master-eligible nodes all at once during the upgrade, the cluster will become unavailable due to insufficient [voting configurations](/deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-voting.md). You must restart all the stopped master-eligible nodes to allow the cluster to re-form. If the re-formed cluster comprises only upgraded nodes, then the cluster will consider itself to be fully-upgraded, automatically activate new functionality, and leave its backward-compatible mode. In this case, upgrade all other nodes running the old version to enable them to join the re-formed cluster. Upgrade the master-eligible nodes last to make it less likely that this occurs.

In a testing or development environment with only one or two master-eligible nodes, you cannot avoid stopping half or more of the master-eligible nodes, so the cluster will always become unavailable at some point during the upgrade. When you restart the master-eligible nodes after this unavailability, the cluster will re-form with a single upgraded node, which is therefore fully-upgraded and will reject older nodes' attempts to re-join the cluster. Upgrade the master-eligible nodes last to avoid these rejections.
:::{tip}
If you encounter issues during rolling upgrade, refer to [Troubleshoot upgrades](troubleshoot/elasticsearch/troubleshooting-upgrades.md) for common issues.

Check failure on line 253 in deploy-manage/upgrade/deployment-or-cluster/elasticsearch.md

View workflow job for this annotation

GitHub Actions / build / build

`troubleshoot/elasticsearch/troubleshooting-upgrades.md` does not exist. If it was recently removed add a redirect. resolved to `/github/workspace/deploy-manage/upgrade/deployment-or-cluster/troubleshoot/elasticsearch/troubleshooting-upgrades.md`
:::

## Archived settings [archived-settings]

Expand Down
1 change: 1 addition & 0 deletions troubleshoot/elasticsearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ This section helps you fix issues with {{es}} deployments.
* [](/troubleshoot/elasticsearch/start-ilm.md)
* [](/troubleshoot/elasticsearch/index-lifecycle-management-errors.md)
* [](/troubleshoot/elasticsearch/file-based-recovery.md)
* [](/troubleshoot/elasticsearch/troubleshooting-upgrades.md)

## Capacity [troubleshooting-capacity]

Expand Down
24 changes: 20 additions & 4 deletions troubleshoot/elasticsearch/discovery-troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,25 @@ For managed deployments:

When a node wins the master election, it logs a message containing `elected-as-master` and all nodes log a message containing `master node changed` identifying the new elected master node.

If there is no elected master node and no node can win an election, all nodes repeatedly log messages about the problem using a logger called `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By default, this happens every 10 seconds.

Master elections only involve master-eligible nodes, so focus your attention on the master-eligible nodes in this situation. These nodes' logs indicate the requirements for a master election, such as the discovery of a certain set of nodes. The [Health]({{es-apis}}operation/operation-health-report) API on these nodes also provides useful information about the situation.
If there is no elected master node and no node can win an election, all nodes repeatedly log messages about the problem using a [logger](/deploy-manage/monitor/logging-configuration.md) called `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By default, this happens every 10 seconds.

During this time the {{es}} will induce `MasterNotDiscoveredException` errors and which its API will report like:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar issue: the {{es}} will induce ... errors and which its API will report like: has two problems — the article "the" before {{es}} and the spurious "and which".

Suggested rewrite:

Suggested change
During this time the {{es}} will induce `MasterNotDiscoveredException` errors and which its API will report like:
During this time, {{es}} returns `MasterNotDiscoveredException` errors. Its API reports:


```json
{
"error" : {
"root_cause" : [ {
"type" : "master_not_discovered_exception",
"reason" : null
} ],
"type" : "master_not_discovered_exception",
"reason" : null
},
"status" : 503
}
```

Master elections only involve master-eligible nodes, so focus your attention on the [master-eligible nodes](/deploy-manage/distributed-architecture/clusters-nodes-shards/node-roles.md#master-node-role) in this situation. These nodes' logs indicate the requirements for a master election, such as the discovery of a certain set of nodes. The [Health]({{es-apis}}operation/operation-health-report) API on these nodes also provides useful information about the situation.

If the logs or the health report indicate that {{es}} can't discover enough nodes to form a quorum, you must address the reasons preventing {{es}} from discovering the missing nodes. The missing nodes are needed to reconstruct the cluster metadata. Without the cluster metadata, the data in your cluster is meaningless. The cluster metadata is stored on a subset of the master-eligible nodes in the cluster. If a quorum can't be discovered, the missing nodes were the ones holding the cluster metadata.

Expand All @@ -66,7 +82,7 @@ If the logs suggest that the node cannot discover or join the cluster due to tim

## Node joins cluster and leaves again [discovery-node-leaves]

If a node joins the cluster but {{es}} determines it to be faulty, it is removed from the cluster again. Refer to [Troubleshooting an unstable cluster](/troubleshoot/elasticsearch/troubleshooting-unstable-cluster.md) for more information.
If a node joins the cluster but {{es}} determines it to be faulty, it is removed from the cluster again. This will log as `node-join` then afterwards as `node-left` by the elected-master node. Refer to [Troubleshooting an unstable cluster](/troubleshoot/elasticsearch/troubleshooting-unstable-cluster.md) for more information.


## Investigate timeout and network issues [investigate-timeout-and-network-issues]
Expand Down
140 changes: 140 additions & 0 deletions troubleshoot/elasticsearch/troubleshooting-upgrades.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
navigation_title: "Troubleshoot upgrades"
description: "Common upgrade issues and resolutions."
type: troubleshooting
applies_to:
stack:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applies_to: stack: is missing a lifecycle value. Refer to the cumulative-docs reference for valid values (e.g., ga, beta, coming). Without a value this will likely render incorrectly or fail validation.

products:
- id: elasticsearch
---

# Troubleshoot upgrades [troubleshooting-upgrades]

Usually, [{{es}} upgrades](/deploy-manage/upgrade/deployment-or-cluster/elasticsearch.md) proceed smoothly due to [planning](/deploy-manage/upgrade/plan-upgrade.md) and [preparation](/deploy-manage/upgrade/prepare-to-upgrade.md) due dilligence.

Check warning on line 13 in troubleshoot/elasticsearch/troubleshooting-upgrades.md

View workflow job for this annotation

GitHub Actions / build / vale

Elastic.Spelling: 'dilligence' is a possible misspelling.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: due dilligencedue diligence.

Suggested change
Usually, [{{es}} upgrades](/deploy-manage/upgrade/deployment-or-cluster/elasticsearch.md) proceed smoothly due to [planning](/deploy-manage/upgrade/plan-upgrade.md) and [preparation](/deploy-manage/upgrade/prepare-to-upgrade.md) due dilligence.
Usually, [{{es}} upgrades](/deploy-manage/upgrade/deployment-or-cluster/elasticsearch.md) proceed smoothly due to [planning](/deploy-manage/upgrade/plan-upgrade.md) and [preparation](/deploy-manage/upgrade/prepare-to-upgrade.md) due diligence.


This guide outlines {{es}} logs which indicate either upgrade blocking issues or fatal node start-up errors.


##
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This H2 heading is empty (## with no title). It should either be given a title (e.g., ## Monitor upgrade progress) or removed. An untitled heading will also likely fail docs build validation.


To monitor which nodes have been upgraded, use the [CAT nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes) API:

```console
GET _cat/nodes?v=true&h=name,ip,version,uptime&s=uptime
```


## Rolling upgrades considerations [upgrade-issues]

{{es}} supports running two versions during a rolling upgrade, from an earlier version to later version. It does not ever support running more than two versions. It does not support two versions beyond the duration of the rolling upgrade.

During a rolling upgrade, the cluster continues to operate normally. New functionality is either inactive or operates in a backward-compatible mode until the last node of earlier version leaves the cluster. New functionality becomes operational when all nodes in the cluster are running the later version.

Usually, the earlier version nodes only leave the cluster when you shut them down to upgrade them. In this case, the last earlier version node leaves the cluster when there are no more nodes to upgrade. However, it is possible that an earlier version node might temporarily or permanently (until intervened) leave the cluster before you purposely shut it down due to [cluster fault detection](/deploy-manage/distributed-architecture/discovery-cluster-formation/cluster-fault-detection.md).

If all the remaining earlier version nodes unexpectedly leave the cluster during an upgrade, the cluster will consider itself to be fully-upgraded, automatically activate new functionality, and leave its backward-compatible mode.

Once that has happened, there is no way to return the cluster to a state that is compatible with the earlier version nodes. Nodes running the earlier version will not be able to join this fully-upgraded cluster. Their {{es}} logs will report `failed to join` issues due to `Caused by` errors like

* `node version [x.x.x] may not join a cluster comprising only nodes of version [y.y.y] or greater`
* `node with version [x.x.x] may not join a cluster with minimum version [y.y.y]`
* `node with system index mappings versions [y.y.y] may not join a cluster with minimum system index mappings versions [x.x.x]`
* `handshake with [NODE_ID] failed: remote node version [x.x.x] is incompatible with local node version [y.y.y]`

{{es}} maintains the data in the data paths of the older nodes and will recover the cluster to health using this data after the nodes are fully upgraded. Therefore, to bring these nodes back into the cluster, upgrade them.

:::{note} :applies_to: { ece:, ess: }
Usually you can "Reapply" your latest [Deployment activity](/deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) {{es}} upgrade to finish upgrading. If the node out of cluster causes [Cluster health](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-health) status of `red`, then plans will be blocked for data safety. If this is the case, kindly [contact us](/troubleshoot/index.md#contact-us) with {{ech}} deployment ID or [{{ece}} diagnostic](/troubleshoot/deployments/cloud-enterprise/run-ece-diagnostics-tool.md) flagged `--deployments` for problematic deployment.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid "kindly" — the Elastic style guide treats it the same as "please", which should be omitted unless asking users to wait or tolerate inconvenience.

Suggested change
Usually you can "Reapply" your latest [Deployment activity](/deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) {{es}} upgrade to finish upgrading. If the node out of cluster causes [Cluster health](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-health) status of `red`, then plans will be blocked for data safety. If this is the case, kindly [contact us](/troubleshoot/index.md#contact-us) with {{ech}} deployment ID or [{{ece}} diagnostic](/troubleshoot/deployments/cloud-enterprise/run-ece-diagnostics-tool.md) flagged `--deployments` for problematic deployment.
Usually you can "Reapply" your latest [Deployment activity](/deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) {{es}} upgrade to finish upgrading. If the node out of cluster causes [Cluster health](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-health) status of `red`, then plans will be blocked for data safety. If this is the case, [contact us](/troubleshoot/index.md#contact-us) with {{ech}} deployment ID or [{{ece}} diagnostic](/troubleshoot/deployments/cloud-enterprise/run-ece-diagnostics-tool.md) flagged `--deployments` for problematic deployment.

:::

If you stop half or more of the master-eligible nodes all at once during the upgrade, the cluster will [become unavailable](/troubleshoot/elasticsearch/discovery-troubleshooting.md#discovery-no-master) due to insufficient [voting configurations](/deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-voting.md).

You must restart all the stopped master-eligible nodes to allow the cluster to re-form. If the re-formed cluster comprises only upgraded nodes, then the cluster will consider itself to be fully-upgraded, automatically activate new functionality, and leave its backward-compatible mode. In this case, upgrade all other nodes running the old version to enable them to join the re-formed cluster. Upgrade the master-eligible nodes last to make it less likely that this occurs.

In a testing or development environment with only one or two master-eligible nodes, you cannot avoid stopping half or more of the master-eligible nodes, so the cluster will always become unavailable at some point during the upgrade. When you restart the master-eligible nodes after this unavailability, the cluster will re-form with a single upgraded node, which is therefore fully-upgraded and will reject older nodes' attempts to re-join the cluster. Upgrade the master-eligible nodes last to avoid these rejections.


## Symptoms
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The required Symptoms and Resolution sections (and the optional Diagnosis, Best practices, Resources sections) contain only template placeholder comments. The resolution section still has the literal stepper code block from the template. These need to be filled in before the page goes live — the page is currently non-functional for users who land on it from the link added in deploy-manage/upgrade/deployment-or-cluster/elasticsearch.md.


<!-- REQUIRED

Describe what users observe when the problem occurs. Focus on the symptoms themselves, not their causes. Use bullet points. If applicable, include:

- Error messages
- Log output
- Missing or unexpected behavior
- Timeouts or performance issues
-->

## Diagnosis

<!-- OPTIONAL

Use a Diagnosis section when you need to help users narrow down the problem from an initial symptom before providing the resolution. This is especially useful when:

- The initial symptom requires diagnostic steps to identify the specific cause
- Multiple resolutions depend on diagnostic findings
- The same symptom can have multiple root causes

Use numbered steps or bullet points to guide users through the diagnostic process.
-->

## Resolution

<!-- REQUIRED

Provide clear, actionable steps to resolve the issue.

- Order steps from most common to least common to resolve the issue
- Numbered instructions that begin with imperative verb phrases
- Keep each step focused on a single action
- Use the stepper component
- Avoid diagnostic branching unless the problem cannot be resolved linearly.

For complex scenarios, consider these patterns:

- Multiple resolutions or "combo" resolutions: When multiple solutions can be applied together or independently, present them as a list of options (users can choose one or combine multiple approaches).

- Resolutions that differ by deployment type: When steps differ significantly by deployment type ({{ecloud}} versus self-managed versus ECK), organize by deployment type using clear headings.

- Separating diagnosis from causes and resolution: When multiple resolutions depend on diagnostic findings, use a separate Diagnosis section before Resolution to help users identify their specific situation first.

For more information about the stepper component, refer to [the syntax guide](https://elastic.github.io/docs-builder/syntax/stepper/).
-->

```markdown
:::::{stepper}

::::{step} [Step title]
[Step description or instruction - begin with an imperative verb]
::::

::::{step} [Step title]
[Step description or instruction - begin with an imperative verb]
::::

::::{step} [Step title]
[Step description or instruction - begin with an imperative verb]
::::

:::::
```

## Best practices

<!-- OPTIONAL BUT RECOMMENDED

Explain how to avoid this issue in the future. Use bullet points. Do not restate general product best practices or guidance that applies broadly beyond this issue.
-->

## Resources

<!-- OPTIONAL

Link to related documentation for deeper context. These links are supplementary — all information required to fix the issue should already be on this page.

Avoid linking to GitHub issues, pull requests, or internal discussions. Resources should be stable, user-facing documentation.
-->

- [Related documentation link]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder links should be filled in with real targets or removed before publishing:

  • [Related documentation link]
  • [Contrib/upstream reference]

- [Contrib/upstream reference]
1 change: 1 addition & 0 deletions troubleshoot/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ toc:
- file: elasticsearch/start-ilm.md
- file: elasticsearch/index-lifecycle-management-errors.md
- file: elasticsearch/file-based-recovery.md
- file: elasticsearch/troubleshooting-upgrades.md
- file: elasticsearch/security.md
children:
- file: elasticsearch/security/security-trb-settings.md
Expand Down
Loading