Unable to force delete cluster node that is stuck in NEEDS UPGRADE status while others are in UPGRADING status. #410

jeremybusk · 2024-09-04T09:37:52Z

Issue Report

Summary

One node in my MicroCeph cluster is inaccessible, unbootable or corrupt. When I try & force remove it from the cluster using the microceph cluster remove --force command I am unable to do so because all but that node is stuck in UPGRADING status. Manual intervention steps like microceph cluster sql or cecph status are unavailable because API is unavailable because bad node is in NEEDS UPGRADE status.

I was hoping the --force option on the microceph cluster remove command would remove NEEDS UPGRADE node and allow me access to the cluster again. Data is not that big of a deal but OSDs are redundant so I would think I could destroy node to give myself access to the cluster so I can run ceph and disk commands. Perhaps an additional option could be added to bypass safety checks in cases like this.

MicroCeph Version

18.2.4+snapc9f2b08f92  1139   reef/stable

root@microceph6:~# snap list | grep microceph
microceph  18.2.4+snapc9f2b08f92  1139   reef/stable    canonical**  -

Steps to Reproduce

Attempt to upgrade all nodes in a MicroCeph cluster.
Ensure that one node is not upgraded and becomes inaccessible.
Try to force-remove the unresponsive node using the following command:
```
sudo microceph cluster remove microceph7 --force
```

Observe the error:

Error: Database is waiting for an upgrade: 1 cluster member has not yet received the update

Cluster Status

The last two nodes in the cluster list are as follows (total: 3 voters, 3 stand-by):

sudo microceph cluster list
+------------+-------------------+----------+------------------------------------------------------------------+---------------+
|    NAME    |      ADDRESS      |   ROLE   |                           FINGERPRINT                            |    STATUS     |
+------------+-------------------+----------+------------------------------------------------------------------+---------------+

...

| microceph6 | 10.7.8.57:7443    | stand-by | 62769b6ex21b2c04ed06200845f69d089a4fdd540f4dd043effba7d0cac84a6f | UPGRADING     |
+------------+-------------------+----------+------------------------------------------------------------------+---------------+
| microceph7 | 10.7.8.66:7443    | stand-by | ad6576a5fa6ff2ay740e668f8a2657de233320bba8e17e767a4fe92e0e5a4750 | NEEDS UPGRADE |
+------------+-------------------+----------+------------------------------------------------------------------+---------------+

Observed Behavior

The command:

sudo microceph cluster remove microceph7 --force

...does not remove the problematic node. The cluster is stuck in the UPGRADING status, and I'm unable to perform necessary maintenance tasks.

Expected Behavior

I expected the following command:

sudo microceph cluster remove microceph7 --force

...to actually force-remove the node stuck in NEEDS UPGRADE mode. This would allow the cluster to complete the upgrade process and return to a healthy state, enabling access to the normal Ceph command API for further clean-up. If this is not possible with the current command, there should be another command to achieve this.

Relevant Logs & Error Output

The local API at https://10.7.8.57:7443 is waiting for the other node to upgrade, but I can't remove the problematic node without access to the API.

journalctl -f
Sep 04 09:26:38 microceph6 systemd[1]: Started Service for snap application microceph.osd.
Sep 04 09:26:56 microceph6 microceph.daemon[591]: time="2024-09-04T09:26:56Z" level=warning msg="Waiting for other cluster members to upgrade their versions" address="https://10.7.8.57:7443"

If the logs are considerably long, I will paste them to Gist and insert the link here.

Additional Comments

Great project! A fix for this issue would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to force delete cluster node that is stuck in NEEDS UPGRADE status while others are in UPGRADING status. #410

Unable to force delete cluster node that is stuck in NEEDS UPGRADE status while others are in UPGRADING status. #410

jeremybusk commented Sep 4, 2024 •

edited

Loading

Unable to force delete cluster node that is stuck in NEEDS UPGRADE status while others are in UPGRADING status. #410

Unable to force delete cluster node that is stuck in NEEDS UPGRADE status while others are in UPGRADING status. #410

Comments

jeremybusk commented Sep 4, 2024 • edited Loading

Issue Report

Summary

MicroCeph Version

Steps to Reproduce

Cluster Status

Observed Behavior

Expected Behavior

Relevant Logs & Error Output

Additional Comments

jeremybusk commented Sep 4, 2024 •

edited

Loading