Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to force delete cluster node that is stuck in NEEDS UPGRADE status while others are in UPGRADING status. #410

Open
jeremybusk opened this issue Sep 4, 2024 · 0 comments

Comments

@jeremybusk
Copy link

jeremybusk commented Sep 4, 2024


Issue Report

Summary

One node in my MicroCeph cluster is inaccessible, unbootable or corrupt. When I try & force remove it from the cluster using the microceph cluster remove --force command I am unable to do so because all but that node is stuck in UPGRADING status. Manual intervention steps like microceph cluster sql or cecph status are unavailable because API is unavailable because bad node is in NEEDS UPGRADE status.

I was hoping the --force option on the microceph cluster remove command would remove NEEDS UPGRADE node and allow me access to the cluster again. Data is not that big of a deal but OSDs are redundant so I would think I could destroy node to give myself access to the cluster so I can run ceph and disk commands. Perhaps an additional option could be added to bypass safety checks in cases like this.

MicroCeph Version

18.2.4+snapc9f2b08f92  1139   reef/stable 
root@microceph6:~# snap list | grep microceph
microceph  18.2.4+snapc9f2b08f92  1139   reef/stable    canonical**  -

Steps to Reproduce

  1. Attempt to upgrade all nodes in a MicroCeph cluster.

  2. Ensure that one node is not upgraded and becomes inaccessible.

  3. Try to force-remove the unresponsive node using the following command:

    sudo microceph cluster remove microceph7 --force
  4. Observe the error:

    Error: Database is waiting for an upgrade: 1 cluster member has not yet received the update
    

Cluster Status

The last two nodes in the cluster list are as follows (total: 3 voters, 3 stand-by):

sudo microceph cluster list
+------------+-------------------+----------+------------------------------------------------------------------+---------------+
|    NAME    |      ADDRESS      |   ROLE   |                           FINGERPRINT                            |    STATUS     |
+------------+-------------------+----------+------------------------------------------------------------------+---------------+

...

| microceph6 | 10.7.8.57:7443    | stand-by | 62769b6ex21b2c04ed06200845f69d089a4fdd540f4dd043effba7d0cac84a6f | UPGRADING     |
+------------+-------------------+----------+------------------------------------------------------------------+---------------+
| microceph7 | 10.7.8.66:7443    | stand-by | ad6576a5fa6ff2ay740e668f8a2657de233320bba8e17e767a4fe92e0e5a4750 | NEEDS UPGRADE |
+------------+-------------------+----------+------------------------------------------------------------------+---------------+

Observed Behavior

The command:

sudo microceph cluster remove microceph7 --force

...does not remove the problematic node. The cluster is stuck in the UPGRADING status, and I'm unable to perform necessary maintenance tasks.

Expected Behavior

I expected the following command:

sudo microceph cluster remove microceph7 --force

...to actually force-remove the node stuck in NEEDS UPGRADE mode. This would allow the cluster to complete the upgrade process and return to a healthy state, enabling access to the normal Ceph command API for further clean-up. If this is not possible with the current command, there should be another command to achieve this.

Relevant Logs & Error Output

The local API at https://10.7.8.57:7443 is waiting for the other node to upgrade, but I can't remove the problematic node without access to the API.

journalctl -f
Sep 04 09:26:38 microceph6 systemd[1]: Started Service for snap application microceph.osd.
Sep 04 09:26:56 microceph6 microceph.daemon[591]: time="2024-09-04T09:26:56Z" level=warning msg="Waiting for other cluster members to upgrade their versions" address="https://10.7.8.57:7443"

If the logs are considerably long, I will paste them to Gist and insert the link here.

Additional Comments

Great project! A fix for this issue would be greatly appreciated.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant