You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One node in my MicroCeph cluster is inaccessible, unbootable or corrupt. When I try & force remove it from the cluster using the microceph cluster remove --force command I am unable to do so because all but that node is stuck in UPGRADING status. Manual intervention steps like microceph cluster sql or cecph status are unavailable because API is unavailable because bad node is in NEEDS UPGRADE status.
I was hoping the --force option on the microceph cluster remove command would remove NEEDS UPGRADE node and allow me access to the cluster again. Data is not that big of a deal but OSDs are redundant so I would think I could destroy node to give myself access to the cluster so I can run ceph and disk commands. Perhaps an additional option could be added to bypass safety checks in cases like this.
...does not remove the problematic node. The cluster is stuck in the UPGRADING status, and I'm unable to perform necessary maintenance tasks.
Expected Behavior
I expected the following command:
sudo microceph cluster remove microceph7 --force
...to actually force-remove the node stuck in NEEDS UPGRADE mode. This would allow the cluster to complete the upgrade process and return to a healthy state, enabling access to the normal Ceph command API for further clean-up. If this is not possible with the current command, there should be another command to achieve this.
Relevant Logs & Error Output
The local API at https://10.7.8.57:7443 is waiting for the other node to upgrade, but I can't remove the problematic node without access to the API.
journalctl -f
Sep 04 09:26:38 microceph6 systemd[1]: Started Service for snap application microceph.osd.
Sep 04 09:26:56 microceph6 microceph.daemon[591]: time="2024-09-04T09:26:56Z" level=warning msg="Waiting for other cluster members to upgrade their versions" address="https://10.7.8.57:7443"
If the logs are considerably long, I will paste them to Gist and insert the link here.
Additional Comments
Great project! A fix for this issue would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
Issue Report
Summary
One node in my MicroCeph cluster is inaccessible, unbootable or corrupt. When I try & force remove it from the cluster using the
microceph cluster remove --force
command I am unable to do so because all but that node is stuck in UPGRADING status. Manual intervention steps like microceph cluster sql or cecph status are unavailable because API is unavailable because bad node is in NEEDS UPGRADE status.I was hoping the
--force
option on themicroceph cluster remove
command would remove NEEDS UPGRADE node and allow me access to the cluster again. Data is not that big of a deal but OSDs are redundant so I would think I could destroy node to give myself access to the cluster so I can run ceph and disk commands. Perhaps an additional option could be added to bypass safety checks in cases like this.MicroCeph Version
Steps to Reproduce
Attempt to upgrade all nodes in a MicroCeph cluster.
Ensure that one node is not upgraded and becomes inaccessible.
Try to force-remove the unresponsive node using the following command:
Observe the error:
Cluster Status
The last two nodes in the cluster list are as follows (total: 3 voters, 3 stand-by):
Observed Behavior
The command:
...does not remove the problematic node. The cluster is stuck in the
UPGRADING
status, and I'm unable to perform necessary maintenance tasks.Expected Behavior
I expected the following command:
...to actually force-remove the node stuck in
NEEDS UPGRADE
mode. This would allow the cluster to complete the upgrade process and return to a healthy state, enabling access to the normal Ceph command API for further clean-up. If this is not possible with the current command, there should be another command to achieve this.Relevant Logs & Error Output
The local API at
https://10.7.8.57:7443
is waiting for the other node to upgrade, but I can't remove the problematic node without access to the API.If the logs are considerably long, I will paste them to Gist and insert the link here.
Additional Comments
Great project! A fix for this issue would be greatly appreciated.
The text was updated successfully, but these errors were encountered: