-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health API reports disk information Unknown
status / No disk usage data
symptom
#98926
Comments
Pinging @elastic/es-data-management (Team:Data Management) |
Having no disk information when the health node changes (on a master failover, or when the health node is disabled and re-enabled) is normal - upon election the new health node needs to receive the disk data from the nodes in the cluster. Until this data is received there is "no disk data" to display and that's what the indicator is signaling. |
Reopening this as when the health reports get into the "no disk data" case, it stays there. |
I can reproduce this - thanks for the great bug report @romain-chanu |
Relates to #92193 |
The issue seems to be that when we do a snapshot restore including cluster state (which is what ESS does when you create a new deployment based on a snapshot from another deployment), we overwrite the persistent tasks stored in the cluster
I think option 3 is the most flexible solution, as it allows individual task implementations to choose whether they should be kept after a restore or not, but it might be a bit overkill/overengineered. I'm afraid option 2 can result in unwanted side-effects. So I think I'd be leaning towards either option 1 or 3. Curious to hear what others think here. Also pinging @elastic/es-distributed as this relates to snapshots. |
Thanks for the analysis @nielsbauman, do you know why the task wasn't recreated? Is this something we can reproduce by simply deleting the task?
Maybe we just want to always re-add it when it's missing. The task should always be there unless the health node is disabled. What do you think? |
This question made me have another look at the code, specifically, these lines: Lines 162 to 166 in 4cf8942
If I comment out line 163, this whole issue is resolved as the cluster state listener just re-adds the task. My guess would be that this check/line was put there simply because we thought we didn't need the listener anymore after registering the task once, but maybe I'm missing something there.
Removing the lines 162 through 164 would do just that. I think I'd be in favour of doing that. The downside is that we then always execute this listener (which only adds the task if it doesn't already exist), but I think the effect is minimal as the statements in the listener don't seem to be too heavy. Do you know if there are better ways to achieve this (i.e. other than using a cluster state listener)? |
Thank you for following up on it @nielsbauman . I agree with you, we did remove the listener because we didn't think that anyone would be able to remove the task. Let's remove these lines then to ensure that the task will be recreated, I also think it's not a heavy check to do. |
We assumed that once the `HealthNode` persistent task is registered, we won't need to register it again. However, when, for instance, we restore from a snapshot (including cluster state) that was created in version <= 8.4.3, that task doesn't exist yet, which will result in the task being removed after the restore. By keeping the listener active, we will re-add the task after such a restore (or any other potential situation where the task might get deleted). Fixes elastic#98926
@nielsbauman nice work on driving this home ! 🚀 ++ on the solution of keeping the listener (sorry for the late reply here) The health executor makes sure the task is not started twice so no harm in keeping the listener:
|
Elasticsearch Version
8.9.1
Installed Plugins
No response
Java Version
bundled
OS Version
N.A
Problem Description
Elasticsearch Service (ESS) users have observed the below error message when accessing the
Health
API of a deployment:Further investigation revealed that the Cloud UI is unable to gracefully handle the following
disk
information returned by the Health APIWhile the above error can be better handled in ESS, it is unclear why the Health API reports such disk status / symptom. This requires further investigation.
Workaround
Users can disable and enable again the health node in Elasticsearch. These are the required APIs:
Once the two APIs are executed, observe that the Health page is accessible again.
Steps to Reproduce
POST _slm/policy/cloud-snapshot-policy/_execute
Restore snapshot data
when creating the deployment)Health
page and observe the above error message.Logs (if relevant)
After setting the
logger.org.elasticsearch.health.node
toDEBUG
and restarting the current master (instance 0 in our example), the below DEBUG logs were observed:The text was updated successfully, but these errors were encountered: