Health API reports disk information `Unknown` status / `No disk usage data` symptom #98926

romain-chanu · 2023-08-28T09:55:14Z

Elasticsearch Version

8.9.1

Installed Plugins

No response

Java Version

bundled

OS Version

N.A

Problem Description

Elasticsearch Service (ESS) users have observed the below error message when accessing the Health API of a deployment:

Error: Cannot read properties of undefined (reading 'replace")
at https://cloud.elastic.co/app.070493148d8f8558ab6.js:2:3103121
at U1 (https://cloud.elastic.co/app.070493148d88558ab6.js:2:3103141) 
at a (https://cloud.elastic.co/vendor.3ccc0c4579678245a4e.js:2:5829841) 
at $ (https://cloud.elastic.co/vendor.3ccc0c4579678245a4e.js:2:5881966) 
at Tl (https: //cloud.elastic.co/vendor.3ccc@c4579678245a4e. js:2:5869244) 
at El (https: //cloud.elastic.co/vendor.3ccc@c4579678245a4e.j:2:5869172) 
at _1 (https://cloud.elastic.co/vendor.3ccc0c4579678245a4e.js:2:5869035) 
at yl (https: //cloud.elastic.co/vendor.3ccc@c4579678245a4e. js:2:5866022)
at https://cloud.elastic.co/vendor.3ccc0c4579678245a4e.js:2:5815760
at t.unstable_runWithPriority (https://cloud.elastic.co/vendor.3ccc@c457967e8245a4e. js:2:6649725)
at Go (https: //cloud.elastic.co/vendor.3ccc@c457967e8245a4e. js:2:5815537) 
at Xo (https: //cloud.elastic.co/vendor.3ccc@c4579678245a4e. js:2:5815705) 
at $o (https: //cloud.elastic.co/vendor.3ccc0c457967e8245a4e.js:2:5815640) 
at Ml (https: //cloud.elastic.co/vendor. 3ccc0c4579678245a4e. js:2:5866343)
at Object.notify (https: //cloud.elastic.co/vendor.3ccc@c457967e8245a4e. js:2:5977078)
at Object.notifyNestedSubs (https://cloud.elastic.co/vendor.3ccc@c457967e8245a4e. js:2:5977491)

Further investigation revealed that the Cloud UI is unable to gracefully handle the following disk information returned by the Health API

      "disk" : {
        "status" : "Unknown",
        "symptom" : "No disk usage data."
      },

While the above error can be better handled in ESS, it is unclear why the Health API reports such disk status / symptom. This requires further investigation.

Workaround

Users can disable and enable again the health node in Elasticsearch. These are the required APIs:

Disable health node:

PUT _cluster/settings
{
  "persistent": {
    "health.node.enabled": false
  }
}

Enable health node:

PUT _cluster/settings
{
  "persistent": {
    "health.node.enabled": true
  }
}

Once the two APIs are executed, observe that the Health page is accessible again.

Steps to Reproduce

In ESS, create a deployment:
- 2GB RAM x 2AZ (hot tier)
- Version 7.17.6
Wait for the deployment to be created, take a snapshot:
- POST _slm/policy/cloud-snapshot-policy/_execute
Create another deployment:
- Same topology (same region, 2GB RAM x 2AZ (hot tier))
- Version 8.9.1
- Restore snapshot from the 7.17.6 deployment (by toggling the Restore snapshot data when creating the deployment)
Wait for the deployment version 8.9.1 to be created, access the Health page and observe the above error message.
Run the Health API:

{
  "status": "unknown",
  "cluster_name": "xxx",
  "indicators": {
    "master_is_stable": {
      "status": "green",
      "symptom": "The cluster has a stable master node",
      "details": {
        "current_master": {
          "node_id": "4XoNLHsXR3aiO4kqb_x__g",
          "name": "instance-0000000001"
        },
        "recent_masters": [
          {
            "node_id": "YE16I4OrQP60r0A9hGYMhg",
            "name": "instance-0000000000"
          },
          {
            "node_id": "4XoNLHsXR3aiO4kqb_x__g",
            "name": "instance-0000000001"
          }
        ]
      }
    },
    "repository_integrity": {
      "status": "green",
      "symptom": "No corrupted snapshot repositories.",
      "details": {
        "total_repositories": 2
      }
    },
    "shards_availability": {
      "status": "green",
      "symptom": "This cluster has all shards available.",
      "details": {
        "started_primaries": 27,
        "unassigned_primaries": 0,
        "initializing_replicas": 0,
        "started_replicas": 27,
        "initializing_primaries": 0,
        "restarting_replicas": 0,
        "restarting_primaries": 0,
        "unassigned_replicas": 0,
        "creating_primaries": 0
      }
    },
    "disk": {
      "status": "unknown",
      "symptom": "No disk usage data."
    },
    "shards_capacity": {
      "status": "green",
      "symptom": "The cluster has enough room to add new shards.",
      "details": {
        "data": {
          "max_shards_in_cluster": 2000
        },
        "frozen": {
          "max_shards_in_cluster": 0
        }
      }
    },
    "ilm": {
      "status": "green",
      "symptom": "Index Lifecycle Management is running",
      "details": {
        "policies": 29,
        "stagnating_indices": 0,
        "ilm_status": "RUNNING"
      }
    },
    "slm": {
      "status": "green",
      "symptom": "Snapshot Lifecycle Management is running",
      "details": {
        "slm_status": "RUNNING",
        "policies": 1
      }
    }
  }
}

Logs (if relevant)

After setting the logger.org.elasticsearch.health.node to DEBUG and restarting the current master (instance 0 in our example), the below DEBUG logs were observed:

  | 28 Aug, 2023 @ 09:33:08.070 | [instance-0000000000] Resetting the health monitoring because the master node changed, current health node is null.

  | 28 Aug, 2023 @ 09:30:55.653 | [instance-0000000001] Resetting the health monitoring because the master node changed, current health node is null.

  | 28 Aug, 2023 @ 09:30:55.145 | [tiebreaker-0000000002] Resetting the health monitoring because the master node changed, current health node is null. 

  | 28 Aug, 2023 @ 09:30:54.647 | [instance-0000000001] Resetting the health monitoring because the master node changed, current health node is null

  | 28 Aug, 2023 @ 09:30:54.630 | [tiebreaker-0000000002] Resetting the health monitoring because the master node changed, current health node is null

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-08-28T10:47:00Z

Pinging @elastic/es-data-management (Team:Data Management)

andreidan · 2023-09-05T08:51:21Z

Having no disk information when the health node changes (on a master failover, or when the health node is disabled and re-enabled) is normal - upon election the new health node needs to receive the disk data from the nodes in the cluster. Until this data is received there is "no disk data" to display and that's what the indicator is signaling.
It can take (with enough "bad timing") up to 30 seconds until the health node receives the disk usage data.

andreidan · 2023-09-05T09:32:39Z

Reopening this as when the health reports get into the "no disk data" case, it stays there.

andreidan · 2023-09-05T10:11:41Z

I can reproduce this - thanks for the great bug report @romain-chanu

andreidan · 2023-09-05T10:30:51Z

Still digging here, but the problem seems to be that the health node isn't reassigned once the persistent tasks are re-enabled (I believe we disabled the persistent task allocation and re-enable it as part of the "restore deployment from snapshot" operation)

mattc58 · 2024-01-02T18:52:14Z

Relates to #92193

nielsbauman · 2024-02-03T09:59:41Z

The issue seems to be that when we do a snapshot restore including cluster state (which is what ESS does when you create a new deployment based on a snapshot from another deployment), we overwrite the persistent tasks stored in the cluster Metadata (see here). This has probably been implemented this way for a reason, but the issue here is that if you create a snapshot from version <= 8.4.3, and restore that in >= 8.5.0, the health-node persistent task gets removed from the cluster state (because it didn't exist pre 8.5.0). I can think of three options:

Explicitly re-add the health-node persistent task to the cluster state after a restore.
Re-add all persistent tasks that existed before the restore.
Add some kind of flag/method to the persistent task infrastructure that allows specifying whether a task should be re-added after a restore.

I think option 3 is the most flexible solution, as it allows individual task implementations to choose whether they should be kept after a restore or not, but it might be a bit overkill/overengineered. I'm afraid option 2 can result in unwanted side-effects. So I think I'd be leaning towards either option 1 or 3. Curious to hear what others think here.

Also pinging @elastic/es-distributed as this relates to snapshots.

gmarouli · 2024-02-06T08:51:37Z

Thanks for the analysis @nielsbauman, do you know why the task wasn't recreated?

Is this something we can reproduce by simply deleting the task?

Explicitly re-add the health-node persistent task to the cluster state after a restore.

Maybe we just want to always re-add it when it's missing. The task should always be there unless the health node is disabled.

What do you think?

nielsbauman · 2024-02-06T14:39:55Z

@gmarouli

do you know why the task wasn't recreated?

This question made me have another look at the code, specifically, these lines:

elasticsearch/server/src/main/java/org/elasticsearch/health/node/selection/HealthNodeTaskExecutor.java

Lines 162 to 166 in 4cf8942

    
           if (isElectedMaster || healthNodeTaskExists) { 
        
               clusterService.removeListener(taskStarter); 
        
           } 
        
           if (isElectedMaster && healthNodeTaskExists == false) { 
        
               persistentTasksService.sendStartRequest(

If I comment out line 163, this whole issue is resolved as the cluster state listener just re-adds the task. My guess would be that this check/line was put there simply because we thought we didn't need the listener anymore after registering the task once, but maybe I'm missing something there.

Maybe we just want to always re-add it when it's missing.

Removing the lines 162 through 164 would do just that. I think I'd be in favour of doing that. The downside is that we then always execute this listener (which only adds the task if it doesn't already exist), but I think the effect is minimal as the statements in the listener don't seem to be too heavy. Do you know if there are better ways to achieve this (i.e. other than using a cluster state listener)?

gmarouli · 2024-02-12T14:50:09Z

Thank you for following up on it @nielsbauman . I agree with you, we did remove the listener because we didn't think that anyone would be able to remove the task. Let's remove these lines then to ensure that the task will be recreated, I also think it's not a heavy check to do.

We assumed that once the `HealthNode` persistent task is registered, we won't need to register it again. However, when, for instance, we restore from a snapshot (including cluster state) that was created in version <= 8.4.3, that task doesn't exist yet, which will result in the task being removed after the restore. By keeping the listener active, we will re-add the task after such a restore (or any other potential situation where the task might get deleted). Fixes elastic#98926

andreidan · 2024-02-20T08:54:58Z

@nielsbauman nice work on driving this home ! 🚀

++ on the solution of keeping the listener (sorry for the late reply here)

The health executor makes sure the task is not started twice so no harm in keeping the listener:

    void startTask(ClusterChangedEvent event) {
        // Wait until every node in the cluster supports health checks
        if (event.state().clusterRecovered() && featureService.clusterHasFeature(event.state(), HealthFeatures.SUPPORTS_HEALTH)) {
            boolean healthNodeTaskExists = HealthNode.findTask(event.state()) != null;
            boolean isElectedMaster = event.localNodeMaster();
            if (isElectedMaster && healthNodeTaskExists == false) {
              ...

romain-chanu added >bug needs:triage Requires assignment of a team area label :Data Management/Health labels Aug 28, 2023

elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Aug 28, 2023

andreidan closed this as completed Sep 5, 2023

andreidan reopened this Sep 5, 2023

andreidan self-assigned this Sep 5, 2023

nielsbauman self-assigned this Feb 3, 2024

andreidan removed their assignment Feb 4, 2024

nielsbauman mentioned this issue Feb 13, 2024

Don't stop checking if the HealthNode persistent task is present #105449

Merged

nielsbauman closed this as completed in #105449 Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health API reports disk information `Unknown` status / `No disk usage data` symptom #98926

Health API reports disk information `Unknown` status / `No disk usage data` symptom #98926

romain-chanu commented Aug 28, 2023 •

edited

elasticsearchmachine commented Aug 28, 2023

andreidan commented Sep 5, 2023

andreidan commented Sep 5, 2023 •

edited

andreidan commented Sep 5, 2023

andreidan commented Sep 5, 2023

mattc58 commented Jan 2, 2024

nielsbauman commented Feb 3, 2024

gmarouli commented Feb 6, 2024

nielsbauman commented Feb 6, 2024

gmarouli commented Feb 12, 2024

andreidan commented Feb 20, 2024

Health API reports disk information Unknown status / No disk usage data symptom #98926

Health API reports disk information Unknown status / No disk usage data symptom #98926

Comments

romain-chanu commented Aug 28, 2023 • edited

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Workaround

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Aug 28, 2023

andreidan commented Sep 5, 2023

andreidan commented Sep 5, 2023 • edited

andreidan commented Sep 5, 2023

andreidan commented Sep 5, 2023

mattc58 commented Jan 2, 2024

nielsbauman commented Feb 3, 2024

gmarouli commented Feb 6, 2024

nielsbauman commented Feb 6, 2024

gmarouli commented Feb 12, 2024

andreidan commented Feb 20, 2024

Health API reports disk information `Unknown` status / `No disk usage data` symptom #98926

Health API reports disk information `Unknown` status / `No disk usage data` symptom #98926

romain-chanu commented Aug 28, 2023 •

edited

andreidan commented Sep 5, 2023 •

edited