Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health API reports disk information Unknown status / No disk usage data symptom #98926

Closed
romain-chanu opened this issue Aug 28, 2023 · 11 comments · Fixed by #105449
Closed

Health API reports disk information Unknown status / No disk usage data symptom #98926

romain-chanu opened this issue Aug 28, 2023 · 11 comments · Fixed by #105449
Assignees
Labels

Comments

@romain-chanu
Copy link

romain-chanu commented Aug 28, 2023

Elasticsearch Version

8.9.1

Installed Plugins

No response

Java Version

bundled

OS Version

N.A

Problem Description

Elasticsearch Service (ESS) users have observed the below error message when accessing the Health API of a deployment:

Error: Cannot read properties of undefined (reading 'replace")
at https://cloud.elastic.co/app.070493148d8f8558ab6.js:2:3103121
at U1 (https://cloud.elastic.co/app.070493148d88558ab6.js:2:3103141) 
at a (https://cloud.elastic.co/vendor.3ccc0c4579678245a4e.js:2:5829841) 
at $ (https://cloud.elastic.co/vendor.3ccc0c4579678245a4e.js:2:5881966) 
at Tl (https: //cloud.elastic.co/vendor.3ccc@c4579678245a4e. js:2:5869244) 
at El (https: //cloud.elastic.co/vendor.3ccc@c4579678245a4e.j:2:5869172) 
at _1 (https://cloud.elastic.co/vendor.3ccc0c4579678245a4e.js:2:5869035) 
at yl (https: //cloud.elastic.co/vendor.3ccc@c4579678245a4e. js:2:5866022)
at https://cloud.elastic.co/vendor.3ccc0c4579678245a4e.js:2:5815760
at t.unstable_runWithPriority (https://cloud.elastic.co/vendor.3ccc@c457967e8245a4e. js:2:6649725)
at Go (https: //cloud.elastic.co/vendor.3ccc@c457967e8245a4e. js:2:5815537) 
at Xo (https: //cloud.elastic.co/vendor.3ccc@c4579678245a4e. js:2:5815705) 
at $o (https: //cloud.elastic.co/vendor.3ccc0c457967e8245a4e.js:2:5815640) 
at Ml (https: //cloud.elastic.co/vendor. 3ccc0c4579678245a4e. js:2:5866343)
at Object.notify (https: //cloud.elastic.co/vendor.3ccc@c457967e8245a4e. js:2:5977078)
at Object.notifyNestedSubs (https://cloud.elastic.co/vendor.3ccc@c457967e8245a4e. js:2:5977491)

Further investigation revealed that the Cloud UI is unable to gracefully handle the following disk information returned by the Health API

      "disk" : {
        "status" : "Unknown",
        "symptom" : "No disk usage data."
      },

While the above error can be better handled in ESS, it is unclear why the Health API reports such disk status / symptom. This requires further investigation.

Workaround

Users can disable and enable again the health node in Elasticsearch. These are the required APIs:

  • Disable health node:
PUT _cluster/settings
{
  "persistent": {
    "health.node.enabled": false
  }
}
  • Enable health node:
PUT _cluster/settings
{
  "persistent": {
    "health.node.enabled": true
  }
}

Once the two APIs are executed, observe that the Health page is accessible again.

Steps to Reproduce

  • In ESS, create a deployment:
    • 2GB RAM x 2AZ (hot tier)
    • Version 7.17.6
  • Wait for the deployment to be created, take a snapshot:
    • POST _slm/policy/cloud-snapshot-policy/_execute
  • Create another deployment:
    • Same topology (same region, 2GB RAM x 2AZ (hot tier))
    • Version 8.9.1
    • Restore snapshot from the 7.17.6 deployment (by toggling the Restore snapshot data when creating the deployment)
  • Wait for the deployment version 8.9.1 to be created, access the Health page and observe the above error message.
  • Run the Health API:
{
  "status": "unknown",
  "cluster_name": "xxx",
  "indicators": {
    "master_is_stable": {
      "status": "green",
      "symptom": "The cluster has a stable master node",
      "details": {
        "current_master": {
          "node_id": "4XoNLHsXR3aiO4kqb_x__g",
          "name": "instance-0000000001"
        },
        "recent_masters": [
          {
            "node_id": "YE16I4OrQP60r0A9hGYMhg",
            "name": "instance-0000000000"
          },
          {
            "node_id": "4XoNLHsXR3aiO4kqb_x__g",
            "name": "instance-0000000001"
          }
        ]
      }
    },
    "repository_integrity": {
      "status": "green",
      "symptom": "No corrupted snapshot repositories.",
      "details": {
        "total_repositories": 2
      }
    },
    "shards_availability": {
      "status": "green",
      "symptom": "This cluster has all shards available.",
      "details": {
        "started_primaries": 27,
        "unassigned_primaries": 0,
        "initializing_replicas": 0,
        "started_replicas": 27,
        "initializing_primaries": 0,
        "restarting_replicas": 0,
        "restarting_primaries": 0,
        "unassigned_replicas": 0,
        "creating_primaries": 0
      }
    },
    "disk": {
      "status": "unknown",
      "symptom": "No disk usage data."
    },
    "shards_capacity": {
      "status": "green",
      "symptom": "The cluster has enough room to add new shards.",
      "details": {
        "data": {
          "max_shards_in_cluster": 2000
        },
        "frozen": {
          "max_shards_in_cluster": 0
        }
      }
    },
    "ilm": {
      "status": "green",
      "symptom": "Index Lifecycle Management is running",
      "details": {
        "policies": 29,
        "stagnating_indices": 0,
        "ilm_status": "RUNNING"
      }
    },
    "slm": {
      "status": "green",
      "symptom": "Snapshot Lifecycle Management is running",
      "details": {
        "slm_status": "RUNNING",
        "policies": 1
      }
    }
  }
}

Logs (if relevant)

After setting the logger.org.elasticsearch.health.node to DEBUG and restarting the current master (instance 0 in our example), the below DEBUG logs were observed:

  | 28 Aug, 2023 @ 09:33:08.070 | [instance-0000000000] Resetting the health monitoring because the master node changed, current health node is null.

  | 28 Aug, 2023 @ 09:30:55.653 | [instance-0000000001] Resetting the health monitoring because the master node changed, current health node is null.

  | 28 Aug, 2023 @ 09:30:55.145 | [tiebreaker-0000000002] Resetting the health monitoring because the master node changed, current health node is null. 

  | 28 Aug, 2023 @ 09:30:54.647 | [instance-0000000001] Resetting the health monitoring because the master node changed, current health node is null

  | 28 Aug, 2023 @ 09:30:54.630 | [tiebreaker-0000000002] Resetting the health monitoring because the master node changed, current health node is null
@romain-chanu romain-chanu added >bug needs:triage Requires assignment of a team area label :Data Management/Health labels Aug 28, 2023
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Aug 28, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@andreidan
Copy link
Contributor

Having no disk information when the health node changes (on a master failover, or when the health node is disabled and re-enabled) is normal - upon election the new health node needs to receive the disk data from the nodes in the cluster. Until this data is received there is "no disk data" to display and that's what the indicator is signaling.
It can take (with enough "bad timing") up to 30 seconds until the health node receives the disk usage data.

@andreidan
Copy link
Contributor

andreidan commented Sep 5, 2023

Reopening this as when the health reports get into the "no disk data" case, it stays there.

@andreidan andreidan reopened this Sep 5, 2023
@andreidan
Copy link
Contributor

I can reproduce this - thanks for the great bug report @romain-chanu

@andreidan andreidan self-assigned this Sep 5, 2023
@andreidan
Copy link
Contributor

Still digging here, but the problem seems to be that the health node isn't reassigned once the persistent tasks are re-enabled (I believe we disabled the persistent task allocation and re-enable it as part of the "restore deployment from snapshot" operation)
image

@mattc58
Copy link
Contributor

mattc58 commented Jan 2, 2024

Relates to #92193

@nielsbauman nielsbauman self-assigned this Feb 3, 2024
@nielsbauman
Copy link
Contributor

The issue seems to be that when we do a snapshot restore including cluster state (which is what ESS does when you create a new deployment based on a snapshot from another deployment), we overwrite the persistent tasks stored in the cluster Metadata (see here). This has probably been implemented this way for a reason, but the issue here is that if you create a snapshot from version <= 8.4.3, and restore that in >= 8.5.0, the health-node persistent task gets removed from the cluster state (because it didn't exist pre 8.5.0). I can think of three options:

  1. Explicitly re-add the health-node persistent task to the cluster state after a restore.
  2. Re-add all persistent tasks that existed before the restore.
  3. Add some kind of flag/method to the persistent task infrastructure that allows specifying whether a task should be re-added after a restore.

I think option 3 is the most flexible solution, as it allows individual task implementations to choose whether they should be kept after a restore or not, but it might be a bit overkill/overengineered. I'm afraid option 2 can result in unwanted side-effects. So I think I'd be leaning towards either option 1 or 3. Curious to hear what others think here.

Also pinging @elastic/es-distributed as this relates to snapshots.

@andreidan andreidan removed their assignment Feb 4, 2024
@gmarouli
Copy link
Contributor

gmarouli commented Feb 6, 2024

Thanks for the analysis @nielsbauman, do you know why the task wasn't recreated?

Is this something we can reproduce by simply deleting the task?

Explicitly re-add the health-node persistent task to the cluster state after a restore.

Maybe we just want to always re-add it when it's missing. The task should always be there unless the health node is disabled.

What do you think?

@nielsbauman
Copy link
Contributor

@gmarouli

do you know why the task wasn't recreated?

This question made me have another look at the code, specifically, these lines:

if (isElectedMaster || healthNodeTaskExists) {
clusterService.removeListener(taskStarter);
}
if (isElectedMaster && healthNodeTaskExists == false) {
persistentTasksService.sendStartRequest(

If I comment out line 163, this whole issue is resolved as the cluster state listener just re-adds the task. My guess would be that this check/line was put there simply because we thought we didn't need the listener anymore after registering the task once, but maybe I'm missing something there.

Maybe we just want to always re-add it when it's missing.

Removing the lines 162 through 164 would do just that. I think I'd be in favour of doing that. The downside is that we then always execute this listener (which only adds the task if it doesn't already exist), but I think the effect is minimal as the statements in the listener don't seem to be too heavy. Do you know if there are better ways to achieve this (i.e. other than using a cluster state listener)?

@gmarouli
Copy link
Contributor

Thank you for following up on it @nielsbauman . I agree with you, we did remove the listener because we didn't think that anyone would be able to remove the task. Let's remove these lines then to ensure that the task will be recreated, I also think it's not a heavy check to do.

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Feb 13, 2024
We assumed that once the `HealthNode` persistent task is registered,
we won't need to register it again. However, when, for instance, we
restore from a snapshot (including cluster state) that was created
in version <= 8.4.3, that task doesn't exist yet, which will result
in the task being removed after the restore. By keeping the listener
active, we will re-add the task after such a restore (or any other
potential situation where the task might get deleted).

Fixes elastic#98926
@andreidan
Copy link
Contributor

@nielsbauman nice work on driving this home ! 🚀

++ on the solution of keeping the listener (sorry for the late reply here)

The health executor makes sure the task is not started twice so no harm in keeping the listener:

    void startTask(ClusterChangedEvent event) {
        // Wait until every node in the cluster supports health checks
        if (event.state().clusterRecovered() && featureService.clusterHasFeature(event.state(), HealthFeatures.SUPPORTS_HEALTH)) {
            boolean healthNodeTaskExists = HealthNode.findTask(event.state()) != null;
            boolean isElectedMaster = event.localNodeMaster();
            if (isElectedMaster && healthNodeTaskExists == false) {
              ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants