Skip to content

3.11.1 slurmctld core dumps with error message: double free or corruption (!prev) #6529

@gwolski

Description

@gwolski

Attempting to move to parallelcluster 3.11.1. I have been using 3.9.1 (with its known bug since May) w/o this issue.
3.9.1 is custom Rocky 8.9 image with parallelcluster 3.9.1 overlay.

New set up is a custom Rocky 8.10 AMI upon which I have overlayed parallelcuster 3.11.1 with pcluster build.

Deployed just fine. Ran for a week with small number of jobs submitted.
Started banging on it a bit more with hundreds of jobs/day. About every 24 hours, slurmctld dumps core with ultimate core dump error message in /var/log/messages:

Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: double free or corruption (!prev)
Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 2977/UID 0).
Oct 30 11:04:30 ip-10-6-11-248 systemd-coredump[2978]: Process 1374 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 2531:#12#0

Reboot the machine and 24 hours later:

Oct 31 10:48:04 ip-10-6-11-248 slurmctld[1383]: corrupted double-linked list
Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 728511/UID 0).
Oct 31 10:48:04 ip-10-6-11-248 systemd-coredump[728512]: Process 1383 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 728510:#12#0

Prior error messages right before this are of the form:
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-21:6818) failed: Name or service not known
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-21"
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-25:6818) failed: Name or service not known
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-25"
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given
[duplicate messages not included, just different node names]

Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: slurmctld: agent/is_node_resp: node:sp-m7a-l-dy-sp-8-gb-2-cores-4 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication conne
[duplicate messages not included, just different node names]

Error messages in slurmctld.log from same time as the second crash message above:

[2024-10-31T10:47:04.008] error: unable to split forward hostlist
[2024-10-31T10:47:04.008] error: _thread_per_group_rpc: no ret_list given
[2024-10-31T10:47:05.134] error: slurm_receive_msg [10.6.2.57:50156]: Zero Bytes were transmitted or received
[2024-10-31T10:47:18.718] error: slurm_receive_msg [10.6.9.248:41134]: Zero Bytes were transmitted or received
[2024-10-31T10:47:20.862] error: slurm_receive_msg [10.6.14.229:53816]: Zero Bytes were transmitted or received
[2024-10-31T10:47:22.137] error: slurm_receive_msg [10.6.2.57:50996]: Zero Bytes were transmitted or received
[2024-10-31T10:48:04.000] cleanup_completing: JobId=4094 completion process took 134 seconds
[2024-10-31T10:48:04.000] error: Nodes sp-r7a-m-dy-sp-8-gb-1-cores-37 not responding, setting DOWN
[2024-10-31T10:48:04.003] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-10:6818) failed: Name or service not known
[2024-10-31T10:48:04.003] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-10"
[2024-10-31T10:48:04.005] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-11:6818) failed: Name or service not known
[2024-10-31T10:48:04.005] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-11"
[2024-10-31T10:48:04.007] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-33:6818) failed: Name or service not known
[2024-10-31T10:48:04.007] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-33"
[2024-10-31T10:48:04.009] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-36:6818) failed: Name or service not known
[2024-10-31T10:48:04.009] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-36"
[2024-10-31T10:48:04.010] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores

Has anyone seen this? I'm going back to 3.10.1 and will attempt to deploy that version.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions