-
Notifications
You must be signed in to change notification settings - Fork 314
Description
Attempting to move to parallelcluster 3.11.1. I have been using 3.9.1 (with its known bug since May) w/o this issue.
3.9.1 is custom Rocky 8.9 image with parallelcluster 3.9.1 overlay.
New set up is a custom Rocky 8.10 AMI upon which I have overlayed parallelcuster 3.11.1 with pcluster build.
Deployed just fine. Ran for a week with small number of jobs submitted.
Started banging on it a bit more with hundreds of jobs/day. About every 24 hours, slurmctld dumps core with ultimate core dump error message in /var/log/messages:
Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: double free or corruption (!prev)
Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 2977/UID 0).
Oct 30 11:04:30 ip-10-6-11-248 systemd-coredump[2978]: Process 1374 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 2531:#12#0
Reboot the machine and 24 hours later:
Oct 31 10:48:04 ip-10-6-11-248 slurmctld[1383]: corrupted double-linked list
Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 728511/UID 0).
Oct 31 10:48:04 ip-10-6-11-248 systemd-coredump[728512]: Process 1383 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 728510:#12#0
Prior error messages right before this are of the form:
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-21:6818) failed: Name or service not known
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-21"
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-25:6818) failed: Name or service not known
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-25"
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given
[duplicate messages not included, just different node names]
Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: slurmctld: agent/is_node_resp: node:sp-m7a-l-dy-sp-8-gb-2-cores-4 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication conne
[duplicate messages not included, just different node names]
Error messages in slurmctld.log from same time as the second crash message above:
[2024-10-31T10:47:04.008] error: unable to split forward hostlist
[2024-10-31T10:47:04.008] error: _thread_per_group_rpc: no ret_list given
[2024-10-31T10:47:05.134] error: slurm_receive_msg [10.6.2.57:50156]: Zero Bytes were transmitted or received
[2024-10-31T10:47:18.718] error: slurm_receive_msg [10.6.9.248:41134]: Zero Bytes were transmitted or received
[2024-10-31T10:47:20.862] error: slurm_receive_msg [10.6.14.229:53816]: Zero Bytes were transmitted or received
[2024-10-31T10:47:22.137] error: slurm_receive_msg [10.6.2.57:50996]: Zero Bytes were transmitted or received
[2024-10-31T10:48:04.000] cleanup_completing: JobId=4094 completion process took 134 seconds
[2024-10-31T10:48:04.000] error: Nodes sp-r7a-m-dy-sp-8-gb-1-cores-37 not responding, setting DOWN
[2024-10-31T10:48:04.003] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-10:6818) failed: Name or service not known
[2024-10-31T10:48:04.003] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-10"
[2024-10-31T10:48:04.005] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-11:6818) failed: Name or service not known
[2024-10-31T10:48:04.005] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-11"
[2024-10-31T10:48:04.007] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-33:6818) failed: Name or service not known
[2024-10-31T10:48:04.007] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-33"
[2024-10-31T10:48:04.009] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-36:6818) failed: Name or service not known
[2024-10-31T10:48:04.009] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-36"
[2024-10-31T10:48:04.010] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores
Has anyone seen this? I'm going back to 3.10.1 and will attempt to deploy that version.