Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"slurmctld restart" stuck after scaling the nodes #57

Closed
mangov99 opened this issue Feb 18, 2021 · 1 comment
Closed

"slurmctld restart" stuck after scaling the nodes #57

mangov99 opened this issue Feb 18, 2021 · 1 comment

Comments

@mangov99
Copy link

CycleCloud Version - 8.1.0-1275
Slurm - 19.05.8-1

Scenario:

  1. Changing the Max core count for the HPC array in the CycleCloud UI
  2. Run the scale command (./cyclecloud_slurm.sh scale) and we see below behavior:

{{{
sinfo doesn't show up new added node and it seems slurmctld stuck in restart:
[root@ip-0A060009 slurm]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
hpc* up infinite 2 alloc hpc-pg0-[1-2]
htc up infinite 2 idle~ htc-[1-2]

[root@ip-0A060009 slurm]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: failed (Result: exit-code) since Thu 2021-02-18 20:42:28 UTC; 3s ago
Process: 11278 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 11280 (code=exited, status=1/FAILURE)

Feb 18 20:42:28 ip-0A060009 systemd[1]: Starting Slurm controller daemon...
Feb 18 20:42:28 ip-0A060009 systemd[1]: Started Slurm controller daemon.
Feb 18 20:42:28 ip-0A060009 systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE
Feb 18 20:42:28 ip-0A060009 systemd[1]: Unit slurmctld.service entered failed state.
Feb 18 20:42:28 ip-0A060009 systemd[1]: slurmctld.service failed.
}}}

@davestacionis
Copy link

I can confirm this, but in my case, the process is running, just listed as 'slurmctld restart'. The system works fine even with the daemon failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants