You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was testing reconfiguration without a restart via resource and Fluxion module reload and ran into one issue.
Here's a test that somewhat randomly redistributes resources among queues while jobs are running. The queues are reconfigured and no running jobs are lost, but in the last step, all the resources are moved to a single queue (one) and a full system job is submitted. Once the resources are available, the 100 node job should start, but instead it is stuck indefinitely in the SCHED state. If I submit any job to that queue, then the pending job is scheduled and things move along.
Scheduling is stopped
flux-module: remove sched-fluxion-qmanager: No such file or directory
flux-module: remove sched-fluxion-resource: No such file or directory
May 14 16:16:38.141986 sched-simple.err[0]: exiting due to resource update failure: the resource module was unloaded
three: Scheduling is started
two: Scheduling is started
four: Scheduling is started
one: Scheduling is started
STATE QUEUE NNODES NCORES NGPUS NODELIST
free one 25 2400 100 test[1-25]
free two 25 2400 100 test[26-50]
free three 25 2400 100 test[51-75]
free four 25 2400 100 test[76-100]
allocated 0 0 0
down 0 0 0
f2HvLhyh
f2HvLhyi
f2HwphG3
f2HyJgYP
f2HyJgYQ
f2Hznfpj
f2Hznfpk
f2J2Gf75
f2J2Gf76
f2J3kePR
f2J3kePS
f2J5Edfm
f2J5Edfn
f2J6icx7
f2J6icx8
f2J8CcET
f2J9gbWo
f2J9gbWp
f2JBAao9
f2JMYVmZ
f2JMYVma
f2JMYVmb
f2JMYVmc
f2JMYVmd
f2JMYVme
f2JMYVmf
f2JP2V3u
f2JP2V3v
f2JP2V3w
f2JP2V3x
f2JP2V3y
f2JP2V3z
f2Ki7qhq
f2Ki7qhr
f2Ki7qhs
f2Ki7qht
f2KjbpzB
f2KjbpzC
f2Km5pGX
f2Km5pGY
JOBID QUEUE USER NAME ST NTASKS NNODES TIME INFO
f2Km5pGY four grondo sleep R 2 2 0.272s test[81-82]
f2KjbpzC three grondo sleep R 2 2 0.273s test[56-57]
f2KjbpzB three grondo sleep R 2 2 0.273s test[58-59]
f2Ki7qht three grondo sleep R 2 2 0.274s test[60-61]
f2Ki7qhs three grondo sleep R 2 2 0.274s test[62-63]
f2Ki7qhr three grondo sleep R 2 2 0.274s test[64-65]
f2Km5pGX four grondo sleep R 2 2 0.279s test[83-84]
f2Ki7qhq three grondo sleep R 2 2 0.280s test[66-67]
f2JP2V3z three grondo sleep R 2 2 0.293s test[68-69]
f2JP2V3y three grondo sleep R 2 2 0.297s test[70-71]
f2JP2V3w three grondo sleep R 2 2 0.297s test[72-73]
f2JP2V3x four grondo sleep R 2 2 0.297s test[85-86]
f2JP2V3v four grondo sleep R 2 2 0.301s test[87-88]
f2JMYVmf four grondo sleep R 2 2 0.301s test[89-90]
f2JMYVmd four grondo sleep R 2 2 0.302s test[91-92]
f2JMYVmb four grondo sleep R 2 2 0.302s test[93-94]
f2JMYVmZ four grondo sleep R 2 2 0.302s test[95-96]
f2J9gbWp four grondo sleep R 2 2 0.304s test[97-98]
f2JMYVme two grondo sleep R 2 2 0.305s test[31-32]
f2JMYVmc two grondo sleep R 2 2 0.305s test[33-34]
f2JMYVma two grondo sleep R 2 2 0.306s test[35-36]
f2JBAao9 two grondo sleep R 2 2 0.306s test[37-38]
f2J9gbWo two grondo sleep R 2 2 0.306s test[39-40]
f2J6icx8 two grondo sleep R 2 2 0.307s test[41-42]
f2J6icx7 two grondo sleep R 2 2 0.311s test[43-44]
f2J5Edfn two grondo sleep R 2 2 0.316s test[45-46]
f2J5Edfm two grondo sleep R 2 2 0.316s test[47-48]
f2J3kePR one grondo sleep R 2 2 0.316s test[6-7]
f2J2Gf76 one grondo sleep R 2 2 0.326s test[8-9]
f2J2Gf75 one grondo sleep R 2 2 0.326s test[10-11]
f2Hznfpk one grondo sleep R 2 2 0.326s test[12-13]
f2Hznfpj one grondo sleep R 2 2 0.327s test[14-15]
f2HyJgYQ one grondo sleep R 2 2 0.327s test[16-17]
f2HyJgYP one grondo sleep R 2 2 0.327s test[18-19]
f2HwphG3 one grondo sleep R 2 2 0.327s test[20-21]
f2HvLhyi one grondo sleep R 2 2 0.328s test[22-23]
f2JP2V3u three grondo sleep R 2 2 0.328s test[74-75]
f2J8CcET four grondo sleep R 2 2 0.328s test[99-100]
f2J3kePS two grondo sleep R 2 2 0.328s test[49-50]
f2HvLhyh one grondo sleep R 2 2 0.329s test[24-25]
three: Scheduling is stopped
two: Scheduling is stopped
four: Scheduling is stopped
one: Scheduling is stopped
three: Scheduling is started
two: Scheduling is started
four: Scheduling is started
one: Scheduling is started
STATE QUEUE NNODES NCORES NGPUS NODELIST
free one 5 480 20 test[1-5]
free two 5 480 20 test[26-30]
free three 5 480 20 test[51-55]
free four 5 480 20 test[76-80]
allocated one 5 480 20 test[6-10]
allocated one,four 15 1440 60 test[11-25]
allocated two 20 1920 80 test[31-50]
allocated three 20 1920 80 test[56-75]
allocated four 20 1920 80 test[81-100]
down 0 0 0
JOBID QUEUE USER NAME ST NTASKS NNODES TIME INFO
f2Km5pGY four grondo sleep R 2 2 2.080s test[81-82]
f2KjbpzC three grondo sleep R 2 2 2.081s test[56-57]
f2KjbpzB three grondo sleep R 2 2 2.081s test[58-59]
f2Ki7qht three grondo sleep R 2 2 2.081s test[60-61]
f2Ki7qhs three grondo sleep R 2 2 2.082s test[62-63]
f2Ki7qhr three grondo sleep R 2 2 2.082s test[64-65]
f2Km5pGX four grondo sleep R 2 2 2.087s test[83-84]
f2Ki7qhq three grondo sleep R 2 2 2.087s test[66-67]
f2JP2V3z three grondo sleep R 2 2 2.101s test[68-69]
f2JP2V3y three grondo sleep R 2 2 2.104s test[70-71]
f2JP2V3w three grondo sleep R 2 2 2.104s test[72-73]
f2JP2V3x four grondo sleep R 2 2 2.105s test[85-86]
f2JP2V3v four grondo sleep R 2 2 2.109s test[87-88]
f2JMYVmf four grondo sleep R 2 2 2.109s test[89-90]
f2JMYVmd four grondo sleep R 2 2 2.109s test[91-92]
f2JMYVmb four grondo sleep R 2 2 2.109s test[93-94]
f2JMYVmZ four grondo sleep R 2 2 2.110s test[95-96]
f2J9gbWp four grondo sleep R 2 2 2.112s test[97-98]
f2JMYVme two grondo sleep R 2 2 2.113s test[31-32]
f2JMYVmc two grondo sleep R 2 2 2.113s test[33-34]
f2JMYVma two grondo sleep R 2 2 2.113s test[35-36]
f2JBAao9 two grondo sleep R 2 2 2.114s test[37-38]
f2J9gbWo two grondo sleep R 2 2 2.114s test[39-40]
f2J6icx8 two grondo sleep R 2 2 2.114s test[41-42]
f2J6icx7 two grondo sleep R 2 2 2.118s test[43-44]
f2J5Edfn two grondo sleep R 2 2 2.123s test[45-46]
f2J5Edfm two grondo sleep R 2 2 2.123s test[47-48]
f2J3kePR one grondo sleep R 2 2 2.124s test[6-7]
f2J2Gf76 one grondo sleep R 2 2 2.134s test[8-9]
f2J2Gf75 one grondo sleep R 2 2 2.134s test[10-11]
f2Hznfpk one grondo sleep R 2 2 2.134s test[12-13]
f2Hznfpj one grondo sleep R 2 2 2.134s test[14-15]
f2HyJgYQ one grondo sleep R 2 2 2.135s test[16-17]
f2HyJgYP one grondo sleep R 2 2 2.135s test[18-19]
f2HwphG3 one grondo sleep R 2 2 2.135s test[20-21]
f2HvLhyi one grondo sleep R 2 2 2.135s test[22-23]
f2JP2V3u three grondo sleep R 2 2 2.135s test[74-75]
f2J8CcET four grondo sleep R 2 2 2.136s test[99-100]
f2J3kePS two grondo sleep R 2 2 2.136s test[49-50]
f2HvLhyh one grondo sleep R 2 2 2.136s test[24-25]
three: Scheduling is stopped
two: Scheduling is stopped
four: Scheduling is stopped
one: Scheduling is stopped
three: Scheduling is started
two: Scheduling is started
four: Scheduling is started
one: Scheduling is started
STATE QUEUE NNODES NCORES NGPUS NODELIST
free one 5 480 20 test[1-5]
free two 5 480 20 test[26-30]
free three 5 480 20 test[51-55]
free four 5 480 20 test[76-80]
allocated one 5 480 20 test[6-10]
allocated one,three 1 96 4 test11
allocated one,four 14 1344 56 test[12-25]
allocated two 20 1920 80 test[31-50]
allocated three 20 1920 80 test[56-75]
allocated four 20 1920 80 test[81-100]
down 0 0 0
three: Scheduling is stopped
two: Scheduling is stopped
four: Scheduling is stopped
one: Scheduling is stopped
three: Scheduling is started
two: Scheduling is started
four: Scheduling is started
one: Scheduling is started
STATE QUEUE NNODES NCORES NGPUS NODELIST
free one 20 1920 80 test[1-5,26-30,51-55,76-80]
allocated one 20 1920 80 test[6-25]
allocated one,two 20 1920 80 test[31-50]
allocated one,three 20 1920 80 test[56-75]
allocated one,four 20 1920 80 test[81-100]
down 0 0 0
JOBID QUEUE USER NAME ST NTASKS NNODES TIME INFO
f4xFwwXm one grondo sleep S 100 100 -
f2Km5pGY four grondo sleep R 2 2 6.195s test[81-82]
f2KjbpzC three grondo sleep R 2 2 6.196s test[56-57]
f2KjbpzB three grondo sleep R 2 2 6.196s test[58-59]
f2Ki7qht three grondo sleep R 2 2 6.196s test[60-61]
f2Ki7qhs three grondo sleep R 2 2 6.196s test[62-63]
f2Ki7qhr three grondo sleep R 2 2 6.197s test[64-65]
f2Km5pGX four grondo sleep R 2 2 6.202s test[83-84]
f2Ki7qhq three grondo sleep R 2 2 6.202s test[66-67]
f2JP2V3z three grondo sleep R 2 2 6.216s test[68-69]
f2JP2V3y three grondo sleep R 2 2 6.219s test[70-71]
f2JP2V3w three grondo sleep R 2 2 6.219s test[72-73]
f2JP2V3x four grondo sleep R 2 2 6.219s test[85-86]
f2JP2V3v four grondo sleep R 2 2 6.224s test[87-88]
f2JMYVmf four grondo sleep R 2 2 6.224s test[89-90]
f2JMYVmd four grondo sleep R 2 2 6.224s test[91-92]
f2JMYVmb four grondo sleep R 2 2 6.224s test[93-94]
f2JMYVmZ four grondo sleep R 2 2 6.225s test[95-96]
f2J9gbWp four grondo sleep R 2 2 6.227s test[97-98]
f2JMYVme two grondo sleep R 2 2 6.227s test[31-32]
f2JMYVmc two grondo sleep R 2 2 6.228s test[33-34]
f2JMYVma two grondo sleep R 2 2 6.228s test[35-36]
f2JBAao9 two grondo sleep R 2 2 6.228s test[37-38]
f2J9gbWo two grondo sleep R 2 2 6.229s test[39-40]
f2J6icx8 two grondo sleep R 2 2 6.229s test[41-42]
f2J6icx7 two grondo sleep R 2 2 6.233s test[43-44]
f2J5Edfn two grondo sleep R 2 2 6.238s test[45-46]
f2J5Edfm two grondo sleep R 2 2 6.238s test[47-48]
f2J3kePR one grondo sleep R 2 2 6.239s test[6-7]
f2J2Gf76 one grondo sleep R 2 2 6.248s test[8-9]
f2J2Gf75 one grondo sleep R 2 2 6.249s test[10-11]
f2Hznfpk one grondo sleep R 2 2 6.249s test[12-13]
f2Hznfpj one grondo sleep R 2 2 6.249s test[14-15]
f2HyJgYQ one grondo sleep R 2 2 6.249s test[16-17]
f2HyJgYP one grondo sleep R 2 2 6.250s test[18-19]
f2HwphG3 one grondo sleep R 2 2 6.250s test[20-21]
f2HvLhyi one grondo sleep R 2 2 6.250s test[22-23]
f2JP2V3u three grondo sleep R 2 2 6.250s test[74-75]
f2J8CcET four grondo sleep R 2 2 6.251s test[99-100]
f2J3kePS two grondo sleep R 2 2 6.251s test[49-50]
f2HvLhyh one grondo sleep R 2 2 6.251s test[24-25]
STATE QUEUE NNODES NCORES NGPUS NODELIST
free one 100 9600 400 test[1-100]
allocated 0 0 0
down 0 0 0
JOBID QUEUE USER NAME ST NTASKS NNODES TIME INFO
f4xFwwXm one grondo sleep S 100 100 -
1715728605.926426 submit userid=6885 urgency=16 flags=0 version=1
1715728605.938349 validate
1715728605.949083 depend
1715728605.949115 priority priority=16
I can then attach this instance with flux proxy pid:BROKER_PID and submit one job to queue one and this will cause the stuck job to be scheduled.
The text was updated successfully, but these errors were encountered:
I was testing reconfiguration without a restart via
resource
and Fluxion module reload and ran into one issue.Here's a test that somewhat randomly redistributes resources among queues while jobs are running. The queues are reconfigured and no running jobs are lost, but in the last step, all the resources are moved to a single queue (
one
) and a full system job is submitted. Once the resources are available, the 100 node job should start, but instead it is stuck indefinitely in the SCHED state. If I submit any job to that queue, then the pending job is scheduled and things move along.Output:
I can then attach this instance with
flux proxy pid:BROKER_PID
and submit one job to queueone
and this will cause the stuck job to be scheduled.The text was updated successfully, but these errors were encountered: