R_lite distribution failure causes all future jobs to stall at submitted or reserved #1425

trws · 2018-04-04T03:23:33Z

This error from the updated wreck appears to happen if there are less tasks than nodes among other times. If I see it though, the instance is no longer capable of scheduling or running work, even if the scheduler is reloaded.

 sched.err[0]: job 60 bad state transition from runrequest to failed
2018-04-04T00:03:38.209169Z lwj.60.emerg[184]: failed to distribute tasks over R_lite

The text was updated successfully, but these errors were encountered:

garlick · 2018-04-04T03:36:59Z

I think the transition to failed is the expected one with tasks < nodes after #1403. It sounds like sched's state machine doesn't like it?

I'm curious, does unloading sched fix it, e.g. can you run with wreckrun then?

trws · 2018-04-04T03:41:02Z

That seems to be part of it. If I use wreckrun immediate that seems to work, so it should be a sched thing, but unloading and re-loading doesn't seem to work for some reason.

dongahn · 2018-04-04T05:01:00Z

It is an issue in sched. Its state machine doesn't currently expect this state transition. I think the job isn't removed from the pending queue and the scheduler state probably gets into some sort of limbo state.

I don't understand why unloading/reloading doesn't work though.

If someone tell me, from what states the job can get into failed state, I can work on a patch maybe tomorrow morning.

In the next round, we probably want to document all of the state changes a job can go through somewhere to firm up the contract.

garlick · 2018-04-04T05:28:41Z

Is the problem with reloading sched that sched doesn't pick up the existing jobs? Or do new jobs not run?

trws · 2018-04-04T05:29:53Z

New jobs don't run. It may be that it tries to schedule the first job that's in reserved state and doesn't know what to do with it?

dongahn · 2018-04-04T05:36:19Z

Is the problem with reloading sched that sched doesn't pick up the existing jobs

Currently reloaded sched won't pick up the existing jobs. (should be a part of future resilience work.)

New jobs don't run. It may be that it tries to schedule the first job that's in reserved state and doesn't know what to do with it?

Hmm... Which sched version is yours based on? A submitted job shouldn't start from reserved state. It should start from submitted...

trws · 2018-04-04T05:38:53Z

It does, but if there are waiting jobs when oddness happened then they somehow end up in a reserved state rather than allocated, and then the new sched doesn't know what to do with them.

garlick · 2018-04-04T05:40:25Z

I don't see FAILED in the big action() switch in sched/sched.c at all.

I think STARTING is the only state that will transition to FAILED?

dongahn · 2018-04-04T05:41:14Z

ok. thanks. I will try to reproduce it as well tomorrow morning then.

dongahn · 2018-04-04T05:45:34Z

I don't see FAILED in the big action() switch in sched/sched.c at all.

Yeah... and it needs to be covered. BTW the switch statement is based on the old state of a job.

I think STARTING is the only state that will transition to FAILED?

Is it starting or runrequest?

garlick · 2018-04-04T06:03:32Z

Sorry, @trws's report does show runrequest transitioning to failed.

It does look like it could transition to failed from any state the job is in when sched sends the wrexec.run event also (presumably including runrequest - that state is not used anywhere in wreck).

So sounds like we should allow either runrequest or starting to transition to failed.

dongahn · 2018-04-04T17:42:30Z

@trws: I added support for failed event and forced a push into flux-framework/flux-sched#305

The commit is flux-framework/flux-sched@69df09f

My test shows it solves the failed event emitted due to the condition you described. But I couldn't reproduce the load/unload oddity. My guess is that was a bug cascaded from the lack of failed event support. Please try.

dongahn · 2018-04-10T22:00:35Z

This has been fixed in flux-framework/flux-sched#305.

dongahn closed this as completed Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R_lite distribution failure causes all future jobs to stall at submitted or reserved #1425

R_lite distribution failure causes all future jobs to stall at submitted or reserved #1425

trws commented Apr 4, 2018

garlick commented Apr 4, 2018

trws commented Apr 4, 2018

dongahn commented Apr 4, 2018

garlick commented Apr 4, 2018

trws commented Apr 4, 2018

dongahn commented Apr 4, 2018

trws commented Apr 4, 2018

garlick commented Apr 4, 2018

dongahn commented Apr 4, 2018

dongahn commented Apr 4, 2018

garlick commented Apr 4, 2018 •

edited

Loading

dongahn commented Apr 4, 2018

dongahn commented Apr 10, 2018

R_lite distribution failure causes all future jobs to stall at submitted or reserved #1425

R_lite distribution failure causes all future jobs to stall at submitted or reserved #1425

Comments

trws commented Apr 4, 2018

garlick commented Apr 4, 2018

trws commented Apr 4, 2018

dongahn commented Apr 4, 2018

garlick commented Apr 4, 2018

trws commented Apr 4, 2018

dongahn commented Apr 4, 2018

trws commented Apr 4, 2018

garlick commented Apr 4, 2018

dongahn commented Apr 4, 2018

dongahn commented Apr 4, 2018

garlick commented Apr 4, 2018 • edited Loading

dongahn commented Apr 4, 2018

dongahn commented Apr 10, 2018

garlick commented Apr 4, 2018 •

edited

Loading