Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R_lite distribution failure causes all future jobs to stall at submitted or reserved #1425

Closed
trws opened this issue Apr 4, 2018 · 13 comments

Comments

@trws
Copy link
Member

trws commented Apr 4, 2018

This error from the updated wreck appears to happen if there are less tasks than nodes among other times. If I see it though, the instance is no longer capable of scheduling or running work, even if the scheduler is reloaded.

 sched.err[0]: job 60 bad state transition from runrequest to failed
2018-04-04T00:03:38.209169Z lwj.60.emerg[184]: failed to distribute tasks over R_lite
@garlick
Copy link
Member

garlick commented Apr 4, 2018

I think the transition to failed is the expected one with tasks < nodes after #1403. It sounds like sched's state machine doesn't like it?

I'm curious, does unloading sched fix it, e.g. can you run with wreckrun then?

@trws
Copy link
Member Author

trws commented Apr 4, 2018

That seems to be part of it. If I use wreckrun immediate that seems to work, so it should be a sched thing, but unloading and re-loading doesn't seem to work for some reason.

@dongahn
Copy link
Member

dongahn commented Apr 4, 2018

It is an issue in sched. Its state machine doesn't currently expect this state transition. I think the job isn't removed from the pending queue and the scheduler state probably gets into some sort of limbo state.

I don't understand why unloading/reloading doesn't work though.

If someone tell me, from what states the job can get into failed state, I can work on a patch maybe tomorrow morning.

In the next round, we probably want to document all of the state changes a job can go through somewhere to firm up the contract.

@garlick
Copy link
Member

garlick commented Apr 4, 2018

Is the problem with reloading sched that sched doesn't pick up the existing jobs? Or do new jobs not run?

@trws
Copy link
Member Author

trws commented Apr 4, 2018

New jobs don't run. It may be that it tries to schedule the first job that's in reserved state and doesn't know what to do with it?

@dongahn
Copy link
Member

dongahn commented Apr 4, 2018

Is the problem with reloading sched that sched doesn't pick up the existing jobs

Currently reloaded sched won't pick up the existing jobs. (should be a part of future resilience work.)

New jobs don't run. It may be that it tries to schedule the first job that's in reserved state and doesn't know what to do with it?

Hmm... Which sched version is yours based on? A submitted job shouldn't start from reserved state. It should start from submitted...

@trws
Copy link
Member Author

trws commented Apr 4, 2018

It does, but if there are waiting jobs when oddness happened then they somehow end up in a reserved state rather than allocated, and then the new sched doesn't know what to do with them.

@garlick
Copy link
Member

garlick commented Apr 4, 2018

I don't see FAILED in the big action() switch in sched/sched.c at all.

I think STARTING is the only state that will transition to FAILED?

@dongahn
Copy link
Member

dongahn commented Apr 4, 2018

ok. thanks. I will try to reproduce it as well tomorrow morning then.

@dongahn
Copy link
Member

dongahn commented Apr 4, 2018

I don't see FAILED in the big action() switch in sched/sched.c at all.

Yeah... and it needs to be covered. BTW the switch statement is based on the old state of a job.

I think STARTING is the only state that will transition to FAILED?

Is it starting or runrequest?

@garlick
Copy link
Member

garlick commented Apr 4, 2018

Sorry, @trws's report does show runrequest transitioning to failed.

It does look like it could transition to failed from any state the job is in when sched sends the wrexec.run event also (presumably including runrequest - that state is not used anywhere in wreck).

So sounds like we should allow either runrequest or starting to transition to failed.

@dongahn
Copy link
Member

dongahn commented Apr 4, 2018

@trws: I added support for failed event and forced a push into flux-framework/flux-sched#305

The commit is flux-framework/flux-sched@69df09f

My test shows it solves the failed event emitted due to the condition you described. But I couldn't reproduce the load/unload oddity. My guess is that was a bug cascaded from the lack of failed event support. Please try.

@dongahn
Copy link
Member

dongahn commented Apr 10, 2018

This has been fixed in flux-framework/flux-sched#305.

@dongahn dongahn closed this as completed Apr 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants