New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job is not ending - Bypassing signal SIGTERM #1677
Comments
So the way SLURM is stopping job is cluster specific. In our cluster submitit relies on the SIGUSR1 to signal an upcoming preemption and then send SIGKILL to kill the job. We trigger checkpointing on SIGUSR1, and do nothing when receiving SIGTERM, just log it. Do you know when your cluster is sending SIGTERM ? And why isn't it sending SIGKILL ? In any case you can look at the how submitit is setting up the default signal handler on startup: submitit/submitit/core/job_environment.py Lines 130 to 141 in ddf0626
So in your job you can also call The main reason we are doing it this way is historical. Some cluster (and the previous setting on our cluster) send SIGTERM then SIGUSR1 to signal a preemption and only SIGTERM for a timeout. So we couldn't stop on the first SIGTERM, and had to skip it. We now have a different logic to detect timeout, so maybe we should reconsider how we handle SIGTERM. |
I think the main bug is that the job is not finishing even though the script finished. |
This script takes in a large text file and shards it into small text files. For some text files, it works and for some, it does not. Can you please provide me with an example of how to register such a custom signal handler (hopefully in a way that won't require much code changes)? |
there is a snippet in my previous message that shows how to register a signal handler |
Hi, I'm facing a similar problem on a slurm cluster where I canceled a job with |
I think that there was a multiprocessing issue on my part. Now it seems to be solved. Closing (feel free to reopen). |
@yuvalkirstain Can you share some info on what your multiprocessing problem/fix was?
As a result, the resources are not released, so other queued jobs never start, i.e. the entire system deadlocks. UPDATE: I think it has to do with spawned subprocesses not being killed properly. I'm using a 3rd party library that starts many processes via @gwenzek Given that multiprocessing uses submitit/submitit/core/job_environment.py Line 140 in ddf0626
|
I had the same issue. I was using hydra-submitit-plugin to submit jobs to the slurm cluster. The plugin uses submitit internally. My code had multiple places where worker/subprocess were created using the default method, which on Linux is fork. When a subprocess is created using fork it inherits the signal handlers of the parent. Hence, all these subprocesses inherited the SIGTERM handler set by submitit, which basically printed a statement and ignored the SIGTERM. So the workers would not respect the graceful terminate (SIGTERM) call of the main process. To solve this issue, you could do one of the two things:
Together, these two methods worked for me. |
Hello,
I am sending a job to my Slurm cluster with submitit. The jobs runs as it is supposed to (you can see the
Finished script
log), but the slurm job itself does not finish. Instead, I get theseBypassing signal
messages. Because I need this job to finish before moving on other jobs I am in a deadlock. I am really not sure what I should do and will appreciate the help.Here are logs from my neverending job :(
To provide more information, this happens when I run run job arrays:
Some of them finish successfully, and other get stuck (until I had to clear the queue and scanceled them):
The text was updated successfully, but these errors were encountered: