Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job is not ending - Bypassing signal SIGTERM #1677

Closed
yuvalkirstain opened this issue Mar 31, 2022 · 8 comments
Closed

Job is not ending - Bypassing signal SIGTERM #1677

yuvalkirstain opened this issue Mar 31, 2022 · 8 comments

Comments

@yuvalkirstain
Copy link

yuvalkirstain commented Mar 31, 2022

Hello,

I am sending a job to my Slurm cluster with submitit. The jobs runs as it is supposed to (you can see the Finished script log), but the slurm job itself does not finish. Instead, I get these Bypassing signal messages. Because I need this job to finish before moving on other jobs I am in a deadlock. I am really not sure what I should do and will appreciate the help.

Here are logs from my neverending job :(

03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,093) - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM

To provide more information, this happens when I run run job arrays:

jobs = executor.map_array(fn, configs)
job2cfg = {job: cfg for job, cfg in zip(jobs, configs)}
list(tqdm(submitit.helpers.as_completed(jobs), total=len(jobs)))

Some of them finish successfully, and other get stuck (until I had to clear the queue and scanceled them):

❯ tail /home/olab/kirstain/masking/log_test/71604_*/71604_*_0_log.err
==> /home/olab/kirstain/masking/log_test/71604_0/71604_0_0_log.err <==
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Flattening
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 292793 blocks to write
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Batching
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 3 shards to write
100%|██████████| 3/3 [00:00<00:00, 3093.14it/s]
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Writing 3 shards to /home/olab/kirstain/masking/data/5/1024/shards/ArXiv
100%|██████████| 3/3 [00:08<00:00,  2.78s/it]
03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished script
03/31/2022 20:53:18 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_10/71604_10_0_log.err <==
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,179) - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:56:55 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_11/71604_11_0_log.err <==
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,093) - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:00:50 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_12/71604_12_0_log.err <==
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_13/71604_13_0_log.err <==
100%|██████████| 65/65 [00:00<00:00, 10539.67it/s]
03/31/2022 21:06:43 - INFO - masking.scripts.shard_corpus - Writing 65 shards to /home/olab/kirstain/masking/data/5/1024/shards/Wikipedia_en
100%|██████████| 65/65 [01:37<00:00,  1.50s/it]
03/31/2022 21:08:22 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 21:08:23 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 21:08:23,557) - Bypassing signal SIGTERM
03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:08:23,561) - Bypassing signal SIGTERM
03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:08:31 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_1/71604_1_0_log.err <==
03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 20:59:04,562) - Bypassing signal SIGTERM
03/31/2022 20:59:04 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,084) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71606.0 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***
submitit WARNING (2022-03-31 21:25:40,085) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
slurmstepd: error: *** JOB 71606 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_2/71604_2_0_log.err <==
03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Batching
03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - We have 11 shards to write
100%|██████████| 11/11 [00:00<00:00, 3551.76it/s]
03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Writing 11 shards to /home/olab/kirstain/masking/data/5/1024/shards/Books3
100%|██████████| 11/11 [00:33<00:00,  3.09s/it]
03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 21:20:33,823) - Bypassing signal SIGTERM
03/31/2022 21:20:33 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:20:34 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_3/71604_3_0_log.err <==
submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_4/71604_4_0_log.err <==
03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,096) - Bypassing signal SIGCONT
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 71609.0 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***
submitit WARNING (2022-03-31 21:25:40,106) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
slurmstepd: error: *** JOB 71609 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_5/71604_5_0_log.err <==
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,092) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71610.0 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
submitit WARNING (2022-03-31 21:25:40,093) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
slurmstepd: error: *** JOB 71610 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_6/71604_6_0_log.err <==
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,089) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71611.0 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
submitit WARNING (2022-03-31 21:25:40,091) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
slurmstepd: error: *** JOB 71611 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_7/71604_7_0_log.err <==
100%|██████████| 13/13 [00:00<00:00, 5749.26it/s]
03/31/2022 20:53:43 - INFO - masking.scripts.shard_corpus - Writing 13 shards to /home/olab/kirstain/masking/data/5/1024/shards/OpenWebText2
100%|██████████| 13/13 [00:27<00:00,  2.08s/it]
03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 20:54:11,320) - Bypassing signal SIGTERM
03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:54:11,321) - Bypassing signal SIGTERM
03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:54:14 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_8/71604_8_0_log.err <==
submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,086) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71613.0 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***
submitit WARNING (2022-03-31 21:25:40,088) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
slurmstepd: error: *** JOB 71613 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_9/71604_9_0_log.err <==
100%|██████████| 21/21 [00:00<00:00, 4838.52it/s]
03/31/2022 20:56:39 - INFO - masking.scripts.shard_corpus - Writing 21 shards to /home/olab/kirstain/masking/data/5/1024/shards/Pile-CC
100%|██████████| 21/21 [00:42<00:00,  2.01s/it]
03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 20:57:21,672) - Bypassing signal SIGTERM
03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:57:21,674) - Bypassing signal SIGTERM
03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:57:24 - INFO - submitit - Job completed successfully
@yuvalkirstain yuvalkirstain changed the title Job is not ending - strange Job is not ending - Bypassing signal SIGTERM Mar 31, 2022
@gwenzek
Copy link
Contributor

gwenzek commented Apr 1, 2022

So the way SLURM is stopping job is cluster specific. In our cluster submitit relies on the SIGUSR1 to signal an upcoming preemption and then send SIGKILL to kill the job. We trigger checkpointing on SIGUSR1, and do nothing when receiving SIGTERM, just log it.

Do you know when your cluster is sending SIGTERM ? And why isn't it sending SIGKILL ?

In any case you can look at the how submitit is setting up the default signal handler on startup:

def _handle_signals(self, paths: JobPaths, submission: DelayedSubmission) -> None:
"""Set up signals handler for the current executable.
The default implementation checkpoint the given submission and requeues it.
@plugin-dev: Should be adapted to the signals used in this cluster.
"""
handler = SignalHandler(self, paths, submission)
signal.signal(signal.SIGUSR1, handler.checkpoint_and_try_requeue)
# A priori we don't need other signals anymore,
# but still log them to make it easier to debug.
signal.signal(signal.SIGTERM, handler.bypass)
signal.signal(signal.SIGCONT, handler.bypass)

So in your job you can also call signal.signal(signal.SIGTERM, my_custom_handler) to change the behavior.

The main reason we are doing it this way is historical. Some cluster (and the previous setting on our cluster) send SIGTERM then SIGUSR1 to signal a preemption and only SIGTERM for a timeout. So we couldn't stop on the first SIGTERM, and had to skip it. We now have a different logic to detect timeout, so maybe we should reconsider how we handle SIGTERM.

@gwenzek
Copy link
Contributor

gwenzek commented Apr 5, 2022

I think the main bug is that the job is not finishing even though the script finished.
Can you provide more details on what the script is doing ? Can you register a custom signal handler that prints a stack trace on SIGTERM ?

@yuvalkirstain
Copy link
Author

This script takes in a large text file and shards it into small text files. For some text files, it works and for some, it does not.

Can you please provide me with an example of how to register such a custom signal handler (hopefully in a way that won't require much code changes)?

@gwenzek
Copy link
Contributor

gwenzek commented Apr 6, 2022

there is a snippet in my previous message that shows how to register a signal handler

@Buzz-Beater
Copy link

Hi, I'm facing a similar problem on a slurm cluster where I canceled a job with scancel and drained the resources I was allocated. I was wondering how can we find the correct signals sent and adjust the signals to make sure submitit can be correctly ended without holding the resources. Thanks!

@yuvalkirstain
Copy link
Author

I think that there was a multiprocessing issue on my part. Now it seems to be solved. Closing (feel free to reopen).

@nikhilxb
Copy link

nikhilxb commented Nov 14, 2022

@yuvalkirstain Can you share some info on what your multiprocessing problem/fix was?
I am also running a script with many multiprocessing.Process() , and the SLURM job is not stopping properly after the script presumably ends. I get one warning per spawned process (e.g. 4 in the example below):

[hydra.experimental.callbacks.LogJobReturnCallback][INFO] - Succeeded with return value: None
[submitit][INFO] - Job completed successfully
[submitit][WARNING] - Bypassing signal SIGTERM
[submitit][WARNING] - Bypassing signal SIGTERM
[submitit][WARNING] - Bypassing signal SIGTERM
[submitit][WARNING] - Bypassing signal SIGTERM

As a result, the resources are not released, so other queued jobs never start, i.e. the entire system deadlocks.

UPDATE: I think it has to do with spawned subprocesses not being killed properly. I'm using a 3rd party library that starts many processes via process.start() but doesn't properly process.kill() them at the end. (Btw, even if it used process.terminate() properly, that would not work because of the handler.bypass).

@gwenzek Given that multiprocessing uses SIGTERM by default to kill running subprocesses, is it right to hardcode this behavior?

signal.signal(signal.SIGTERM, handler.bypass)

@dhruvdcoder
Copy link

I had the same issue. I was using hydra-submitit-plugin to submit jobs to the slurm cluster. The plugin uses submitit internally. My code had multiple places where worker/subprocess were created using the default method, which on Linux is fork. When a subprocess is created using fork it inherits the signal handlers of the parent. Hence, all these subprocesses inherited the SIGTERM handler set by submitit, which basically printed a statement and ignored the SIGTERM. So the workers would not respect the graceful terminate (SIGTERM) call of the main process. To solve this issue, you could do one of the two things:

  1. If you have access to the code that creates the subprocesses, use the worker init function to remove the SIGTERM handler from the subprocesses. This works for things like pytorch multiprocess dataloading etc.

  2. If (1) is not possible, try to modify the method of subprocess creation to spawn.

Together, these two methods worked for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants