Job is not ending - Bypassing signal SIGTERM #1677

yuvalkirstain · 2022-03-31T17:06:18Z

Hello,

I am sending a job to my Slurm cluster with submitit. The jobs runs as it is supposed to (you can see the Finished script log), but the slurm job itself does not finish. Instead, I get these Bypassing signal messages. Because I need this job to finish before moving on other jobs I am in a deadlock. I am really not sure what I should do and will appreciate the help.

Here are logs from my neverending job :(

03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 19:56:03 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,088) - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 19:56:03,093) - Bypassing signal SIGTERM
03/31/2022 19:56:03 - WARNING - submitit - Bypassing signal SIGTERM

To provide more information, this happens when I run run job arrays:

jobs = executor.map_array(fn, configs)
job2cfg = {job: cfg for job, cfg in zip(jobs, configs)}
list(tqdm(submitit.helpers.as_completed(jobs), total=len(jobs)))

Some of them finish successfully, and other get stuck (until I had to clear the queue and scanceled them):

❯ tail /home/olab/kirstain/masking/log_test/71604_*/71604_*_0_log.err
==> /home/olab/kirstain/masking/log_test/71604_0/71604_0_0_log.err <==
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Flattening
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 292793 blocks to write
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Batching
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - We have 3 shards to write
100%|██████████| 3/3 [00:00<00:00, 3093.14it/s]
03/31/2022 20:53:09 - INFO - masking.scripts.shard_corpus - Writing 3 shards to /home/olab/kirstain/masking/data/5/1024/shards/ArXiv
100%|██████████| 3/3 [00:08<00:00,  2.78s/it]
03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:53:17 - INFO - masking.scripts.shard_corpus - Finished script
03/31/2022 20:53:18 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_10/71604_10_0_log.err <==
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,179) - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:56:55,188) - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:56:55 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:56:55 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_11/71604_11_0_log.err <==
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,088) - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:00:42,093) - Bypassing signal SIGTERM
03/31/2022 21:00:42 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:00:50 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_12/71604_12_0_log.err <==
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:51:13,411) - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:51:13 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_13/71604_13_0_log.err <==
100%|██████████| 65/65 [00:00<00:00, 10539.67it/s]
03/31/2022 21:06:43 - INFO - masking.scripts.shard_corpus - Writing 65 shards to /home/olab/kirstain/masking/data/5/1024/shards/Wikipedia_en
100%|██████████| 65/65 [01:37<00:00,  1.50s/it]
03/31/2022 21:08:22 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 21:08:23 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 21:08:23,557) - Bypassing signal SIGTERM
03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:08:23,561) - Bypassing signal SIGTERM
03/31/2022 21:08:23 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:08:31 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_1/71604_1_0_log.err <==
03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:59:04 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 20:59:04,562) - Bypassing signal SIGTERM
03/31/2022 20:59:04 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,084) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71606.0 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***
submitit WARNING (2022-03-31 21:25:40,085) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
slurmstepd: error: *** JOB 71606 ON kilonova CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_2/71604_2_0_log.err <==
03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Batching
03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - We have 11 shards to write
100%|██████████| 11/11 [00:00<00:00, 3551.76it/s]
03/31/2022 21:19:58 - INFO - masking.scripts.shard_corpus - Writing 11 shards to /home/olab/kirstain/masking/data/5/1024/shards/Books3
100%|██████████| 11/11 [00:33<00:00,  3.09s/it]
03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 21:20:33 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 21:20:33,823) - Bypassing signal SIGTERM
03/31/2022 21:20:33 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:20:34 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_3/71604_3_0_log.err <==
submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:52:43,363) - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:52:43 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_4/71604_4_0_log.err <==
03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:05:52 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,096) - Bypassing signal SIGCONT
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 71609.0 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***
submitit WARNING (2022-03-31 21:25:40,106) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
slurmstepd: error: *** JOB 71609 ON rack-iscb-32 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_5/71604_5_0_log.err <==
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:58 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,092) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71610.0 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
submitit WARNING (2022-03-31 21:25:40,093) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
slurmstepd: error: *** JOB 71610 ON rack-iscb-33 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_6/71604_6_0_log.err <==
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:49:47 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,089) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71611.0 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
submitit WARNING (2022-03-31 21:25:40,091) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
slurmstepd: error: *** JOB 71611 ON rack-iscb-34 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_7/71604_7_0_log.err <==
100%|██████████| 13/13 [00:00<00:00, 5749.26it/s]
03/31/2022 20:53:43 - INFO - masking.scripts.shard_corpus - Writing 13 shards to /home/olab/kirstain/masking/data/5/1024/shards/OpenWebText2
100%|██████████| 13/13 [00:27<00:00,  2.08s/it]
03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:54:11 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 20:54:11,320) - Bypassing signal SIGTERM
03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:54:11,321) - Bypassing signal SIGTERM
03/31/2022 20:54:11 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:54:14 - INFO - submitit - Job completed successfully

==> /home/olab/kirstain/masking/log_test/71604_8/71604_8_0_log.err <==
submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:50:43,863) - Bypassing signal SIGTERM
03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:50:43 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 21:25:40,086) - Bypassing signal SIGCONT
slurmstepd: error: *** STEP 71613.0 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***
submitit WARNING (2022-03-31 21:25:40,088) - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 21:25:40 - WARNING - submitit - Bypassing signal SIGCONT
slurmstepd: error: *** JOB 71613 ON rack-iscb-36 CANCELLED AT 2022-03-31T21:25:40 ***

==> /home/olab/kirstain/masking/log_test/71604_9/71604_9_0_log.err <==
100%|██████████| 21/21 [00:00<00:00, 4838.52it/s]
03/31/2022 20:56:39 - INFO - masking.scripts.shard_corpus - Writing 21 shards to /home/olab/kirstain/masking/data/5/1024/shards/Pile-CC
100%|██████████| 21/21 [00:42<00:00,  2.01s/it]
03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished Writing
03/31/2022 20:57:21 - INFO - masking.scripts.shard_corpus - Finished script
submitit WARNING (2022-03-31 20:57:21,672) - Bypassing signal SIGTERM
03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
submitit WARNING (2022-03-31 20:57:21,674) - Bypassing signal SIGTERM
03/31/2022 20:57:21 - WARNING - submitit - Bypassing signal SIGTERM
03/31/2022 20:57:24 - INFO - submitit - Job completed successfully

The text was updated successfully, but these errors were encountered:

gwenzek · 2022-04-01T08:13:18Z

So the way SLURM is stopping job is cluster specific. In our cluster submitit relies on the SIGUSR1 to signal an upcoming preemption and then send SIGKILL to kill the job. We trigger checkpointing on SIGUSR1, and do nothing when receiving SIGTERM, just log it.

Do you know when your cluster is sending SIGTERM ? And why isn't it sending SIGKILL ?

In any case you can look at the how submitit is setting up the default signal handler on startup:

submitit/submitit/core/job_environment.py

Lines 130 to 141 in ddf0626

    
               def _handle_signals(self, paths: JobPaths, submission: DelayedSubmission) -> None: 
        
                   """Set up signals handler for the current executable. 
        
                   The default implementation checkpoint the given submission and requeues it. 
        
                   @plugin-dev: Should be adapted to the signals used in this cluster. 
        
                   """ 
        
                   handler = SignalHandler(self, paths, submission) 
        
                   signal.signal(signal.SIGUSR1, handler.checkpoint_and_try_requeue) 
        
                   # A priori we don't need other signals anymore, 
        
                   # but still log them to make it easier to debug. 
        
                   signal.signal(signal.SIGTERM, handler.bypass) 
        
                   signal.signal(signal.SIGCONT, handler.bypass)

So in your job you can also call signal.signal(signal.SIGTERM, my_custom_handler) to change the behavior.

The main reason we are doing it this way is historical. Some cluster (and the previous setting on our cluster) send SIGTERM then SIGUSR1 to signal a preemption and only SIGTERM for a timeout. So we couldn't stop on the first SIGTERM, and had to skip it. We now have a different logic to detect timeout, so maybe we should reconsider how we handle SIGTERM.

gwenzek · 2022-04-05T12:03:50Z

I think the main bug is that the job is not finishing even though the script finished.
Can you provide more details on what the script is doing ? Can you register a custom signal handler that prints a stack trace on SIGTERM ?

yuvalkirstain · 2022-04-05T13:08:35Z

This script takes in a large text file and shards it into small text files. For some text files, it works and for some, it does not.

Can you please provide me with an example of how to register such a custom signal handler (hopefully in a way that won't require much code changes)?

gwenzek · 2022-04-06T14:18:46Z

there is a snippet in my previous message that shows how to register a signal handler

Buzz-Beater · 2022-04-15T07:22:09Z

Hi, I'm facing a similar problem on a slurm cluster where I canceled a job with scancel and drained the resources I was allocated. I was wondering how can we find the correct signals sent and adjust the signals to make sure submitit can be correctly ended without holding the resources. Thanks!

yuvalkirstain · 2022-04-24T17:47:49Z

I think that there was a multiprocessing issue on my part. Now it seems to be solved. Closing (feel free to reopen).

nikhilxb · 2022-11-14T03:02:38Z

@yuvalkirstain Can you share some info on what your multiprocessing problem/fix was?
I am also running a script with many multiprocessing.Process() , and the SLURM job is not stopping properly after the script presumably ends. I get one warning per spawned process (e.g. 4 in the example below):

[hydra.experimental.callbacks.LogJobReturnCallback][INFO] - Succeeded with return value: None
[submitit][INFO] - Job completed successfully
[submitit][WARNING] - Bypassing signal SIGTERM
[submitit][WARNING] - Bypassing signal SIGTERM
[submitit][WARNING] - Bypassing signal SIGTERM
[submitit][WARNING] - Bypassing signal SIGTERM

As a result, the resources are not released, so other queued jobs never start, i.e. the entire system deadlocks.

UPDATE: I think it has to do with spawned subprocesses not being killed properly. I'm using a 3rd party library that starts many processes via process.start() but doesn't properly process.kill() them at the end. (Btw, even if it used process.terminate() properly, that would not work because of the handler.bypass).

@gwenzek Given that multiprocessing uses SIGTERM by default to kill running subprocesses, is it right to hardcode this behavior?

submitit/submitit/core/job_environment.py

Line 140 in ddf0626

signal.signal(signal.SIGTERM, handler.bypass)

dhruvdcoder · 2023-10-01T22:52:45Z

I had the same issue. I was using hydra-submitit-plugin to submit jobs to the slurm cluster. The plugin uses submitit internally. My code had multiple places where worker/subprocess were created using the default method, which on Linux is fork. When a subprocess is created using fork it inherits the signal handlers of the parent. Hence, all these subprocesses inherited the SIGTERM handler set by submitit, which basically printed a statement and ignored the SIGTERM. So the workers would not respect the graceful terminate (SIGTERM) call of the main process. To solve this issue, you could do one of the two things:

If you have access to the code that creates the subprocesses, use the worker init function to remove the SIGTERM handler from the subprocesses. This works for things like pytorch multiprocess dataloading etc.
If (1) is not possible, try to modify the method of subprocess creation to spawn.

Together, these two methods worked for me.

yuvalkirstain changed the title ~~Job is not ending - strange~~ Job is not ending - Bypassing signal SIGTERM Mar 31, 2022

yuvalkirstain closed this as completed Apr 24, 2022

nikhilxb mentioned this issue Nov 3, 2023

SLURM Job keeps running after Successful Job Completon (Hydra Submitit Plugin) #1731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job is not ending - Bypassing signal SIGTERM #1677

Job is not ending - Bypassing signal SIGTERM #1677

yuvalkirstain commented Mar 31, 2022 •

edited

gwenzek commented Apr 1, 2022

gwenzek commented Apr 5, 2022

yuvalkirstain commented Apr 5, 2022

gwenzek commented Apr 6, 2022

Buzz-Beater commented Apr 15, 2022

yuvalkirstain commented Apr 24, 2022

nikhilxb commented Nov 14, 2022 •

edited

dhruvdcoder commented Oct 1, 2023

Job is not ending - Bypassing signal SIGTERM #1677

Job is not ending - Bypassing signal SIGTERM #1677

Comments

yuvalkirstain commented Mar 31, 2022 • edited

gwenzek commented Apr 1, 2022

gwenzek commented Apr 5, 2022

yuvalkirstain commented Apr 5, 2022

gwenzek commented Apr 6, 2022

Buzz-Beater commented Apr 15, 2022

yuvalkirstain commented Apr 24, 2022

nikhilxb commented Nov 14, 2022 • edited

dhruvdcoder commented Oct 1, 2023

yuvalkirstain commented Mar 31, 2022 •

edited

nikhilxb commented Nov 14, 2022 •

edited