Allow long runs #1062

JoranAngevaare · 2022-07-11T15:58:27Z

if we want to allow > 2h long runs, we should not kill bootstrax before that time. Instead - only kill bootstrax if it is still running for a long time despite the run having ended. Additionally, we can also fail if we did not write any new file for more than 30 minutes which would be a better indicator of hangs than the total processing time.

coveralls · 2022-07-11T16:15:49Z

Coverage increased (+0.03%) to 93.212% when pulling 5763ab7 on increased_run_duration into 2398287 on master.

JoranAngevaare · 2022-07-12T09:49:09Z

bin/bootstrax

+def kill_process(pid):
+    """Kill process pid"""
+    log.warning(f'Kill PID:{pid}')
    if not pid_exists(pid):
+        log.warning(f'No PID:{pid}')
        return

-    for sig in [signal.SIGTERM, signal.SIGKILL, 'die']:
-        time.sleep(wait_time)
-        if not pid_exists(pid):
-            return
-        if signal == 'die':
-            message = f"Could not kill process {pid}"
-            log_warning(message, priority='fatal')
-            raise RuntimeError(message)
-        os.kill(pid, sig)
+    parent = Process(pid)
+    for child in parent.children(recursive=True):
+        child.kill()
+    parent.kill()
+
+    # Just make it extra dead
+    os.kill(pid, signal.SIGKILL)
+
+    if pid_exists(pid):
+        message = f"Could not kill process {pid}?!"
+        log_warning(message, priority='fatal')        


Old version only killed the parent - resulting in many child processes waiting to die

darrylmasson

I'm sure there's an edge case we aren't considering, but otherwise looks good.

darrylmasson · 2022-07-12T10:26:51Z

bin/bootstrax

+                # Fail because for some reason, we are not writing any new files to disk.
+                if (t0 < now(-timeouts['max_no_write_time'])
+                    and (last_write := last_file_write_time(os.path.join(output_folder, f'{run_id}*'))
+                        ) < now(-timeouts['max_no_write_time'])):


Is this set of booleans correct? Why should we care about both when we started processing and how old the newest file is?

~~because if we start processing a run like 3 days later, we don't want to fail.~~
wrong reply for this if.

Because we check the status of processing every 10 seconds, we want to be sure that we have actually have had some times to write files. The first minute or so, we might not write new files and don't want to immediately fail. We could use a different timeout but I think 30 minutes is fine.

bin/bootstrax

darrylmasson · 2022-07-12T11:41:16Z

bin/bootstrax

+        chunk_files = sorted(glob(os.path.join(folder, '*')))[-3:]
+        for chunk in chunk_files:
+            # Check that we did not rename this file since the glob above
+            if os.path.exists(chunk):


Do we care about the edge case where all the files we check were renamed, so that we end up returning the dawn of time, or is that sufficiently unlikely?

I think the os.path.exists(chunk) check is already quite conservative - I doubt it every actually returns False, the edge case where all things are renamed is somethings I consider sufficiently unlikely. It's an easy fix if my imagination is ever proven to be too limited 😉

allow long runs

50c859c

JoranAngevaare marked this pull request as draft July 11, 2022 15:58

JoranAngevaare added 4 commits July 12, 2022 08:49

check for last written file

a2304d7

check for any files

af18cc7

tz aware

1c61987

Fix timeouts

85473a2

JoranAngevaare commented Jul 12, 2022

View reviewed changes

add some comments

6706f8d

JoranAngevaare requested a review from darrylmasson July 12, 2022 09:55

JoranAngevaare marked this pull request as ready for review July 12, 2022 09:55

darrylmasson approved these changes Jul 12, 2022

View reviewed changes

JoranAngevaare added 2 commits July 12, 2022 14:02

add endtime helper

dacdbb8

Merge branch 'master' into increased_run_duration

5763ab7

JoranAngevaare merged commit dfbea23 into master Jul 12, 2022

JoranAngevaare deleted the increased_run_duration branch July 12, 2022 12:28

JoranAngevaare mentioned this pull request Jul 13, 2022

Bootstrax file-check fix #1064

Merged

JoranAngevaare added the enhancement New feature or request label Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow long runs #1062

Allow long runs #1062

JoranAngevaare commented Jul 11, 2022 •

edited

Loading

coveralls commented Jul 11, 2022 •

edited

Loading

JoranAngevaare Jul 12, 2022

darrylmasson left a comment

darrylmasson Jul 12, 2022

JoranAngevaare Jul 12, 2022 •

edited

Loading

darrylmasson Jul 12, 2022

JoranAngevaare Jul 12, 2022

Allow long runs #1062

Allow long runs #1062

Conversation

JoranAngevaare commented Jul 11, 2022 • edited Loading

coveralls commented Jul 11, 2022 • edited Loading

JoranAngevaare Jul 12, 2022

Choose a reason for hiding this comment

darrylmasson left a comment

Choose a reason for hiding this comment

darrylmasson Jul 12, 2022

Choose a reason for hiding this comment

JoranAngevaare Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

darrylmasson Jul 12, 2022

Choose a reason for hiding this comment

JoranAngevaare Jul 12, 2022

Choose a reason for hiding this comment

JoranAngevaare commented Jul 11, 2022 •

edited

Loading

coveralls commented Jul 11, 2022 •

edited

Loading

JoranAngevaare Jul 12, 2022 •

edited

Loading