-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow long runs #1062
Allow long runs #1062
Conversation
def kill_process(pid): | ||
"""Kill process pid""" | ||
log.warning(f'Kill PID:{pid}') | ||
if not pid_exists(pid): | ||
log.warning(f'No PID:{pid}') | ||
return | ||
|
||
for sig in [signal.SIGTERM, signal.SIGKILL, 'die']: | ||
time.sleep(wait_time) | ||
if not pid_exists(pid): | ||
return | ||
if signal == 'die': | ||
message = f"Could not kill process {pid}" | ||
log_warning(message, priority='fatal') | ||
raise RuntimeError(message) | ||
os.kill(pid, sig) | ||
parent = Process(pid) | ||
for child in parent.children(recursive=True): | ||
child.kill() | ||
parent.kill() | ||
|
||
# Just make it extra dead | ||
os.kill(pid, signal.SIGKILL) | ||
|
||
if pid_exists(pid): | ||
message = f"Could not kill process {pid}?!" | ||
log_warning(message, priority='fatal') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old version only killed the parent - resulting in many child processes waiting to die
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure there's an edge case we aren't considering, but otherwise looks good.
# Fail because for some reason, we are not writing any new files to disk. | ||
if (t0 < now(-timeouts['max_no_write_time']) | ||
and (last_write := last_file_write_time(os.path.join(output_folder, f'{run_id}*')) | ||
) < now(-timeouts['max_no_write_time'])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this set of booleans correct? Why should we care about both when we started processing and how old the newest file is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because if we start processing a run like 3 days later, we don't want to fail.
wrong reply for this if.
Because we check the status of processing every 10 seconds, we want to be sure that we have actually have had some times to write files. The first minute or so, we might not write new files and don't want to immediately fail. We could use a different timeout but I think 30 minutes is fine.
chunk_files = sorted(glob(os.path.join(folder, '*')))[-3:] | ||
for chunk in chunk_files: | ||
# Check that we did not rename this file since the glob above | ||
if os.path.exists(chunk): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about the edge case where all the files we check were renamed, so that we end up returning the dawn of time, or is that sufficiently unlikely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the os.path.exists(chunk)
check is already quite conservative - I doubt it every actually returns False, the edge case where all things are renamed is somethings I consider sufficiently unlikely. It's an easy fix if my imagination is ever proven to be too limited 😉
if we want to allow > 2h long runs, we should not kill bootstrax before that time. Instead - only kill bootstrax if it is still running for a long time despite the run having ended. Additionally, we can also fail if we did not write any new file for more than 30 minutes which would be a better indicator of hangs than the total processing time.