Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow long runs #1062

Merged
merged 8 commits into from
Jul 12, 2022
Merged

Allow long runs #1062

merged 8 commits into from
Jul 12, 2022

Conversation

JoranAngevaare
Copy link
Contributor

@JoranAngevaare JoranAngevaare commented Jul 11, 2022

if we want to allow > 2h long runs, we should not kill bootstrax before that time. Instead - only kill bootstrax if it is still running for a long time despite the run having ended. Additionally, we can also fail if we did not write any new file for more than 30 minutes which would be a better indicator of hangs than the total processing time.

@JoranAngevaare JoranAngevaare marked this pull request as draft July 11, 2022 15:58
@coveralls
Copy link

coveralls commented Jul 11, 2022

Coverage Status

Coverage increased (+0.03%) to 93.212% when pulling 5763ab7 on increased_run_duration into 2398287 on master.

Comment on lines +463 to +480
def kill_process(pid):
"""Kill process pid"""
log.warning(f'Kill PID:{pid}')
if not pid_exists(pid):
log.warning(f'No PID:{pid}')
return

for sig in [signal.SIGTERM, signal.SIGKILL, 'die']:
time.sleep(wait_time)
if not pid_exists(pid):
return
if signal == 'die':
message = f"Could not kill process {pid}"
log_warning(message, priority='fatal')
raise RuntimeError(message)
os.kill(pid, sig)
parent = Process(pid)
for child in parent.children(recursive=True):
child.kill()
parent.kill()

# Just make it extra dead
os.kill(pid, signal.SIGKILL)

if pid_exists(pid):
message = f"Could not kill process {pid}?!"
log_warning(message, priority='fatal')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old version only killed the parent - resulting in many child processes waiting to die

@JoranAngevaare JoranAngevaare marked this pull request as ready for review July 12, 2022 09:55
Copy link
Contributor

@darrylmasson darrylmasson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure there's an edge case we aren't considering, but otherwise looks good.

# Fail because for some reason, we are not writing any new files to disk.
if (t0 < now(-timeouts['max_no_write_time'])
and (last_write := last_file_write_time(os.path.join(output_folder, f'{run_id}*'))
) < now(-timeouts['max_no_write_time'])):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this set of booleans correct? Why should we care about both when we started processing and how old the newest file is?

Copy link
Contributor Author

@JoranAngevaare JoranAngevaare Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because if we start processing a run like 3 days later, we don't want to fail.
wrong reply for this if.

Because we check the status of processing every 10 seconds, we want to be sure that we have actually have had some times to write files. The first minute or so, we might not write new files and don't want to immediately fail. We could use a different timeout but I think 30 minutes is fine.

bin/bootstrax Outdated Show resolved Hide resolved
chunk_files = sorted(glob(os.path.join(folder, '*')))[-3:]
for chunk in chunk_files:
# Check that we did not rename this file since the glob above
if os.path.exists(chunk):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about the edge case where all the files we check were renamed, so that we end up returning the dawn of time, or is that sufficiently unlikely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the os.path.exists(chunk) check is already quite conservative - I doubt it every actually returns False, the edge case where all things are renamed is somethings I consider sufficiently unlikely. It's an easy fix if my imagination is ever proven to be too limited 😉

@JoranAngevaare JoranAngevaare merged commit dfbea23 into master Jul 12, 2022
@JoranAngevaare JoranAngevaare deleted the increased_run_duration branch July 12, 2022 12:28
@JoranAngevaare JoranAngevaare added the enhancement New feature or request label Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants