Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log forced fitting progress #667

Open
ddobie opened this issue Oct 28, 2022 · 1 comment
Open

Log forced fitting progress #667

ddobie opened this issue Oct 28, 2022 · 1 comment
Labels
discussion topic to discuss openly enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@ddobie
Copy link
Contributor

ddobie commented Oct 28, 2022

Currently Step 5 of the pipeline processing (forced fitting) does not have any logging, presumably because the step involves parallel execution which makes general logging nonsensical.

However, Step 5 is also the slowest (in my experience at least!) part of the pipeline, and hence having some form of progress indication would be useful so that users can be confident their jobs are being successfully executed and not just hanging. I am currently executing a pipeline run (https://dev.pipeline.vast-survey.org/piperuns/40/) of ~1800 images and a quick BOTE calculation suggests that the forced fit runtime is approximately 5 days. I am 2.5 days in now and have no way to determine if my estimate is still on-track or if the job has stalled.

Naively, I would assume fixing this issue is as simple as using the dask ProgressBar object. All this would require is from dask.diagnostics import ProgressBar and adding

with ProgressBar():

before

forced_dfs = (
bags.map(lambda x: extract_from_image(
edge_buffer=edge_buffer,
cluster_threshold=cluster_threshold,
allow_nan=allow_nan,
**x
))
.compute()
)

However, I have minimal experience with dask and this might be a completely wrong implementation (e.g. if it adds a new line to the log file with each iteration).

@ddobie ddobie added enhancement New feature or request help wanted Extra attention is needed question Further information is requested discussion topic to discuss openly labels Oct 28, 2022
@ajstewart
Copy link
Contributor

ajstewart commented Oct 31, 2022

Looking at that run I suspect it timed out.

If you are running via the website there is in fact a maximum process time setting that will kill runs that are more than, by default, 24 hours

# Q_CLUSTER_TIMEOUT=86400

This is probably it, I don't know what the setting is on the deployment. Of course this should absolutely be communicated better on the website itself! The clue is that it was attempted to be run again, you can see the log on the website. If this wasn't you then this is the retry from qcluster, but it was already running so it didn't bother running it (actually a handy bug for not reprocessing timeouts 😅).

As there is also this issue: #594 that means that if a timeout occurs during a dask step it can not exit gracefully and it's difficult to update the status.

For really large runs I recommend running the job through the CLI on the server as then you bypass all the rules that are in place to "protect" the website running method.

I also think that there is a little bit of logging for the forced step though I might be wrong, something along the lines of how many forced extractions are out of range of the image or something.

You can also tweak the forced extraction sensitivity in the config to help with not having an overwhelming number of forced extractions to perform in such a large run.

As for monitoring, your best bet is probably to serve some sort of monitoring software on the server itself so that you can see how busy the machine is. Or yeah it's just a case of an admin logging on and running htop to see what's happening!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion topic to discuss openly enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants