Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal: hook Fail Dir:node_fail #1872

Closed
stemangiola opened this issue Feb 22, 2018 · 6 comments
Closed

fatal: hook Fail Dir:node_fail #1872

stemangiola opened this issue Feb 22, 2018 · 6 comments
Assignees
Labels

Comments

@stemangiola
Copy link

Hello,

I got this error, I cannot understand the cause.

2018/02/22 20:03:13.66 makeflow[15674] fatal: hook Fail Dir:node_fail returned 1
received signal 15 (Terminated), cleaning up remote jobs and files...
Killed

Any suggestion is appreciate.

Here the debug file

makeflow.zip

@nhazekam nhazekam self-assigned this Feb 22, 2018
@nhazekam
Copy link
Contributor

The feature that is causing this Fail Dir takes the output of a failed node and stores it in a directory so that the user can more easily diagnose the cause. The feature knows how to clean up previous failed nodes and when Makeflow is rerun it will clean them up if needed.

The error message that you got is likely a result of changing the Makeflow and rerunning it without cleaning the previous run. After changing the makeflow you may have gotten a message that the makeflow was corrupted in reference to the makeflowlog. If you deleted the makeflowlog, the fail directories would not be removed and Makeflow would see them and assume they were there prior to Makeflow and not remove them. There may be other reasons why it persisted and I will look into those as well.

Unfortunately this will happen on each fail dir that exists if that same node fails on a different run.
Solutions:
1: Cleanup these directories. rm -rf makeflow.failed.* This should clean these files up.
2: Turn off saving failed outputs. --do-not-save-failed-output Add this option to you makeflow and these directories will not be created and should not collide with the existing directories.

Makeflow is conservative about files it did not create. As a result when the makeflowlog that recorded making this files is deleted Makeflow now treats these files as pre-existing and will not remove them.
This can be avoid by running makeflow -c prior to changing a makeflow, but this will remove any partial progress that was made. I will discuss this with my team.

@stemangiola
Copy link
Author

Thanks,

that is indeed what happened.

Just a further question:

when I get the error "corrupted makeflowlog", maybe because I changed Makeflow file, how can I preserve the already executed commands. I have 5000+ jobs, and starting again each time is a risk I cannot take.

Bw.

@nhazekam
Copy link
Contributor

This is not directly answering you question, but is there a reason you need all 5000+ in a single Makeflow. Are they structured as a set of pipelines or as an intricate tree of tasks?

@stemangiola
Copy link
Author

My pipeline is just 3 independent benchmarks with many combination of parameters each, that go to one script for descriptive statistics of all of them.

@Nekel-Seyew
Copy link
Contributor

@nhazekam Can this be closed?

@nhazekam
Copy link
Contributor

nhazekam commented Jul 5, 2019

@stemangiola Please let us know if this is still an issue for you.

@nhazekam nhazekam closed this as completed Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants