New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrying failed remote jobs does not work #437
Comments
That snippet of the log certainly seems suspect:
|
If it may be of any help, here is the patch that I use to inject random remote failures. Here are a few more logs:
|
This tries to address a problem described in fastbuild#437.
Trying to bump this issue with more information. Here is what I think happens:
The patch in #644 works for me, but I must admit I’m not 100% sure about the consequences, I’d be happy to rework it if you have comments. |
I was finally able to spend the time to investigate and fix this issue. Thank you for providing this report and accompanying logs - they were helpful in understanding the issue. I confirmed, with a repro case that the following sequence of events was not correctly handled:
The result of such a sequence was usually a "BUILD FAILED" but with no error message about the failure. Given the nature of the bug and the resulting bad internal state, it could also potentially manifest in other ways too. Change 556c31a fixes this issue along with some other improvements and fixes relating to -monitor, -distverbose and -profile command line options that related to this job flow. These fixes and improvements will be in the next release (v1.06). I'll leave this issue open until that version is released. Thank you for reporting this issue! |
Although the core (and most severe issue is fixed), some problems still remain.
followed by this assert:
I will continue to investigate this. |
I've fixed two more problems (c51a009):
With these issues fixed, I think this issue can finally be considered resolved. There changes will also be in the next release (v1.06). I'll leave this issue open until that version is released. |
v1.06 has now been released. I believe all the problems covered by this issue are resolved in that release. |
The mechanism that blacklists hosts and retries jobs up to a certain limit does not appear to work properly. I injected random failures in
Client::Process
to force retries (result = false; systemError = true;
) and noticed several things happen:ASSERT( Job::GetTotalLocalDataMemoryUsage() == 0 );
fails inJobQueue::~JobQueue
job->GetDistributionState() == Job::DIST_COMPLETED_REMOTELY
should probably account forJob::DIST_RACE_WON_REMOTELY
too inJobQueue::ReturnUnfinishedDistributableJob
ASSERT( distState == Job::DIST_RACING );
fails inJobQueue::FinalizeCompletedJobs
becausedistState == Job::DIST_BUILDING_LOCALLY
sometimes.I am attaching a build log of FASTBuild with
-dist -verbose -distverbose -summary
. The relevant parts are probably:The text was updated successfully, but these errors were encountered: