Retrying failed remote jobs does not work #437

samhocevar · 2018-06-07T12:09:50Z

The mechanism that blacklists hosts and retries jobs up to a certain limit does not appear to work properly. I injected random failures in Client::Process to force retries (result = false; systemError = true;) and noticed several things happen:

ASSERT( Job::GetTotalLocalDataMemoryUsage() == 0 ); fails in JobQueue::~JobQueue
job->GetDistributionState() == Job::DIST_COMPLETED_REMOTELY should probably account for Job::DIST_RACE_WON_REMOTELY too in JobQueue::ReturnUnfinishedDistributableJob
ASSERT( distState == Job::DIST_RACING ); fails in JobQueue::FinalizeCompletedJobs because distState == Job::DIST_BUILDING_LOCALLY sometimes.
and of course, builds fail even if the job is retried and succeeds locally

I am attaching a build log of FASTBuild with -dist -verbose -distverbose -summary. The relevant parts are probably:

-> Obj: C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj <REMOTE: farmagent13>
10>Obj: C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj <LOCAL RACE>
Got Result: farmagent13 - C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj (Won Race)
Remote System Failure!
 - Blacklisted Worker: farmagent13
 - Node              : C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj
 - Job Error Count   : 1 / 3
7> Obj: C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj <LOCAL>

The text was updated successfully, but these errors were encountered:

ffulin · 2018-06-08T02:12:33Z

That snippet of the log certainly seems suspect:

we seem to consider the remote race "won" even though it failed. Could be just a mistake in the log, or a legit problem
it looks while we were already racing the job, we started compiling it again locally. That should never happen

samhocevar · 2018-06-08T12:44:57Z

If it may be of any help, here is the patch that I use to inject random remote failures. Here are a few more logs:

A build log that simply exits with Error: BUILD FAILED without any explicit error:
build004.log
A build log that triggers Assert: job->GetDistributionState() == Job::DIST_BUILDING_REMOTELY: build001.log
After ignoring the above, a build log that triggers Assert: distState == Job::DIST_RACING:
build003.log

This tries to address a problem described in fastbuild#437.

samhocevar · 2019-10-15T16:01:56Z

Trying to bump this issue with more information.

Here is what I think happens:

Client::Process() reads result and systemError
Client::Process() calls OnReturnRemoteJob() regardless of the value of result
- OnReturnRemoteJob() cancels the local job if the state is DIST_RACING (this is already bad, because it could be worth keeping the job)
Client::Process() calls ReturnUnfinishedDistributableJob() to retry the job, possibly locally

The patch in #644 works for me, but I must admit I’m not 100% sure about the consequences, I’d be happy to rework it if you have comments.

ffulin · 2021-12-27T03:12:21Z

I was finally able to spend the time to investigate and fix this issue. Thank you for providing this report and accompanying logs - they were helpful in understanding the issue.

I confirmed, with a repro case that the following sequence of events was not correctly handled:

a job starts remotely
a local race of that job begins
the job fails remotely with a "system failure"
the job is returned while still being raced

The result of such a sequence was usually a "BUILD FAILED" but with no error message about the failure. Given the nature of the bug and the resulting bad internal state, it could also potentially manifest in other ways too.

Change 556c31a fixes this issue along with some other improvements and fixes relating to -monitor, -distverbose and -profile command line options that related to this job flow.

These fixes and improvements will be in the next release (v1.06). I'll leave this issue open until that version is released.

Thank you for reporting this issue!

ffulin · 2021-12-27T08:10:51Z

Although the core (and most severe issue is fixed), some problems still remain.
The following assert can still trigger:

ASSERT( distState == Job::DIST_RACING );

followed by this assert:

ASSERT( Job::GetTotalLocalDataMemoryUsage() == 0 );

I will continue to investigate this.

ffulin · 2022-01-08T22:42:36Z

I've fixed two more problems (c51a009):

a build could fail without error when a remote race was returned if the job was returned at a specific point during the completion of the same job locally
a memory leak which could occur if a remote race was lost and returned during a specific point during local completion of the job

With these issues fixed, I think this issue can finally be considered resolved. There changes will also be in the next release (v1.06). I'll leave this issue open until that version is released.

ffulin · 2022-04-15T07:45:31Z

v1.06 has now been released. I believe all the problems covered by this issue are resolved in that release.

samhocevar changed the title ~~Retrying jobs does not work~~ Retrying failed remote jobs does not work Jun 7, 2018

ffulin added the Bug label Jun 8, 2018

ffulin mentioned this issue Jun 23, 2018

Possible data race between OnReturnRemoteJob and FinalizeCompletedJobs in JobQueue #447

Closed

dummyunit mentioned this issue Jun 24, 2018

Add support for running tests on Travis CI #450

Merged

samhocevar added a commit to dontnod/fork-fastbuild that referenced this issue Oct 15, 2019

Fix the logic for retries after remote system errors.

5c9ebf3

This tries to address a problem described in fastbuild#437.

samhocevar mentioned this issue Oct 15, 2019

Fix the logic for retries after remote system errors. #644

Closed

ffulin added the Resolved In Next Version label Dec 27, 2021

ffulin mentioned this issue Jan 6, 2022

CPU cores endless use #874

Open

ffulin closed this as completed Apr 15, 2022

ffulin removed the Resolved In Next Version label Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrying failed remote jobs does not work #437

Retrying failed remote jobs does not work #437

samhocevar commented Jun 7, 2018

ffulin commented Jun 8, 2018

samhocevar commented Jun 8, 2018

samhocevar commented Oct 15, 2019

ffulin commented Dec 27, 2021

ffulin commented Dec 27, 2021 •

edited

ffulin commented Jan 8, 2022

ffulin commented Apr 15, 2022

Retrying failed remote jobs does not work #437

Retrying failed remote jobs does not work #437

Comments

samhocevar commented Jun 7, 2018

ffulin commented Jun 8, 2018

samhocevar commented Jun 8, 2018

samhocevar commented Oct 15, 2019

ffulin commented Dec 27, 2021

ffulin commented Dec 27, 2021 • edited

ffulin commented Jan 8, 2022

ffulin commented Apr 15, 2022

ffulin commented Dec 27, 2021 •

edited