Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrying failed remote jobs does not work #437

Closed
samhocevar opened this issue Jun 7, 2018 · 7 comments
Closed

Retrying failed remote jobs does not work #437

samhocevar opened this issue Jun 7, 2018 · 7 comments
Labels

Comments

@samhocevar
Copy link
Contributor

The mechanism that blacklists hosts and retries jobs up to a certain limit does not appear to work properly. I injected random failures in Client::Process to force retries (result = false; systemError = true;) and noticed several things happen:

  • ASSERT( Job::GetTotalLocalDataMemoryUsage() == 0 ); fails in JobQueue::~JobQueue
  • job->GetDistributionState() == Job::DIST_COMPLETED_REMOTELY should probably account for Job::DIST_RACE_WON_REMOTELY too in JobQueue::ReturnUnfinishedDistributableJob
  • ASSERT( distState == Job::DIST_RACING ); fails in JobQueue::FinalizeCompletedJobs because distState == Job::DIST_BUILDING_LOCALLY sometimes.
  • and of course, builds fail even if the job is retried and succeeds locally

I am attaching a build log of FASTBuild with -dist -verbose -distverbose -summary. The relevant parts are probably:

-> Obj: C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj <REMOTE: farmagent13>
10>Obj: C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj <LOCAL RACE>
Got Result: farmagent13 - C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj (Won Race)
Remote System Failure!
 - Blacklisted Worker: farmagent13
 - Node              : C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj
 - Job Error Count   : 1 / 3
7> Obj: C:\Users\s.hocevar\fastbuild\tmp\x64-Release\FBuildCore\Graph\ObjectListNode.obj <LOCAL>
@samhocevar samhocevar changed the title Retrying jobs does not work Retrying failed remote jobs does not work Jun 7, 2018
@ffulin
Copy link
Contributor

ffulin commented Jun 8, 2018

That snippet of the log certainly seems suspect:

  • we seem to consider the remote race "won" even though it failed. Could be just a mistake in the log, or a legit problem
  • it looks while we were already racing the job, we started compiling it again locally. That should never happen

@ffulin ffulin added the Bug label Jun 8, 2018
@samhocevar
Copy link
Contributor Author

If it may be of any help, here is the patch that I use to inject random remote failures. Here are a few more logs:

  • A build log that simply exits with Error: BUILD FAILED without any explicit error:
    build004.log
  • A build log that triggers Assert: job->GetDistributionState() == Job::DIST_BUILDING_REMOTELY: build001.log
  • After ignoring the above, a build log that triggers Assert: distState == Job::DIST_RACING:
    build003.log

@samhocevar
Copy link
Contributor Author

Trying to bump this issue with more information.

Here is what I think happens:

  • Client::Process() reads result and systemError
  • Client::Process() calls OnReturnRemoteJob() regardless of the value of result
    • OnReturnRemoteJob() cancels the local job if the state is DIST_RACING (this is already bad, because it could be worth keeping the job)
  • Client::Process() calls ReturnUnfinishedDistributableJob() to retry the job, possibly locally

The patch in #644 works for me, but I must admit I’m not 100% sure about the consequences, I’d be happy to rework it if you have comments.

@ffulin
Copy link
Contributor

ffulin commented Dec 27, 2021

I was finally able to spend the time to investigate and fix this issue. Thank you for providing this report and accompanying logs - they were helpful in understanding the issue.

I confirmed, with a repro case that the following sequence of events was not correctly handled:

  • a job starts remotely
  • a local race of that job begins
  • the job fails remotely with a "system failure"
  • the job is returned while still being raced

The result of such a sequence was usually a "BUILD FAILED" but with no error message about the failure. Given the nature of the bug and the resulting bad internal state, it could also potentially manifest in other ways too.

Change 556c31a fixes this issue along with some other improvements and fixes relating to -monitor, -distverbose and -profile command line options that related to this job flow.

These fixes and improvements will be in the next release (v1.06). I'll leave this issue open until that version is released.

Thank you for reporting this issue!

@ffulin
Copy link
Contributor

ffulin commented Dec 27, 2021

Although the core (and most severe issue is fixed), some problems still remain.
The following assert can still trigger:

ASSERT( distState == Job::DIST_RACING );

followed by this assert:

ASSERT( Job::GetTotalLocalDataMemoryUsage() == 0 );

I will continue to investigate this.

@ffulin
Copy link
Contributor

ffulin commented Jan 8, 2022

I've fixed two more problems (c51a009):

  • a build could fail without error when a remote race was returned if the job was returned at a specific point during the completion of the same job locally
  • a memory leak which could occur if a remote race was lost and returned during a specific point during local completion of the job

With these issues fixed, I think this issue can finally be considered resolved. There changes will also be in the next release (v1.06). I'll leave this issue open until that version is released.

@ffulin
Copy link
Contributor

ffulin commented Apr 15, 2022

v1.06 has now been released. I believe all the problems covered by this issue are resolved in that release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants