Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix race condition in build wait timers for substantiated workers #6344

Merged
merged 1 commit into from Jan 31, 2022

Conversation

RyanDraves
Copy link

Patched an issue where latent workers that have already been substantiated will start a build_wait_timer during substantiation of a new build. This timer starts after the build_wait_timer is cleared from a new build arriving, disconnecting the worker build_wait_timeout seconds later, mid-build or not.

Contributor Checklist:

  • I have updated the unit tests
  • I have created a file in the newsfragments directory (and read the README.txt in that directory)
  • I have updated the appropriate documentation
    • added a nice comment

@codecov
Copy link

codecov bot commented Jan 25, 2022

Codecov Report

Merging #6344 (ad4989f) into master (2497d2b) will decrease coverage by 0.09%.
The diff coverage is n/a.

❗ Current head ad4989f differs from pull request most recent head 1592ac2. Consider uploading reports for the commit 1592ac2 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6344      +/-   ##
==========================================
- Coverage   91.99%   91.89%   -0.10%     
==========================================
  Files         335      335              
  Lines       37583    37582       -1     
==========================================
- Hits        34573    34536      -37     
- Misses       3010     3046      +36     
Impacted Files Coverage Δ
master/buildbot/worker/latent.py 96.70% <ø> (-0.02%) ⬇️
master/buildbot/__init__.py 76.92% <0.00%> (-15.39%) ⬇️
worker/buildbot_worker/__init__.py 77.61% <0.00%> (-14.93%) ⬇️
master/buildbot/db/test_results.py 89.54% <0.00%> (-6.54%) ⬇️
master/buildbot/db/enginestrategy.py 69.10% <0.00%> (-6.51%) ⬇️
master/buildbot/db/connector.py 96.51% <0.00%> (-3.49%) ⬇️
master/buildbot/util/queue.py 96.51% <0.00%> (+5.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2497d2b...1592ac2. Read the comment docs.

@@ -236,7 +236,9 @@ def substantiate(self, wfb, build):
return defer.succeed(False)

if self.state == States.SUBSTANTIATED and self.conn is not None:
self._setBuildWaitTimer()
# Skip the wait timer if a build is already queued
if not self.building:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is equivalent to deleting self._setBuildWaitTimer(). self.building checks whether at least one of WorkerForBuilder instances associated with the worker are busy (via isBusy). They will necessarily be busy, because at least one got its buildStarted called in the path to Build.startBuild which eventually indirectly calls LatentWorker.substantiate().

However, this also means that _setBuildWaitTimer is completely unnecessary here. If buildStarted is guaranteed to be called when we are here, then regardless of what happens buildFinished will be called too for that particular build. That is the place for _setBuildWaitTimer which we already do there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, thanks for the PR, it identified an important bug and did 95% of the work fixing it. I'll finish the polishing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, thanks. Do you think self.building / isBusy should be a part of the state machine? I found that lack of information in the worker's state to be terribly confusing to debug, which led to me exploring way too much of the call tree to find the root issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.building is quite separate from the internal state machine in that the states aren't dependent on each other. If we merge them then the state machine would grow double in size.

I agree that the latent worker itself is quite confusing. It became such once all edge cases got covered. It seems that there should be a way to refactor everything for simplicity, but nobody found any such way yet.

@p12tic p12tic force-pushed the feature/latent-race-condition branch from d191fa8 to 05b507b Compare January 31, 2022 18:00
Calling _setBuildWaitTimer() here in substantiate() may cause the worker
to be insubstantiated later while build is running. Three places where
the timer is reset are _setBuildWaitTimer() itself, stopService() and
buildStarted(). The first two are not interesting and buildStarted() has
already been called for a particular build when we are in
substantiate():

- buildStarted() is called via WorkerForBuilder.buildStarted() which is
called from Builder._startBuildFor()
- substantiate() is called via
LatentWorkerForBuilder.substantiate_if_needed() via Build.startBuild()
from Builder._startBuildFor().

Once the current build is finished for any reason, _setBuildWaitTimer()
will be called from buildFinished(). Thus calling _setBuildWaitTimer()
from substantiate() is not needed.

Commit adjusted by: Povilas Kanapickas <povilas@radix.lt>
@p12tic
Copy link
Member

p12tic commented Jan 31, 2022

This PR fixes #5988.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants