Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildbot stop is not reliable in buildbot 0.9 #3535

Closed
aj062 opened this issue Aug 16, 2017 · 17 comments
Closed

buildbot stop is not reliable in buildbot 0.9 #3535

aj062 opened this issue Aug 16, 2017 · 17 comments

Comments

@aj062
Copy link
Contributor

aj062 commented Aug 16, 2017

If have noticed multiple times that buildbot stop never finish. Running 'buildbot stop' again results in Error "twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running". Only option left at that time is to "kill -9 pid".

'buildbot stop' should be more reliable.

For example, below are logs from a test instance (single master) which was not running any build at the time of 'buildbot stop'. However, 'buildbot stop' never finished, issuing another 'buildbot stop' didn't help either.

2017-08-16 09:42:23-0700 [-] Received SIGTERM, shutting down.
2017-08-16 09:42:23-0700 [-] doing housekeeping for master 1 build.hostname.com:[path_here]
2017-08-16 09:42:24-0700 [-] Initiating clean shutdown
2017-08-16 09:42:24-0700 [-] No running jobs, starting shutdown immediately
2017-08-16 09:42:24-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:24-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:24-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:24-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:24-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:24-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:24-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:24-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:25-0700 [-] (ignored) while invoking Triggerable schedulers:
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: Triggerable scheduler stopped before build was complete
	
2017-08-16 09:42:26-0700 [-] (TCP Port 8010 Closed)
2017-08-16 09:42:26-0700 [-] Stopping factory <buildbot.www.service.BuildbotSite instance at 0x7ff4580b0dd0>
2017-08-16 09:42:26-0700 [-] (TCP Port 9989 Closed)
2017-08-16 09:42:26-0700 [-] Stopping factory <twisted.spread.pb.PBServerFactory instance at 0x3545878>



2017-08-16 09:51:21-0700 [-] Received SIGTERM, shutting down.
2017-08-16 09:51:21-0700 [-] Unhandled Error
	Traceback (most recent call last):
	  File "/usr/lib64/python2.7/site-packages/twisted/application/app.py", line 396, in startReactor
	    self.config, oldstdout, oldstderr, self.profiler, reactor)
	  File "/usr/lib64/python2.7/site-packages/twisted/application/app.py", line 311, in runReactorWithLogging
	    reactor.run()
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1243, in run
	    self.mainLoop()
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1252, in mainLoop
	    self.runUntilCurrent()
	--- <exception caught here> ---
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 851, in runUntilCurrent
	    f(*a, **kw)
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 627, in stop
	    "Can't stop reactor that isn't running.")
	twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.
@tardyp
Copy link
Member

tardyp commented Aug 28, 2017

This is a known issue related to txrequests, and cleanup of txrequests threads. I was never able to find a reliable method to shutdown the threads correctly sometimes it works some times it doesn't..

This is why I added support to treq, which doesn't use threads. You may switch to treq as a backend for httpclientservice.

pip install treq
pip uninstall txrequests

@aj062
Copy link
Contributor Author

aj062 commented Sep 29, 2017

We seems to be using treq. (Though I never manually installed any of them. Probably buildbot by default used treq in my case).

[bash]$ pip freeze | grep txrequests
[bash]$
[bash]$ pip freeze | grep treq
treq==17.7.0

Do you expect this problem with treq as well?

@seankelly
Copy link
Member

Are you using any latent workers? EC2 and OpenStack use threads I recall.

@aj062
Copy link
Contributor Author

aj062 commented Nov 8, 2017

@seankelly no.

@tardyp This problem happens frequently on heavily loaded master. I think the problem is that even after running buildbot stop command, new builds keeps starting. So the load doesn't decrease and there is always long list of pending_builders, and so buildbot stop is not able to acquire activity_lock.

The workaround which works for me is:
buildbot stop --clean
#wait for 3-4 minutes and then press control+c
buildbot stop

buildbot stop --clean helps in reducing the load, and then buildbot stop works.

I think new builds should not start once buildbot stop is issued.

@tardyp
Copy link
Member

tardyp commented Nov 9, 2017

The workaround which works for me is
[..]
ahah. Interesting findings. Do you still have my change in the brd in your prod?
https://github.com/buildbot/buildbot/pull/3684/files

The activity lock should be taken much less longer there.

@aj062
Copy link
Contributor Author

aj062 commented Feb 16, 2018

@tardyp The workaround I mentioned above (using --clean) has been working perfectly fine for me for last few months. I just tried to do regular "buildbot stop" and it failed again (as expected).

ahah. Interesting findings. Do you still have my change in the brd in your prod?

Yes, I have that change.

Also, I am not not sure if buildbot is waiting for activity_lock or stuck somewhere else. I notice the line "doing housekeeping for master" in the logs. I have added a log statement at https://git.io/vdKlC, but that wasn't printed in the logs (atleast this time). So, buildbot might be stuck in _masterDeactivatedHousekeeping() https://git.io/vAWTf or somewhere else. Note that the number of builds running currently doesn't decrease for me (except for few builds which finish normally).

Here are the redacted twistd.log: https://goo.gl/NwzJcC

I think that one problem which is surely present is that even after running buildbot stop command, new builds keeps starting. "buildbot stop" behavior should be same as "buildbot stop --clean", in the sense that new builds shouldn't start after issuing stop command.

@aj062
Copy link
Contributor Author

aj062 commented Apr 18, 2018

@tardyp, for sure, "buildbot stop --clean" prevents new builds from starting, while "buildbot stop" doesn't. Can you please check why?

@aj062
Copy link
Contributor Author

aj062 commented Aug 26, 2018

@tardyp any idea why "buildbot stop" doesn't prevents new builds from starting, while "buildbot stop --clean" does?

@seankelly
Copy link
Member

My guess is because the first instructs Buildbot to stop immediately while a clean stop doesn't. Because stop should be immediate it's assumed a build won't be able to start. That is less likely to be true on a busy or large Buildbot instance though.

@aj062
Copy link
Contributor Author

aj062 commented Aug 28, 2018

Make sense.

Is this the code which prevents new builds from starting (while using "buildbot stop --clean")?

https://github.com/buildbot/buildbot/blob/master/master/buildbot/process/botmaster.py#L105

@seankelly
Copy link
Member

That highlighted line adds all running builds to a list so they can be monitored. I don't know where builds are queued, it appears to be elsewhere than this file. My guess is the BuildRequestDistributor can't distribute them because it's disowned so the builds naturally queue.

@openSourceBugs
Copy link

Wow. 5 year old issue and this is still happening for me. I probably can't use buildbot at this point. I'm not even on section 1.2.3 of the tutorial and open source quality rears its head.

@p12tic
Copy link
Member

p12tic commented Aug 9, 2022

@openSourceBugs Which version are you using? The reliability of buildbot stop has been quite improved around version 2.10.2 / 3.0 released 1.5 years ago.

Note that you need to run buildbot stop with --no-wait in order for it to not wait for builds to complete.

@openSourceBugs
Copy link

@openSourceBugs Which version are you using? The reliability of buildbot stop has been quite improved around version 2.10.2 / 3.0 released 1.5 years ago.

Note that you need to run buildbot stop with --no-wait in order for it to not wait for builds to complete.

This was on what I pip installed yesterday from the tutorial docs. There was a "waiting for build to stop" status that never went away for 30 minutes after attempting to stop a failed build. Me stopping the build arose because the tutorial and the code it is using are obsolete (using git:// instead of https) and hasn't been updated for 5 years at least. There was a timeout when running the tutorial as it is now. I probably won't be using buildbot moving forward and this will be my last comment on this.

@aj062
Copy link
Contributor Author

aj062 commented Aug 11, 2022

The reliability of buildbot stop has been quite improved around version 2.10.2 / 3.0 released 1.5 years ago.

Interesting. It would be nice if #6029 can also be fixed.

@mokibit
Copy link
Collaborator

mokibit commented Aug 24, 2022

For the record, the issue with using obsolete code in tutorial was recently fixed: #6599.

@mokibit
Copy link
Collaborator

mokibit commented Aug 26, 2022

This issue was fixed in #6620.

@p12tic p12tic closed this as completed Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants