New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds doesn't get scheduled even when worker is idle #3661

Closed
aj062 opened this Issue Sep 29, 2017 · 10 comments

Comments

Projects
None yet
2 participants
@aj062

aj062 commented Sep 29, 2017

I have noticed many times, that some workers stay idle for long time, even when the corresponding builder has a lot of pending builds. The builder have just one worker associated, and that worker is idle (not running any build). Still, builds are not getting scheduled on the worker even though builder have pending buildrequests.

twistd.log on worker/master doesn't have any error. Just have connection events. e.g.: "Worker bot11 attached to mybuilder-release-builder"

master running: Buildbot 0.9.11, Twisted: 17.5.0, Python: 2.7.5
worker running: buildbot-worker 0.9.11 or buildslave-0.8.12

What can I do to debug this further?

@aj062

This comment has been minimized.

Show comment
Hide comment
@aj062

aj062 Sep 29, 2017

Can this issue be caused due to not doing a clean "buildbot stop"?
#3535

Maybe buildbot still thinks that the worker is building something.

Is there is a way we can check in database (postgres) what's the current state of the worker, to debug this further?

aj062 commented Sep 29, 2017

Can this issue be caused due to not doing a clean "buildbot stop"?
#3535

Maybe buildbot still thinks that the worker is building something.

Is there is a way we can check in database (postgres) what's the current state of the worker, to debug this further?

@aj062

This comment has been minimized.

Show comment
Hide comment
@aj062

aj062 Oct 6, 2017

Looking at this https://git.io/vdEAm, I think there are multiple issues. This happens more frequently in case of large number of builders.

  1. There is delay(~5-30s) between each iteration of activityLoop (probably acquiring locks or _maybeStartBuildsOnBuilder (https://git.io/vdEs5) takes time)
  2. _pending_builders is not processed in FIFO order, so activityLoop might never reach some of the builders, because it just keep processing new builders added to _pending_builders. It seems to be sorted here: https://git.io/vdEGL
  3. efficiency: buildrequestdistributor is processing just one builder per iteration in activityLoop https://git.io/vdEGY, probably it can process multiple one.

aj062 commented Oct 6, 2017

Looking at this https://git.io/vdEAm, I think there are multiple issues. This happens more frequently in case of large number of builders.

  1. There is delay(~5-30s) between each iteration of activityLoop (probably acquiring locks or _maybeStartBuildsOnBuilder (https://git.io/vdEs5) takes time)
  2. _pending_builders is not processed in FIFO order, so activityLoop might never reach some of the builders, because it just keep processing new builders added to _pending_builders. It seems to be sorted here: https://git.io/vdEGL
  3. efficiency: buildrequestdistributor is processing just one builder per iteration in activityLoop https://git.io/vdEGY, probably it can process multiple one.
@tardyp

This comment has been minimized.

Show comment
Hide comment
@tardyp

tardyp Oct 8, 2017

Member

There is delay(~5-30s)

I think it might be related to the discussion we had a little while ago:
#3395
How many pending buildrequests do you have?
Is it a bit better if you flush out old ones?

It seems to be sorted here:

I think it is sorted only when there is a new buildrequest or a finished build (a new worker available)

one builder per iteration

It well processing several builders per iteration would just add a second loop, and this will actually be the same sequence of processing.
pending_builders.pop(0) is actually poping the processed builder, so next loop will process the rest unless there is another sort in between.
Maybe indeed this is our problem and we should have the pending_builder locked until it is empty.

Member

tardyp commented Oct 8, 2017

There is delay(~5-30s)

I think it might be related to the discussion we had a little while ago:
#3395
How many pending buildrequests do you have?
Is it a bit better if you flush out old ones?

It seems to be sorted here:

I think it is sorted only when there is a new buildrequest or a finished build (a new worker available)

one builder per iteration

It well processing several builders per iteration would just add a second loop, and this will actually be the same sequence of processing.
pending_builders.pop(0) is actually poping the processed builder, so next loop will process the rest unless there is another sort in between.
Maybe indeed this is our problem and we should have the pending_builder locked until it is empty.

@aj062

This comment has been minimized.

Show comment
Hide comment
@aj062

aj062 Oct 8, 2017

How many pending buildrequests do you have?

~10,000 pending buildrequests.

I think it is sorted only when there is a new buildrequest or a finished build (a new worker available)

New buildrequests keeps coming because of frequent commits in the repository (or triggers from one builder to another). Also builds keep finishing frequently as I have 200+ builds running in parallel most of the time.

processing several builders per iteration would just add a second loop, and this will actually be the same sequence of processing.

Agree. sequence will be same logically. However locks wouldn't be acquired again and resetPendingBuildersList wouldn't be called meanwhile.

Maybe indeed this is our problem and we should have the pending_builder locked until it is empty.

Yes, I completely agree.
In fact that's the workaround I did on my instance for the time-being and it is working fine (calling pending_builders.pop(0) in a loop and invoking _maybeStartBuildsOnBuilder() on them, although I did it using manhole).

This would remove any dependency on number of pending buildrequests as well.

aj062 commented Oct 8, 2017

How many pending buildrequests do you have?

~10,000 pending buildrequests.

I think it is sorted only when there is a new buildrequest or a finished build (a new worker available)

New buildrequests keeps coming because of frequent commits in the repository (or triggers from one builder to another). Also builds keep finishing frequently as I have 200+ builds running in parallel most of the time.

processing several builders per iteration would just add a second loop, and this will actually be the same sequence of processing.

Agree. sequence will be same logically. However locks wouldn't be acquired again and resetPendingBuildersList wouldn't be called meanwhile.

Maybe indeed this is our problem and we should have the pending_builder locked until it is empty.

Yes, I completely agree.
In fact that's the workaround I did on my instance for the time-being and it is working fine (calling pending_builders.pop(0) in a loop and invoking _maybeStartBuildsOnBuilder() on them, although I did it using manhole).

This would remove any dependency on number of pending buildrequests as well.

@tardyp

This comment has been minimized.

Show comment
Hide comment
@tardyp

tardyp Oct 9, 2017

Member

~10,000 pending buildrequests.

You should definitely have a working collapser if your master can't keep up with the requests. 10k buildrequests is too much as we already discussed.

In fact that's the workaround I did on my instance for the time-being and it is working fine (calling pending_builders.pop(0) in a loop and invoking _maybeStartBuildsOnBuilder() on them, although I did it using manhole).

I prepared #3685 as a PoC for this issue

This would remove any dependency on number of pending buildrequests as well.

Probably not fully remove it as we still have large db inefficiency at fetching buildrequest list in the distributor

Member

tardyp commented Oct 9, 2017

~10,000 pending buildrequests.

You should definitely have a working collapser if your master can't keep up with the requests. 10k buildrequests is too much as we already discussed.

In fact that's the workaround I did on my instance for the time-being and it is working fine (calling pending_builders.pop(0) in a loop and invoking _maybeStartBuildsOnBuilder() on them, although I did it using manhole).

I prepared #3685 as a PoC for this issue

This would remove any dependency on number of pending buildrequests as well.

Probably not fully remove it as we still have large db inefficiency at fetching buildrequest list in the distributor

@tardyp

This comment has been minimized.

Show comment
Hide comment
@tardyp

tardyp Nov 7, 2017

Member

@aj062 did you try #3685 ?

Member

tardyp commented Nov 7, 2017

@aj062 did you try #3685 ?

@aj062

This comment has been minimized.

Show comment
Hide comment
@aj062

aj062 Nov 7, 2017

@tardyp Yes, it works fine.

Although I still see delays (worker being idle for upto 30 minutes), because of long list of pending_builders and slow processing of each builder. It might have to do with number of pending build-requests. I currently have ~2000 pending buildrequests.

However, this change ensures that each builder is processed in proper order, unlike before when they were starved indefinitely.

aj062 commented Nov 7, 2017

@tardyp Yes, it works fine.

Although I still see delays (worker being idle for upto 30 minutes), because of long list of pending_builders and slow processing of each builder. It might have to do with number of pending build-requests. I currently have ~2000 pending buildrequests.

However, this change ensures that each builder is processed in proper order, unlike before when they were starved indefinitely.

@aj062

This comment has been minimized.

Show comment
Hide comment
@aj062

aj062 Sep 14, 2018

@tardyp, I did some profiling, and it seems that that buildbot process is doing a lot of things, and spending a lot of time in sqlalchemy operations. Does that means that the process gets very less time to spend of build-reqeust-distributor, or does brd runs on a separate thread/core?

Here is the icicle graph (generated using py-spy): https://www.icloud.com/iclouddrive/0cB1xo8yMKXioR1jjAz4YH94Q#profile%5F2018-09-11-1403.svg

aj062 commented Sep 14, 2018

@tardyp, I did some profiling, and it seems that that buildbot process is doing a lot of things, and spending a lot of time in sqlalchemy operations. Does that means that the process gets very less time to spend of build-reqeust-distributor, or does brd runs on a separate thread/core?

Here is the icicle graph (generated using py-spy): https://www.icloud.com/iclouddrive/0cB1xo8yMKXioR1jjAz4YH94Q#profile%5F2018-09-11-1403.svg

@aj062

This comment has been minimized.

Show comment
Hide comment
@aj062

aj062 Sep 14, 2018

Also, Is there a way to configure multi-master with one master running just build-request-distributor?

aj062 commented Sep 14, 2018

Also, Is there a way to configure multi-master with one master running just build-request-distributor?

@tardyp

This comment has been minimized.

Show comment
Hide comment
@tardyp

tardyp Sep 15, 2018

Member

There is also buildbot_profiler which is generating similar graphs. https://github.com/tardyp/buildbot_profiler

Build request distributor must run on the same master as they worker it is distributing the work on.
This is related to #3395 no need to create mass issues about it.

Member

tardyp commented Sep 15, 2018

There is also buildbot_profiler which is generating similar graphs. https://github.com/tardyp/buildbot_profiler

Build request distributor must run on the same master as they worker it is distributing the work on.
This is related to #3395 no need to create mass issues about it.

@tardyp tardyp closed this Sep 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment