External Worker affinity on ATCs #2312

jama22 · 2018-06-21T18:06:12Z

We've observed situations where external workers will be unusually attached to web node (ATC/TSA) when an additional web node is added. The re-balancing of workers doesn't seem to happen as it should

cc @topherbullock

topherbullock · 2018-07-11T22:17:28Z

When registering externally the worker component will proxy over the Garden and Baggageclaim connections via the TSA. This allows operators to only worry about ingress from Worker -> TSA when they're configuring external workers.

When when registering externally in "forward mode", the Worker registers with its addresses for BaggageClaim and Garden as the proxied addresses for the TSA. The ATC needs the worker to keep that connection alive, because if the TCP connection closes in the middle of the ATC using it, panic will ensue! CHAOS! ... basically the ATC won't be able to talk to workers if they're in the middle of a rebalance that closes off the connection to the TSA..

For reference:

keithkroeger · 2018-07-17T14:17:51Z

Just a few related questions:
• When do we need to consider using 4 ATCs?
• Is there a downside to having more ATCs? For example, more contention?
Would it be better to size up to 3 atcs only when doing a deployment and then ramp down, for example?
• We see an ‘expired’ column in the workers table. What is this for?
We believe this means that the worker will have to re-register with the ATC by that time or be listed as stalled, but is this the case?
• Can we force rebalancing without recreating the worker? (Restarting the worker doesn’t work unless we prune, perhaps.)

topherbullock · 2018-08-07T18:16:39Z

Considerations here:

we need a way to "drain" when there are active connections (OR DO WE?!?!?.. could the current scheduler just be resilient to that)
should this be a configurable timeout for the workers' connection with the TSA?
what's the cost of doing this all the time regardless of there being many TSAs
ideally this should be zero downtime (swap over to the new connection?)
should the worker be in some new state when it is rebalancing?
does the registration of the worker update the addresses if the state doesn't change?

xtremerui · 2018-08-07T20:53:03Z

hi @keithkroeger, when you were seeing all workers registered through one atc, what is the resource consumption on all atc nodes eg. CPU, memory and network throughput? Thx.

xtreme-sameer-vohra · 2018-08-10T16:17:07Z

Branch - atc-affinity#2312

Summarizing 10 Failures:

[Fail] [#129726011] Worker landing with one worker restarting the worker with an interruptible build in-flight [It] does not wait for the build
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/worker_landing_test.go:133

[Fail] [#129726011] Worker landing with one worker restarting the worker with volumes and containers present [It] keeps volumes and containers after restart
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/worker_landing_test.go:112

[Fail] [#129726011] Worker landing with a single team worker restarting the worker with an interruptible build in-flight [It] does not wait for the build
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/worker_landing_test.go:133

[Fail] [#129726011] Worker landing with a single team worker restarting the worker with volumes and containers present [It] keeps volumes and containers after restart
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/worker_landing_test.go:104

[Fail] [AfterEach] an ATC with default resource limits set respects the default resource limits, overridding when specified
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [AfterEach] :life Garbage collecting resource cache volumes A resource that was removed from pipeline has its resource cache, resource cache uses and resource cache volumes cleared out
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [AfterEach] :life Garbage collecting resource cache volumes A resource in paused pipeline has its resource cache, resource cache uses and resource cache volumes cleared out
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [AfterEach] :life [#136140165] Container scope when the container is scoped to a team is only hijackable by someone in that team
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [BeforeEach] Hijacked containers does not delete hijacked resource containers from the database, and sets a 5 minute TTL on the container in garden
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [AfterEach] A build using an image_resource one-off builds does not garbage-collect the image immediately
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

Ran 60 of 76 Specs in 17828.034 seconds
FAIL! -- 50 Passed | 10 Failed | 0 Pending | 16 Skipped

Ginkgo ran 1 suite in 4h57m11.092423768s
Test Suite Failed

keithkroeger · 2018-08-28T14:46:30Z

Hello @xtremerui
We see cpu and memory (and go routines) spike on the one ATC before it craters while the other remains almost unused, strangely.

no information about network to mind, unfortunately.

YoussB · 2018-09-12T15:43:54Z

Changes that will be required on the ATC

drop addr_when_running constraint on workers table.
heartbeat should NOT update the garden and baggage claim addresses.
stalled worker should not get garden and baggage claim addresses set to nil.

xtreme-sameer-vohra · 2018-09-17T14:10:50Z

We're modifying the workers' table constraint: addr_when_running
- from: ADD CONSTRAINT "addr_when_running" CHECK ((((state <> 'stalled'::worker_state) AND (state <> 'landed'::worker_state) AND ((addr IS NOT NULL) OR (baggageclaim_url IS NOT NULL))) OR (((state = 'stalled'::worker_state) OR (state = 'landed'::worker_state)) AND (addr IS NULL) AND (baggageclaim_url IS NULL))));
- to: ADD CONSTRAINT "addr_when_running" CHECK ((state <> 'stalled'::worker_state) AND (state <> 'landed'::worker_state) AND ((addr IS NOT NULL) OR (baggageclaim_url IS NOT NULL)));
In order to make a migration down for this change we'll be deleting all the records that don't satisfy this constraint, or truncating the whole workers table (which will result into truncating other tables as well). this shouldn't result into a huge problem since the workers will eventually re-register themselves again.

any thoughts on that ?
@vito @topherbullock

xtreme-sameer-vohra · 2018-09-20T20:53:15Z

As part of this feature, we will be introducing a few parameters to the Worker that are noteworthy.

Will be configurable by the user
- rebalanceTime : The interval on which a new connection will be created by the worker. A value of 0 would mean that the Worker will not create additional connections. 
	Default Value : 0	

- idleConnectionTimeout : The time after which a stale connection will timeout if no data is sent on the connection (Connection isnt being used for pulling build logs). The AWS LB default value for a idle timeout is 60s, while the maximum is 4000s. Also, a stale connection has heartbeating turned off.
	Default Value : 1 hr

Will NOT be configurable by the user
- maxConnections : The maximum # of connections that the worker will create to the TSA(s). We can use Alex's suggestion and set this value to Max(5, # of TSA addresses provided to the worker).

Let us know if you have any thoughts or concerns ?
@vito @topherbullock

- drainer will check process every second This addresses a pre-existing race condition Signed-off-by: Sameer Vohra <svohra@pivotal.io>

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

- drainer will check process every second This addresses a pre-existing race condition Signed-off-by: Sameer Vohra <svohra@pivotal.io>

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

- drainer will check process every second This addresses a pre-existing race condition Signed-off-by: Sameer Vohra <svohra@pivotal.io>

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

- drainer will check process every second This addresses a pre-existing race condition Signed-off-by: Sameer Vohra <svohra@pivotal.io>

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

Signed-off-by: Sameer Vohra <svohra@pivotal.io>

xtreme-sameer-vohra · 2018-10-01T19:44:49Z

We updated the properties to simplify things as such;

Will be configurable by the user
- rebalanceTime : The interval on which a new connection will be created by the worker. A value of 0 would mean that the Worker will not create additional connections. This value will also be used to configure a idleConnectionTimeout for connections between the Worker and the TSA. 
	Default Value : 0	

The maxConnections is set to 5. It is unlikely that multiple TSA addresses will be provided to a worker in the `forwarded` case as generally these TSAs would be behind a LB.

marco-m · 2018-10-01T20:25:13Z

Hello @xtreme-sameer-vohra, are the various commits also containing some documentation for us poor users ? :-)

Signed-off-by: Sameer Vohra <svohra@pivotal.io>

xtreme-sameer-vohra · 2018-10-03T14:11:52Z

Hey @marco-m
Yep, its available here

- beacon -> land, retire, delete Signed-off-by: Sameer Vohra <svohra@pivotal.io>

- fix keepalive strict checking for tcpConn Signed-off-by: Sameer Vohra <svohra@pivotal.io>

Signed-off-by: Sameer Vohra <svohra@pivotal.io>

This commit adds an ops file so that users can make use of the configurable worker rebalancing interval. concourse/concourse#2312 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Make use of the worker healthcheck endpoint Previously, the liveness probe of the worker was based on logs, which could end up killing the whole worker in the case of a malformed pipeline. Now, making use of the native concourse healthchecking that the workers provide, we can delegate to concourse the task of telling k8s if it's alive or not. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> Signed-off-by: Saman Alvi <salvi@pivotal.io> - Make use of signals to terminate workers Instead of making use of `concourse retire-worker` which initiates a connection to the TSA to then retire the worker, we can instead make use of the newly introduced mechanism of sending signals to the worker to tell it to retire, being less error-prone. The idea of this commit is to do similar to what's done for Concourse's official BOSH releases (see concourse/concourse-bosh-release@a3ebf6a?diff=split) Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Removes GARDEN_ env in favor of config file On the 5.x series of Concourse, there's no need to have all the garden flags specified as environment variables anymore. That's because the new image assumes that a `gdn` binary is shipped together, which can look for a set of configurations from a specific config file. The set of possible values that can be used in the configuration file can be found here [1]. [1]: https://github.com/cloudfoundry/guardian/blob/c1f268e69cd204e891f29bb020e32284a0054606/gqt/runner/runner.go#L41-L93 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Removes references to fatal errors Previously, the Helm chart made use of a set of possible fatal errors that either `garden` or `baggageclaim` would produce, terminating the worker pod in those cases. Now, making use of the worker probing endpoint (see [1]), we're able to implement better strategies for determining whether the worker is up or not while not changing the contract that `health_ip:health_port` gives back the info that the worker is alive or not. [1]: concourse/concourse@c3b26a0 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Updates env to match concourse 5.x There were some changes to some variables in the next release of concourse. - Introduces bitbucket cloud auth variables In Concourse 5 it becomes possible to make use of Bitbucket cloud as an authenticator (see [1]). This commit includes the variables necessary for doing so, as well as the necessary keys under `secrets` to have those variables injected into web's environment. [1]: concourse/concourse#2631 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Add captureErrorMetrics and rebalanceInterval flags - RebalanceInterval: concourse/concourse#2312 - CaptureErrorMetrics: concourse/concourse#2754 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

vito · 2019-01-22T19:15:42Z

Re-branding this as an enhancement - we never supported worker rebalancing, so this was a whole new feature, not a misbehavior.

- Make use of the worker healthcheck endpoint Previously, the liveness probe of the worker was based on logs, which could end up killing the whole worker in the case of a malformed pipeline. Now, making use of the native concourse healthchecking that the workers provide, we can delegate to concourse the task of telling k8s if it's alive or not. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> Signed-off-by: Saman Alvi <salvi@pivotal.io> - Make use of signals to terminate workers Instead of making use of `concourse retire-worker` which initiates a connection to the TSA to then retire the worker, we can instead make use of the newly introduced mechanism of sending signals to the worker to tell it to retire, being less error-prone. The idea of this commit is to do similar to what's done for Concourse's official BOSH releases (see concourse/concourse-bosh-release@a3ebf6a?diff=split) Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Removes GARDEN_ env in favor of config file On the 5.x series of Concourse, there's no need to have all the garden flags specified as environment variables anymore. That's because the new image assumes that a `gdn` binary is shipped together, which can look for a set of configurations from a specific config file. The set of possible values that can be used in the configuration file can be found here [1]. [1]: https://github.com/cloudfoundry/guardian/blob/c1f268e69cd204e891f29bb020e32284a0054606/gqt/runner/runner.go#L41-L93 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Removes references to fatal errors Previously, the Helm chart made use of a set of possible fatal errors that either `garden` or `baggageclaim` would produce, terminating the worker pod in those cases. Now, making use of the worker probing endpoint (see [1]), we're able to implement better strategies for determining whether the worker is up or not while not changing the contract that `health_ip:health_port` gives back the info that the worker is alive or not. [1]: concourse/concourse@c3b26a0 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Updates env to match concourse 5.x There were some changes to some variables in the next release of concourse. - Introduces bitbucket cloud auth variables In Concourse 5 it becomes possible to make use of Bitbucket cloud as an authenticator (see [1]). This commit includes the variables necessary for doing so, as well as the necessary keys under `secrets` to have those variables injected into web's environment. [1]: concourse/concourse#2631 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Add captureErrorMetrics and rebalanceInterval flags - RebalanceInterval: concourse/concourse#2312 - CaptureErrorMetrics: concourse/concourse#2754 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Make use of the worker healthcheck endpoint Previously, the liveness probe of the worker was based on logs, which could end up killing the whole worker in the case of a malformed pipeline. Now, making use of the native concourse healthchecking that the workers provide, we can delegate to concourse the task of telling k8s if it's alive or not. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> Signed-off-by: Saman Alvi <salvi@pivotal.io> - Make use of signals to terminate workers Instead of making use of `concourse retire-worker` which initiates a connection to the TSA to then retire the worker, we can instead make use of the newly introduced mechanism of sending signals to the worker to tell it to retire, being less error-prone. The idea of this commit is to do similar to what's done for Concourse's official BOSH releases (see concourse/concourse-bosh-release@a3ebf6a?diff=split) Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Removes GARDEN_ env in favor of config file On the 5.x series of Concourse, there's no need to have all the garden flags specified as environment variables anymore. That's because the new image assumes that a `gdn` binary is shipped together, which can look for a set of configurations from a specific config file. The set of possible values that can be used in the configuration file can be found here [1]. [1]: https://github.com/cloudfoundry/guardian/blob/c1f268e69cd204e891f29bb020e32284a0054606/gqt/runner/runner.go#L41-L93 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Removes references to fatal errors Previously, the Helm chart made use of a set of possible fatal errors that either `garden` or `baggageclaim` would produce, terminating the worker pod in those cases. Now, making use of the worker probing endpoint (see [1]), we're able to implement better strategies for determining whether the worker is up or not while not changing the contract that `health_ip:health_port` gives back the info that the worker is alive or not. [1]: concourse/concourse@c3b26a0 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Updates env to match concourse 5.x There were some changes to some variables in the next release of concourse. - Introduces bitbucket cloud auth variables In Concourse 5 it becomes possible to make use of Bitbucket cloud as an authenticator (see [1]). This commit includes the variables necessary for doing so, as well as the necessary keys under `secrets` to have those variables injected into web's environment. [1]: concourse/concourse#2631 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io> - Add captureErrorMetrics and rebalanceInterval flags - RebalanceInterval: concourse/concourse#2312 - CaptureErrorMetrics: concourse/concourse#2754 - Removes flags in web cmd and add missing flags - Introduce TSA_DEBUG* variables - uses *bind for healthcheck variables - update worker debug flags - update debug-bind-* variables for baggageclaim Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

jama22 created this issue from a note in Runtime (Backlog) Jun 21, 2018

jama22 added bug efficiency ops/dayN labels Jun 26, 2018

topherbullock added the size/medium An easily manageable amount of work. Well-defined scope, few unknowns. label Aug 7, 2018

xtremerui moved this from Backlog to In Flight in Runtime Aug 7, 2018

jama22 added the paused label Aug 13, 2018

jama22 removed the paused label Aug 20, 2018

jama22 removed this from In Flight in Runtime Aug 27, 2018

jama22 added this to Icebox in Operations via automation Aug 27, 2018

jama22 moved this from Icebox to In Flight in Operations Aug 27, 2018

jama22 moved this from In Flight to Backlog in Operations Aug 30, 2018

jama22 moved this from Backlog to In Flight in Operations Aug 30, 2018

xtreme-sameer-vohra pushed a commit that referenced this issue Sep 21, 2018

[#2312] update drainer

d11f268

- drainer will check process every second This addresses a pre-existing race condition Signed-off-by: Sameer Vohra <svohra@pivotal.io>

xtreme-sameer-vohra pushed a commit that referenced this issue Sep 24, 2018

[#2312] add worker connection rebalance

835cd6d

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Sep 25, 2018

[#2312] add worker connection rebalance

87913a8

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

xtreme-sameer-vohra pushed a commit that referenced this issue Sep 27, 2018

[#2312] add worker connection rebalance

8b08f5c

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] update drainer

1d496c8

- drainer will check process every second This addresses a pre-existing race condition Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] add worker connection rebalance

4684759

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] add worker connection rebalance

c07eb24

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] add worker connection rebalance

72f9bb7

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] add worker connection rebalance

b37278e

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] update drainer

7a56eca

- drainer will check process every second This addresses a pre-existing race condition Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] add worker connection rebalance

91f5d38

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] update drainer

ca1ed21

- drainer will check process every second This addresses a pre-existing race condition Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] add worker connection rebalance

808db7f

- applies to forward workers Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 1, 2018

[#2312] Fixes slow tests for multiple worker connections

5e7a45e

Signed-off-by: Sameer Vohra <svohra@pivotal.io>

xtreme-sameer-vohra closed this as completed Oct 1, 2018

Operations automation moved this from In Flight to Done Oct 1, 2018

YoussB added a commit that referenced this issue Oct 2, 2018

[#2312] Remove unnecessary log statements

603113a

Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB pushed a commit that referenced this issue Oct 2, 2018

[#2312] Fix extraneous heartbeat logging

c4ecfc5

Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB pushed a commit that referenced this issue Oct 2, 2018

[#2312] Remove addressed TODO statement

80d354a

Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 3, 2018

[#2312] Fix potential race condition in worker

6a731f3

- beacon -> land, retire, delete Signed-off-by: Sameer Vohra <svohra@pivotal.io>

YoussB added a commit that referenced this issue Oct 3, 2018

[#2312] Update worker->beacon->dialer

e99f06e

- fix keepalive strict checking for tcpConn Signed-off-by: Sameer Vohra <svohra@pivotal.io>

vito pushed a commit that referenced this issue Oct 23, 2018

[#2312] Topgun test for worker rebalancing.

4217378

Signed-off-by: Sameer Vohra <svohra@pivotal.io>

cirocosta mentioned this issue Jan 11, 2019

Add worker rebalancing configuration concourse/concourse-bosh-deployment#135

Merged

vito added enhancement and removed bug labels Jan 22, 2019

vito added this to the v5.0.0 milestone Jan 22, 2019

vito added the release/documented Documentation and release notes have been updated. label Jan 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External Worker affinity on ATCs #2312

External Worker affinity on ATCs #2312

jama22 commented Jun 21, 2018 •

edited

topherbullock commented Jul 11, 2018 •

edited

keithkroeger commented Jul 17, 2018

topherbullock commented Aug 7, 2018 •

edited

xtremerui commented Aug 7, 2018

xtreme-sameer-vohra commented Aug 10, 2018

keithkroeger commented Aug 28, 2018

YoussB commented Sep 12, 2018

xtreme-sameer-vohra commented Sep 17, 2018

xtreme-sameer-vohra commented Sep 20, 2018

xtreme-sameer-vohra commented Oct 1, 2018

marco-m commented Oct 1, 2018

xtreme-sameer-vohra commented Oct 3, 2018

vito commented Jan 22, 2019

External Worker affinity on ATCs #2312

External Worker affinity on ATCs #2312

Comments

jama22 commented Jun 21, 2018 • edited

topherbullock commented Jul 11, 2018 • edited

keithkroeger commented Jul 17, 2018

topherbullock commented Aug 7, 2018 • edited

xtremerui commented Aug 7, 2018

xtreme-sameer-vohra commented Aug 10, 2018

keithkroeger commented Aug 28, 2018

YoussB commented Sep 12, 2018

xtreme-sameer-vohra commented Sep 17, 2018

xtreme-sameer-vohra commented Sep 20, 2018

xtreme-sameer-vohra commented Oct 1, 2018

marco-m commented Oct 1, 2018

xtreme-sameer-vohra commented Oct 3, 2018

vito commented Jan 22, 2019

jama22 commented Jun 21, 2018 •

edited

topherbullock commented Jul 11, 2018 •

edited

topherbullock commented Aug 7, 2018 •

edited