Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Worker affinity on ATCs #2312

Closed
jama22 opened this issue Jun 21, 2018 · 13 comments
Closed

External Worker affinity on ATCs #2312

jama22 opened this issue Jun 21, 2018 · 13 comments
Labels
efficiency enhancement ops/dayN release/documented Documentation and release notes have been updated. size/medium An easily manageable amount of work. Well-defined scope, few unknowns.
Projects
Milestone

Comments

@jama22
Copy link
Member

jama22 commented Jun 21, 2018

We've observed situations where external workers will be unusually attached to web node (ATC/TSA) when an additional web node is added. The re-balancing of workers doesn't seem to happen as it should

cc @topherbullock

@jama22 jama22 created this issue from a note in Runtime (Backlog) Jun 21, 2018
@topherbullock
Copy link
Member

topherbullock commented Jul 11, 2018

When registering externally the worker component will proxy over the Garden and Baggageclaim connections via the TSA. This allows operators to only worry about ingress from Worker -> TSA when they're configuring external workers.

When when registering externally in "forward mode", the Worker registers with its addresses for BaggageClaim and Garden as the proxied addresses for the TSA. The ATC needs the worker to keep that connection alive, because if the TCP connection closes in the middle of the ATC using it, panic will ensue! CHAOS! ... basically the ATC won't be able to talk to workers if they're in the middle of a rebalance that closes off the connection to the TSA..

For reference:

@keithkroeger
Copy link

Just a few related questions:
• When do we need to consider using 4 ATCs?
• Is there a downside to having more ATCs? For example, more contention?
Would it be better to size up to 3 atcs only when doing a deployment and then ramp down, for example?
• We see an ‘expired’ column in the workers table. What is this for?
We believe this means that the worker will have to re-register with the ATC by that time or be listed as stalled, but is this the case?
• Can we force rebalancing without recreating the worker? (Restarting the worker doesn’t work unless we prune, perhaps.)

@topherbullock
Copy link
Member

topherbullock commented Aug 7, 2018

Considerations here:

  • we need a way to "drain" when there are active connections (OR DO WE?!?!?.. could the current scheduler just be resilient to that)
  • should this be a configurable timeout for the workers' connection with the TSA?
  • what's the cost of doing this all the time regardless of there being many TSAs
  • ideally this should be zero downtime (swap over to the new connection?)
  • should the worker be in some new state when it is rebalancing?
  • does the registration of the worker update the addresses if the state doesn't change?

@topherbullock topherbullock added the size/medium An easily manageable amount of work. Well-defined scope, few unknowns. label Aug 7, 2018
@xtremerui xtremerui moved this from Backlog to In Flight in Runtime Aug 7, 2018
@xtremerui
Copy link
Contributor

hi @keithkroeger, when you were seeing all workers registered through one atc, what is the resource consumption on all atc nodes eg. CPU, memory and network throughput? Thx.

@xtreme-sameer-vohra
Copy link
Contributor

Branch - atc-affinity#2312

Summarizing 10 Failures:

[Fail] [#129726011] Worker landing with one worker restarting the worker with an interruptible build in-flight [It] does not wait for the build
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/worker_landing_test.go:133

[Fail] [#129726011] Worker landing with one worker restarting the worker with volumes and containers present [It] keeps volumes and containers after restart
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/worker_landing_test.go:112

[Fail] [#129726011] Worker landing with a single team worker restarting the worker with an interruptible build in-flight [It] does not wait for the build
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/worker_landing_test.go:133

[Fail] [#129726011] Worker landing with a single team worker restarting the worker with volumes and containers present [It] keeps volumes and containers after restart
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/worker_landing_test.go:104

[Fail] [AfterEach] an ATC with default resource limits set respects the default resource limits, overridding when specified
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [AfterEach] :life Garbage collecting resource cache volumes A resource that was removed from pipeline has its resource cache, resource cache uses and resource cache volumes cleared out
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [AfterEach] :life Garbage collecting resource cache volumes A resource in paused pipeline has its resource cache, resource cache uses and resource cache volumes cleared out
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [AfterEach] :life [#136140165] Container scope when the container is scoped to a team is only hijackable by someone in that team
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [BeforeEach] Hijacked containers does not delete hijacked resource containers from the database, and sets a 5 minute TTL on the container in garden
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

[Fail] [AfterEach] A build using an image_resource one-off builds does not garbage-collect the image immediately
/Users/pivotal/go/src/github.com/concourse/concourse/src/github.com/concourse/topgun/topgun_suite_test.go:430

Ran 60 of 76 Specs in 17828.034 seconds
FAIL! -- 50 Passed | 10 Failed | 0 Pending | 16 Skipped

Ginkgo ran 1 suite in 4h57m11.092423768s
Test Suite Failed

@jama22 jama22 added the paused label Aug 13, 2018
@jama22 jama22 removed the paused label Aug 20, 2018
@jama22 jama22 removed this from In Flight in Runtime Aug 27, 2018
@jama22 jama22 added this to Icebox in Operations via automation Aug 27, 2018
@jama22 jama22 moved this from Icebox to In Flight in Operations Aug 27, 2018
@keithkroeger
Copy link

Hello @xtremerui
We see cpu and memory (and go routines) spike on the one ATC before it craters while the other remains almost unused, strangely.

no information about network to mind, unfortunately.

@jama22 jama22 moved this from In Flight to Backlog in Operations Aug 30, 2018
@jama22 jama22 moved this from Backlog to In Flight in Operations Aug 30, 2018
@YoussB
Copy link
Member

YoussB commented Sep 12, 2018

Changes that will be required on the ATC

  • drop addr_when_running constraint on workers table.
  • heartbeat should NOT update the garden and baggage claim addresses.
  • stalled worker should not get garden and baggage claim addresses set to nil.

@xtreme-sameer-vohra
Copy link
Contributor

  • We're modifying the workers' table constraint: addr_when_running
    • from: ADD CONSTRAINT "addr_when_running" CHECK ((((state <> 'stalled'::worker_state) AND (state <> 'landed'::worker_state) AND ((addr IS NOT NULL) OR (baggageclaim_url IS NOT NULL))) OR (((state = 'stalled'::worker_state) OR (state = 'landed'::worker_state)) AND (addr IS NULL) AND (baggageclaim_url IS NULL))));
    • to: ADD CONSTRAINT "addr_when_running" CHECK ((state <> 'stalled'::worker_state) AND (state <> 'landed'::worker_state) AND ((addr IS NOT NULL) OR (baggageclaim_url IS NOT NULL)));
  • In order to make a migration down for this change we'll be deleting all the records that don't satisfy this constraint, or truncating the whole workers table (which will result into truncating other tables as well). this shouldn't result into a huge problem since the workers will eventually re-register themselves again.

any thoughts on that ?
@vito @topherbullock

@xtreme-sameer-vohra
Copy link
Contributor

As part of this feature, we will be introducing a few parameters to the Worker that are noteworthy.

Will be configurable by the user
- rebalanceTime : The interval on which a new connection will be created by the worker. A value of 0 would mean that the Worker will not create additional connections. 
	Default Value : 0	

- idleConnectionTimeout : The time after which a stale connection will timeout if no data is sent on the connection (Connection isnt being used for pulling build logs). The AWS LB default value for a idle timeout is 60s, while the maximum is 4000s. Also, a stale connection has heartbeating turned off.
	Default Value : 1 hr

Will NOT be configurable by the user
- maxConnections : The maximum # of connections that the worker will create to the TSA(s). We can use Alex's suggestion and set this value to Max(5, # of TSA addresses provided to the worker).

Let us know if you have any thoughts or concerns ?
@vito @topherbullock

xtreme-sameer-vohra pushed a commit that referenced this issue Sep 21, 2018
    - drainer will check process every second
      This addresses a pre-existing race condition

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
xtreme-sameer-vohra pushed a commit that referenced this issue Sep 24, 2018
- applies to forward workers

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Sep 25, 2018
- applies to forward workers

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
xtreme-sameer-vohra pushed a commit that referenced this issue Sep 27, 2018
- applies to forward workers

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
    - drainer will check process every second
      This addresses a pre-existing race condition

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
- applies to forward workers

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
- applies to forward workers

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
- applies to forward workers

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
- applies to forward workers

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
    - drainer will check process every second
      This addresses a pre-existing race condition

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
- applies to forward workers

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
    - drainer will check process every second
      This addresses a pre-existing race condition

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
- applies to forward workers

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 1, 2018
Signed-off-by: Sameer Vohra <svohra@pivotal.io>
@xtreme-sameer-vohra
Copy link
Contributor

We updated the properties to simplify things as such;

Will be configurable by the user
- rebalanceTime : The interval on which a new connection will be created by the worker. A value of 0 would mean that the Worker will not create additional connections. This value will also be used to configure a idleConnectionTimeout for connections between the Worker and the TSA. 
	Default Value : 0	

The maxConnections is set to 5. It is unlikely that multiple TSA addresses will be provided to a worker in the `forwarded` case as generally these TSAs would be behind a LB.

Operations automation moved this from In Flight to Done Oct 1, 2018
@marco-m
Copy link
Contributor

marco-m commented Oct 1, 2018

Hello @xtreme-sameer-vohra, are the various commits also containing some documentation for us poor users ? :-)

YoussB added a commit that referenced this issue Oct 2, 2018
Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB pushed a commit that referenced this issue Oct 2, 2018
Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB pushed a commit that referenced this issue Oct 2, 2018
Signed-off-by: Sameer Vohra <svohra@pivotal.io>
@xtreme-sameer-vohra
Copy link
Contributor

Hey @marco-m
Yep, its available here

YoussB added a commit that referenced this issue Oct 3, 2018
- beacon -> land, retire, delete

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
YoussB added a commit that referenced this issue Oct 3, 2018
- fix keepalive strict checking for tcpConn

Signed-off-by: Sameer Vohra <svohra@pivotal.io>
vito pushed a commit that referenced this issue Oct 23, 2018
Signed-off-by: Sameer Vohra <svohra@pivotal.io>
cirocosta pushed a commit to concourse/concourse-bosh-deployment that referenced this issue Jan 11, 2019
This commit adds an ops file so that users can make use
of the configurable worker rebalancing interval.

concourse/concourse#2312

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta pushed a commit to concourse/concourse-bosh-deployment that referenced this issue Jan 11, 2019
This commit adds an ops file so that users can make use
of the configurable worker rebalancing interval.

concourse/concourse#2312

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta pushed a commit to cirocosta/charts that referenced this issue Jan 18, 2019
- Make use of the worker healthcheck endpoint

	Previously, the liveness probe of the worker was based on logs, which
	could end up killing the whole worker in the case of a malformed
	pipeline.

	Now, making use of the native concourse healthchecking that the workers
	provide, we can delegate to concourse the task of telling k8s if it's
	alive or not.

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
	Signed-off-by: Saman Alvi <salvi@pivotal.io>

- Make use of signals to terminate workers

	Instead of making use of `concourse retire-worker` which initiates a
	connection to the TSA to then retire the worker, we can instead make use
	of the newly introduced mechanism of sending signals to the worker to
	tell it to retire, being less error-prone.

	The idea of this commit is to do similar to what's done for Concourse's
	official BOSH releases (see
	concourse/concourse-bosh-release@a3ebf6a?diff=split)

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes GARDEN_ env in favor of config file

	On the 5.x series of Concourse, there's no need to have all the garden
	flags specified as environment variables anymore.

	That's because the new image assumes that a `gdn` binary is shipped
	together, which can look for a set of configurations from a specific
	config file.

	The set of possible values that can be used in the configuration file
	can be found here [1].

	[1]: https://github.com/cloudfoundry/guardian/blob/c1f268e69cd204e891f29bb020e32284a0054606/gqt/runner/runner.go#L41-L93

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes references to fatal errors

	Previously, the Helm chart made use of a set of possible fatal errors
	that either `garden` or `baggageclaim` would produce, terminating the
	worker pod in those cases.

	Now, making use of the worker probing endpoint (see [1]), we're able to
	implement better strategies for determining whether the worker is up or
	not while not changing the contract that `health_ip:health_port` gives
	back the info that the worker is alive or not.

	[1]: concourse/concourse@c3b26a0

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Updates env to match concourse 5.x

	There were some changes to some variables in the next release of
	concourse.

- Introduces bitbucket cloud auth variables

	In Concourse 5 it becomes possible to make use of Bitbucket cloud as an
	authenticator (see [1]).

	This commit includes the variables necessary for doing so, as well as
	the necessary keys under `secrets` to have those variables injected into
	web's environment.

	[1]: concourse/concourse#2631

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Add captureErrorMetrics and rebalanceInterval flags

	- RebalanceInterval: concourse/concourse#2312
	- CaptureErrorMetrics: concourse/concourse#2754

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta pushed a commit to cirocosta/charts that referenced this issue Jan 18, 2019
- Make use of the worker healthcheck endpoint

	Previously, the liveness probe of the worker was based on logs, which
	could end up killing the whole worker in the case of a malformed
	pipeline.

	Now, making use of the native concourse healthchecking that the workers
	provide, we can delegate to concourse the task of telling k8s if it's
	alive or not.

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
	Signed-off-by: Saman Alvi <salvi@pivotal.io>

- Make use of signals to terminate workers

	Instead of making use of `concourse retire-worker` which initiates a
	connection to the TSA to then retire the worker, we can instead make use
	of the newly introduced mechanism of sending signals to the worker to
	tell it to retire, being less error-prone.

	The idea of this commit is to do similar to what's done for Concourse's
	official BOSH releases (see
	concourse/concourse-bosh-release@a3ebf6a?diff=split)

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes GARDEN_ env in favor of config file

	On the 5.x series of Concourse, there's no need to have all the garden
	flags specified as environment variables anymore.

	That's because the new image assumes that a `gdn` binary is shipped
	together, which can look for a set of configurations from a specific
	config file.

	The set of possible values that can be used in the configuration file
	can be found here [1].

	[1]: https://github.com/cloudfoundry/guardian/blob/c1f268e69cd204e891f29bb020e32284a0054606/gqt/runner/runner.go#L41-L93

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes references to fatal errors

	Previously, the Helm chart made use of a set of possible fatal errors
	that either `garden` or `baggageclaim` would produce, terminating the
	worker pod in those cases.

	Now, making use of the worker probing endpoint (see [1]), we're able to
	implement better strategies for determining whether the worker is up or
	not while not changing the contract that `health_ip:health_port` gives
	back the info that the worker is alive or not.

	[1]: concourse/concourse@c3b26a0

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Updates env to match concourse 5.x

	There were some changes to some variables in the next release of
	concourse.

- Introduces bitbucket cloud auth variables

	In Concourse 5 it becomes possible to make use of Bitbucket cloud as an
	authenticator (see [1]).

	This commit includes the variables necessary for doing so, as well as
	the necessary keys under `secrets` to have those variables injected into
	web's environment.

	[1]: concourse/concourse#2631

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Add captureErrorMetrics and rebalanceInterval flags

	- RebalanceInterval: concourse/concourse#2312
	- CaptureErrorMetrics: concourse/concourse#2754

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
@vito vito added enhancement and removed bug labels Jan 22, 2019
@vito vito added this to the v5.0.0 milestone Jan 22, 2019
@vito
Copy link
Member

vito commented Jan 22, 2019

Re-branding this as an enhancement - we never supported worker rebalancing, so this was a whole new feature, not a misbehavior.

@vito vito added the release/documented Documentation and release notes have been updated. label Jan 22, 2019
cirocosta pushed a commit to cirocosta/charts that referenced this issue Jan 23, 2019
- Make use of the worker healthcheck endpoint

	Previously, the liveness probe of the worker was based on logs, which
	could end up killing the whole worker in the case of a malformed
	pipeline.

	Now, making use of the native concourse healthchecking that the workers
	provide, we can delegate to concourse the task of telling k8s if it's
	alive or not.

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
	Signed-off-by: Saman Alvi <salvi@pivotal.io>

- Make use of signals to terminate workers

	Instead of making use of `concourse retire-worker` which initiates a
	connection to the TSA to then retire the worker, we can instead make use
	of the newly introduced mechanism of sending signals to the worker to
	tell it to retire, being less error-prone.

	The idea of this commit is to do similar to what's done for Concourse's
	official BOSH releases (see
	concourse/concourse-bosh-release@a3ebf6a?diff=split)

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes GARDEN_ env in favor of config file

	On the 5.x series of Concourse, there's no need to have all the garden
	flags specified as environment variables anymore.

	That's because the new image assumes that a `gdn` binary is shipped
	together, which can look for a set of configurations from a specific
	config file.

	The set of possible values that can be used in the configuration file
	can be found here [1].

	[1]: https://github.com/cloudfoundry/guardian/blob/c1f268e69cd204e891f29bb020e32284a0054606/gqt/runner/runner.go#L41-L93

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes references to fatal errors

	Previously, the Helm chart made use of a set of possible fatal errors
	that either `garden` or `baggageclaim` would produce, terminating the
	worker pod in those cases.

	Now, making use of the worker probing endpoint (see [1]), we're able to
	implement better strategies for determining whether the worker is up or
	not while not changing the contract that `health_ip:health_port` gives
	back the info that the worker is alive or not.

	[1]: concourse/concourse@c3b26a0

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Updates env to match concourse 5.x

	There were some changes to some variables in the next release of
	concourse.

- Introduces bitbucket cloud auth variables

	In Concourse 5 it becomes possible to make use of Bitbucket cloud as an
	authenticator (see [1]).

	This commit includes the variables necessary for doing so, as well as
	the necessary keys under `secrets` to have those variables injected into
	web's environment.

	[1]: concourse/concourse#2631

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Add captureErrorMetrics and rebalanceInterval flags

	- RebalanceInterval: concourse/concourse#2312
	- CaptureErrorMetrics: concourse/concourse#2754

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta pushed a commit to cirocosta/charts that referenced this issue Feb 16, 2019
- Make use of the worker healthcheck endpoint

	Previously, the liveness probe of the worker was based on logs, which
	could end up killing the whole worker in the case of a malformed
	pipeline.

	Now, making use of the native concourse healthchecking that the workers
	provide, we can delegate to concourse the task of telling k8s if it's
	alive or not.

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
	Signed-off-by: Saman Alvi <salvi@pivotal.io>

- Make use of signals to terminate workers

	Instead of making use of `concourse retire-worker` which initiates a
	connection to the TSA to then retire the worker, we can instead make use
	of the newly introduced mechanism of sending signals to the worker to
	tell it to retire, being less error-prone.

	The idea of this commit is to do similar to what's done for Concourse's
	official BOSH releases (see
	concourse/concourse-bosh-release@a3ebf6a?diff=split)

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes GARDEN_ env in favor of config file

	On the 5.x series of Concourse, there's no need to have all the garden
	flags specified as environment variables anymore.

	That's because the new image assumes that a `gdn` binary is shipped
	together, which can look for a set of configurations from a specific
	config file.

	The set of possible values that can be used in the configuration file
	can be found here [1].

	[1]: https://github.com/cloudfoundry/guardian/blob/c1f268e69cd204e891f29bb020e32284a0054606/gqt/runner/runner.go#L41-L93

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes references to fatal errors

	Previously, the Helm chart made use of a set of possible fatal errors
	that either `garden` or `baggageclaim` would produce, terminating the
	worker pod in those cases.

	Now, making use of the worker probing endpoint (see [1]), we're able to
	implement better strategies for determining whether the worker is up or
	not while not changing the contract that `health_ip:health_port` gives
	back the info that the worker is alive or not.

	[1]: concourse/concourse@c3b26a0

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Updates env to match concourse 5.x

	There were some changes to some variables in the next release of
	concourse.

- Introduces bitbucket cloud auth variables

	In Concourse 5 it becomes possible to make use of Bitbucket cloud as an
	authenticator (see [1]).

	This commit includes the variables necessary for doing so, as well as
	the necessary keys under `secrets` to have those variables injected into
	web's environment.

	[1]: concourse/concourse#2631

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Add captureErrorMetrics and rebalanceInterval flags

	- RebalanceInterval: concourse/concourse#2312
	- CaptureErrorMetrics: concourse/concourse#2754

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta pushed a commit to cirocosta/charts that referenced this issue Mar 1, 2019
- Make use of the worker healthcheck endpoint

	Previously, the liveness probe of the worker was based on logs, which
	could end up killing the whole worker in the case of a malformed
	pipeline.

	Now, making use of the native concourse healthchecking that the workers
	provide, we can delegate to concourse the task of telling k8s if it's
	alive or not.

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
	Signed-off-by: Saman Alvi <salvi@pivotal.io>

- Make use of signals to terminate workers

	Instead of making use of `concourse retire-worker` which initiates a
	connection to the TSA to then retire the worker, we can instead make use
	of the newly introduced mechanism of sending signals to the worker to
	tell it to retire, being less error-prone.

	The idea of this commit is to do similar to what's done for Concourse's
	official BOSH releases (see
	concourse/concourse-bosh-release@a3ebf6a?diff=split)

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes GARDEN_ env in favor of config file

	On the 5.x series of Concourse, there's no need to have all the garden
	flags specified as environment variables anymore.

	That's because the new image assumes that a `gdn` binary is shipped
	together, which can look for a set of configurations from a specific
	config file.

	The set of possible values that can be used in the configuration file
	can be found here [1].

	[1]: https://github.com/cloudfoundry/guardian/blob/c1f268e69cd204e891f29bb020e32284a0054606/gqt/runner/runner.go#L41-L93

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Removes references to fatal errors

	Previously, the Helm chart made use of a set of possible fatal errors
	that either `garden` or `baggageclaim` would produce, terminating the
	worker pod in those cases.

	Now, making use of the worker probing endpoint (see [1]), we're able to
	implement better strategies for determining whether the worker is up or
	not while not changing the contract that `health_ip:health_port` gives
	back the info that the worker is alive or not.

	[1]: concourse/concourse@c3b26a0

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Updates env to match concourse 5.x

	There were some changes to some variables in the next release of
	concourse.

- Introduces bitbucket cloud auth variables

	In Concourse 5 it becomes possible to make use of Bitbucket cloud as an
	authenticator (see [1]).

	This commit includes the variables necessary for doing so, as well as
	the necessary keys under `secrets` to have those variables injected into
	web's environment.

	[1]: concourse/concourse#2631

	Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

- Add captureErrorMetrics and rebalanceInterval flags

	- RebalanceInterval: concourse/concourse#2312
	- CaptureErrorMetrics: concourse/concourse#2754

- Removes flags in web cmd and add missing flags
- Introduce TSA_DEBUG* variables
- uses *bind for healthcheck variables
- update worker debug flags
- update debug-bind-* variables for baggageclaim

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency enhancement ops/dayN release/documented Documentation and release notes have been updated. size/medium An easily manageable amount of work. Well-defined scope, few unknowns.
Projects
Development

No branches or pull requests

8 participants