atc: behaviour: emit tasks waiting prometheus metric #5448

tenjaa · 2020-04-16T07:07:31Z

Existing Issue

Why do we need this PR?

Using the limit-active-tasks placement strategy gives the perfect opportunity
to dynamically scale workers based on tasks running. This metric makes
it easier for operators to implement such a scaling.

Changes proposed in this pull request

add prometheus metric for waiting builds when using limit-active-tasks placement strategy

Contributor Checklist

Unit tests
Not sure if we need tests for Metrics. Looking through the code we didn't find any tests for metrics. Should we add some?
Updated release notes (located at https://github.com/concourse/concourse/tree/master/release-notes)

Reviewer Checklist

Code reviewed
Tests reviewed
Documentation reviewed
Release notes reviewed
PR acceptance performed
New config flags added? Ensure that they are added to the BOSH and Helm packaging; otherwise, ignored for the integration tests (for example, if they are Garden configs that are not displayed in the --help text).

jamieklassen

Initially I had some concerns about the ability to use a gauge (the number of tasks currently waiting, which can increase and decrease) instead of a counter (the number of attempts to schedule a task container which have failed due to all workers being at capacity), because I thought there could be some misbehaviour when multiple web nodes were in play. So I tested it out manually.

diff --git a/docker-compose.yml b/docker-compose.yml
index 5618630a7..aeac1ed14 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -11,7 +11,7 @@ services:
       POSTGRES_USER: dev
       POSTGRES_PASSWORD: dev
 
-  web:
+  web-1:
     build: .
     image: concourse/concourse:local
     volumes:
@@ -20,6 +20,7 @@ services:
     depends_on: [db]
     ports:
     - 8080:8080
+    - 9100:9100
     environment:
       CONCOURSE_LOG_LEVEL: debug
       CONCOURSE_POSTGRES_HOST: db
@@ -32,20 +33,53 @@ services:
       CONCOURSE_CLUSTER_NAME: dev
       CONCOURSE_CLIENT_SECRET: Y29uY291cnNlLXdlYgo=
       CONCOURSE_TSA_CLIENT_SECRET: Y29uY291cnNlLXdvcmtlcgo=
+      CONCOURSE_CONTAINER_PLACEMENT_STRATEGY: limit-active-tasks
+      CONCOURSE_MAX_ACTIVE_TASKS_PER_WORKER: 1
+      CONCOURSE_PROMETHEUS_BIND_IP: "0.0.0.0"
+      CONCOURSE_PROMETHEUS_BIND_PORT: "9100"
+      CONCOURSE_PEER_ADDRESS: web-1
+
+  web-2:
+    build: .
+    image: concourse/concourse:local
+    volumes:
+    - .:/src
+    command: web
+    depends_on: [db]
+    ports:
+    - 8081:8080
+    - 9101:9100
+    environment:
+      CONCOURSE_LOG_LEVEL: debug
+      CONCOURSE_POSTGRES_HOST: db
+      CONCOURSE_POSTGRES_USER: dev
+      CONCOURSE_POSTGRES_PASSWORD: dev
+      CONCOURSE_POSTGRES_DATABASE: concourse
+      CONCOURSE_EXTERNAL_URL: http://localhost:8081
+      CONCOURSE_ADD_LOCAL_USER: test:test,guest:guest
+      CONCOURSE_MAIN_TEAM_LOCAL_USER: test
+      CONCOURSE_CLUSTER_NAME: dev
+      CONCOURSE_CLIENT_SECRET: Y29uY291cnNlLXdlYgo=
+      CONCOURSE_TSA_CLIENT_SECRET: Y29uY291cnNlLXdvcmtlcgo=
+      CONCOURSE_CONTAINER_PLACEMENT_STRATEGY: limit-active-tasks
+      CONCOURSE_MAX_ACTIVE_TASKS_PER_WORKER: 1
+      CONCOURSE_PROMETHEUS_BIND_IP: "0.0.0.0"
+      CONCOURSE_PROMETHEUS_BIND_PORT: "9100"
+      CONCOURSE_PEER_ADDRESS: web-2
 
   worker:
     build: .
     image: concourse/concourse:local
     command: worker
     privileged: true
-    depends_on: [web]
+    depends_on: [web-1,web-2]
     ports:
     - 7777:7777
     - 7788:7788
     stop_signal: SIGUSR2
     environment:
       CONCOURSE_LOG_LEVEL: debug
-      CONCOURSE_TSA_HOST: web:2222
+      CONCOURSE_TSA_HOST: web-1:2222
 
       # avoid using loopbacks
       CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
diff --git a/hack/overrides/prometheus.yml b/hack/overrides/prometheus.yml
index bd4ceb2ae..71f2dc762 100644
--- a/hack/overrides/prometheus.yml
+++ b/hack/overrides/prometheus.yml
@@ -8,13 +8,6 @@
 version: '3'
 
 services:
-  web:
-    environment:
-      CONCOURSE_PROMETHEUS_BIND_IP: "0.0.0.0"
-      CONCOURSE_PROMETHEUS_BIND_PORT: "9100"
-    ports:
-      - '9100:9100'
-
   prometheus:
     image: prom/prometheus
     entrypoint:
@@ -30,7 +23,8 @@ services:
           - job_name: 'concourse'
             static_configs:
               - targets:
-                - 'web:9100'
+                - 'web-1:9100'
+                - 'web-2:9100'
         " > config.yml
 
           exec prometheus \

And what was striking was that it totally worked! I guess the logic that acquires a lock when attempting to schedule a container actually synchronizes through the database -- i was concerned we might "over-report" the number of waiting tasks since two different web nodes might be attempting to start the same task.

I ran

for i in {1..10}; do
  fly -t web2 execute -c web/wats/fixtures/smoke-task.yml -i some-input=web/wats/fixtures/some-input &
  fly -t dev execute -c web/wats/fixtures/smoke-task.yml -i some-input=web/wats/fixtures/some-input &
done

where dev was localhost:8080 and web2 was localhost:8081. And what I saw in prometheus was quite promising!

Each web node schedules its own containers. Now I need to attempt a similar exercise using job builds instead of one-offs (fly execute).

jamieklassen

This continues to resist my attempts to break it - similar good behaviour when I trigger a bunch of job builds:

Before I get into it, the implementation itself looks pretty much gold - very idiomatic, small diff, no big WTFs, and it works! Very well done.

I'd love to see some documentation for these metrics, maybe around here (source) or here (source) -- would you mind submitting patches for those?

In terms of testing, I think you could certainly add unit tests for a few things.

you're periodically forwarding the state of a global gauge to the emitter - there are tests for this behaviour in https://github.com/concourse/concourse/blob/master/atc/metric/periodic_test.go. (full disclosure, I have an open PR that adds tests here, too!)
you're configuring an entry to be exposed by the prometheus endpoint - there are tests for this sort of thing in https://github.com/concourse/concourse/blob/master/atc/metric/emitter/prometheus_test.go
something I find very interesting and wasn't obvious to me until I extensively tested manually was the concurrency-safety of this approach. A test around the worker client might have made this super clear! Around https://github.com/concourse/concourse/blob/master/atc/worker/client_test.go#L659-L667 maybe.

Finally, there's a merge conflict in the release notes - probably my bad for taking so long to get to this review. I was crunching on releasing 5.5.11, sorry! Anyway, I'm sure a simple rebase would take care of it.

release-notes/latest.md

This configuration allows to use EU based NewRelic account. Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

NewRelicEmitter URL is the complete url and not the base URL as configuredin the NewRelicConfig Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

Add NewRelicEmitter changes to release notes Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

tenjaa · 2020-05-04T13:41:22Z

Hi @pivotal-jamie-klassen
we added some tests and also fixed the merge conflict.
We will also create a PR for the docs today (or tomorrow).

* NOTE: we don't log failed type conversions. This is probably something we want to start doing once we allow users to bypass dex. For now its probably ok to fail silently because our token structure will be consistent for the time being. https://github.com/concourse/concourse/projects/49#card-31357785 https://github.com/concourse/concourse/projects/49#card-36796693 Co-authored-by: Ciro S. Costa <cscosta@pivotal.io> Signed-off-by: Josh Winters <jwinters@pivotal.io>

Using the limit-active-tasks placement strategy gives the perfect opportunity to dynamically scale workers based on tasks running. This metric makes it easier for operators to implement such a scaling. #5057 Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com> Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

- To avoid test pollution all previously registered collectors are unregistered in garbage collection test Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

…to be available - Make metrics test more stable Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

- introduce wait group so we know the context is cancelled before running the tests Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

shyamz-22 · 2020-05-05T08:30:13Z

@pivotal-jamie-klassen we have added test as mentioned in the last review.

However our test that tests if the task waiting metric is gauged correctly when waiting for worker to be available is bit flaky. It seem to succeed after 2 or 3 attempts. Can you help us with that?
Thanks :)

- Close sshClient Conn on keepalive -> SendRequest operation error - Add 5 second timeout for keepalive -> SendRequest operation Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>

- pull out keepalive from tsa client - add test coverage for new behaviour Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>

- Changed the timeout from 5s to 5m as 5s might be too aggresive for remote workers that don't have reliable connections. This timeout is to detect the condition where we observed SentRequest to hang indefinitely, which can still be reliably detected with a timeout of 5m. Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>

Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>

jamieklassen · 2020-05-05T21:09:16Z

tl;dr - making this properly testable may require some dependency-breaking refactors. Or maybe a more sophisticated stub on the fakePool's FindOrChooseWorkerForContainer method.

@shyamz-22 I don't have time to go deeply into a solution right now, but I will come back to this and I just want to bring the challenge to the surface.

The full scenario we're trying to simulate is this:

the client cannot find a worker
the waiting tasks gauge is incremented
the step is aborted
the waiting tasks gauge is decremented

In the JustBeforeEach, client.RunTaskStep executes, and is blocking. When we look in the implementation, we are getting a race between the context being canceled and the metric being incremented. Somehow we want to wait long enough for the metric to actually be incremented before aborting the step.

For the record, I was able to get the test to consistently pass by writing it like this:

Context("when waiting for worker to be available", func() {
	BeforeEach(func() {
		fakePool.FindOrChooseWorkerForContainerReturns(nil, nil)
		go func() {
			Eventually(metric.TasksWaiting.Max, 6*time.Second).Should(Equal(float64(1)))
			cancel()
		}()
	})

	It("metric task waiting is gauged and lock is released", func() {
		Expect(metric.TasksWaiting.Max()).To(Equal(float64(0)))
		Expect(fakeLock.ReleaseCallCount()).To(Equal(fakeLockFactory.AcquireCallCount()))
	})
})

However, I strongly do not recommend keeping this approach, because instead of failing it just hangs indefinitely. But it serves as an interesting illustration of how to orchestrate the concurrent processes to achieve the simulation described.

shyamz-22 · 2020-05-06T07:12:56Z

@pivotal-jamie-klassen thanks for the reply.

We were also in the same opinion as you. We also had an idea for the refactor to be able to test this behaviour better. Could we explore this refactor? or would you suggest a lead on how we can proceed

thanks

Signed-off-by: Taylor Silva <tsilva@pivotal.io>

We should switch this over to using the API directly as well as it can make the build logs very noisy as well. Signed-off-by: Taylor Silva <tsilva@pivotal.io>

When using -node, everything in SynchronizedBeforeSuite is only run on one node, therefore other nodes don't have a pointer to kubeClient, it was never set on those nodes! Putting it in the suite's BeforeEach() ensures each node initializes the kubeClient. Signed-off-by: Taylor Silva <tsilva@pivotal.io>

Replaced with directly calling the API for the logs of each pod Signed-off-by: Taylor Silva <tsilva@pivotal.io>

shyamz-22 · 2020-05-11T07:26:49Z

@pivotal-jamie-klassen
Hi,

Thank you so much for the feedback. I have made some changes here.

Please consider this as a draft.

Changes:

Some differences to the implementation and the draft here:

Previously the order of execution was

Acquire lock
Find Existing container
Choose worker

However the steps 1 and 2 are needed only when a worker is found. There are no tests to prove that the order matters. The current refactor translates roughly to the following (omitted release lock, you can find it in the code)

Choose Worker
Acquire lock
If worker is found increase active task and exit
If worker is not found
a. increase waiting task metric
b. wait for worker to be available and continue

Observations:

Tests are not flaky and faster [~2seconds]
current gocyclo score is

➜  concourse git:(draft-waiting-tasks-refactor) gocyclo atc/worker/client.go
16 worker (*client).RunTaskStep atc/worker/client.go:249:1
12 worker (*client).chooseTaskWorker atc/worker/client.go:557:1

I also reordered the code in the order struts declaration, vars come first followed by Exported, private Receiver methods and at last functions

Please let me know how to proceed further.

Make k8s-topgun produce less output

Signed-off-by: James Thomson <jthomson@pivotal.io>

k8s/topgun: update prometheus dependencies and disable worker

jamieklassen · 2020-05-12T14:31:59Z

@shyamz-22 I quite like the reorganizing of that file, as well as the various *Dummy() methods in the test. The test is also much easier to read. Just for fun, I checked out the coverage, and while atc/worker/client.go has ~80% coverage, all of the functions involved in RunTaskStep are quite thoroughly covered.

I had one concern, and this mostly shows the limited degree to which I actually understand the affected components of Concourse. By acquiring the lock after choosing the worker, I can imagine this reducing the thread-safety of the worker-choosing routine. Since the worker-choosing part is no longer synchronized, can multiple web nodes race each other and do anything weird?

I think the answer is no, because the only code path that actually calls RunTaskStep is the one involving the Run method on a TaskStep, which ultimately only gets called from engineBuild.Run. This method itself involves acquiring a tracking lock. Since there can only be a single goroutine that possesses the tracking lock for a build, there is no risk of two webs accidentally scheduling the same task on different workers. It seems like the active task lock is really just to avoid two webs racing to increment/decrement the active tasks on a worker, and having that number potentially go out of sync with the actual number of task containers on the worker.

Since you are making the retry and reporting interval configurable, there are some jobs in our pipeline that will complain unless the appropriate configuration is exposed via the helm chart and bosh release. So it would be helpful to make PRs to those repos as well, though I can help with that. Otherwise, the configuration doesn't actually need to be passed all the way to the operator. One could still externalize them from the method under test but have them always be constant in production. I suggest considering this because it seems like the kind of configuration an operator would have no idea how or why to tune -- indeed I can't think of why I would ever change that variable! Maybe you can?

shyamz-22 · 2020-05-12T14:54:26Z

One could still externalize them from the method under test but have them always be constant in production.

My confusion to achieve this are:

chooseTaskWorker is not exported and is only called from the RunTaskStep which is an interface method. This method (RunTaskStep) is the method under test which means the signature has to be modified.
If we Export the chooseTaskWorker method on the client, The client struct is again not exported and is implementing the Client interface
All the test packages are having _test and I do not want to add a test with the same package name as the production code breaking the convention.

This is why I decided to pass it as a configuration. I cannot think of any other way as my understanding of all the components and code convention of Concourse is not strong

jamieklassen · 2020-05-12T15:20:22Z

@shyamz-22 sorry, I was waving my hands there. What I mean was that instead of adding these parameters you could make a change around here:

https://github.com/volkswagen/concourse/blob/0f7d6d1040658596568ba9b6f28fb8a65370041d/atc/atccmd/command.go#L86-L94

adding some lines defining constants:

var workerAvailabilityPollingInterval = 5 * time.Second
var workerStatusPublishInterval = 1 * time.Minute

and then the code here:

https://github.com/volkswagen/concourse/blob/0f7d6d1040658596568ba9b6f28fb8a65370041d/atc/atccmd/command.go#L648

could instead look like

	workerClient := worker.NewClient(pool, workerProvider, compressionLib, workerAvailabilityPollingInterval, workerStatusPublishInterval)

and similarly here.

Does this make sense? I'm also open to making those things operator-configurable if you think there's a precedent, I just can't think of any.

shyamz-22 · 2020-05-12T16:50:17Z

@pivotal-jamie-klassen I also cannot find a valid reason why an operator would need to configure the polling interval. I will do as you suggested by adding defaults and merge the changes. Hopefully this closes the PR 🤞 😄

Using the limit-active-tasks placement strategy gives the perfect opportunity to dynamically scale workers based on tasks running. This metric makes it easier for operators to implement such a scaling. #5057 Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com> Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

- To avoid test pollution all previously registered collectors are unregistered in garbage collection test Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

…to be available - Make metrics test more stable Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

- introduce wait group so we know the context is cancelled before running the tests Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

- Externalize worker polling and status configuration to have more stable tests Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

jamieklassen

Thank you for being so responsive and thorough, and really taking the time for craftsmanship. It makes a huge difference!

shyamz-22 · 2020-05-12T19:22:29Z

@pivotal-jamie-klassen thank you to you too for providing enough information in each review really appreciate it :)

tenjaa requested review from a team April 16, 2020 07:07

vito assigned jamieklassen Apr 17, 2020

jamieklassen reviewed Apr 28, 2020

View reviewed changes

jamieklassen suggested changes Apr 28, 2020

View reviewed changes

release-notes/latest.md Outdated Show resolved Hide resolved

shyamz-22 added 5 commits May 1, 2020 12:38

add configuration for NewRelic insights url

cbed34e

This configuration allows to use EU based NewRelic account. Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

add release notes

9197cd2

Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

fix test by using correct assertion

002d8a3

NewRelicEmitter URL is the complete url and not the base URL as configuredin the NewRelicConfig Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

Merge release notes with upstream changes

511bcb5

Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

Merge changes from upstream branch

97e925b

Add NewRelicEmitter changes to release notes Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>

tenjaa requested review from jamieklassen and removed request for a team May 4, 2020 13:41

HannesHasselbring mentioned this pull request May 4, 2020

Add documentation for metric concourse_tasks_waiting concourse/docs#313

Merged

Josh Winters and others added 8 commits May 4, 2020 15:29

Add periodic test for tasks waiting metrics

3e74ba7

Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com> Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

update latest.md

947c351

Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

add client test to check if metrics are gauged properly

35f6115

Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

test if task lock is released in RunTaskStep when waiting for worker …

e0ec268

…to be available - Make metrics test more stable Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

make test less flaky

82a042d

- introduce wait group so we know the context is cancelled before running the tests Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

xtreme-sameer-vohra added 4 commits May 5, 2020 15:42

[#5082] Close ssh conn on worker when keepalive fails

008ec24

- Close sshClient Conn on keepalive -> SendRequest operation error - Add 5 second timeout for keepalive -> SendRequest operation Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>

[#5082] Refactor keepalive into a separate func

f82ed96

- pull out keepalive from tsa client - add test coverage for new behaviour Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>

[#5082] add release note

858edcf

Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>

xtreme-sameer-vohra and others added 6 commits May 8, 2020 10:06

Merge branch 'master' into log-tx-query

9ffa9e2

Run go mod tidy

53ee9e6

Signed-off-by: Taylor Silva <tsilva@pivotal.io>

Reduce the frequency of call to kubectl logs

0e06ef7

We should switch this over to using the API directly as well as it can make the build logs very noisy as well. Signed-off-by: Taylor Silva <tsilva@pivotal.io>

Remove call to kubectl logs

44cea28

Replaced with directly calling the API for the logs of each pod Signed-off-by: Taylor Silva <tsilva@pivotal.io>

Merge pull request #5520 from concourse/log-tx-query

e9de452

taylorsilva and others added 4 commits May 11, 2020 14:00

Merge pull request #5567 from concourse/make-k8s-topgun-less-noisy

3abdcd3

Make k8s-topgun produce less output

add config file for PullRequest bot automation

8e9f120

Signed-off-by: James Thomson <jthomson@pivotal.io>

Merge pull request #5568 from concourse/k8s-topgun-prometheus-fix

313b9c6

k8s/topgun: update prometheus dependencies and disable worker

Merge pull request #5564 from concourse/contributor-automation

3ac5db5

Hannes Hasselbring and others added 9 commits May 12, 2020 19:10

Add periodic test for tasks waiting metrics

ce8d014

Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com> Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

update latest.md

e5b2275

Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com> Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>

add client test to check if metrics are gauged properly

c4bfdcd

Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

test if task lock is released in RunTaskStep when waiting for worker …

b267a1f

…to be available - Make metrics test more stable Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com> Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

make test less flaky

07e1b4a

- introduce wait group so we know the context is cancelled before running the tests Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

Refactor choose task worker to make it more testable

11a0216

- Externalize worker polling and status configuration to have more stable tests Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

fix merge conflicts with latest.md

920eeb5

Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>

tenjaa requested a review from vito as a code owner May 12, 2020 17:31

tenjaa requested a review from a team May 12, 2020 17:31

jamieklassen approved these changes May 12, 2020

View reviewed changes

jamieklassen merged commit 587339a into concourse:master May 12, 2020

taylorsilva added the release/documented Documentation and release notes have been updated. label May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

atc: behaviour: emit tasks waiting prometheus metric #5448

atc: behaviour: emit tasks waiting prometheus metric #5448

tenjaa commented Apr 16, 2020 •

edited by jamieklassen

jamieklassen left a comment

jamieklassen left a comment •

edited

tenjaa commented May 4, 2020 •

edited

shyamz-22 commented May 5, 2020 •

edited

jamieklassen commented May 5, 2020

shyamz-22 commented May 6, 2020 •

edited

shyamz-22 commented May 11, 2020

jamieklassen commented May 12, 2020

shyamz-22 commented May 12, 2020 •

edited

jamieklassen commented May 12, 2020

shyamz-22 commented May 12, 2020

jamieklassen left a comment

shyamz-22 commented May 12, 2020

atc: behaviour: emit tasks waiting prometheus metric #5448

atc: behaviour: emit tasks waiting prometheus metric #5448

Conversation

tenjaa commented Apr 16, 2020 • edited by jamieklassen

Existing Issue

Why do we need this PR?

Changes proposed in this pull request

Contributor Checklist

Reviewer Checklist

jamieklassen left a comment

Choose a reason for hiding this comment

jamieklassen left a comment • edited

Choose a reason for hiding this comment

tenjaa commented May 4, 2020 • edited

shyamz-22 commented May 5, 2020 • edited

jamieklassen commented May 5, 2020

shyamz-22 commented May 6, 2020 • edited

shyamz-22 commented May 11, 2020

Changes:

Observations:

jamieklassen commented May 12, 2020

shyamz-22 commented May 12, 2020 • edited

jamieklassen commented May 12, 2020

shyamz-22 commented May 12, 2020

jamieklassen left a comment

Choose a reason for hiding this comment

shyamz-22 commented May 12, 2020

tenjaa commented Apr 16, 2020 •

edited by jamieklassen

jamieklassen left a comment •

edited

tenjaa commented May 4, 2020 •

edited

shyamz-22 commented May 5, 2020 •

edited

shyamz-22 commented May 6, 2020 •

edited

shyamz-22 commented May 12, 2020 •

edited