Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atc: behaviour: emit tasks waiting prometheus metric #5448

Merged
merged 54 commits into from May 12, 2020
Merged

atc: behaviour: emit tasks waiting prometheus metric #5448

merged 54 commits into from May 12, 2020

Conversation

tenjaa
Copy link
Contributor

@tenjaa tenjaa commented Apr 16, 2020

Existing Issue

Fixes #5057

Why do we need this PR?

Using the limit-active-tasks placement strategy gives the perfect opportunity
to dynamically scale workers based on tasks running. This metric makes
it easier for operators to implement such a scaling.

Changes proposed in this pull request

  • add prometheus metric for waiting builds when using limit-active-tasks placement strategy

Contributor Checklist

Reviewer Checklist

  • Code reviewed
  • Tests reviewed
  • Documentation reviewed
  • Release notes reviewed
  • PR acceptance performed
  • New config flags added? Ensure that they are added to the BOSH and Helm packaging; otherwise, ignored for the integration tests (for example, if they are Garden configs that are not displayed in the --help text).

@tenjaa tenjaa requested review from a team April 16, 2020 07:07
Copy link
Member

@jamieklassen jamieklassen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I had some concerns about the ability to use a gauge (the number of tasks currently waiting, which can increase and decrease) instead of a counter (the number of attempts to schedule a task container which have failed due to all workers being at capacity), because I thought there could be some misbehaviour when multiple web nodes were in play. So I tested it out manually.

diff --git a/docker-compose.yml b/docker-compose.yml
index 5618630a7..aeac1ed14 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -11,7 +11,7 @@ services:
       POSTGRES_USER: dev
       POSTGRES_PASSWORD: dev
 
-  web:
+  web-1:
     build: .
     image: concourse/concourse:local
     volumes:
@@ -20,6 +20,7 @@ services:
     depends_on: [db]
     ports:
     - 8080:8080
+    - 9100:9100
     environment:
       CONCOURSE_LOG_LEVEL: debug
       CONCOURSE_POSTGRES_HOST: db
@@ -32,20 +33,53 @@ services:
       CONCOURSE_CLUSTER_NAME: dev
       CONCOURSE_CLIENT_SECRET: Y29uY291cnNlLXdlYgo=
       CONCOURSE_TSA_CLIENT_SECRET: Y29uY291cnNlLXdvcmtlcgo=
+      CONCOURSE_CONTAINER_PLACEMENT_STRATEGY: limit-active-tasks
+      CONCOURSE_MAX_ACTIVE_TASKS_PER_WORKER: 1
+      CONCOURSE_PROMETHEUS_BIND_IP: "0.0.0.0"
+      CONCOURSE_PROMETHEUS_BIND_PORT: "9100"
+      CONCOURSE_PEER_ADDRESS: web-1
+
+  web-2:
+    build: .
+    image: concourse/concourse:local
+    volumes:
+    - .:/src
+    command: web
+    depends_on: [db]
+    ports:
+    - 8081:8080
+    - 9101:9100
+    environment:
+      CONCOURSE_LOG_LEVEL: debug
+      CONCOURSE_POSTGRES_HOST: db
+      CONCOURSE_POSTGRES_USER: dev
+      CONCOURSE_POSTGRES_PASSWORD: dev
+      CONCOURSE_POSTGRES_DATABASE: concourse
+      CONCOURSE_EXTERNAL_URL: http://localhost:8081
+      CONCOURSE_ADD_LOCAL_USER: test:test,guest:guest
+      CONCOURSE_MAIN_TEAM_LOCAL_USER: test
+      CONCOURSE_CLUSTER_NAME: dev
+      CONCOURSE_CLIENT_SECRET: Y29uY291cnNlLXdlYgo=
+      CONCOURSE_TSA_CLIENT_SECRET: Y29uY291cnNlLXdvcmtlcgo=
+      CONCOURSE_CONTAINER_PLACEMENT_STRATEGY: limit-active-tasks
+      CONCOURSE_MAX_ACTIVE_TASKS_PER_WORKER: 1
+      CONCOURSE_PROMETHEUS_BIND_IP: "0.0.0.0"
+      CONCOURSE_PROMETHEUS_BIND_PORT: "9100"
+      CONCOURSE_PEER_ADDRESS: web-2
 
   worker:
     build: .
     image: concourse/concourse:local
     command: worker
     privileged: true
-    depends_on: [web]
+    depends_on: [web-1,web-2]
     ports:
     - 7777:7777
     - 7788:7788
     stop_signal: SIGUSR2
     environment:
       CONCOURSE_LOG_LEVEL: debug
-      CONCOURSE_TSA_HOST: web:2222
+      CONCOURSE_TSA_HOST: web-1:2222
 
       # avoid using loopbacks
       CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
diff --git a/hack/overrides/prometheus.yml b/hack/overrides/prometheus.yml
index bd4ceb2ae..71f2dc762 100644
--- a/hack/overrides/prometheus.yml
+++ b/hack/overrides/prometheus.yml
@@ -8,13 +8,6 @@
 version: '3'
 
 services:
-  web:
-    environment:
-      CONCOURSE_PROMETHEUS_BIND_IP: "0.0.0.0"
-      CONCOURSE_PROMETHEUS_BIND_PORT: "9100"
-    ports:
-      - '9100:9100'
-
   prometheus:
     image: prom/prometheus
     entrypoint:
@@ -30,7 +23,8 @@ services:
           - job_name: 'concourse'
             static_configs:
               - targets:
-                - 'web:9100'
+                - 'web-1:9100'
+                - 'web-2:9100'
         " > config.yml
 
           exec prometheus \

And what was striking was that it totally worked! I guess the logic that acquires a lock when attempting to schedule a container actually synchronizes through the database -- i was concerned we might "over-report" the number of waiting tasks since two different web nodes might be attempting to start the same task.

I ran

for i in {1..10}; do
  fly -t web2 execute -c web/wats/fixtures/smoke-task.yml -i some-input=web/wats/fixtures/some-input &
  fly -t dev execute -c web/wats/fixtures/smoke-task.yml -i some-input=web/wats/fixtures/some-input &
done

where dev was localhost:8080 and web2 was localhost:8081. And what I saw in prometheus was quite promising!

Screen Shot 2020-04-28 at 4 28 29 PM

Each web node schedules its own containers. Now I need to attempt a similar exercise using job builds instead of one-offs (fly execute).

Copy link
Member

@jamieklassen jamieklassen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This continues to resist my attempts to break it - similar good behaviour when I trigger a bunch of job builds:

Screen Shot 2020-04-28 at 4 38 01 PM

Before I get into it, the implementation itself looks pretty much gold - very idiomatic, small diff, no big WTFs, and it works! Very well done.

I'd love to see some documentation for these metrics, maybe around here (source) or here (source) -- would you mind submitting patches for those?

In terms of testing, I think you could certainly add unit tests for a few things.

Finally, there's a merge conflict in the release notes - probably my bad for taking so long to get to this review. I was crunching on releasing 5.5.11, sorry! Anyway, I'm sure a simple rebase would take care of it.

release-notes/latest.md Outdated Show resolved Hide resolved
This configuration allows to use EU based NewRelic account.

Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>
Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>
NewRelicEmitter URL is the complete url and not the base URL as configuredin the NewRelicConfig

Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>
Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>
Add NewRelicEmitter changes to release notes

Signed-off-by: Syamala Umamaheswaran <shykart2203@gmail.com>
@tenjaa
Copy link
Contributor Author

tenjaa commented May 4, 2020

Hi @pivotal-jamie-klassen
we added some tests and also fixed the merge conflict.
We will also create a PR for the docs today (or tomorrow).

Josh Winters and others added 8 commits May 4, 2020 15:29
* NOTE: we don't log failed type conversions. This is probably something
we want to start doing once we allow users to bypass dex. For now its
probably ok to fail silently because our token structure will be
consistent for the time being.

https://github.com/concourse/concourse/projects/49#card-31357785
https://github.com/concourse/concourse/projects/49#card-36796693

Co-authored-by: Ciro S. Costa <cscosta@pivotal.io>
Signed-off-by: Josh Winters <jwinters@pivotal.io>
Using the limit-active-tasks placement strategy gives the perfect opportunity
to dynamically scale workers based on tasks running. This metric makes
it easier for operators to implement such a scaling.

#5057

Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com>
Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com>
Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
- To avoid test pollution all previously registered collectors are unregistered in
  garbage collection test

Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
…to be available

- Make metrics test more stable

Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
- introduce wait group so we know the context is cancelled before running the tests

Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
@shyamz-22
Copy link
Member

shyamz-22 commented May 5, 2020

@pivotal-jamie-klassen we have added test as mentioned in the last review.

However our test that tests if the task waiting metric is gauged correctly when waiting for worker to be available is bit flaky. It seem to succeed after 2 or 3 attempts. Can you help us with that?
Thanks :)

- Close sshClient Conn on keepalive -> SendRequest operation error
- Add 5 second timeout for keepalive -> SendRequest operation

Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>
- pull out keepalive from tsa client
- add test coverage for new behaviour

Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>
- Changed the timeout from 5s to 5m as 5s might be too aggresive for
  remote workers that don't have reliable connections.

  This timeout is to detect the condition where we observed SentRequest
  to hang indefinitely, which can still be reliably detected with a
  timeout of 5m.

Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>
Signed-off-by: Sameer Vohra <vohra.sam@gmail.com>
@jamieklassen
Copy link
Member

tl;dr - making this properly testable may require some dependency-breaking refactors. Or maybe a more sophisticated stub on the fakePool's FindOrChooseWorkerForContainer method.

@shyamz-22 I don't have time to go deeply into a solution right now, but I will come back to this and I just want to bring the challenge to the surface.

The full scenario we're trying to simulate is this:

  1. the client cannot find a worker
  2. the waiting tasks gauge is incremented
  3. the step is aborted
  4. the waiting tasks gauge is decremented

In the JustBeforeEach, client.RunTaskStep executes, and is blocking. When we look in the implementation, we are getting a race between the context being canceled and the metric being incremented. Somehow we want to wait long enough for the metric to actually be incremented before aborting the step.

For the record, I was able to get the test to consistently pass by writing it like this:

Context("when waiting for worker to be available", func() {
	BeforeEach(func() {
		fakePool.FindOrChooseWorkerForContainerReturns(nil, nil)
		go func() {
			Eventually(metric.TasksWaiting.Max, 6*time.Second).Should(Equal(float64(1)))
			cancel()
		}()
	})

	It("metric task waiting is gauged and lock is released", func() {
		Expect(metric.TasksWaiting.Max()).To(Equal(float64(0)))
		Expect(fakeLock.ReleaseCallCount()).To(Equal(fakeLockFactory.AcquireCallCount()))
	})
})

However, I strongly do not recommend keeping this approach, because instead of failing it just hangs indefinitely. But it serves as an interesting illustration of how to orchestrate the concurrent processes to achieve the simulation described.

@shyamz-22
Copy link
Member

shyamz-22 commented May 6, 2020

@pivotal-jamie-klassen thanks for the reply.

We were also in the same opinion as you. We also had an idea for the refactor to be able to test this behaviour better. Could we explore this refactor? or would you suggest a lead on how we can proceed

thanks

xtreme-sameer-vohra and others added 6 commits May 8, 2020 10:06
Signed-off-by: Taylor Silva <tsilva@pivotal.io>
We should switch this over to using the API directly as well as it can
make the build logs very noisy as well.

Signed-off-by: Taylor Silva <tsilva@pivotal.io>
When using -node, everything in SynchronizedBeforeSuite is only run on
one node, therefore other nodes don't have a pointer to kubeClient, it
was never set on those nodes!

Putting it in the suite's BeforeEach() ensures each node initializes the
kubeClient.

Signed-off-by: Taylor Silva <tsilva@pivotal.io>
Replaced with directly calling the API for the logs of each pod

Signed-off-by: Taylor Silva <tsilva@pivotal.io>
@shyamz-22
Copy link
Member

@pivotal-jamie-klassen
Hi,

Thank you so much for the feedback. I have made some changes here.

Please consider this as a draft.

Changes:

  1. https://github.com/volkswagen/concourse/blob/0f7d6d1040658596568ba9b6f28fb8a65370041d/atc/worker/client.go#L557
  2. https://github.com/volkswagen/concourse/blob/0f7d6d1040658596568ba9b6f28fb8a65370041d/atc/worker/client.go#L112
  3. https://github.com/volkswagen/concourse/blob/0f7d6d1040658596568ba9b6f28fb8a65370041d/atc/atccmd/command.go#L648
  4. https://github.com/volkswagen/concourse/blob/0f7d6d1040658596568ba9b6f28fb8a65370041d/atc/atccmd/command.go#L876
  5. https://github.com/volkswagen/concourse/blob/0f7d6d1040658596568ba9b6f28fb8a65370041d/atc/atccmd/command.go#L149
  6. https://github.com/volkswagen/concourse/blob/draft-waiting-tasks-refactor/atc/worker/choose_task_worker_test.go

Some differences to the implementation and the draft here:

Previously the order of execution was

  1. Acquire lock
  2. Find Existing container
  3. Choose worker

However the steps 1 and 2 are needed only when a worker is found. There are no tests to prove that the order matters. The current refactor translates roughly to the following (omitted release lock, you can find it in the code)

  1. Choose Worker
  2. Acquire lock
  3. If worker is found increase active task and exit
  4. If worker is not found
    a. increase waiting task metric
    b. wait for worker to be available and continue

Observations:

  • Tests are not flaky and faster [~2seconds]
  • current gocyclo score is
➜  concourse git:(draft-waiting-tasks-refactor) gocyclo atc/worker/client.go
16 worker (*client).RunTaskStep atc/worker/client.go:249:1
12 worker (*client).chooseTaskWorker atc/worker/client.go:557:1

I also reordered the code in the order struts declaration, vars come first followed by Exported, private Receiver methods and at last functions

Please let me know how to proceed further.

taylorsilva and others added 4 commits May 11, 2020 14:00
Signed-off-by: James Thomson <jthomson@pivotal.io>
k8s/topgun: update prometheus dependencies and disable worker
@jamieklassen
Copy link
Member

@shyamz-22 I quite like the reorganizing of that file, as well as the various *Dummy() methods in the test. The test is also much easier to read. Just for fun, I checked out the coverage, and while atc/worker/client.go has ~80% coverage, all of the functions involved in RunTaskStep are quite thoroughly covered.

I had one concern, and this mostly shows the limited degree to which I actually understand the affected components of Concourse. By acquiring the lock after choosing the worker, I can imagine this reducing the thread-safety of the worker-choosing routine. Since the worker-choosing part is no longer synchronized, can multiple web nodes race each other and do anything weird?

I think the answer is no, because the only code path that actually calls RunTaskStep is the one involving the Run method on a TaskStep, which ultimately only gets called from engineBuild.Run. This method itself involves acquiring a tracking lock. Since there can only be a single goroutine that possesses the tracking lock for a build, there is no risk of two webs accidentally scheduling the same task on different workers. It seems like the active task lock is really just to avoid two webs racing to increment/decrement the active tasks on a worker, and having that number potentially go out of sync with the actual number of task containers on the worker.

Since you are making the retry and reporting interval configurable, there are some jobs in our pipeline that will complain unless the appropriate configuration is exposed via the helm chart and bosh release. So it would be helpful to make PRs to those repos as well, though I can help with that. Otherwise, the configuration doesn't actually need to be passed all the way to the operator. One could still externalize them from the method under test but have them always be constant in production. I suggest considering this because it seems like the kind of configuration an operator would have no idea how or why to tune -- indeed I can't think of why I would ever change that variable! Maybe you can?

@shyamz-22
Copy link
Member

shyamz-22 commented May 12, 2020

One could still externalize them from the method under test but have them always be constant in production.

My confusion to achieve this are:

  1. chooseTaskWorker is not exported and is only called from the RunTaskStep which is an interface method. This method (RunTaskStep) is the method under test which means the signature has to be modified.

  2. If we Export the chooseTaskWorker method on the client, The client struct is again not exported and is implementing the Client interface

  3. All the test packages are having _test and I do not want to add a test with the same package name as the production code breaking the convention.

This is why I decided to pass it as a configuration. I cannot think of any other way as my understanding of all the components and code convention of Concourse is not strong

@jamieklassen
Copy link
Member

@shyamz-22 sorry, I was waving my hands there. What I mean was that instead of adding these parameters you could make a change around here:

https://github.com/volkswagen/concourse/blob/0f7d6d1040658596568ba9b6f28fb8a65370041d/atc/atccmd/command.go#L86-L94

adding some lines defining constants:

var workerAvailabilityPollingInterval = 5 * time.Second
var workerStatusPublishInterval = 1 * time.Minute

and then the code here:

https://github.com/volkswagen/concourse/blob/0f7d6d1040658596568ba9b6f28fb8a65370041d/atc/atccmd/command.go#L648

could instead look like

	workerClient := worker.NewClient(pool, workerProvider, compressionLib, workerAvailabilityPollingInterval, workerStatusPublishInterval)

and similarly here.

Does this make sense? I'm also open to making those things operator-configurable if you think there's a precedent, I just can't think of any.

@shyamz-22
Copy link
Member

@pivotal-jamie-klassen I also cannot find a valid reason why an operator would need to configure the polling interval. I will do as you suggested by adding defaults and merge the changes. Hopefully this closes the PR 🤞 😄

Hannes Hasselbring and others added 9 commits May 12, 2020 19:10
Using the limit-active-tasks placement strategy gives the perfect opportunity
to dynamically scale workers based on tasks running. This metric makes
it easier for operators to implement such a scaling.

#5057

Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com>
Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Co-authored-by: Torben Neufeldt <torben.neufeldt@oss.volkswagen.com>
Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Co-authored-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
Signed-off-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
- To avoid test pollution all previously registered collectors are unregistered in
  garbage collection test

Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
…to be available

- Make metrics test more stable

Co-authored-by: Hannes Hasselbring <hannes@oss.volkswagen.com>
Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
- introduce wait group so we know the context is cancelled before running the tests

Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
- Externalize worker polling and status configuration
  to have more stable tests

Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
Signed-off-by: Syamala Umamaheswaran <shyam@oss.volkswagen.com>
@tenjaa tenjaa requested a review from vito as a code owner May 12, 2020 17:31
@tenjaa tenjaa requested a review from a team May 12, 2020 17:31
Copy link
Member

@jamieklassen jamieklassen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for being so responsive and thorough, and really taking the time for craftsmanship. It makes a huge difference!

@shyamz-22
Copy link
Member

@pivotal-jamie-klassen thank you to you too for providing enough information in each review really appreciate it :)

@jamieklassen jamieklassen merged commit 587339a into concourse:master May 12, 2020
@taylorsilva taylorsilva added the release/documented Documentation and release notes have been updated. label May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/documented Documentation and release notes have been updated.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide metric for tasks queue when using limit-active-tasks placement strategy
5 participants