ListAllJobs endpoint can be watched for changes #5802

aoldershaw · 2020-06-23T13:30:20Z

What does this PR accomplish?

Bug Fix | Feature | Documentation

Changes proposed by this PR:

This PR aims to provide functionality similar to Kubernetes' ?watch API (e.g. kubectl get ... --watch) for the ListAllJobs endpoint.

The motivation for this is described more in depth in the RFC, but essentially, it should provide:

(Hopefully) better performance on the ATC and DB
Faster feedback in the UI dashboard

Eventually, I see this change applying to several endpoints - rather than polling in the UI, we have the ATC send us updates.

Notes to reviewer:

To test: visit the dashboard, and then modify some jobs! (e.g. trigger builds, set/delete pipelines, expose/hide pipelines, etc)

This is feature flagged behind the --enable-watch-endpoints flag, which I've enabled by default in docker-compose.yml.

Contributor Checklist

Followed Code of conduct, Contributing Guide & avoided Anti-patterns
Signed all commits
Added tests (Unit and/or Integration)
Updated Documentation
Updated Release notes

Reviewer Checklist

Code reviewed
Tests reviewed
Documentation reviewed
Release notes reviewed
PR acceptance performed
New config flags added? Ensure that they are added to the
BOSH and
Helm packaging; otherwise, ignored for
the integration
tests
(for example, if they are Garden configs that are not displayed in the
--help text).

atc/db/watch/list_all_jobs_watcher_test.go

jamieklassen · 2020-09-22T15:09:42Z

I have come back to this several times and never really convinced myself I understood how it works. The subtleties of queuing/not queuing notifications and handling dropped/interrupted connections or other edge-casey things seem like they would be easy to get wrong, and I couldn't grasp what dirty meant. If I was designing something like this from the ground up I'm sure I'd follow the example of kubernetes and use something like etcd which already has a "watch" system.

However, I don't want to block this potentially-valuable performance improvement just because my brain is too small. The feature flag seems to work properly so this is totally backwards-compatible, so I'd rather focus on getting it merged and load-tested instead of staring at it forever. So let's get it rebased, add the feature flag into the helm chart and bosh release, and turn it on in hush-house.

Maybe we can focus some effort on planning the load test experiment - what will we measure (memory on the web nodes, network i/o between the DB and web, number of DB connections? probably something else on the DB)? what do we expect to change? this exercise might motivate us to add some observability. A gauge for the number of open ListAllJobs connections might not be amiss.

aoldershaw · 2020-09-22T15:31:48Z

@jamieklassen thanks for the review! Hmm, I think if you are struggling to fully understand it then there's definitely a need to improve upon the clarity - I'll try to think about that before merging.

I'd also like to a load test experiment as you suggested, but haven't found the time. I think it's probably wise to validate the impact it has before adding so much more complexity to the codebase

aoldershaw · 2020-09-22T15:41:46Z

Opened #6084 to track load testing

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

This isn't really an intended feature - just a consequence of the implementation. Probably doesn't make much sense to have a test for it. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

I realized that it's possible for the `drain` goroutine to leak when trying to send the pending events to the notify channel (since nothing will be draining the notify channel when it's unsubscribed) Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

Rather than in the watcher itself. Now the watcher is unaware of what its subscribers can see, and instead just send all watch events to all subscribers. Its the responsibility of the subscriber to filter the list. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

by only recomputing the pipeline layers when the inputs change. On a large deployment, I found that the massive amount of updates to job builds (e.g. builds starting and completing) were resulting in a ton of CPU usage client side. There were ~100 updates/second. I speculate the slowness was due to the pipeline layers computation, which would run for each pipeline containing a job that got updated. With this optimization, job updates that don't affect the job's inputs do not trigger a recomputation. I suspect this'll be the most common case - the shape of a pipeline isn't likely to change as frequently as the jobs run Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

also, I realize I screwed up the optimization by still calculating pipelineLayers always, even when it's not being used! whoops, that's fixed now. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

Just like we do in present.Build Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

I deployed an image to the load test environment with a bug (the wrong JSON was being emitted to the event stream), so Elm was erroring and hitting the "fallback to polling" path. Unfortunately, we never actually closed the event stream. This meant we were hitting the list all jobs endpoint every time an event batch came in (which was 5 times a second!), all while keeping the event stream open. This makes the UI more resilient in that case Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

rather than on every request. I found that the previous optimization wasn't actually terribly effective at boosting performance - it only seemed it was due to a bug in the API. However, not writing to localstorage on every event seems to improve performance a fair bit. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

aoldershaw · 2020-10-18T20:36:40Z

@jamieklassen I've done some load testing and documented it here: #6084

The skinny is that, under heavy load (tested with 100 active "dashboards"), enabling watch endpoints seems to result in:

Reduced ATC CPU usage, and as a result, increased build scheduling (compared to when watch endpoints are disabled)
- Without watch endpoints, the ATCs CPU was pegged at 100% (compared to 80% in the baseline), and build scheduling dropped by ~50% as result
- With watch endpoints, the ATCs CPU was around 85-95% (compared to 80% in the baseline), and build scheduling only dropped by ~10%
Reduced network transmission (ATC -> browser dropped by ~70%, DB -> ATC dropped by ~60%)
Slightly increased memory usage on both ATC/DB (~10-15%)

Not under client load (just the regular build scheduling load), having watch endpoints enabled resulted in slightly higher DB CPU and memory (~5-8% for each metric) - worth noting that at least some of this could be ascribed to the fact that the watch endpoint test was run after the vanilla concourse test, so the magnitude of data in the DB was a bit larger. The completed DB transaction rate didn't seem to be significantly affected, nor did DB connections.

In doing the load testing, I noticed some performance issues on the UI, primarily caused by writing the jobs cache to localStorage after every batch of events. I added some optimizations to mitigate that (9af0bc7, 100e067).

this exercise might motivate us to add some observability. A gauge for the number of open ListAllJobs connections might not be amiss

While looking into this, I noticed we have a GaugeVec for concurrent requests as part of the concurrent request limiting you worked on. I wonder if we should just always emit the number of ListAllJobs concurrent requests even if there is no concurrent request limit set.

It also makes me wonder about how the watch endpoints should play with the concurrent request limit - it's no longer a "you'll get your turn eventually, so just keep retrying" sort of thing. What do you think?

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

chenbh · 2021-07-21T14:17:21Z

Closing as it looks like we won't be prioritizing this anytime soon and the merge conflicts look scary

vito assigned jamieklassen Jun 23, 2020

aoldershaw force-pushed the watch-endpoints branch 3 times, most recently from dd6431d to b26a5f6 Compare June 23, 2020 16:39

chenbh reviewed Jun 24, 2020

View reviewed changes

atc/db/watch/list_all_jobs_watcher_test.go Outdated Show resolved Hide resolved

aoldershaw force-pushed the watch-endpoints branch from 72a6a17 to 103bef2 Compare June 25, 2020 14:51

aoldershaw added the experiment label Jun 26, 2020

aoldershaw force-pushed the watch-endpoints branch 2 times, most recently from d775bd5 to caaf5fb Compare June 27, 2020 18:40

aoldershaw changed the title ~~atc: ListAllJobs endpoint can be watched for changes~~ atc/web: ListAllJobs endpoint can be watched for changes Jun 29, 2020

aoldershaw force-pushed the watch-endpoints branch 2 times, most recently from 2344240 to 9adfb76 Compare July 2, 2020 23:40

aoldershaw changed the title ~~atc/web: ListAllJobs endpoint can be watched for changes~~ ListAllJobs endpoint can be watched for changes Jul 3, 2020

aoldershaw mentioned this pull request Jul 3, 2020

RFC: add watch API endpoints concourse/rfcs#61

Open

aoldershaw force-pushed the watch-endpoints branch 3 times, most recently from 33197f3 to cd51702 Compare July 9, 2020 21:21

aoldershaw added the enhancement label Jul 24, 2020

aoldershaw marked this pull request as ready for review July 24, 2020 13:28

aoldershaw force-pushed the watch-endpoints branch 3 times, most recently from 4a770da to 5319fa5 Compare August 14, 2020 14:16

aoldershaw force-pushed the watch-endpoints branch from 5319fa5 to c3756d8 Compare August 19, 2020 03:04

aoldershaw force-pushed the watch-endpoints branch from 335a570 to 1998e07 Compare August 27, 2020 20:36

aoldershaw force-pushed the watch-endpoints branch from 1998e07 to 842cbd5 Compare September 8, 2020 02:05

jamieklassen previously approved these changes Sep 22, 2020

View reviewed changes

aoldershaw mentioned this pull request Sep 22, 2020

Load test ListAllJobs watch endpoint #6084

Closed

Aidan Oldershaw added 7 commits October 16, 2020 09:40

enable watch endpoints by default in docker-compose.yml

aaefa3d

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

db: remove Delete access control test

c681cea

This isn't really an intended feature - just a consequence of the implementation. Probably doesn't make much sense to have a test for it. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

fly: behaviour: fix panic on render events

9e0acf6

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

fix feature flag

96d1856

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

fix tests after rebase

027634d

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

aoldershaw force-pushed the watch-endpoints branch from 20702d6 to 027634d Compare October 16, 2020 13:40

Aidan Oldershaw added 2 commits October 17, 2020 12:17

Merge branch 'master' into watch-endpoints

5284d60

aoldershaw force-pushed the watch-endpoints branch 2 times, most recently from b286dc2 to 41727be Compare October 17, 2020 21:47

fix merge issues

1e40db6

also, I realize I screwed up the optimization by still calculating pipelineLayers always, even when it's not being used! whoops, that's fixed now. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

aoldershaw force-pushed the watch-endpoints branch from 41727be to e3f25d5 Compare October 18, 2020 00:56

Aidan Oldershaw added 3 commits October 17, 2020 23:07

atc: behaviour: job factory properly zeroes time

dff8065

Just like we do in present.Build Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

aoldershaw force-pushed the watch-endpoints branch from e3f25d5 to bd44776 Compare October 18, 2020 03:57

atc: behaviour: watch session can't exceed access token expiry

c4349fd

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

aoldershaw force-pushed the watch-endpoints branch from bd44776 to c4349fd Compare October 18, 2020 18:24

aoldershaw removed the blocked label Oct 18, 2020

Merge branch 'master' into watch-endpoints

2d49ad3

aoldershaw requested a review from jamieklassen October 21, 2020 13:29

Merge branch 'master' into watch-endpoints

c3e7e98

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>

jamieklassen removed their assignment Nov 6, 2020

vito assigned chenbh Nov 9, 2020

aoldershaw mentioned this pull request Jun 30, 2021

Add page to view all builds/resource versions downstream/upstream from a root resource version #7125

Merged

chenbh closed this Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ListAllJobs endpoint can be watched for changes #5802

ListAllJobs endpoint can be watched for changes #5802

aoldershaw commented Jun 23, 2020 •

edited by jamieklassen

jamieklassen commented Sep 22, 2020

aoldershaw commented Sep 22, 2020

aoldershaw commented Sep 22, 2020

aoldershaw commented Oct 18, 2020 •

edited

chenbh commented Jul 21, 2021

ListAllJobs endpoint can be watched for changes #5802

ListAllJobs endpoint can be watched for changes #5802

Conversation

aoldershaw commented Jun 23, 2020 • edited by jamieklassen

What does this PR accomplish?

Changes proposed by this PR:

Notes to reviewer:

Contributor Checklist

Reviewer Checklist

jamieklassen commented Sep 22, 2020

aoldershaw commented Sep 22, 2020

aoldershaw commented Sep 22, 2020

aoldershaw commented Oct 18, 2020 • edited

chenbh commented Jul 21, 2021

aoldershaw commented Jun 23, 2020 •

edited by jamieklassen

aoldershaw commented Oct 18, 2020 •

edited