Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ListAllJobs endpoint can be watched for changes #5802

Closed
wants to merge 33 commits into from

Conversation

aoldershaw
Copy link
Contributor

@aoldershaw aoldershaw commented Jun 23, 2020

What does this PR accomplish?

Bug Fix | Feature | Documentation

Related RFC: concourse/rfcs#61

Changes proposed by this PR:

This PR aims to provide functionality similar to Kubernetes' ?watch API (e.g. kubectl get ... --watch) for the ListAllJobs endpoint.

abort-job

set-pipeline

The motivation for this is described more in depth in the RFC, but essentially, it should provide:

  • (Hopefully) better performance on the ATC and DB
  • Faster feedback in the UI dashboard

Eventually, I see this change applying to several endpoints - rather than polling in the UI, we have the ATC send us updates.

Notes to reviewer:

To test: visit the dashboard, and then modify some jobs! (e.g. trigger builds, set/delete pipelines, expose/hide pipelines, etc)

This is feature flagged behind the --enable-watch-endpoints flag, which I've enabled by default in docker-compose.yml.

Contributor Checklist

Reviewer Checklist

  • Code reviewed
  • Tests reviewed
  • Documentation reviewed
  • Release notes reviewed
  • PR acceptance performed
  • New config flags added? Ensure that they are added to the
    BOSH and
    Helm packaging; otherwise, ignored for
    the integration
    tests

    (for example, if they are Garden configs that are not displayed in the
    --help text).

@aoldershaw aoldershaw force-pushed the watch-endpoints branch 3 times, most recently from dd6431d to b26a5f6 Compare June 23, 2020 16:39
@aoldershaw aoldershaw force-pushed the watch-endpoints branch 2 times, most recently from d775bd5 to caaf5fb Compare June 27, 2020 18:40
@aoldershaw aoldershaw changed the title atc: ListAllJobs endpoint can be watched for changes atc/web: ListAllJobs endpoint can be watched for changes Jun 29, 2020
@aoldershaw aoldershaw force-pushed the watch-endpoints branch 2 times, most recently from 2344240 to 9adfb76 Compare July 2, 2020 23:40
@aoldershaw aoldershaw changed the title atc/web: ListAllJobs endpoint can be watched for changes ListAllJobs endpoint can be watched for changes Jul 3, 2020
@aoldershaw aoldershaw force-pushed the watch-endpoints branch 3 times, most recently from 33197f3 to cd51702 Compare July 9, 2020 21:21
@aoldershaw aoldershaw marked this pull request as ready for review July 24, 2020 13:28
@aoldershaw aoldershaw force-pushed the watch-endpoints branch 3 times, most recently from 4a770da to 5319fa5 Compare August 14, 2020 14:16
@jamieklassen
Copy link
Member

I have come back to this several times and never really convinced myself I understood how it works. The subtleties of queuing/not queuing notifications and handling dropped/interrupted connections or other edge-casey things seem like they would be easy to get wrong, and I couldn't grasp what dirty meant. If I was designing something like this from the ground up I'm sure I'd follow the example of kubernetes and use something like etcd which already has a "watch" system.

However, I don't want to block this potentially-valuable performance improvement just because my brain is too small. The feature flag seems to work properly so this is totally backwards-compatible, so I'd rather focus on getting it merged and load-tested instead of staring at it forever. So let's get it rebased, add the feature flag into the helm chart and bosh release, and turn it on in hush-house.

Maybe we can focus some effort on planning the load test experiment - what will we measure (memory on the web nodes, network i/o between the DB and web, number of DB connections? probably something else on the DB)? what do we expect to change? this exercise might motivate us to add some observability. A gauge for the number of open ListAllJobs connections might not be amiss.

jamieklassen
jamieklassen previously approved these changes Sep 22, 2020
@aoldershaw
Copy link
Contributor Author

@jamieklassen thanks for the review! Hmm, I think if you are struggling to fully understand it then there's definitely a need to improve upon the clarity - I'll try to think about that before merging.

I'd also like to a load test experiment as you suggested, but haven't found the time. I think it's probably wise to validate the impact it has before adding so much more complexity to the codebase

@aoldershaw
Copy link
Contributor Author

Opened #6084 to track load testing

Aidan Oldershaw added 7 commits October 16, 2020 09:40
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
This isn't really an intended feature - just a consequence of the
implementation. Probably doesn't make much sense to have a test for it.

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
I realized that it's possible for the `drain` goroutine to leak when
trying to send the pending events to the notify channel (since nothing
will be draining the notify channel when it's unsubscribed)

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Rather than in the watcher itself. Now the watcher is unaware of what
its subscribers can see, and instead just send all watch events to all
subscribers. Its the responsibility of the subscriber to filter the
list.

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Aidan Oldershaw added 2 commits October 17, 2020 12:17
by only recomputing the pipeline layers when the inputs change. On a
large deployment, I found that the massive amount of updates to job
builds (e.g. builds starting and completing) were resulting in a ton of
CPU usage client side. There were ~100 updates/second. I speculate the
slowness was due to the pipeline layers computation, which would run
for each pipeline containing a job that got updated.

With this optimization, job updates that don't affect the job's inputs
do not trigger a recomputation. I suspect this'll be the most common
case - the shape of a pipeline isn't likely to change as frequently as
the jobs run

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
@aoldershaw aoldershaw force-pushed the watch-endpoints branch 2 times, most recently from b286dc2 to 41727be Compare October 17, 2020 21:47
also, I realize I screwed up the optimization by still calculating
pipelineLayers always, even when it's not being used! whoops, that's
fixed now.

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Aidan Oldershaw added 3 commits October 17, 2020 23:07
Just like we do in present.Build

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
I deployed an image to the load test environment with a bug (the wrong
JSON was being emitted to the event stream), so Elm was erroring and
hitting the "fallback to polling" path. Unfortunately, we never actually
closed the event stream. This meant we were hitting the list all jobs
endpoint every time an event batch came in (which was 5 times a
second!), all while keeping the event stream open. This makes the UI
more resilient in that case

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
rather than on every request. I found that the previous optimization
wasn't actually terribly effective at boosting performance - it only
seemed it was due to a bug in the API. However, not writing to
localstorage on every event seems to improve performance a fair bit.

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
@aoldershaw
Copy link
Contributor Author

aoldershaw commented Oct 18, 2020

@jamieklassen I've done some load testing and documented it here: #6084

The skinny is that, under heavy load (tested with 100 active "dashboards"), enabling watch endpoints seems to result in:

  • Reduced ATC CPU usage, and as a result, increased build scheduling (compared to when watch endpoints are disabled)
    • Without watch endpoints, the ATCs CPU was pegged at 100% (compared to 80% in the baseline), and build scheduling dropped by ~50% as result
    • With watch endpoints, the ATCs CPU was around 85-95% (compared to 80% in the baseline), and build scheduling only dropped by ~10%
  • Reduced network transmission (ATC -> browser dropped by ~70%, DB -> ATC dropped by ~60%)
  • Slightly increased memory usage on both ATC/DB (~10-15%)

Not under client load (just the regular build scheduling load), having watch endpoints enabled resulted in slightly higher DB CPU and memory (~5-8% for each metric) - worth noting that at least some of this could be ascribed to the fact that the watch endpoint test was run after the vanilla concourse test, so the magnitude of data in the DB was a bit larger. The completed DB transaction rate didn't seem to be significantly affected, nor did DB connections.


In doing the load testing, I noticed some performance issues on the UI, primarily caused by writing the jobs cache to localStorage after every batch of events. I added some optimizations to mitigate that (9af0bc7, 100e067).

this exercise might motivate us to add some observability. A gauge for the number of open ListAllJobs connections might not be amiss

While looking into this, I noticed we have a GaugeVec for concurrent requests as part of the concurrent request limiting you worked on. I wonder if we should just always emit the number of ListAllJobs concurrent requests even if there is no concurrent request limit set.

It also makes me wonder about how the watch endpoints should play with the concurrent request limit - it's no longer a "you'll get your turn eventually, so just keep retrying" sort of thing. What do you think?

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
@chenbh
Copy link
Contributor

chenbh commented Jul 21, 2021

Closing as it looks like we won't be prioritizing this anytime soon and the merge conflicts look scary

@chenbh chenbh closed this Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants