New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ListAllJobs endpoint can be watched for changes #5802
Conversation
dd6431d
to
b26a5f6
Compare
72a6a17
to
103bef2
Compare
d775bd5
to
caaf5fb
Compare
2344240
to
9adfb76
Compare
33197f3
to
cd51702
Compare
4a770da
to
5319fa5
Compare
5319fa5
to
c3756d8
Compare
335a570
to
1998e07
Compare
1998e07
to
842cbd5
Compare
I have come back to this several times and never really convinced myself I understood how it works. The subtleties of queuing/not queuing notifications and handling dropped/interrupted connections or other edge-casey things seem like they would be easy to get wrong, and I couldn't grasp what However, I don't want to block this potentially-valuable performance improvement just because my brain is too small. The feature flag seems to work properly so this is totally backwards-compatible, so I'd rather focus on getting it merged and load-tested instead of staring at it forever. So let's get it rebased, add the feature flag into the helm chart and bosh release, and turn it on in hush-house. Maybe we can focus some effort on planning the load test experiment - what will we measure (memory on the web nodes, network i/o between the DB and web, number of DB connections? probably something else on the DB)? what do we expect to change? this exercise might motivate us to add some observability. A gauge for the number of open ListAllJobs connections might not be amiss. |
@jamieklassen thanks for the review! Hmm, I think if you are struggling to fully understand it then there's definitely a need to improve upon the clarity - I'll try to think about that before merging. I'd also like to a load test experiment as you suggested, but haven't found the time. I think it's probably wise to validate the impact it has before adding so much more complexity to the codebase |
Opened #6084 to track load testing |
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
This isn't really an intended feature - just a consequence of the implementation. Probably doesn't make much sense to have a test for it. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
I realized that it's possible for the `drain` goroutine to leak when trying to send the pending events to the notify channel (since nothing will be draining the notify channel when it's unsubscribed) Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Rather than in the watcher itself. Now the watcher is unaware of what its subscribers can see, and instead just send all watch events to all subscribers. Its the responsibility of the subscriber to filter the list. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
20702d6
to
027634d
Compare
by only recomputing the pipeline layers when the inputs change. On a large deployment, I found that the massive amount of updates to job builds (e.g. builds starting and completing) were resulting in a ton of CPU usage client side. There were ~100 updates/second. I speculate the slowness was due to the pipeline layers computation, which would run for each pipeline containing a job that got updated. With this optimization, job updates that don't affect the job's inputs do not trigger a recomputation. I suspect this'll be the most common case - the shape of a pipeline isn't likely to change as frequently as the jobs run Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
b286dc2
to
41727be
Compare
also, I realize I screwed up the optimization by still calculating pipelineLayers always, even when it's not being used! whoops, that's fixed now. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
41727be
to
e3f25d5
Compare
Just like we do in present.Build Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
I deployed an image to the load test environment with a bug (the wrong JSON was being emitted to the event stream), so Elm was erroring and hitting the "fallback to polling" path. Unfortunately, we never actually closed the event stream. This meant we were hitting the list all jobs endpoint every time an event batch came in (which was 5 times a second!), all while keeping the event stream open. This makes the UI more resilient in that case Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
rather than on every request. I found that the previous optimization wasn't actually terribly effective at boosting performance - it only seemed it was due to a bug in the API. However, not writing to localstorage on every event seems to improve performance a fair bit. Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
e3f25d5
to
bd44776
Compare
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
bd44776
to
c4349fd
Compare
@jamieklassen I've done some load testing and documented it here: #6084 The skinny is that, under heavy load (tested with 100 active "dashboards"), enabling watch endpoints seems to result in:
Not under client load (just the regular build scheduling load), having watch endpoints enabled resulted in slightly higher DB CPU and memory (~5-8% for each metric) - worth noting that at least some of this could be ascribed to the fact that the watch endpoint test was run after the vanilla concourse test, so the magnitude of data in the DB was a bit larger. The completed DB transaction rate didn't seem to be significantly affected, nor did DB connections. In doing the load testing, I noticed some performance issues on the UI, primarily caused by writing the jobs cache to localStorage after every batch of events. I added some optimizations to mitigate that (9af0bc7, 100e067).
While looking into this, I noticed we have a It also makes me wonder about how the watch endpoints should play with the concurrent request limit - it's no longer a "you'll get your turn eventually, so just keep retrying" sort of thing. What do you think? |
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Closing as it looks like we won't be prioritizing this anytime soon and the merge conflicts look scary |
What does this PR accomplish?
Bug Fix | Feature | Documentation
Related RFC: concourse/rfcs#61
Changes proposed by this PR:
This PR aims to provide functionality similar to Kubernetes'
?watch
API (e.g.kubectl get ... --watch
) for theListAllJobs
endpoint.The motivation for this is described more in depth in the RFC, but essentially, it should provide:
Eventually, I see this change applying to several endpoints - rather than polling in the UI, we have the ATC send us updates.
Notes to reviewer:
To test: visit the dashboard, and then modify some jobs! (e.g. trigger builds, set/delete pipelines, expose/hide pipelines, etc)
This is feature flagged behind the
--enable-watch-endpoints
flag, which I've enabled by default indocker-compose.yml
.Contributor Checklist
Reviewer Checklist
BOSH and
Helm packaging; otherwise, ignored for
the integration
tests
(for example, if they are Garden configs that are not displayed in the
--help
text).