Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stampeding API requests causes stalled workers #4372

Open
clarafu opened this issue Sep 4, 2019 · 8 comments
Open

Stampeding API requests causes stalled workers #4372

clarafu opened this issue Sep 4, 2019 · 8 comments
Assignees
Labels

Comments

@clarafu
Copy link
Contributor

clarafu commented Sep 4, 2019

Within our lidar-drills environment (which is running with Algorithm V3 and Lidar), we are putting the concourse deployment under a lot of stress (to put it in perspective we have about double the amount of jobs as wings). When we open the web UI to view the pipelines and click around, it causes a stampede of API requests to various endpoints (for example, GetJobs for the pipeline view or GetBuild from the build page).

Screen Shot 2019-09-04 at 2 37 31 PM

These requests build up because there is a new one sent every tick if we haven't received a response back yet as they are all stuck pending. This is becoming a problem now because we have set a dedicated connection pool for the api connections (default to 10 right now) and once it hits that max 10 connections, workers will become stalled because they arn't able to send their heartbeating requests.

Maybe instead of sending a new request every tick, we can just wait for a response after the first request is sent?

Thanks!

@clarafu clarafu added the bug label Sep 4, 2019
@jamieklassen
Copy link
Member

I'm confident we can ensure that single client doesn't have multiple in-flight requests to the same endpoint. This is a good web-ui issue.

@clarafu
Copy link
Contributor Author

clarafu commented Sep 6, 2019

That's great! I'm going to add this to the 6.0 milestone since we probably want get this in with the new algorithm.

@jamieklassen
Copy link
Member

related: #2776

@zoetian
Copy link
Member

zoetian commented Nov 4, 2019

Let's not forget to ensure that there is a FetchUser call every 5 seconds on the dashboard. nothing that comes from the /userinfo endpoint will change in any 5 second interval.

jamieklassen pushed a commit that referenced this issue Nov 4, 2019
#4372

Signed-off-by: Zoe Tian <ztian@pivotal.io>
Co-authored-by: Jamie Klassen <cklassen@pivotal.io>
jamieklassen pushed a commit that referenced this issue Nov 4, 2019
and remove .userState from Dashboard model.

#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
jamieklassen pushed a commit that referenced this issue Nov 4, 2019
#4372

Signed-off-by: Zoe Tian <ztian@pivotal.io>
Co-authored-by: Jamie Klassen <cklassen@pivotal.io>
jamieklassen pushed a commit that referenced this issue Nov 4, 2019
#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
jamieklassen pushed a commit that referenced this issue Nov 4, 2019
#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
@jomsie jomsie assigned zoetian and unassigned jomsie Nov 5, 2019
zoetian pushed a commit that referenced this issue Nov 5, 2019
#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
zoetian pushed a commit that referenced this issue Nov 5, 2019
The display of the orange triangle on pipeline cards depended on the order in
which the APIDataFetched and AllResourcesFetched callbacks were handled.
Accordingly, we reversed the order in which the existing tests for this feature
received those callbacks. Technically we have no coverage for the scenario where
the callbacks are received in the other order now, but the covered case was the
more difficult of the two anyway.

#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
@zoetian zoetian added the paused label Nov 6, 2019
@zoetian zoetian removed the paused label Nov 19, 2019
zoetian pushed a commit that referenced this issue Nov 19, 2019
#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
zoetian pushed a commit that referenced this issue Nov 20, 2019
#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
zoetian pushed a commit that referenced this issue Nov 21, 2019
#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
zoetian pushed a commit that referenced this issue Nov 25, 2019
#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
zoetian pushed a commit that referenced this issue Nov 27, 2019
#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
jamieklassen pushed a commit that referenced this issue Nov 29, 2019
…n getting team names from groups

#4372

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Co-authored-by: Zoe Tian <ztian@pivotal.io>
@vito vito added this to To do in Runtime side-road Dec 11, 2019
@vito vito removed the web-ui label Dec 11, 2019
@vito
Copy link
Member

vito commented Dec 11, 2019

@pivotal-jamie-klassen I've extracted #4885 to cover the web UI aspect, but there's still a general concern over API requests starving the worker registration API requests of database connections which results in workers stalling.

@concourse/runtime Added this to the Runtime side-road - it's technically API related but I figured it has more to do with worker registration, and putting it in Runtime instead of Core could give us an opportunity to rethink how registration works if we want. (I have no strong suggestion.)

@vito vito removed this from the v6.0.0 milestone Dec 11, 2019
@odormond
Copy link
Contributor

@vito, I noticed that you removed it from the v6.0.0 milestone. As I'm not familiar with your development workflow, I'm wondering if that means it's getting postponed for later or not?

@vito
Copy link
Member

vito commented Jan 8, 2020

@odormond It just means it's not required for v6.0. It doesn't affect the priority; this is still high up in the runtime backlog.

@odormond
Copy link
Contributor

odormond commented Jan 9, 2020

Thanks @vito for the update.
Happy New Year and congratulation for the 5.8.0 release! We're excited by the containerd work and looking forward for a working implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

No branches or pull requests

6 participants