CAPI performance issues at scale #17

XanderStrike · 2020-03-09T17:43:27Z

One of our (cf-k8s-networking team) goals for GA is for the networking data and configuration planes to operate performantly at a scale of 1,000 routes and 2,000 AIs. To that end we started doing some scaling tests. In doing so we discovered some issues with the capi that we thought we'd bring to your attention.

In a space with 1,000 apps and 1,000 external routes:

cf apps times out after 60 seconds. The CLI makes a single request to /v2/spaces/<guid>/summary that eventually times out with an nginx error.

cf v3-apps hangs seemingly forever. When we use -v we see a constant stream of shorter requests, the ones to /v3/processes/<guid>/stats seem to take a while but not long enough to cause a timeout.

cf app <appname> fails after 3 minutes. It attempts /v3/processes/<guid>/stats 3 times and times out each time.

cf delete works great 😄

cf routes takes 20 seconds.

Some interesting things we found:

log-cache client currently will do three retries with 0.1 seconds of sleep between each retry. This might add a little bit since it's not working yet (we think).
instances reporter has a gobal workpool. We noticed that once we saturate an API with a Space Summary request it causes subsequent individual requests to /v3/processes/<guid>/stats to time out. Before the Space Summary request these took about ~1 second but seemed to succeed.
Eirini's way of fetching instance stats seems less performant than just hitting the Diego BBS

cc @tcdowney @rosenhouse @ndhanushkodi @rodolfo2488

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2020-03-09T17:43:30Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/171706511

The labels on this github issue will be updated when the story is started.

tcdowney · 2020-03-09T17:54:12Z

Another thing that would help us get better numbers is to include the response time in the nginx access logs. It didn't seem to be present when we checked, so I'm guessing it is explicitly added for the BOSH-generated nginx config.

https://github.com/cloudfoundry/capi-release/blob/d0fa221451cb34177124a83a4bd27c0493578ed4/jobs/cloud_controller_ng/spec#L829-L833

piyalibanerjee · 2020-03-17T17:37:08Z

Thanks for the suggestion, @tcdowney! We will add that nginx response time property to capi-k8s-release so we can get those numbers as we reproduce the error -- we'll get back to you with our findings.

We have a couple questions for you and @XanderStrike:

Do you have the performance environment (or a script to build it) where you found this issue so we can do some testing on our own? Currently we are testing it in a cf-for-k8s environment where we deployed 1000 apps.
To reproduce this issue, is it critical for the apps to be started? We used the no-start flag and pushed 1000 apps (so the apps were each assigned an external route, as described in the github issue). Our findings so far:

cf apps did not timeout.cf v3-apps took much longer to execute than cf apps and sometimes timed out. We will investigate this further.
cf app <APP_NAME> and cf7 app <APP_NAME> both return the results pretty fast, so we couldn't reproduce the performance issue you discovered (may be related to us not starting the deployed apps?).
cf routes, as you observed, did take ~20 seconds. We have plans to cross-team with you all (Route CRD stories) which would improve performance for routes.

EDIT (from @jspawar): we observed all of the above on a small cluster not at all of the same size as the cluster you originally used. We will attempt again with a cluster of similar spec

XanderStrike · 2020-03-17T21:01:20Z

Thank you for taking a look at the issue!

I'll take your questions in order:

We do! I've spruced it up for you to see here. It takes about 90 minutes to get it created and pushed, but I also have an environment up and available that we can look at if you like. Reach out on Slack.
I'm not sure if it's critical for capi's purposes, but in our tests we do have the apps started. Our chief concern in doing these tests is istio control plane latency (time from cf push to route availability) so for us it is essential that they be running.

I just reproduced our issues with cf apps and cf v3-apps timing out using the script/environment above using 6.5. With cf7 I see cf apps seems to hang forever with the same behavior as cf v3-apps.
I was unable to reproduce the timeout with cf app <appname> with 6.5. Seems to be taking about 3-5 seconds now. It does still take a long time (30-60 seconds) with cf7 though.

Let me know if you have any more questions and feel free to reach out to me (or the team!)!

cwlbraa · 2020-03-18T22:46:09Z

Have y'all tried running these tests with the apps spread out across spaces or even with a more realistic ratio of instances per app? We're creating bugs to work out the problems you've found with having many apps in one space, but I'm not sure that tells us very much about how a realistic environment might fail at scale.

piyalibanerjee · 2020-03-18T23:12:43Z

Hi @XanderStrike! We made a story in the CLI team's backlog, which we'll cross team pair with them on, to mitigate the cf7 apps performance issues. We also filed a github issue (which will become a bug in our backlog) so we can solve the 504 Gateway Timeout error we are seeing with cf6 apps in a cf-for-k8s env with many apps deployed in a single space. We'll likely need to collaborate with the eirini and/or you to fix it.

cwlbraa · 2020-09-08T17:26:33Z

This probably needs to be revisited.

We ultimately decided not to support capi-k8s-release+cf6.
VAT did some work to make v3/apps faster shortly after this issue was created and discussed.
It's possible this is still slow due to eirini instance reporter performance.

njbennett · 2020-09-29T16:53:17Z

@cwlbraa When you say "revisited," what's the next action here? Are we requesting that @XanderStrike retest? Or is there more work on our side? This issue is in "accepted" state so, what's necessary to finish this out?

cwlbraa · 2020-09-29T16:59:50Z

@emalm pinged me out-of-band about this or something similar a few minutes ago. @XanderStrike @astrieanna have revisited their scale tests and apparently they are, in fact, still having issues.

tcdowney · 2020-09-29T17:00:53Z

cc @keshav-pivotal

XanderStrike · 2020-09-29T17:01:54Z

To give an unofficial off the cuff status update about cf-for-k8s scaling, we're still struggling with either capi or eirini at this scale, getting a lot of these kinds of things:

Unexpected Response
Response Code: 500
Request ID:    2ff6d9ec-4e96-4a1e-bf17-5eb98f8dd1f0::01e36d07-393a-41d2-9439-bf4a2de2bdb1
Code: 0, Title: , Detail: {
  "errors": [
    {
      "title": "UnknownError",
      "detail": "An unknown error occurred.",
      "code": 10001
    }
  ]
}
FAILED

This prevents us from reaching our 1.0 goal of 1,000 apps and 2,000 routes because we often have many of these failures before we can even start testing networking components.

However, we've deprioritized this work and paused scale testing entirely because we're confident networking components can scale as well as or better than the rest of the platform, so we haven't spent much time looking into why these errors happen. We'd also like to have a post-1.0 discussion about who should own scaling tests and which teams should be running them, since we've spent as much time debugging other components as we have our own 😂

cwlbraa · 2020-09-29T17:08:56Z

We'd also like to have a post-1.0 discussion about who should own scaling tests and which teams should be running them, since we've spent as much time debugging other components as we have our own 😂

We have been relying on the fact that other folks have been scale testing and we deeply appreciate the work you've done. I empathize that it's super frustrating catching bugs that you're not equipped or empowered to fix.

We'd love to work with you to get these errors fixed and help unblock you, synchronously or asynchronously. A raw "500" from the CLI side is not enough for us to act on, though, we'd need some logs from cloud controller.

KesavanKing · 2020-09-30T05:24:49Z

@cwlbraa From our scale tests all the issues and relevant logs related to 500 and 503 are documented #67 and Latency issue #70

jspawar · 2021-02-18T19:41:31Z

Re: cf apps taking too long, we think we might have addressed some of that with these changes we just merged in: cloudfoundry/cloud_controller_ng#2123

heycait · 2021-04-15T20:50:52Z

Closing this out due to staleness. If there are more performance concerns, please open a new issue.

cf-gitbot added the unscheduled label Mar 9, 2020

cf-gitbot added scheduled in progress and removed unscheduled scheduled in progress labels Mar 11, 2020

cf-gitbot added delivered in progress and removed in progress delivered labels Mar 17, 2020

piyalibanerjee mentioned this issue Mar 18, 2020

Request to /v2/spaces/:guid/summary times out for spaces with many apps (when using eirini). cloudfoundry/cloud_controller_ng#1577

Closed

cf-gitbot added delivered accepted and removed in progress delivered labels Mar 18, 2020

tcdowney mentioned this issue May 6, 2020

Route Unavailable on High Load cloudfoundry/cf-k8s-networking#40

Closed

cwlbraa added performance eirini labels Sep 8, 2020

tcdowney mentioned this issue Jan 12, 2021

Listing apps takes a very long time at scale cloudfoundry/cf-for-k8s#606

Closed

heycait closed this as completed Apr 15, 2021

cf-gitbot removed the accepted label Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPI performance issues at scale #17

CAPI performance issues at scale #17

XanderStrike commented Mar 9, 2020

cf-gitbot commented Mar 9, 2020

tcdowney commented Mar 9, 2020

piyalibanerjee commented Mar 17, 2020 •

edited by jspawar

Loading

XanderStrike commented Mar 17, 2020 •

edited

Loading

cwlbraa commented Mar 18, 2020 •

edited

Loading

piyalibanerjee commented Mar 18, 2020

cwlbraa commented Sep 8, 2020 •

edited

Loading

njbennett commented Sep 29, 2020

cwlbraa commented Sep 29, 2020 •

edited

Loading

tcdowney commented Sep 29, 2020

XanderStrike commented Sep 29, 2020

cwlbraa commented Sep 29, 2020

KesavanKing commented Sep 30, 2020

jspawar commented Feb 18, 2021

heycait commented Apr 15, 2021

CAPI performance issues at scale #17

CAPI performance issues at scale #17

Comments

XanderStrike commented Mar 9, 2020

cf-gitbot commented Mar 9, 2020

tcdowney commented Mar 9, 2020

piyalibanerjee commented Mar 17, 2020 • edited by jspawar Loading

XanderStrike commented Mar 17, 2020 • edited Loading

cwlbraa commented Mar 18, 2020 • edited Loading

piyalibanerjee commented Mar 18, 2020

cwlbraa commented Sep 8, 2020 • edited Loading

njbennett commented Sep 29, 2020

cwlbraa commented Sep 29, 2020 • edited Loading

tcdowney commented Sep 29, 2020

XanderStrike commented Sep 29, 2020

cwlbraa commented Sep 29, 2020

KesavanKing commented Sep 30, 2020

jspawar commented Feb 18, 2021

heycait commented Apr 15, 2021

piyalibanerjee commented Mar 17, 2020 •

edited by jspawar

Loading

XanderStrike commented Mar 17, 2020 •

edited

Loading

cwlbraa commented Mar 18, 2020 •

edited

Loading

cwlbraa commented Sep 8, 2020 •

edited

Loading

cwlbraa commented Sep 29, 2020 •

edited

Loading