Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

CAPI performance issues at scale #17

Closed
XanderStrike opened this issue Mar 9, 2020 · 15 comments
Closed

CAPI performance issues at scale #17

XanderStrike opened this issue Mar 9, 2020 · 15 comments

Comments

@XanderStrike
Copy link

One of our (cf-k8s-networking team) goals for GA is for the networking data and configuration planes to operate performantly at a scale of 1,000 routes and 2,000 AIs. To that end we started doing some scaling tests. In doing so we discovered some issues with the capi that we thought we'd bring to your attention.

In a space with 1,000 apps and 1,000 external routes:

cf apps times out after 60 seconds. The CLI makes a single request to /v2/spaces/<guid>/summary that eventually times out with an nginx error.

cf v3-apps hangs seemingly forever. When we use -v we see a constant stream of shorter requests, the ones to /v3/processes/<guid>/stats seem to take a while but not long enough to cause a timeout.

cf app <appname> fails after 3 minutes. It attempts /v3/processes/<guid>/stats 3 times and times out each time.

cf delete works great 😄

cf routes takes 20 seconds.


Some interesting things we found:

  • log-cache client currently will do three retries with 0.1 seconds of sleep between each retry. This might add a little bit since it's not working yet (we think).
  • instances reporter has a gobal workpool. We noticed that once we saturate an API with a Space Summary request it causes subsequent individual requests to /v3/processes/<guid>/stats to time out. Before the Space Summary request these took about ~1 second but seemed to succeed.
  • Eirini's way of fetching instance stats seems less performant than just hitting the Diego BBS

cc @tcdowney @rosenhouse @ndhanushkodi @rodolfo2488

@cf-gitbot
Copy link
Collaborator

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/171706511

The labels on this github issue will be updated when the story is started.

@tcdowney
Copy link
Member

tcdowney commented Mar 9, 2020

Another thing that would help us get better numbers is to include the response time in the nginx access logs. It didn't seem to be present when we checked, so I'm guessing it is explicitly added for the BOSH-generated nginx config.

@piyalibanerjee
Copy link
Contributor

piyalibanerjee commented Mar 17, 2020

Thanks for the suggestion, @tcdowney! We will add that nginx response time property to capi-k8s-release so we can get those numbers as we reproduce the error -- we'll get back to you with our findings.

We have a couple questions for you and @XanderStrike:

  1. Do you have the performance environment (or a script to build it) where you found this issue so we can do some testing on our own? Currently we are testing it in a cf-for-k8s environment where we deployed 1000 apps.
  2. To reproduce this issue, is it critical for the apps to be started? We used the no-start flag and pushed 1000 apps (so the apps were each assigned an external route, as described in the github issue). Our findings so far:
  • cf apps did not timeout.cf v3-apps took much longer to execute than cf apps and sometimes timed out. We will investigate this further.
  • cf app <APP_NAME> and cf7 app <APP_NAME> both return the results pretty fast, so we couldn't reproduce the performance issue you discovered (may be related to us not starting the deployed apps?).
  • cf routes, as you observed, did take ~20 seconds. We have plans to cross-team with you all (Route CRD stories) which would improve performance for routes.

EDIT (from @jspawar): we observed all of the above on a small cluster not at all of the same size as the cluster you originally used. We will attempt again with a cluster of similar spec

@XanderStrike
Copy link
Author

XanderStrike commented Mar 17, 2020

Thank you for taking a look at the issue!

I'll take your questions in order:

  1. We do! I've spruced it up for you to see here. It takes about 90 minutes to get it created and pushed, but I also have an environment up and available that we can look at if you like. Reach out on Slack.
  2. I'm not sure if it's critical for capi's purposes, but in our tests we do have the apps started. Our chief concern in doing these tests is istio control plane latency (time from cf push to route availability) so for us it is essential that they be running.
  • I just reproduced our issues with cf apps and cf v3-apps timing out using the script/environment above using 6.5. With cf7 I see cf apps seems to hang forever with the same behavior as cf v3-apps.
  • I was unable to reproduce the timeout with cf app <appname> with 6.5. Seems to be taking about 3-5 seconds now. It does still take a long time (30-60 seconds) with cf7 though.

image

image

Let me know if you have any more questions and feel free to reach out to me (or the team!)!

@cwlbraa
Copy link
Contributor

cwlbraa commented Mar 18, 2020

Have y'all tried running these tests with the apps spread out across spaces or even with a more realistic ratio of instances per app? We're creating bugs to work out the problems you've found with having many apps in one space, but I'm not sure that tells us very much about how a realistic environment might fail at scale.

@piyalibanerjee
Copy link
Contributor

Hi @XanderStrike! We made a story in the CLI team's backlog, which we'll cross team pair with them on, to mitigate the cf7 apps performance issues. We also filed a github issue (which will become a bug in our backlog) so we can solve the 504 Gateway Timeout error we are seeing with cf6 apps in a cf-for-k8s env with many apps deployed in a single space. We'll likely need to collaborate with the eirini and/or you to fix it.

@cwlbraa
Copy link
Contributor

cwlbraa commented Sep 8, 2020

This probably needs to be revisited.

  1. We ultimately decided not to support capi-k8s-release+cf6.
  2. VAT did some work to make v3/apps faster shortly after this issue was created and discussed.
  3. It's possible this is still slow due to eirini instance reporter performance.

@njbennett
Copy link
Contributor

@cwlbraa When you say "revisited," what's the next action here? Are we requesting that @XanderStrike retest? Or is there more work on our side? This issue is in "accepted" state so, what's necessary to finish this out?

@cwlbraa
Copy link
Contributor

cwlbraa commented Sep 29, 2020

@emalm pinged me out-of-band about this or something similar a few minutes ago. @XanderStrike @astrieanna have revisited their scale tests and apparently they are, in fact, still having issues.

@tcdowney
Copy link
Member

cc @keshav-pivotal

@XanderStrike
Copy link
Author

To give an unofficial off the cuff status update about cf-for-k8s scaling, we're still struggling with either capi or eirini at this scale, getting a lot of these kinds of things:

Unexpected Response
Response Code: 500
Request ID:    2ff6d9ec-4e96-4a1e-bf17-5eb98f8dd1f0::01e36d07-393a-41d2-9439-bf4a2de2bdb1
Code: 0, Title: , Detail: {
  "errors": [
    {
      "title": "UnknownError",
      "detail": "An unknown error occurred.",
      "code": 10001
    }
  ]
}
FAILED

This prevents us from reaching our 1.0 goal of 1,000 apps and 2,000 routes because we often have many of these failures before we can even start testing networking components.

However, we've deprioritized this work and paused scale testing entirely because we're confident networking components can scale as well as or better than the rest of the platform, so we haven't spent much time looking into why these errors happen. We'd also like to have a post-1.0 discussion about who should own scaling tests and which teams should be running them, since we've spent as much time debugging other components as we have our own 😂

@cwlbraa
Copy link
Contributor

cwlbraa commented Sep 29, 2020

We'd also like to have a post-1.0 discussion about who should own scaling tests and which teams should be running them, since we've spent as much time debugging other components as we have our own 😂

We have been relying on the fact that other folks have been scale testing and we deeply appreciate the work you've done. I empathize that it's super frustrating catching bugs that you're not equipped or empowered to fix.

We'd love to work with you to get these errors fixed and help unblock you, synchronously or asynchronously. A raw "500" from the CLI side is not enough for us to act on, though, we'd need some logs from cloud controller.

@KesavanKing
Copy link

@cwlbraa From our scale tests all the issues and relevant logs related to 500 and 503 are documented #67 and Latency issue #70

@jspawar
Copy link
Contributor

jspawar commented Feb 18, 2021

Re: cf apps taking too long, we think we might have addressed some of that with these changes we just merged in: cloudfoundry/cloud_controller_ng#2123

@heycait
Copy link

heycait commented Apr 15, 2021

Closing this out due to staleness. If there are more performance concerns, please open a new issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants