-
Notifications
You must be signed in to change notification settings - Fork 25
CAPI performance issues at scale #17
Comments
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/171706511 The labels on this github issue will be updated when the story is started. |
Another thing that would help us get better numbers is to include the response time in the nginx access logs. It didn't seem to be present when we checked, so I'm guessing it is explicitly added for the BOSH-generated nginx config. |
Thanks for the suggestion, @tcdowney! We will add that nginx response time property to capi-k8s-release so we can get those numbers as we reproduce the error -- we'll get back to you with our findings. We have a couple questions for you and @XanderStrike:
EDIT (from @jspawar): we observed all of the above on a small cluster not at all of the same size as the cluster you originally used. We will attempt again with a cluster of similar spec |
Thank you for taking a look at the issue! I'll take your questions in order:
Let me know if you have any more questions and feel free to reach out to me (or the team!)! |
Have y'all tried running these tests with the apps spread out across spaces or even with a more realistic ratio of instances per app? We're creating bugs to work out the problems you've found with having many apps in one space, but I'm not sure that tells us very much about how a realistic environment might fail at scale. |
Hi @XanderStrike! We made a story in the CLI team's backlog, which we'll cross team pair with them on, to mitigate the |
This probably needs to be revisited.
|
@cwlbraa When you say "revisited," what's the next action here? Are we requesting that @XanderStrike retest? Or is there more work on our side? This issue is in "accepted" state so, what's necessary to finish this out? |
@emalm pinged me out-of-band about this or something similar a few minutes ago. @XanderStrike @astrieanna have revisited their scale tests and apparently they are, in fact, still having issues. |
cc @keshav-pivotal |
To give an unofficial off the cuff status update about cf-for-k8s scaling, we're still struggling with either capi or eirini at this scale, getting a lot of these kinds of things:
This prevents us from reaching our 1.0 goal of 1,000 apps and 2,000 routes because we often have many of these failures before we can even start testing networking components. However, we've deprioritized this work and paused scale testing entirely because we're confident networking components can scale as well as or better than the rest of the platform, so we haven't spent much time looking into why these errors happen. We'd also like to have a post-1.0 discussion about who should own scaling tests and which teams should be running them, since we've spent as much time debugging other components as we have our own 😂 |
We have been relying on the fact that other folks have been scale testing and we deeply appreciate the work you've done. I empathize that it's super frustrating catching bugs that you're not equipped or empowered to fix. We'd love to work with you to get these errors fixed and help unblock you, synchronously or asynchronously. A raw "500" from the CLI side is not enough for us to act on, though, we'd need some logs from cloud controller. |
Re: |
Closing this out due to staleness. If there are more performance concerns, please open a new issue. |
One of our (cf-k8s-networking team) goals for GA is for the networking data and configuration planes to operate performantly at a scale of 1,000 routes and 2,000 AIs. To that end we started doing some scaling tests. In doing so we discovered some issues with the capi that we thought we'd bring to your attention.
In a space with 1,000 apps and 1,000 external routes:
cf apps
times out after 60 seconds. The CLI makes a single request to/v2/spaces/<guid>/summary
that eventually times out with an nginx error.cf v3-apps
hangs seemingly forever. When we use -v we see a constant stream of shorter requests, the ones to/v3/processes/<guid>/stats
seem to take a while but not long enough to cause a timeout.cf app <appname>
fails after 3 minutes. It attempts/v3/processes/<guid>/stats
3 times and times out each time.cf delete
works great 😄cf routes
takes 20 seconds.Some interesting things we found:
0.1
seconds of sleep between each retry. This might add a little bit since it's not working yet (we think)./v3/processes/<guid>/stats
to time out. Before the Space Summary request these took about ~1 second but seemed to succeed.cc @tcdowney @rosenhouse @ndhanushkodi @rodolfo2488
The text was updated successfully, but these errors were encountered: