Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

Set memory and cpu requests and limit values for all containers #65

Merged
merged 2 commits into from
Sep 16, 2020

Conversation

paulcwarren
Copy link
Member

@paulcwarren paulcwarren commented Aug 28, 2020

Relint is currently working on a scaling and Quality of Service (QoS) set of stories.

We are targeting 1.0 to be configured out-of-the-box as a "developer" edition aimed at those users who want to kick the tyres. As part of this, we would like to set limits on mem/cpu.

Since a "developer" edition may not be preferred by everyone, we want each component to be configurable to scale both horizontally (replicas) and vertically (mem/cpu). This will also allow users to deliver a Guaranteed QoS when required (although we are recommending that all of our pods and containers use the Burstable QoS) As part of this we would like to ask you to do several things:

  1. consider which of your pods/containers you would like to expose for scaling properties for.
  2. expose said configuration properties.
  3. sets mem and cpu values for all containers in order to provide as much meta-data to k8s as possible so that its scheduler can do as good a job as possible. This PR is an initial attempt at setting these values, although we know you are much more likely to have insight into your components mem/cpu requirements than our guess.

If you have any questions or concerns, please let us know! Thanks!

#174462927

Co-Authored-By: Angela Chin achin@pivotal.io

@cf-gitbot
Copy link
Collaborator

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/174559282

The labels on this github issue will be updated when the story is started.

memory: 300Mi
limits:
cpu: 1000m
memory: 1.2Gi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't change this, but for @cloudfoundry/cf-capi how did we choose this memory limit?

On BOSH the limit ~4GB:
https://github.com/cloudfoundry/capi-release/blob/6f8899976561e64eab8c9804b1ce772083bbf68c/jobs/cloud_controller_ng/spec#L878-L886

In production envs I've seen a well-used CF API settle in at around 2.5-3GB (occasionally bursting higher for certain requests), so I think we should make it match the capi-release limits.

Copy link
Contributor

@cwlbraa cwlbraa Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was probably a raw multiple of the initially-deployed memory usage. We set these to keep the API from getting OOM killed during test, not based on any experiment to produce memory bloat or "realistic" workloads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good context, thanks @cwlbraa.

I wonder if we should just make the very forgiving for now (either meet or exceed the BOSH defaults). The main use for them is to just be a good neighbor and avoid starving other Pods of resources, but I feel like if we're targeting a "developer edition" 1.0 they might be more frustrating than helpful if they're too strict.

@cwlbraa
Copy link
Contributor

cwlbraa commented Aug 28, 2020

thanks for this! the main change I'd make to what's here is:

  1. worker, deployment_updater, and clock should all be allowed 1000m CPU, and request 300m. They're GIL'd and sometimes do compute-intensive tasks, but are mostly idle.

are y'all interested in adding the configurability to this PR? maybe to build some empathy for how cf-for-k8s component teams add optional config parameters to cf-for-k8s?

Things that should be configurable:

  1. cf-api-server instance count
  2. worker instance count
  3. clock memory limit (and request?)
  4. deployment updater memory limit (and request?)
  5. cf-api-controllers memory limit (and request?) AND CPU limit/request

Do we expect to have 4 scaling parameters exposed for each multithreaded system-singleton, vertical-scale only pod? It seems a bit much.

@njbennett
Copy link
Contributor

njbennett commented Aug 28, 2020

I'm not sure if this is in-scope for this PR but if we're talking about automatic horizontal scaling...

One of the more dramatic failure modes I've seen in CF-for-VMs is the behavior of the platform when the shared internal database runs out of configured connections. If UAA can't get connections then the whole control plane becomes inaccessible. Does kubernetes have any means of taking this shared resource into account in its scaling decisions?

If not, enabling horizontal autoscaling seems like it creates an availability threat to the platform, even as it resolves another one.

@jamespollard8
Copy link
Contributor

are y'all interested in adding the configurability to this PR? maybe to build some empathy for how cf-for-k8s component teams add optional config parameters to cf-for-k8s?

We've worked on a proposal for the Scaling Interface and it's under review now. So for now, we'd like to place configurability out of scope.

limits to 300m/1000m, as requested

[#174559282](https://www.pivotaltracker.com/story/show/174559282)

Co-authored-by: Eric Promislow <epromislow@suse.com>
@jamespollard8
Copy link
Contributor

worker, deployment_updater, and clock should all be allowed 1000m CPU, and request 300m. They're GIL'd and sometimes do compute-intensive tasks, but are mostly idle.

Done ✅

@jamespollard8
Copy link
Contributor

I'm not sure if this is in-scope for this PR but if we're talking about automatic horizontal scaling...

One of the more dramatic failure modes I've seen in CF-for-VMs is the behavior of the platform when the shared internal database runs out of configured connections. If UAA can't get connections then the whole control plane becomes inaccessible. Does kubernetes have any means of taking this shared resource into account in its scaling decisions?

If not, enabling horizontal autoscaling seems like it creates an availability threat to the platform, even as it resolves another one.

Great thoughts, Nat. They're out of scope for this minimal PR but I'd recommend dropping these thoughts in #cf-for-k8s slack channel to get a conversation going.

FWIW, our initial take is that this is something Kubernetes should help handle. We'd hope the Horizontal Pod Autoscaler would help here

@cwlbraa cwlbraa merged commit 19ce138 into master Sep 16, 2020
jspawar pushed a commit to cloudfoundry/cf-for-k8s that referenced this pull request Sep 16, 2020
- Bumps to kpack 0.1.2 which has multiple breaking changes from the
previous version we used
- Bumps to a branch of capi-k8s-release which supports the breaking
changes kpack introduced
   - We (CAPI) will merge this into our primary branch after this change
   has been merged into cf-for-k8s
   - This also encapsulates additional changes which happened in the
   interim for:
      - cloudfoundry/capi-k8s-release#44
      - cloudfoundry/capi-k8s-release#65

Co-authored-by: Sannidhi Jalukar <sjalukar@pivotal.io>
@jamespollard8 jamespollard8 deleted the qos-scaling branch September 18, 2020 21:25
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants