Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k8s resource limit patch lib #19

Merged
merged 40 commits into from Aug 3, 2022
Merged

Add k8s resource limit patch lib #19

merged 40 commits into from Aug 3, 2022

Conversation

sed-i
Copy link
Contributor

@sed-i sed-i commented Jun 21, 2022

Issue

Need to be able to limit resource usage of a charm.

Crossref: OPENG-272

Solution

  • Create lib to patch k8s compute resource limits and requests.
  • Turn the placeholder charm into a tester charm, and use it in itests.

Context

NTA.

Testing Instructions

  • Deploy prom/318.
  • Compare to kubectl patch statefulset prom -n welcome -p '{"spec": {"template": {"spec": {"containers": [{"name":"prometheus", "resources": {"limits": {"cpu": "2", "memory":"2Gi"}, "requests": {"cpu": "2", "memory": "1Gi"}} }] }}}}'

Release Notes

Add k8s resource limit patch lib.

@sed-i
Copy link
Contributor Author

sed-i commented Jun 23, 2022

There's a complication:

  • When patching the StatefulSet with {"memory": "0.9Gi"}, k8s generates a PodSpec with {"memory": "966367641600m"}, i.e. millibytes (e.g. to convert it back to GiB, value = bitmath.Byte(float(memory[:-1]) / 1000).to_GiB().value).
  • K8s docs claim that the m suffix if part of SI, but that seems to be false, at least for binary multiples.

The .is_ready() method compares statefulset to actual pod and currently it is buggy because of the above.

Is there a way to force K8s to KEEP the same unit the user provided? @simskij @rbarry82

UPDATE: similarly, {"cpu": "0.30000000000000004"} -> {"cpu": "301m"}

@simskij
Copy link
Member

simskij commented Jun 23, 2022

There's a complication:

  • When patching the StatefulSet with {"memory": "0.9Gi"}, k8s generates a PodSpec with {"memory": "966367641600m"}, i.e. millibytes (e.g. to convert it back to GiB, value = bitmath.Byte(float(memory[:-1]) / 1000).to_GiB().value).
  • K8s docs claim that the m suffix if part of SI, but that seems to be false, at least for binary multiples.

The .is_ready() method compares statefulset to actual pod and currently it is buggy because of the above.

Is there a way to force K8s to KEEP the same unit the user provided? @simskij @rbarry82

Not that I know off, but I'll dig. What happens if you instead of 0.9Gi set it to 966367641600m from the get-go? Does it still bug?

@simskij
Copy link
Member

simskij commented Jun 23, 2022

There's a complication:

  • When patching the StatefulSet with {"memory": "0.9Gi"}, k8s generates a PodSpec with {"memory": "966367641600m"}, i.e. millibytes (e.g. to convert it back to GiB, value = bitmath.Byte(float(memory[:-1]) / 1000).to_GiB().value).
  • K8s docs claim that the m suffix if part of SI, but that seems to be false, at least for binary multiples.

The .is_ready() method compares statefulset to actual pod and currently it is buggy because of the above.
Is there a way to force K8s to KEEP the same unit the user provided? @simskij @rbarry82

Not that I know off, but I'll dig. What happens if you instead of 0.9Gi set it to 966367641600m from the get-go? Does it still bug?

Ok, so found it. The reason for this is that 0.9GiB (Gibibyte) in its canonical form (ie. without fractional digits, using the largest possible suffix, is m.

Before serializing, Quantity will be put in "canonical form". This means that Exponent/suffix will be adjusted up or down (with a corresponding increase or decrease in Mantissa) such that: a. No precision is lost b. No fractional digits will be emitted c. The exponent (or suffix) is as large as possible. The sign will be omitted unless the number is negative.

If we just disallow setting fractions in the first place, the problem will be solved. For instance, 900Mi will not get converted, while 0.9Gi will. Likely, 900Mi was also what the user tried to express when they put in 0.9Gi, which isn't entirely true as 0.9Gi = 6866,46Mi.

@sed-i
Copy link
Contributor Author

sed-i commented Jun 23, 2022

disallow setting fractions in the first place, the problem will be solved.

There's also the case of 1000000Mi. We should probably support what k8s supports and do proper conversion.

Abuelodelanada
Abuelodelanada previously approved these changes Jun 29, 2022
@sed-i sed-i requested a review from dstathis June 29, 2022 13:36
@sed-i sed-i dismissed stale reviews from balbirthomas and Abuelodelanada via 4a8ab5a July 5, 2022 06:17
dstathis
dstathis previously approved these changes Jul 6, 2022
@sed-i
Copy link
Contributor Author

sed-i commented Jul 7, 2022

Problem: when there is a single unit of prometheus and the user sets too high resource limits, then juju is stuck in

prometheus-k8s/0  unknown   lost   0/1

because the pod cannot be scheduled and there is no charm to take juju config changes to repatch the statefulset.

Any ideas @simskij @rbarry82?
For example, is there a way to ask K8s ahead of time if a given limit is going to be a problem, so I could BlockStatus instead?

@rbarry82
Copy link
Contributor

rbarry82 commented Jul 7, 2022

Problem: when there is a single unit of prometheus and the user sets too high resource limits, then juju is stuck in

prometheus-k8s/0  unknown   lost   0/1

because the pod cannot be scheduled and there is no charm to take juju config changes to repatch the statefulset.

Any ideas @simskij @rbarry82? For example, is there a way to ask K8s ahead of time if a given limit is going to be a problem, so I could BlockStatus instead?

This is kind of a consistent mess with the kube scheduler. That is -- the kube scheduler is not aware of what else is happening on the system, and a trival process which just malloc()s up to 80% will happily let "guaranteed" pods be scheduled which are then OOM killed.

Inside kube itself, though...

In general, kubelet itself can (and by best practice, does) reserve a certain amount of memory beyond which it will refuse to schedule because the kernel OOM killer may accidentally kill important things (like the kubelet, even, or anything else) otherwise, or swap it out.

Inside the pod, /proc is present. So from a basic level, you can check /proc/meminfo and do comparisons to see whether there is free memory. Then, you'd need to determine where the pod is running (if we hope/assume that, once a limit is set, even if the other nodes in the cluster are under MemoryPressure, that it can be rescheduled here, though that could also be racy), and describe the node to see whether there's room. Such as:

status:
  ...
  allocatable:
    cpu: "4"
    ephemeral-storage: "113180564088"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 7895304Ki
    pods: "110"
  capacity:
    cpu: "4"
    ephemeral-storage: 122808772Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 7997704Ki
    pods: "110"

Lastly, you'd need to check on the pod itself, which may be further limited (if there's no resource limits set in the podspec, this is probably unnecessary paperwork, but regardless) via /sys/fs/cgroup/memory/memory.limit_in_bytes inside the pod.

Trivially, checking /proc/meminfo to see the amount of free memory (which kube doesn't actually care about, but we may, lest it get OOM killed immediately) versus status["allocatable"]["memory"] on a single node under the guess that the cluster is homogenous (and/or the Juju admin deployed with some constraints applied, so we'll end up on an identical worker node) will let you check via differencing whether /proc/meminfo:MemTotal and allocatable memory differ to determine the amount of memory reserved in the kubelet (if any -- if there isn't one set, memory.limit_in_bytes will be like 9.2 exabytes).

If there isn't a reservation, then all you have to go on is whether a particular node (the one you're running on, maybe) has enough free memory, or you can check all of them, then 🤞 that some other pod which is pending or cannot be scheduled due to MemoryPressure doesn't steal it from you when you patch the StatefulSet. Or just set BlockedStatus and with a message that the requested limits won't be applied.

simskij
simskij previously approved these changes Aug 1, 2022
Copy link
Member

@simskij simskij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the reduced complexity by dropping the scaling factor functionality! Good job!

@sed-i sed-i dismissed Abuelodelanada’s stale review August 2, 2022 09:15

Resolved and updated the loki PR accordingly.

@sed-i sed-i requested a review from simskij August 2, 2022 22:53
@sed-i
Copy link
Contributor Author

sed-i commented Aug 2, 2022

Ready for re-review, @simskij @rbarry82 @dstathis.

@sed-i sed-i merged commit af7dffa into main Aug 3, 2022
@sed-i sed-i deleted the feature/resource-patch branch August 3, 2022 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants