Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud Burst Configuration -> Pod does not trigger auto-scale #202

Open
maaft opened this issue Nov 21, 2023 · 2 comments
Open

Cloud Burst Configuration -> Pod does not trigger auto-scale #202

maaft opened this issue Nov 21, 2023 · 2 comments

Comments

@maaft
Copy link

maaft commented Nov 21, 2023

I have two clusters:

  1. on-prem: cluster with 2 GPU nodes
  2. cloud: cluster with 0-5 GPU nodes (auto-scaling is active)

I tried to use the proposed solution for "cloud bursting":

  1. I schedule two GPU workloads on on-prem with multicluster.admiralty.io/elect: "" annotation
  2. both workloads will be scheduled by admiralty and are running in my on-prem cluster as expected
  3. I schedule another GPU workload. This is now in Pending state forever, since admiralty doesn't even try to schedule on my cloud cluster due to apparent missing resources (Insufficient nvidia.com/gpu),

How can I tell admiralty to schedule to my cloud cluster, even though currently there are no free resources available so the scheduled Pod can trigger auto-scaling?

If this is not possible, I don't see the point of "Cloud Bursting" here since I'd need to have my cloud resources "always on" and pay for them.

@maaft
Copy link
Author

maaft commented Nov 21, 2023

Also I noticed, that even when my cloud cluster has nvidia.com/gpu = 1, this information is not propagated to the virtual-node in my on-prem cluster.

cloud cluster

kubectl describe node aks-gpu-10181809-vmss000000

Name:               aks-gpu-10181809-vmss000000
Roles:              agent
Labels:             accelerator=nvidia
Unschedulable:      false
Lease:
  HolderIdentity:  aks-gpu-10181809-vmss000000
  AcquireTime:     <unset>
  RenewTime:       Tue, 21 Nov 2023 13:44:43 +0100

Capacity:
  cpu:                4
  ephemeral-storage:  129886128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             28736348Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                3860m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             24487772Ki
  nvidia.com/gpu:     1
  pods:               110
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                670m (17%)   2700m (69%)
  memory             1240Mi (5%)  5660Mi (23%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  nvidia.com/gpu     0            0

on-prem cluster

kubectl describe node admiralty-aks101cluster

Name:               admiralty-aks101cluster
Roles:              agent
Labels:             alpha.service-controller.kubernetes.io/exclude-balancer=true
                    kubernetes.io/role=agent
                    multicluster.admiralty.io/cluster-target-name=aks101cluster
                    node-role.kubernetes.io/agent=
                    node.kubernetes.io/exclude-from-external-load-balancers=true
                    type=virtual-kubelet
                    virtual-kubelet.io/provider=admiralty
Taints:             virtual-kubelet.io/provider=admiralty:NoSchedule

 
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)

Is this (another) bug?

@adrienjt
Copy link
Contributor

The virtual node should have capacity and allocatable including nvidia.com/gpu, as implemented here: https://github.com/admiraltyio/admiralty/tree/master/pkg/controllers/resources

So I suspect a configuration issue. Are you able to run the quick start on these clusters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants