Feature Demo

Notes

OpenPAI is leveraged to show these end-to-end scheduling feature demos, but it is not a requisite for HiveD. All these features can also be reproduced with other platforms or even by raw pods.
HiveD config and Job request may need to be adjusted according to your own cluster.

VC Safety

Description

HiveD guarantees quota safety for all VCs, in the sense that the requests to cells defined in each VC can always be satisfied.

VC's cells can be described by Hardware Quantity, Topology, Type, Pinned Cells, etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's guaranteed jobs should never make fragmentation inside other VCs:

Two DGX-2s, two VCs each owns one DGX-2 node. For a traditional scheduler, this will translate into two VCs each owning 16 GPUs. When a user submits 16 1-GPU jobs to vc1, the user in vc2 might not be able to run a 16-GPU job, due to possible fragmentation issue caused by vc1. While HiveD can guarantee each VC always has one entire node available for its dedicated use.

Reproduce Steps

Use hived-config-1.
Submit 2 jobs itc-safety-1, itc-safety-2 to the same VC, all tasks will always run within the same node (10.151.41.26).

Pinned Cells

Description

One VC contains two DGX-2 node cells. The VC admin would like to pin one DGX-2 node cell in the physical cluster for dedicated use, i.e. that cell will be bound to a node statically. Without explicit pinnedCellId specified, a job will not be allowed to run on the pinned node.

This is similar to K8S Taints and Tolerations, but with VC Safety guaranteed.

Reproduce Steps

Use hived-config-8.
Submit job itc-pin to vc1, all tasks in task role vc1pinned will be on node 10.151.41.25 (which is pinned), all tasks in task role vc1nopinned will NOT be on node 10.151.41.25.

SKU Type

Description

skuType is the leaf cellType which does not have internal topology anymore.

If skuType is specified in the job, only that type of leaf cell will be allocated to the job, otherwise, any type of leaf cell can be allocated.

This is similar to K8S Labels and Selectors, but with VC Safety guaranteed.

Reproduce Steps

`skuType` specified

Use hived-config-2.
Submit job itc-k80-type, it will be partially running (some tasks waiting because all the specified K80 GPUs are used).

`skuType` not specified

Use hived-config-2.
Submit job itc-no-type, it will be fully running, and some tasks are using K80 (10.151.41.18) while others are using M60 (10.151.41.26).

Gang Scheduling

Description

A set of pods is scheduled as a gang, i.e. in an all-or-nothing fashion.

The gang is treated as an AffinityGroup, the scheduling unit of HiveD.

A job can specify all its pods are in the same AffinityGroup, so the whole job is gang scheduled.

This is useful for jobs that cannot perform any useful work, such as making progress or serving, until all pods are running. A typical example in deep learning workloads is distributed training.

Reproduce Steps

Basic

Use hived-config-2.
Submit job itc-gang, which requests for 6 single-GPU tasks. All tasks will be waiting without IP associated, because the VC only has 4 GPUs for the specified type.
Submit job itc-gang4, which requests for 4 single-GPU tasks, all tasks will be running, while itc-gang will be still waiting without IP associated. This also shows that there is no head-of-line blocking by HiveD itself (i.e., itc-gang will not block itc-gang4).

TensorFlow Distributed Training

Use hived-config-2.
Submit job itc-dtf to default VC, it will success.

Incremental Scheduling

Description

A set of pods is scheduled regardless of each other, i.e. does not require Gang Scheduling.

A job can specify its pods in different AffinityGroups, so the whole job is incrementally scheduled (one AffinityGroup each time).

This is used for jobs that can still perform useful works, such as making progress or serving, even if only one pod is running.

Reproduce Steps

Use hived-config-1.
Submit job itc-elastic whose total request is larger than its VC quota, however, it can still partially run.

Guaranteed Job

Description

Guaranteed Job: Job whose priority is non-negative, it can only use its own VC's quota, however, once it is allocated, it will not be preempted by other VCs' jobs.

Reproduce Steps

Use hived-config-1.
Submit job itc-elastic, it will not use more than one node.

Opportunistic Job

Description

Opportunistic Job: Job whose priority is -1, it can use other VCs' quota, however, once it is allocated, it may be preempted by other VCs' guaranteed jobs.

Reproduce Steps

Use hived-config-1.
Submit job itc-oppo, it will use more than one node, even if its VC has only one node.

Intra-VC Preemption

Description

Within one VC, a high-priority job can preempt low-priority jobs.

Reproduce Steps

Immediate Preemption

Use hived-config-3.
Submit itc-intra-imd-preempt-test, which requests for 4 M60 GPUs for vc1 with test (0) priority.
Submit itc-intra-imd-preempt-prod, which also requests for 4 M60 GPUs for vc1 with prod (100) priority. The job will preempt the test job immediately, so the test job is retried and waiting for resource.

Lazy Preemption

Use hived-config-3.
Submit itc-intra-lazy-preempt-test, which requests for 4 K80 GPUs for vc1 with test (0) priority.
Submit itc-intra-lazy-preempt-prod, which also requests for 4 K80 GPUs for vc1 with prod (100) priority. The job will just downgrade the test job to be Opportunistic Job, instead of preempting it immediately, because all jobs can still fit into the whole physical cluster.
Submit itc-intra-lazy-preempt-prod2, which also requests for 3 * 4 K80 GPUs for default VC with prod (100) priority. The job will preempt the test job immediately, because all jobs cannot fit into the whole physical cluster.

NOTE: lazyPreemptionEnable option is disabled by default, becasue earlier job may be downgraded to low priority job and get preempted by later jobs, which may be confusing.

Inter-VC Preemption

Description

One VC's Guaranteed Job can preempt other VCs' Opportunistic Jobs.

Reproduce Steps

Use hived-config-3.
Submit itc-inter-preempt-oppo, which requests for 2 * 4 K80 GPUs for vc1 with oppo (-1) priority.
Submit itc-inter-preempt-prod, which also requests for 3 * 4 K80 GPUs for default VC with prod (100) priority. The job will preempt the oppo job immediately.

Topology-Aware Intra-VC Scheduling

Description

Within one VC, HiveD chooses nearest leaf cells for one AffinityGroup in best effort.

Reproduce Steps

Use hived-config-2.
Submit job itc-buddy, which requests for 2 single GPU tasks in the same AffinityGroup, tasks will be allocated to 2 buddy GPUs.

Work-Preserving Reconfiguration

Description

HiveD can be reconfigured without unnecessary user impacts, such as add/update/delete physical/virtual clusters, different device types/topologies, etc.

Reproduce Steps

PhysicalCluster Reconfig - Delete PhysicalCell

Use hived-config-2.
Submit job itc-reconfig-1 which requests M60 skuType. Wait until it is running.
Delete all M60 skuType related PhysicalCells and VirtualCells from hived-config-2, i.e. becomes hived-config-33.
Use hived-config-33, and restart HiveD.
The job will still run without any impact, but its M60 usage is ignored by HiveD. However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.

PhysicalCluster Reconfig - Add PhysicalCell

Use hived-config-33.
Submit job itc-k80-type which requests K80 skuType. Wait until it is running.
Add all M60 skuType related PhysicalCells and VirtualCells into hived-config-33, i.e. becomes hived-config-2.
Use hived-config-2, and restart HiveD.
The job will still run without any impact, and its K80 usage is still accounted by HiveD.

PhysicalCluster Reconfig - Update PhysicalCell - Add Node

Use hived-config-2.
Submit job itc-reconfig-1 which requests M60 skuType. Wait until it is running.
Add one M60 node into a PhysicalCell, then becomes hived-config-4.
Use hived-config-4, and restart HiveD.
The job will still run without any impact, and its M60 usage is still accounted by HiveD.
To confirm the job is not impacted, such as lazy preempted. Submit job itc-reconfig-2 which requests all M60 nodes and has the same priority as itc-reconfig-1. The job will be waiting instead of preempting itc-reconfig-1.

PhysicalCluster Reconfig - Update PhysicalCell - Delete Node

Use hived-config-2.
Submit job itc-reconfig-3 which requests K80 skuType. Wait until it is running.
Delete one K80 node used by itc-reconfig-3 from a PhysicalCell, then becomes hived-config-7.
Use hived-config-7, and restart HiveD.
The job will still run without any impact, but its deleted node usage is ignored by HiveD. However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.

VirtualCluster Reconfig - Delete VirtualCluster

Use hived-config-2.
Submit job itc-reconfig-3 to default VC. Wait until it is running.
Delete the default VC and move its quota to vc1, then becomes hived-config-5.
Use hived-config-5, and restart HiveD.
The job will still run without any interruption but lazy preempted by HiveD.
To confirm it is lazy preempted, submit job itc-reconfig-4 to vc1 which requests all K80 nodes. The job will immediately preempt itc-reconfig-3.

VirtualCluster Reconfig - Update VirtualCluster

Use hived-config-2.
Submit job itc-reconfig-3 to default VC. Wait until it is running.
Move one K80-NODE cell from default VC to vc1, then becomes hived-config-6.
Use hived-config-6, and restart HiveD.
The job will still run without any interruption but lazy preempted by HiveD.
To confirm it is lazy preempted, submit job itc-reconfig-5 to vc1 which requests all K80 nodes. The job will immediately preempt itc-reconfig-3.

Bad Hardware Awareness

Description

Avoid scheduling pods to bad hardware.

Reproduce Steps

Use hived-config-2.
Stop kubelet on 10.151.41.26 (the only M60 node) by sudo systemctl stop kubelet. Wait until this is detected by K8S.
Submit job itc-badnode50, which requests M60 node, it will be waiting without IP associated.
Bring back 10.151.41.26 by sudo systemctl start kubelet. Wait until this is detected by K8S.
The waiting job will start running, without any retries.

Files

README.md

Latest commit

History

README.md

File metadata and controls

Feature Demo

Notes

VC Safety

Description

Reproduce Steps

Pinned Cells

Description

Reproduce Steps

SKU Type

Description

Reproduce Steps

skuType specified

skuType not specified

Gang Scheduling

Description

Reproduce Steps

Basic

TensorFlow Distributed Training

Incremental Scheduling

Description

Reproduce Steps

Guaranteed Job

Description

Reproduce Steps

Opportunistic Job

Description

Reproduce Steps

Intra-VC Preemption

Description

Reproduce Steps

Immediate Preemption

Lazy Preemption

Inter-VC Preemption

Description

Reproduce Steps

Topology-Aware Intra-VC Scheduling

Description

Reproduce Steps

Work-Preserving Reconfiguration

Description

Reproduce Steps

PhysicalCluster Reconfig - Delete PhysicalCell

PhysicalCluster Reconfig - Add PhysicalCell

PhysicalCluster Reconfig - Update PhysicalCell - Add Node

PhysicalCluster Reconfig - Update PhysicalCell - Delete Node

VirtualCluster Reconfig - Delete VirtualCluster

VirtualCluster Reconfig - Update VirtualCluster

Bad Hardware Awareness

Description

Reproduce Steps

`skuType` specified

`skuType` not specified