Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2] gang scheduling shim side implementation #219

Merged
merged 17 commits into from
Jan 20, 2021
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ OUTPUT=_output
RELEASE_BIN_DIR=${OUTPUT}/bin
ADMISSION_CONTROLLER_BIN_DIR=${OUTPUT}/admission-controllers/
POD_ADMISSION_CONTROLLER_BINARY=scheduler-admission-controller
GANG_BIN_DIR=${OUTPUT}/gang
GANG_CLIENT_BINARY=simulation-gang-worker
GANG_SERVER_BINARY=simulation-gang-coordinator
LOCAL_CONF=conf
CONF_FILE=queues.yaml
REPO=github.com/apache/incubator-yunikorn-k8shim/pkg
Expand Down Expand Up @@ -168,6 +171,32 @@ adm_image: admission
docker build ./deployments/image/admission -t ${REGISTRY}/yunikorn:admission-${VERSION}
@rm -f ./deployments/image/admission/${POD_ADMISSION_CONTROLLER_BINARY}

# Build gang web server and client binary in a production ready version
.PHONY: simulation
simulation:
@echo "building gang web client binary"
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -a -o=${GANG_BIN_DIR}/${GANG_CLIENT_BINARY} -ldflags \
'-extldflags "-static" -X main.version=${VERSION} -X main.date=${DATE}' \
-tags netgo -installsuffix netgo \
./pkg/simulation/gang/gangclient
@echo "building gang web server binary"
go build -a -o=${GANG_BIN_DIR}/${GANG_SERVER_BINARY} -ldflags \
'-extldflags "-static" -X main.version=${VERSION} -X main.date=${DATE}' \
-tags netgo -installsuffix netgo \
./pkg/simulation/gang/webserver

# Build gang test images based on the production ready version
.PHONY: simulation_image
simulation_image: simulation
@echo "building gang test docker images"
@cp ${GANG_BIN_DIR}/${GANG_CLIENT_BINARY} ./deployments/image/gang/gangclient
@cp ${GANG_BIN_DIR}/${GANG_SERVER_BINARY} ./deployments/image/gang/webserver
docker build ./deployments/image/gang/gangclient -t ${REGISTRY}/yunikorn:simulation-gang-worker-latest
docker build ./deployments/image/gang/webserver -t ${REGISTRY}/yunikorn:simulation-gang-coordinator-latest
@rm -f ./deployments/image/gang/gangclient/${GANG_CLIENT_BINARY}
@rm -f ./deployments/image/gang/webserver/${GANG_SERVER_BINARY}

# Build all images based on the production ready version
.PHONY: image
image: sched_image adm_image
Expand Down
8 changes: 8 additions & 0 deletions deployments/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,14 @@ Deployment files for the driver and executor:
A simple example that runs a [kubeflow/tensorflow](./tfjob/tf-job-mnist.yaml) job.
In this example it will run a distributed mnist model for e2e test, for full more detail see the [dist-mnist](https://github.com/kubeflow/tf-operator/tree/master/examples/v1/dist-mnist) section.

## gang
A sample application which implement gang in application level.
Start via the [gangDeploy.sh](./gang/cmd/gangDeploy.sh) script, for full more detail see the [gang](./gang/README.md) section.

Deployment file for gang-coordinator and gang-job:
* [gang-coordinator](./gang/gang-coordinator.yaml)
* [gang-job](./gang/gang-job.yaml)

## volumes
The volumes directory contains three examples:
1. [local volumes](#local-volume)
Expand Down
28 changes: 28 additions & 0 deletions deployments/examples/gang/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
<!--
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
-->

# Gang in application level

The following script runs a given number of jobs, and each one is a simulated job that requires gang scheduling support. Where the job only starts to execute its tasks when the cluster has the min member of instances running.

```shell script
./cmd/gangDeploy.sh <job number> <gang member> <task run time(sec)>
```
Note: if you prefer to manually launch such jobs, please refer to the `gang-coordinator.yaml` `gang-job.yaml`. Where:
* [gang-coordinator.yaml](./gang-coordinator.yaml): the coordinator service used to coordinate the execution of each job's tasks.
* [gang-job](./gang-job.yaml): the simulated job that requires gang members to be started before executing its tasks.
100 changes: 100 additions & 0 deletions deployments/examples/gang/cmd/gangDeploy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
#limitations under the License.
#

# gangDeploy.sh <job amount> <pod amount> <gang member> <task run time(sec)>
set -o errexit
set -o nounset
set -o pipefail

JOBAMOUNT=$1
GANGMEMBER=$2
RUNTIMESEC=$3

# create service
kubectl create -f <(cat << EOF
apiVersion: v1
kind: Service
metadata:
name: gangservice
labels:
app: gang
spec:
selector:
app: gang
type: ClusterIP
ports:
- protocol: TCP
port: 8863
targetPort: 8863
EOF)
# create job counter web server
kubectl create -f <(cat << EOF
apiVersion: v1
kind: Pod
metadata:
name: gangweb
labels:
app: gang
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: gangweb
image: apache/yunikorn:simulation-gang-coordinator-latest
imagePullPolicy: Never
ports:
yangwwei marked this conversation as resolved.
Show resolved Hide resolved
- containerPort: 8863
EOF)
# wait for web server to be running
until grep 'Running' <(kubectl get pod gangweb -o=jsonpath='{.status.phase}'); do
sleep 1
done
yangwwei marked this conversation as resolved.
Show resolved Hide resolved
# create gang jobs
for i in $(seq "$JOBAMOUNT"); do
kubectl create -f <(cat << EOF
apiVersion: batch/v1
kind: Job
metadata:
name: gang-job-$i
labels:
app: gang
queue: root.sandbox
spec:
completions: $GANGMEMBER
parallelism: $GANGMEMBER
template:
spec:
containers:
- name: gang
image: apache/yunikorn:simulation-gang-worker-latest
imagePullPolicy: Never
env:
- name: JOB_ID
value: gang-job-$i
- name: SERVICE_NAME
value: gangservice
- name: MEMBER_AMOUNT
value: "$GANGMEMBER"
- name: TASK_EXECUTION_SECONDS
value: "$RUNTIMESEC"
restartPolicy: Never
schedulerName: yunikorn
yangwwei marked this conversation as resolved.
Show resolved Hide resolved
EOF)
done
46 changes: 46 additions & 0 deletions deployments/examples/gang/gang-coordinator.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: v1
kind: Service
metadata:
name: gangservice
labels:
app: gang
spec:
selector:
app: gang
type: ClusterIP
ports:
- protocol: TCP
port: 8863
targetPort: 8863
---
apiVersion: v1
kind: Pod
metadata:
name: gangweb
labels:
app: gang
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: gangweb
image: apache/yunikorn:simulation-gang-coordinator-latest
imagePullPolicy: Never
ports:
- containerPort: 8863
44 changes: 44 additions & 0 deletions deployments/examples/gang/gang-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
name: gang-job-1
labels:
app: gang
queue: root.sandbox
spec:
completions: 10 # The pod number will create
parallelism: 10 # The pod number will create
template:
spec:
containers:
- name: gang
image: apache/yunikorn:simulation-gang-worker-latest
imagePullPolicy: Never
env:
- name: JOB_ID
value: gang-job-1 # This job's name
- name: SERVICE_NAME
value: gangservice # The service name
- name: MEMBER_AMOUNT
value: "10" # The gang member that you hope, must small than pod number.must be string.
- name: TASK_EXECUTION_SECONDS
value: "60" # The task execution time (sec), it will start to countdown when the gang member amount be satisfied. must be string.
restartPolicy: Never
schedulerName: yunikorn
21 changes: 21 additions & 0 deletions deployments/image/gang/gangclient/dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM golang:1.15.2
WORKDIR /gang/client
ADD . /gang/client
ENTRYPOINT ["./simulation-gang-worker"]
22 changes: 22 additions & 0 deletions deployments/image/gang/webserver/dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM golang:1.15.2
WORKDIR /gang/server
ADD . /gang/server
EXPOSE 8863
ENTRYPOINT ["./simulation-gang-coordinator"]
26 changes: 23 additions & 3 deletions deployments/yunikorn-application/application-definition.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,10 @@ spec:
spec:
type: object
properties:
policy:
schedulingPolicy:
type: object
properties:
policy:
type:
type: string
parameters:
type: object
Expand All @@ -67,7 +67,7 @@ spec:
items:
type: object
properties:
groupName:
name:
type: string
minMember:
type: integer
Expand All @@ -79,6 +79,26 @@ spec:
- type: string
pattern: '^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$'
x-kubernetes-int-or-string: true
nodeSelector:
type: object
additionalProperties:
type: string
tolerations:
type: array
items:
type: object
properties:
effect:
type: string
key:
type: string
operator:
type: string
value:
type: string
tolerationSeconds:
format: int64
type: integer
status:
type: object
properties:
Expand Down
Loading