Skip to content

Commit

Permalink
Feature/dev torque schema (#6)
Browse files Browse the repository at this point in the history
* torqueue init

* support torce cluster create

* support centos and ubuntu base image for torque test case

* polymerization pbspro cmd
  • Loading branch information
chriskery committed Oct 2, 2023
1 parent 5c2b38a commit 45b411f
Show file tree
Hide file tree
Showing 16 changed files with 198 additions and 135 deletions.
10 changes: 4 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,26 @@
[![Coverage Status](https://coveralls.io/repos/github/chriskery/kubecluster/badge.svg?branch=master)](https://coveralls.io/github/chriskery/kubecluster?branch=master)
[![Go Report Card](https://goreportcard.com/badge/github.com/chriskery/kubecluster)](https://goreportcard.com/report/github.com/chriskery/kubecluster)

### The kubecluster implements a mechanism that makes it easy to build Slurm/Torque clusters on Kubernetes.
### The kubecluster implements a mechanism that makes it easy to build Slurm/pbspro clusters on Kubernetes.

## Features
Kubecluster uses Pods to simulate nodes in different clusters, currently supports the following cluster types :

- [Slurm](pkg/controller/slurm_schema)
- [Torque( PBS )](pkg/controller/torque_schema)
- [PBS professional](pkg/controller/pbspro_schema)
## Getting Started
You’ll need a Kubernetes cluster to run against. You can use [KIND](https://sigs.k8s.io/kind) to get a local cluster for testing, or run against a remote cluster.
**Note:** Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster `kubectl cluster-info` shows).

## Installation

### Master Branch

```bash
```sh
kubectl apply -k "github.com/chriskery/kubecluster/manifests/default"
```

## Quick Start

Please refer to the [quick-start.md](docs/quick-start.md) and [Kubeflow Training User Guide](https://www.kubeflow.org/docs/guides/components/tftraining/) for more information.
Please refer to the [quick-start.md](docs/quick-start.md) for more information.


### How it works
Expand Down
65 changes: 42 additions & 23 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
## Create a Torque Cluster
## Build a pbspro Cluster

**Create Torque YAML**
**Create pbspro YAML**

```
kubectl create -f ../manifests/samples/torque-centos.yaml
kubectl create -f ../manifests/samples/pbspro-centos.yaml
```

The torque centos example create a torque cluster with 1 server and 1 worker,
so it will create two pods to simulate two nodes for the torque cluster
The pbspro centos example create a pbspro cluster with 1 server and 1 worker,
so it will create two pods to simulate two nodes for the pbspro cluster

**Get Torque Status**
**Get kubeclusters Status**

Execute the following command:
```
Expand All @@ -19,40 +19,41 @@ The output is like:
```shell
> kubectl get kubeclusters
NAME AGE STATE
torque-centos-sample 3s Running
pbspro-centos-sample 3s Running
```

Now you can enter the " server node " and use this torque-centos-sample look like you're actually using a physical torque cluster
Now you can enter the " server node " as you're actually using a physical pbspro cluster
```
> kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-5bc4c45dc9-npwxp 1/1 Running 16 46h
torque-centos-sample-cpu-0 1/1 Running 0 2m43s
torque-centos-sample-server-0 1/1 Running 0 2m43s
pbspro-centos-sample-cpu-0 1/1 Running 0 2m43s
pbspro-centos-sample-server-0 1/1 Running 0 2m43s
```
torque-centos-sample-server-0 is the server node of cluster torque-centos-sample
pbspro-centos-sample-server-0 is the server node of cluster pbspro-centos-sample
```
> kubectl exec -it torque-centos-sample-server-0 /bin/bash
> kubectl exec -it pbspro-centos-sample-server-0 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
[root@torque-centos-sample-server-0 /]#
[root@pbspro-centos-sample-server-0 /]#
```

**Using Torque Cluster**
Viewing Nodes' status of torque-centos-sample
**Using pbspro Cluster**

Viewing Nodes' status of pbspro-centos-sample
```
[root@torque-centos-sample-server-0 pbs]# pbsnodes -a
torque-centos-sample-server-0
Mom = torque-centos-sample-server-0
[root@pbspro-centos-sample-server-0 pbs]# pbsnodes -a
pbspro-centos-sample-server-0
Mom = pbspro-centos-sample-server-0
Port = 15002
pbs_version = 19.0.0
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = torque-centos-sample-server-0
resources_available.host = pbspro-centos-sample-server-0
resources_available.mem = 64756484kb
resources_available.ncpus = 16
resources_available.vnode = torque-centos-sample-server-0
resources_available.vnode = pbspro-centos-sample-server-0
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
Expand All @@ -63,8 +64,8 @@ torque-centos-sample-server-0
sharing = default_shared
last_state_change_time = Thu Sep 28 07:05:43 2023
torque-centos-sample-cpu-0
Mom = 10-244-0-56.torque-centos-sample-cpu-0.default.svc.cluster.local
pbspro-centos-sample-cpu-0
Mom = 10-244-0-56.pbspro-centos-sample-cpu-0.default.svc.cluster.local
Port = 15002
pbs_version = 19.0.0
ntype = PBS
Expand All @@ -74,7 +75,7 @@ torque-centos-sample-cpu-0
resources_available.host = 10-244-0-56
resources_available.mem = 64756484kb
resources_available.ncpus = 16
resources_available.vnode = torque-centos-sample-cpu-0
resources_available.vnode = pbspro-centos-sample-cpu-0
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
Expand All @@ -85,5 +86,23 @@ torque-centos-sample-cpu-0
sharing = default_shared
last_state_change_time = Thu Sep 28 07:05:43 2023
```
Switch to the normal user and submit the job using [qsub](https://www.jlab.org/hpc/PBS/qsub.html)
```shell
[root@pbspro-centos-sample-server-0 /]# useradd pbsexample
[root@pbspro-centos-sample-server-0 /]# su pbsexample
[pbsexample@pbspro-centos-sample-server-0 /]$ qsub -- hostname
2.pbspro-centos-sample-server-0
[pbsexample@pbspro-centos-sample-server-0 /]$
```
Use [qstat](https://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/qstat.htm) to view the job we just submitted
```
[pbsexample@pbspro-centos-sample-server-0 /]$ qstat -a
pbspro-centos-sample-server-0:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
2.pbspro-centos pbsexamp workq STDIN 1377 1 1 -- -- E 00:00
```


4 changes: 2 additions & 2 deletions manifests/samples/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@ resources:
- slurm-centos.yaml
- slurm-centos-hostNetwork.yaml
- slurm-ubuntu.yaml
- torque-centos.yaml
- torque-ubuntu.yaml
- pbspro-centos.yaml
- pbspro-ubuntu.yaml
#+kubebuilder:scaffold:manifestskustomizesamples
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ metadata:
app.kubernetes.io/part-of: kubecluster
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: kubecluster
name: torque-centos-sample
name: pbspro-centos-sample
spec:
clusterType: torque
clusterType: pbspro
clusterReplicaSpec:
Server:
replicas: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ metadata:
app.kubernetes.io/part-of: kubecluster
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: kubecluster
name: torque-pbspro-sample
name: pbspro-pbspro-sample
spec:
clusterType: torque
clusterType: pbspro
clusterReplicaSpec:
Server:
replicas: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ metadata:
app.kubernetes.io/part-of: kubecluster
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: kubecluster
name: torque-ubuntu-sample
name: pbspro-ubuntu-sample
spec:
clusterType: torque
clusterType: pbspro
clusterReplicaSpec:
Server:
replicas: 1
Expand Down
2 changes: 1 addition & 1 deletion manifests/samples/slurm-centos-hostNetwork.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ metadata:
app.kubernetes.io/part-of: kubecluster
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: kubecluster
name: centos-hostnetwork-sample
name: slurm-centos-hostnetwork-sample
spec:
clusterType: slurm
clusterReplicaSpec:
Expand Down
2 changes: 1 addition & 1 deletion manifests/samples/slurm-centos.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ metadata:
app.kubernetes.io/part-of: kubecluster
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: kubecluster
name: centos-sample
name: slurm-centos-sample
spec:
clusterType: slurm
clusterReplicaSpec:
Expand Down
31 changes: 31 additions & 0 deletions pkg/controller/cluster_schema/pbspro_schema/config.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
package pbspro_schema

import "flag"

// Config is the global configuration for the training operator.
var config struct {
pbsproSchemaInitContainerTemplateFile string
pbsproSchemaInitContainerImage string
pbsproSchemaInitContainerMaxTries int
}

const (
// pbsproSchemaInitContainerImageDefault is the default image for the pbsproSchema
// init container.
pbsproSchemaInitContainerImageDefault = "registry.cn-shanghai.aliyuncs.com/eflops-bcp/pbs-minimal:v1"
// pbsproSchemaInitContainerTemplateFileDefault is the default template file for
// the pbsproSchema init container.
pbsproSchemaInitContainerTemplateFileDefault = "/etc/config/initContainer.yaml"
// pbsproSchemaInitContainerMaxTriesDefault is the default number of tries for the pbsproSchema init container.
pbsproSchemaInitContainerMaxTriesDefault = 100
)

func init() {
// pbsproSchema related flags
flag.StringVar(&config.pbsproSchemaInitContainerImage, "pbsproSchema-init-container-image",
pbsproSchemaInitContainerImageDefault, "The image for pbsproSchema init container")
flag.StringVar(&config.pbsproSchemaInitContainerTemplateFile, "pbsproSchema-init-container-template-file",
pbsproSchemaInitContainerTemplateFileDefault, "The template file for pbsproSchema init container")
flag.IntVar(&config.pbsproSchemaInitContainerMaxTries, "pbsproSchema-init-container-max-tries",
pbsproSchemaInitContainerMaxTriesDefault, "The number of tries for the pbsproSchema init container")
}
Loading

0 comments on commit 45b411f

Please sign in to comment.