Skip to content

Commit

Permalink
add lammps experiments that run on size 128 cluster (well, 129)
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Sep 24, 2023
1 parent a2d6259 commit ee58cce
Show file tree
Hide file tree
Showing 74 changed files with 60,683 additions and 0 deletions.
5 changes: 5 additions & 0 deletions google/kubecon/lammps/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Kubecon Experiments

- [run0](run0) is the original experiment setup, and primarily testing for the main one.
- [run1](run1) is the same experiment, run for real over 6 hours on 17 nodes.
- [run2](run2) is a subset of that, but just the automated lammps to run at a larger scale.
184 changes: 184 additions & 0 deletions google/kubecon/lammps/run2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Lammps Automated

This will only be running automated lammps on a very large cluster size (128).

- c3-standard-176 is USD 5.08 for 1 node over 1 hour
- TIER-1 is usually half the cost of the total cluster (at least before)

If I take 129 nodes this is ~$ 650.84/hour (rounded up) and 2 hours would be $1,301.68.
I'm going to assume ~85/hour for TIER-1 networking (about half). Here is a small table I can keep track of times / costs.

I absolutely cannot go over that. What I'm going to do is quickly bring up the cluster and immediately test on size 128.
If it doesn't work, I'm going to immediately bring it down and do the same experiment on size 64 first.
Note that the c2d-standard-112 instance type has:

- 112 vCPU which is 56 actual cores
- 448 GB memory

If I finish quickly, I'll try a large problem size or two (that failed on c3) although I don't expect a different outcome. :)
See [run1](../run1) for original notes and planning.

| Time | Event |
| -----|-------|
| 11:40am | Bring up cluster |


## Experiments

### Create the Cluster

Let's test a cluster on c3-standard-176 for size 17.
We are following [these best practices](https://cloud.google.com/architecture/best-practices-for-using-mpi-on-compute-engine).

```bash
GOOGLE_PROJECT=myproject
gcloud compute networks create mtu9k --mtu=8896
gcloud container clusters create test-cluster \
--threads-per-core=1 \
--placement-type=COMPACT \
--num-nodes=129 \
--region=us-central1-a \
--project=${GOOGLE_PROJECT} \
--machine-type=c2d-standard-112 \
--network-performance-configs=total-egress-bandwidth-tier=TIER_1 \
--enable-gvnic \
--network=mtu9k \
--system-config-from-file=./crd/system-config.yaml
```

And save metadata about the nodes.

```bash
mkdir -p ./data
kubectl get nodes -o json > data/nodes.json
```

Install the Metrics Operator SDK. Version 19 has added support for custom (raw) log parsing.

```bash
pip install metricsoperator==0.0.19
```

### 2. Setup the Metrics Operator

Install JobSet first:

```bash
VERSION=v0.2.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml
```

Then install the metrics operator.

```bash
$ kubectl apply -f ./operator/metrics-operator.yaml
```

#### 1. LAMMPS Automated

For running lammps at different sizes, we can use the script. I am only running a problem size 2x2x2 that
I know will run at the smaller sizes and complete, the idea being that we want to see how large this
can go before it stops working.


```bash
mkdir -p ./data/lammps

# This is my mistake - I should have told the operators to be scheduled on ONE node, or asked for 2 additional ones
# note that this data file is saved to crd-lammps-lammps-small-128-shared-node.json
time python run-lammps.py --iter 2 --out ./data/lammps --input ./crd/lammps/lammps-small-128.yaml --sleep 60 12 minutes each

# Note that for this run I verified the nodes were not running on any nodes with an operator pod scheduled
time python run-lammps.py --iter 2 --out ./data/lammps --input ./crd/lammps/lammps-small-127.yaml --sleep 5 6 minutes
time python run-lammps.py --iter 2 --out ./data/lammps --input ./crd/lammps/lammps-small-126.yaml --sleep 5 40 seconds
time python run-lammps.py --iter 2 --out ./data/lammps --input ./crd/lammps/lammps-small-124.yaml --sleep 5 38 seconds
time python run-lammps.py --iter 2 --out ./data/lammps --input ./crd/lammps/lammps-small-120.yaml --sleep 5 38 seconds
time python run-lammps.py --iter 2 --out ./data/lammps --input ./crd/lammps/lammps-small-112.yaml --sleep 5 35 seconds
time python run-lammps.py --iter 2 --out ./data/lammps --input ./crd/lammps/lammps-small-96.yaml --sleep 5 31 seconds?
time python run-lammps.py --iter 2 --out ./data/lammps --input ./crd/lammps/lammps-small-64.yaml --sleep 5 24 seconds each
time python run-lammps.py --iter 3 --out ./data/lammps --input ./crd/lammps/lammps-small-32.yaml --sleep 5 14 seconds each
time python run-lammps.py --iter 3 --out ./data/lammps --input ./crd/lammps/lammps-small-16.yaml --sleep 5 10 seconds each
time python run-lammps.py --iter 3 --out ./data/lammps --input ./crd/lammps/lammps-small-8.yaml --sleep 5 5 seconds each
time python run-lammps.py --iter 3 --out ./data/lammps --input ./crd/lammps/lammps-small-4.yaml --sleep 5 7 seconds each
time python run-lammps.py --iter 3 --out ./data/lammps --input ./crd/lammps/lammps-small-2.yaml --sleep 5 4 seconds each
time python run-lammps.py --iter 3 --out ./data/lammps --input ./crd/lammps/lammps-small-1.yaml --sleep 5 2 seconds each
```

I'm worried for the size 128 there was an issue of a shared node. I tried to explicitly ensure that doesn't happen by putting the two operators on the same node, but my effort failed. Before I do that, for the comment above, JobSet is running here:

```
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
jobset-controller-manager-598f89f884-wdhtt 2/2 Running 0 61m 10.8.1.4 gke-test-cluster-default-pool-0b9fe0dd-032c <none> <none>
```

And metrics operator here:

```
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metrics-controller-manager-775bb5f45c-lcpkh 2/2 Running 0 61m 10.8.122.3 gke-test-cluster-default-pool-0b9fe0dd-x27g <none> <none>
```

I tried deleting:

```bash
VERSION=v0.2.0
kubectl delete -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml
kubectl delete -f ./operator/metrics-operator.yaml
```

And then labeling a node to schedule operators on:

```bash
kubectl label nodes gke-test-cluster-default-pool-0b9fe0dd-032c operators=yes
kubectl get node gke-test-cluster-default-pool-0b9fe0dd-032c -o json > data/labeled-node.json
```

But then the application did not work:

```
cd ./operator/labeled
kubectl apply -f jobset-operator.yaml
kubectl delete -f metrics-operator.yaml
```
```console
serviceaccount/jobset-controller-manager created
role.rbac.authorization.k8s.io/jobset-leader-election-role created
clusterrole.rbac.authorization.k8s.io/jobset-manager-role created
clusterrole.rbac.authorization.k8s.io/jobset-metrics-reader created
clusterrole.rbac.authorization.k8s.io/jobset-proxy-role created
rolebinding.rbac.authorization.k8s.io/jobset-leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/jobset-manager-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/jobset-proxy-rolebinding created
secret/jobset-webhook-server-cert created
service/jobset-controller-manager-metrics-service created
service/jobset-webhook-service created
mutatingwebhookconfiguration.admissionregistration.k8s.io/jobset-mutating-webhook-configuration created
validatingwebhookconfiguration.admissionregistration.k8s.io/jobset-validating-webhook-configuration created
Error from server (Invalid): error when creating "jobset-operator.yaml": CustomResourceDefinition.apiextensions.k8s.io "jobsets.jobset.x-k8s.io" is invalid: metadata.annotations: Too long: must have at most 262144 bytes
Error from server (BadRequest): error when creating "jobset-operator.yaml": Deployment in version "v1" cannot be handled as a Deployment: json: cannot unmarshal bool into Go struct field PodSpec.spec.template.spec.nodeSelector of type string
```

I tried a bunch of derivations but nothing worked, and I couldn't keep the cluster up longer.
Note that I verified that the size 127 pods were _not_ sharing nodes with the operators, and it still slowed to 6 minutes, so this scaling issue might be real after all.

Notes:

- A new problem I haven't seen before - will 128 concurrent pulls, I got a lot of failed image pull backoffs from ghcr.io. I had to do it twice so the containers pulled in two groups.

#### 6. Clean up

```bash
gcloud container clusters delete test-cluster --region=us-central1-a --quiet
```

And that's it!


## Results

### LAMMPS

```bash
mkdir -p ./img/lammps
python plot-lammps.py --results ./data/lammps --out ./img/lammps
```
22 changes: 22 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 1
metrics:

# Running more scaled lammps is our main goal
- name: app-lammps

# This has intel mpi on rocky linux, per suggested by Google
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky

options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 56 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"
17 changes: 17 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-112.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 112
metrics:
- name: app-lammps
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky
options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 6272 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"
17 changes: 17 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-120.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 120
metrics:
- name: app-lammps
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky
options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 6720 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"
17 changes: 17 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-124.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 124
metrics:
- name: app-lammps
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky
options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 6944 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"
17 changes: 17 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-126.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 126
metrics:
- name: app-lammps
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky
options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 7056 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"
17 changes: 17 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-127.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 127
metrics:
- name: app-lammps
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky
options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 7112 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"
17 changes: 17 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-128.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 128
metrics:
- name: app-lammps
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky
options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 7168 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"
20 changes: 20 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-16.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 16
metrics:
- name: app-lammps

# This has intel mpi on rocky linux, per suggested by Google
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky

options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 896 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"
22 changes: 22 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 2
metrics:

# Running more scaled lammps is our main goal
- name: app-lammps

# This has intel mpi on rocky linux, per suggested by Google
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky

options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 112 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"
18 changes: 18 additions & 0 deletions google/kubecon/lammps/run2/crd/lammps/lammps-small-32.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 32
metrics:
- name: app-lammps

image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky
options:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 1792 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
soleTenancy: "true"

0 comments on commit ee58cce

Please sign in to comment.