Skip to content

Commit

Permalink
add run4 lammps result - it scaled!
Browse files Browse the repository at this point in the history
I have no idea why it works now. Going to go eat some worms.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Oct 3, 2023
1 parent 50735de commit 0d77684
Show file tree
Hide file tree
Showing 83 changed files with 127,715 additions and 0 deletions.
93 changes: 93 additions & 0 deletions google/kubecon/lammps/run4-test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Lammps Automated (test)

This will test running a larger problem size of LAMMPS on a smaller (9 node) cluster. This
is to ensure when we bring the larger cluster up, we at least knows it runs on size 8.

- c3-standard-176 is USD 5.08 for 1 node over 1 hour
- TIER-1 is usually half the cost of the total cluster (at least before)

This should be ~$45/hour, which isn't bad for testing.

- 112 vCPU which is 56 actual cores
- 448 GB memory

Unlike [run3](../run3) we are going to do a larger problem size for this one that we know runs on 16 nodes (and should run on more too).

## Experiments

### Create the Cluster

Let's test a cluster on c3-standard-176 for size 17.
We are following [these best practices](https://cloud.google.com/architecture/best-practices-for-using-mpi-on-compute-engine).

```bash
GOOGLE_PROJECT=myproject
gcloud compute networks create mtu9k --mtu=8896
gcloud container clusters create test-cluster \
--threads-per-core=1 \
--placement-type=COMPACT \
--num-nodes=9 \
--region=us-central1-a \
--project=${GOOGLE_PROJECT} \
--machine-type=c2d-standard-112 \
--network-performance-configs=total-egress-bandwidth-tier=TIER_1 \
--enable-gvnic \
--network=mtu9k \
--system-config-from-file=./crd/system-config.yaml
```

And save metadata about the nodes.

```bash
mkdir -p ./data
kubectl get nodes -o json > data/nodes.json
```

Install the Metrics Operator SDK. Version 19 has added support for custom (raw) log parsing.

```bash
pip install metricsoperator==0.0.19
```

### 2. Setup the Metrics Operator

Install JobSet first:

```bash
VERSION=v0.2.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml
```

Then install the metrics operator.

```bash
$ kubectl apply -f ./operator/metrics-operator.yaml
```

#### 1. LAMMPS Automated

For running lammps at different sizes, we can use the script. I am only running a problem size 2x2x2 that
I know will run at the smaller sizes and complete, the idea being that we want to see how large this
can go before it stops working.


```bash
mkdir -p ./data/lammps

# Note this takes about 1:22 to 1:23
time python run-lammps.py --iter 3 --out ./data/lammps --input ./crd/lammps/lammps-small-8.yaml --sleep 5 5 seconds each

# size 8 is 1:23
# size 4 is 2:31
# size 2 is 4:43
# size 1 is 9:05
```

We can use these times to estimate runs for tomorrow.

#### 6. Clean up

```bash
gcloud container clusters delete test-cluster --region=us-central1-a --quiet
```

325 changes: 325 additions & 0 deletions google/kubecon/lammps/run4-test/crd/lammps/data/hpctoolkit/lammps.log

Large diffs are not rendered by default.

0 comments on commit 0d77684

Please sign in to comment.