Skip to content

Commit

Permalink
add LAMMPS experiment
Browse files Browse the repository at this point in the history
This should work if/when the base or underlying
terraform modules are updated!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Apr 23, 2023
1 parent b421566 commit 970f5dd
Show file tree
Hide file tree
Showing 10 changed files with 720 additions and 0 deletions.
6 changes: 6 additions & 0 deletions google/bare-metal-comparison/compute-engine/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,9 @@ The following things we will want to do for each experiment:

- ensure that we have enough quota for instances, etc.
- do a cost estimation based on instance usage, storage, and time

## Comparison

- Flux nodes: Flux operator requires a container rebuild, identical, Compute Engine requires a VM rebuild (multiple)
- Interaction: Flux operator affords more programmatic interactions (Compute Engine requires a login)
-
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
terraform.tfstate
terraform.tfstate.backup
fuse-mounts.sh
basic.tfvars
.terraform
.terraform.lock.hcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Flux Framework LAMMPS Cluster Deployment

This deployment illustrates deploying a flux-framework cluster on Google Cloud
to run LAMMPS. All components are included here.

# Usage

Copy the variables to make your own variant:

```bash
$ cp lammps.tfvars.example lammps.tfvars
```

Make note that the machine types should match those you prepared in [build-images](../../build-images)
Initialize the deployment with the command:

```bash
$ terraform init
```

## Deploy

Then, deploy the cluster with the command:

```bash
terraform apply -var-file lammps.tfvars \
-var region=us-central1 \
-var project_id=$(gcloud config get-value core/project) \
-var network_name=foundation-net \
-var zone=us-central1-a
```

This will setup networking and all the instances! Note that
you can change any of the `-var` values to be appropriate for your environment.
Verify that the cluster is up:

```bash
gcloud compute ssh gffw-login-001 --zone us-central1-a
```

## Run Experiments

The easiest thing to do is to copy the file to run experiments to your home directory!

```bash
$ gcloud compute scp --zone us-central1-a ./run-experiments.py gffw-login-001:/home/sochat1_llnl_gov/run-experiments.py
```

And then shell in (as we did above)


```bash
gcloud compute ssh gffw-login-001 --zone us-central1-a
```

Go to the experiment directory with our files of interest

```bash
cd /opt/lammps/examples/reaxff/HNS
```

Try running the lammps experiment, given that lammps is installed on the nodes, and (for this example) we have two nodes only.
Note that by default, output data will be written to the present working directory in a "data" subfolder. Since
we don't have write in the experiment files folder, we direct to our home directory (it will be created):

```bash
$ python3 $HOME/run-experiments.py --outdir /home/sochat1_llnl_gov/data \
--workdir /opt/lammps/examples/reaxff/HNS \
--times 10 -N 2 --tasks 2 lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite
```

<details>

<summary>Example Output</summary>

```console
N: 2
times: 10
sleep: 10
outdir: /home/sochat1_llnl_gov/data
tasks: 2
command: lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS
dry-run: False
identifier: lammps
Submit ƒ31XLJ9fgb: 1 of 10
Submit ƒ31XQvVRh1: 2 of 10
Submit ƒ31XVVsD8j: 3 of 10
Submit ƒ31Xa6iyro: 4 of 10
Submit ƒ31Xehakas: 5 of 10
Submit ƒ31XjKvWbH: 6 of 10
Submit ƒ31XovnHKM: 7 of 10
Submit ƒ31XtXe43R: 8 of 10
Submit ƒ31XyCwncX: 9 of 10
Submit ƒ31Y439Ssh: 10 of 10

⭐️ Waiting for jobs to finish...
Still waiting on job ƒ31XLJ9fgb, has state RUN
No longer waiting on job ƒ31XLJ9fgb, FINISHED 0!
Still waiting on job ƒ31XQvVRh1, has state RUN
No longer waiting on job ƒ31XQvVRh1, FINISHED 0!
Still waiting on job ƒ31XVVsD8j, has state RUN
No longer waiting on job ƒ31XVVsD8j, FINISHED 0!
Still waiting on job ƒ31Xa6iyro, has state RUN
No longer waiting on job ƒ31Xa6iyro, FINISHED 0!
Still waiting on job ƒ31Xehakas, has state RUN
No longer waiting on job ƒ31Xehakas, FINISHED 0!
Still waiting on job ƒ31XjKvWbH, has state RUN
No longer waiting on job ƒ31XjKvWbH, FINISHED 0!
Still waiting on job ƒ31XovnHKM, has state RUN
No longer waiting on job ƒ31XovnHKM, FINISHED 0!
Still waiting on job ƒ31XtXe43R, has state RUN
No longer waiting on job ƒ31XtXe43R, FINISHED 0!
Still waiting on job ƒ31XyCwncX, has state RUN
No longer waiting on job ƒ31XyCwncX, FINISHED 0!
Still waiting on job ƒ31Y439Ssh, has state RUN
No longer waiting on job ƒ31Y439Ssh, FINISHED 0!
Jobs are complete, goodbye! 👋️
```

</details>

The script will hang after the last run waiting for the jobs to finish.
And that's it! The output directory in your home will have both log files (from the job output and error)
and the job info (json) from Flux:

```bash
$ ls /home/sochat1_llnl_gov/data/
```
```console
lammps-0-info.json lammps-2-info.json lammps-4-info.json lammps-6-info.json lammps-8-info.json
lammps-0.log lammps-2.log lammps-4.log lammps-6.log lammps-8.log
lammps-1-info.json lammps-3-info.json lammps-5-info.json lammps-7-info.json lammps-9-info.json
lammps-1.log lammps-3.log lammps-5.log lammps-7.log lammps-9.log
```

When you exit from the node, you can copy this to your computer to save.

```bash
$ mkdir -p ./data
$ gcloud compute scp --zone us-central1-a gffw-login-001:/home/sochat1_llnl_gov/data/* ./data
```

And that's really it :) When you are finished destroy the cluster:


```bash
terraform destroy -var-file lammps.tfvars \
-var region=us-central1 \
-var project_id=$(gcloud config get-value core/project) \
-var network_name=foundation-net \
-var zone=us-central1-a
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

locals {
subnet = "${var.region}/${var.network_name}-subnet-01"
}

data "google_compute_default_service_account" "default" {
project = var.project_id
}

data "google_compute_image" "rocky8" {
project = "rocky-linux-cloud"
family = "rocky-linux-8-optimized-gcp"
}

module "network" {
source = "github.com/terraform-google-modules/terraform-google-network"
project_id = var.project_id
network_name = var.network_name
subnets = [
{
subnet_name = "${var.network_name}-subnet-01"
subnet_ip = var.subnet_ip
subnet_region = var.region
}
]
}

module "nat" {
source = "github.com/terraform-google-modules/terraform-google-cloud-nat"
project_id = var.project_id
region = var.region
network = module.network.network_name
create_router = true
router = "${module.network.network_name}-router"
}

module "firewall" {
source = "github.com/terraform-google-modules/terraform-google-network/modules/firewall-rules"
project_id = var.project_id
network_name = module.network.network_name
rules = [
{
name = "${var.network_name}-allow-ssh"
direction = "INGRESS"
priority = null
description = null
ranges = ["0.0.0.0/0"]
source_tags = null
source_service_accounts = null
target_tags = ["flux"]
target_service_accounts = null
allow = [
{
protocol = "tcp"
ports = ["22"]
}
],
deny = []
log_config = {
metadata = "INCLUDE_ALL_METADATA"
}
},
{
name = "${var.network_name}-allow-interal-traffic"
direction = "INGRESS"
priority = null
description = null
ranges = ["0.0.0.0/0"]
source_tags = null
source_service_accounts = null
target_tags = ["ssh", "flux"]
target_service_accounts = null
allow = [
{
protocol = "icmp"
ports = []
},
{
protocol = "udp"
ports = ["0-65535"]
},
{
protocol = "tcp"
ports = ["0-65535"]
}
]
deny = []
log_config = {
metadata = "INCLUDE_ALL_METADATA"
}
}
]
}

module "nfs_server_instance_template" {
source = "github.com/terraform-google-modules/terraform-google-vm/modules/instance_template"
region = var.region
project_id = var.project_id
name_prefix = var.nfs_prefix
subnetwork = module.network.subnets["${var.region}/${var.network_name}-subnet-01"].self_link
tags = ["ssh", "flux", "nfs"]
machine_type = "e2-standard-4"
disk_size_gb = var.nfs_size
source_image = data.google_compute_image.rocky8.self_link
source_image_project = data.google_compute_image.rocky8.project
service_account = {
email = data.google_compute_default_service_account.default.email
scopes = ["cloud-platform"]
}
startup_script = file("${path.module}/install_nfs.sh")
}

module "nfs_server_instance" {
source = "github.com/terraform-google-modules/terraform-google-vm/modules/compute_instance"
region = var.region
zone = var.zone
hostname = var.nfs_prefix
add_hostname_suffix = true
num_instances = 1
instance_template = module.nfs_server_instance_template.self_link
subnetwork = module.network.subnets["${var.region}/${var.network_name}-subnet-01"].self_link
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash

# This boot script will install lammps on all nodes

# Install time for timed commands
sudo dnf update -y && sudo dnf install -y time cmake openmpi clang git-clang-format
sudo ldconfig

# Needed for ffmpeg
sudo dnf install -y https://download1.rpmfusion.org/free/el/rpmfusion-free-release-8.noarch.rpm
sudo dnf install -y https://download1.rpmfusion.org/nonfree/el/rpmfusion-nonfree-release-8.noarch.rpm
sudo dnf install -y ffmpeg

# install laamps
sudo git clone --depth 1 --branch stable_29Sep2021_update2 https://github.com/lammps/lammps.git /opt/lammps
cd /opt/lammps
sudo mkdir build
cd build

# The cmake prefix path is needed otherwise openmpi is not found
sudo cmake ../cmake -DCMAKE_INSTALL_PREFIX:PATH=/usr -D PKG_REAXFF=yes -D BUILD_MPI=yes -D PKG_OPT=yes -D FFT=FFTW3 -DCMAKE_PREFIX_PATH=/usr/lib64/openmpi
sudo make
sudo make install

# Run from a node:
# cd /opt/lammps/examples/reaxff/HNS
# flux run -n 1 lmp -v x 1 -v y 1 -v z 1 -in in.reaxc.hns -nocite
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

dnf install nfs-utils -y

mkdir -p /var/nfs/home
chown nobody:nobody /var/nfs/home

ip_addr=$(hostname -I)

echo "/var/nfs/home *(rw,no_subtree_check,no_root_squash)" >> /etc/exports

firewall-cmd --add-service={nfs,nfs3,mountd,rpc-bind} --permanent
firewall-cmd --reload

systemctl enable --now nfs-server rpcbind
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

manager_machine_type = "e2-standard-2"
manager_name_prefix = "gffw"
manager_scopes = [ "cloud-platform" ]

login_node_specs = [
{
name_prefix = "gffw-login"
machine_arch = "x86-64"
machine_type = "n2-standard-2"
instances = 1
properties = []
boot_script = "install_lammps.sh"
},
]
login_scopes = [ "cloud-platform" ]

compute_node_specs = [
{
name_prefix = "gffw-compute-a"
machine_arch = "x86-64"
machine_type = "n2-standard-2"
gpu_type = null
gpu_count = 0
compact = false
instances = 2
properties = []
boot_script = "install_lammps.sh"
},
]
compute_scopes = [ "cloud-platform" ]

0 comments on commit 970f5dd

Please sign in to comment.