# Using Weights & Biases' Sweeps module on Slurm

Weights & Biases (W&B) provides a number of tools that make tracking machine learning (ML) models a lot easier. One of their most popular tools is their Sweeps module that allows you to easily perform state-of-the-art hyperparameter optimization techniques across many machines in parallel using `wandb.agent()`. Many academic researchers have access to high performance computing (HPC) clusters that utilize a Slurm job scheduler, but spinning up multiple W&B agents within a Slurm job is not straightforward. Let me walk you through how to do just that.

This walkthrough will have two parts:

1. Setting up your own burstable Slurm cluster on Amazon Web Services (AWS) using their [aws-plugin-for-slurm](https://github.com/aws-samples/aws-plugin-for-slurm/tree/plugin-v2).
2. Formulating and submitting a W&B sweep on Slurm.

## Setting up your own Slurm cluster on AWS
Note: if using AWS's GPU instances for the first time (the P-family being the most common) you must [request a service limit increase](http://aws.amazon.com/contact-us/ec2-request). The number of nodes you request should be at least equal to the number of nodes you'll make available to your Slurm cluster (more below).

AWS offers a great [plugin](https://github.com/aws-samples/aws-plugin-for-slurm/tree/plugin-v2) that greatly simplifies the process of creating your own burstable Slurm cluster. This cluster will constantly run a 'headnode' that runs the Slurm daemon (manages the job queue and spinning up resources) and cron job that pulls down unused resources. Your jobs will be run on compute nodes that are spun up when needed and torn down when not used, saving you money. The plugin also offers other cool abilities, like the ability to be an extension to an existing cluster giving you more compute power or specialized hardware when needed, or the ability to specify partitions that only use spot instances which saves you even more money.

Let's get started:

We will be setting up our cluster via a CloudFormation template, a yaml file that specifies all of the parameters of our cluster. But this template still requires a few inputs:
- a Virtual Private Cloud (VPC)
- two subnets in the VPC but in different availability zones
- a headnode instance type
- a compute node instance type
- an SSH key

Your AWS account comes with a default VPC and defaults subnets on every availability zone which you are more than welcome to use. If you want to set up new ones, follow these instructions:

### AWS SETUP

[Install the aws2 cli](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html).

In your terminal

1. Login to to the CLI: `aws configure`
2. Create an SSH key and upload it to aws:
3. Create a VPC: `aws2 ec2 create-vpc --cidr-block 10.0.0.0/16`
4. Create subnet 1: `aws2 ec2 create-subnet --vpc-id [VPC_ID] --cidr-block 10.0.0.0/20 --availability-zone us-west-2a` (use `aws2 ec2 describe-vpcs` to show the VPC_ID)
4. Create subnet 2: `aws2 ec2 create-subnet --vpc-id [VPC_ID] --cidr-block 10.0.16.0/20 --availability-zone us-west-2b`

### CloudFormation

Download the CloudFormation template: `wget -q https://github.com/elyall/wandb_on_slurm/raw/main/cloudformation-template.yaml`

Then edit the `cloudformation-template.yaml` to your liking. For instance you can change the maximum number of nodes allowed in your cluster at line 319, or the size of your headnode's filesystem where all of the compute nodes access your code at line 230. 

Next we'll setup your stack:

1. Go to the [AWS Console](console.aws.amazon.com/console/home), click "Services" at the top left and type in or select "CloudFormation". 
2. Hit the "Create Stack" dropdown on the right side, then select "With new resources (standard)".
3. Under "Specify Template" click "Upload a template file", then hit "Choose file" and upload the `cloudformation-template.yaml`. 

Hit "Next".

![step1](imgs/cloudformation-step1.png)

1. Enter a "Stack Name" such as "slurm".
2. Select your VPC from the dropdown. (Note: the stack and VPC have to be in the same region)
3. Select two different subnets on different availability zones with the next two dropdowns.
4. Change the Headnode or Compute Node Instance Type if you wish to another [offered on EC2](https://aws.amazon.com/ec2/instance-types/). Make sure the value under "Compute Node vCPUs" matches the number of vCPUs availble in the compute node instance type.
5. Select your SSH key from the "Key Pair" dropdown.

Hit "Next".

![step1](imgs/cloudformation-step2.png)

Hit "Next" again.

Agree to the acknowledgement and then select "Create stack".



### Login to your headnode
1. In the AWS Console, go "Services" -> "EC2" -> "Instances" on the left, then check the "headnode" instance and select "Connect" near the top. 
2. Select "SSH client", copy the text under "Example", and paste it into your terminal.

## Using Weigths & Biases on Slurm
I'm going to demonstrate one example of how to use W&B's Sweep module on slurm. This example assumes you are using a cluster built using the aws-plugin-for-slurm, but the basic principles can be used on any slurm cluster.

To start, login to your cluster then clone the repo: `git clone https://github.com/elyall/wandb_on_slurm.git`

Netxt we will download the example model we will optimize, install its dependencies into a virtual environment, and login to wandb, by running the `setup.sh` script: 
```
cd ~/wandb_on_slurm
bash setup.sh
```

In [1]:
# Here's the setup script:
!cat setup.sh

#!/bin/bash

# update pip
sudo yum update -y
sudo yum install python3-devel -y #necssary for `pip install wandb`
sudo pip3 install --upgrade pip

# create accessible logs directory
sudo mkdir /nfs/logs
sudo chown -R ec2-user:ec2-user /nfs/logs

# create accessible code directory
sudo mkdir /nfs/code
sudo chown -R ec2-user:ec2-user /nfs/code
cp ~/wandb_on_slurm/wandb_on_slurm.py /nfs/code/
cp ~/wandb_on_slurm/start-agent.sh /nfs/code/
chmod +x /nfs/code/start-agent.sh
cd /nfs/code

# clone example to run
git clone https://github.com/wandb/examples.git

# create virtual environment with required dependencies
python3 -m venv wandb-venv
source wandb-venv/bin/activate
pip install --upgrade -r examples/examples/keras/keras-cnn-fashion/requirements.txt

# torch
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda install pytorch torchvision -c pytorch

# login and copy key to accessible folder
wandb login
python - << EOF
impor

### Running a slurm job
To run a slurm job we typically need two things:
1. A sbatch header detailing the resources the job needs
2. The code we want to execute

#### SBATCH Header
Slurm jobs are submitted via shell scripts that have headers specifying the resources the job needs. Here is an example header:
```
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=aws
#SBATCH --time=0:20:0
#SBATCH --output=/nfs/logs/slurm-%j.log
#SBATCH --chdir=/nfs/code/
```
More information on what parameters you can set [can be found here](https://slurm.schedmd.com/sbatch.html). By default most of the parameters are optional, however your cluster manager has likely made some parameters mandatory. Mandatory parametes often include:
- `partition` - specifies what subcluster of nodes to run on.
- `time` - specifies the maximum amount of time the job is allowed to run.
- `qos` - what account to bill to.
- `nodes` - the number of nodes to assign to the job.

After the header is where you place your code which will run on the resources the job scheduler assigns using the header as a guide. This code can be as simple as `python run.py`, but is often more complex. Here is an example job sumbission script:

In [2]:
!cat wandb_on_slurm.sbatch

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=aws
#SBATCH --time=0:20:0
#SBATCH --output=/nfs/logs/slurm-%j.log
#SBATCH --chdir=/nfs/code/
date;hostname;id;pwd;ls

echo 'gathering node information'
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # get the node names
nodes_array=( $nodes )
echo "${nodes_array[@]}"

echo 'activating virtual environment'
source wandb-venv/bin/activate

config_yaml='/nfs/code/examples/examples/keras/keras-cnn-fashion/sweep-bayes-hyperband.yaml'
echo 'template:' $config_yaml

echo 'running script'
python wandb_on_slurm.py $config_yaml "${nodes_array[@]}"

In this script the typical things we do that are done on any machine include:

1. activate a virtual environment where the dependencies are installed
2. specify the parameters of our sweep as defined in a yaml file
3. run our sweep with python

The unique step we have to do to take advantage of the multi-node parallelism offered by slurm, is we have to determine what nodes are assigned to the job, and then pass that list of nodes into the python script where it will spin up a W&B agent on each node.

Here is what the python script looks like:

In [3]:
!cat wandb_on_slurm.py

import wandb
import subprocess
import click
import yaml
import os
import json

if os.path.exists("/nfs/code/keys.json"):
    with open("/nfs/code/keys.json") as file:
        api_key = json.load(file)["work_account"]
        os.environ["WANDB_API_KEY"] = api_key

@click.command()
@click.argument("config_yaml")
@click.argument("node_list", nargs=-1)
def run(config_yaml, node_list):
    project_name = "wandb_on_slurm"

    wandb.init(project=project_name)
    

    with open(config_yaml) as file:
        config_dict = yaml.load(file, Loader=yaml.FullLoader)
    config_dict['program'] = '/nfs/code/examples/examples/keras/keras-cnn-fashion/train.py'
    config_dict['parameters']['epochs']['value'] = 5

    sweep_id = wandb.sweep(config_dict, project=project_name)
    
    sp = []
    for node in node_list:
        sp.append(subprocess.Popen(['srun',
                        '--nodes=1',
                        '--ntasks=1',
                        '-w',
                        node,
       

The specialized part of this script is the for loop at the bottom where we iterate over the list of nodes and start an agent on each one using `start-agent.sh`:

In [4]:
!cat start-agent.sh

#!/bin/bash

wandb agent $1 --project $2