## Slurm Federation on AWS ParallelCluster 

Built upon what you learning in pcluster-athena++ and pcluster-athena++short notebooks, we will explore how to use Slurm federation on AWS ParallelCluster. 

Many research institutions have existing on-prem HPC clusters with Slurm scheduler. Those HPC clusters have a fixed size and sometimes require additional capacity to run workloads. "Bursting into cloud" is a way to handle that requests. 

In this notebook, we will
1. Build two AWS ParallelClusters - "awscluster" (as a worker cluster) and "onpremcluster" (to simulate an on-prem cluster)
1. Enable REST on "onpremcluster"
1. Enable Slurm accouting with mySQL as data store on "onpremcluster"
1. Ebable Slurmdbd on "awscluster" to point to the slurm accounting endpoint on "onpremcluster"
1. Create a federation with "awscluster" and "onpremcluster" clusters. 
1. Submit a job from "onpremcluster" to "awscluster"
1. Submit a job from "awscluster" to "onpremcluster"
1. Check job/queue status on both clusters



In [None]:
import boto3
import botocore
import json
import time
import os
import sys
import base64
import docker
import pandas as pd
import importlib
import project_path # path to helper methods
from lib import workshop
from botocore.exceptions import ClientError
from IPython.display import HTML, display

#sys.path.insert(0, '.')
import pcluster_athena
importlib.reload(pcluster_athena)


# unique name of the pcluster
onprem_pcluster_name = 'onpremcluster'
onprem_config_name = "config-simple"
onprem_post_install_script_prefix = "scripts/pcluster_post_install_onprem.sh"

# unique name of the pcluster
aws_pcluster_name = 'awscluster'
aws_config_name = "config-simple"
aws_post_install_script_prefix = "scripts/pcluster_post_install_aws.sh"

federation_name = "burstworkshop"


In [None]:
# this is used during developemnt, to reload the module after a change in the module
try:
    del sys.modules['pcluster_athena']
except:
    #ignore if the module is not loaded
    print('Module not loaded, ignore')
    
from pcluster_athena import PClusterHelper


In [None]:
# create the onprem clsuter
onprem_pcluster_helper = PClusterHelper(onprem_pcluster_name, onprem_config_name, onprem_post_install_script_prefix, federation_name=federation_name)
onprem_pcluster_helper.create_before()
!pcluster create $onprem_pcluster_helper.pcluster_name -nr -c build/$onprem_config_name
onprem_pcluster_helper.create_after()


In [None]:
# use onprem cluster headnode as the dbd server. 

onprem_pcluster_status = !pcluster status $onprem_pcluster_name
# Grab the IP of the head node, on where Slurm REST endpoint runs. The returned IP is a private IP of the head node. Make sure your SageMaker notebook
# is created in the same VPC (default VPC)
dbd_host = onprem_pcluster_status.grep('MasterPrivateIP').s.split()[1]

print(dbd_host)


In [None]:

# create the onprem clsuter
aws_pcluster_helper = PClusterHelper(aws_pcluster_name, aws_config_name, aws_post_install_script_prefix, dbd_host=dbd_host, federation_name=federation_name)
aws_pcluster_helper.create_before()
!pcluster create $aws_pcluster_helper.pcluster_name -nr -c build/$aws_config_name
aws_pcluster_helper.create_after()


In [None]:
# Add security group to each cluster security group - this only applies to the current configuration where 
# both clusters are in AWS. 
# For a real on-prem environment, you will need to configure your network firewall to allow traffic between the two clusters
# Each pcluster is created with a set of cloudformation templates. We can get some detailed information from the stack
#!pcluster status $aws_pcluster_name

cf_client = boto3.client("cloudformation")
aws_pcluster_head_sg = cf_client.describe_stack_resource(StackName='parallelcluster-'+aws_pcluster_name, LogicalResourceId='MasterSecurityGroup')['StackResourceDetail']['PhysicalResourceId']
onprem_pcluster_head_sg = cf_client.describe_stack_resource(StackName='parallelcluster-'+onprem_pcluster_name, LogicalResourceId='MasterSecurityGroup')['StackResourceDetail']['PhysicalResourceId']

print(aws_pcluster_head_sg)
print(onprem_pcluster_head_sg)

ec2_client = boto3.client("ec2")
try:
    resp = ec2_client.authorize_security_group_ingress(GroupId=aws_pcluster_head_sg , IpPermissions=[ {'FromPort': -1, 'IpProtocol': '-1', 'UserIdGroupPairs': [{'GroupId': onprem_pcluster_head_sg}] } ] ) 
except ClientError  as err:
    print(err , " this is ok , we can ignore")

try:
    resp = ec2_client.authorize_security_group_ingress(GroupId=onprem_pcluster_head_sg , IpPermissions=[ {'FromPort': -1, 'IpProtocol': '-1', 'UserIdGroupPairs': [{'GroupId': aws_pcluster_head_sg}] } ] ) 
except ClientError  as err:
    print(err , " this is ok , we can ignore")


### Add awscluster to the federation. 

After two clusters are created and security groups attached. run the following command on awscluster headnode
```
su -c "/opt/slurm/bin/sacctmgr -i add cluster awscluster" slurm
su -c "/opt/slurm/bin/sacctmgr -i modify federation curstworkshop Clusters+=awscluster" slurm

# restart slurmctd  - this needs to be done after slurmdbd start, otherwise the cluster won't register
systemctl restart slurmctld

```

## Integrate with Slurm REST API running on the head node

SLURM REST is currently running on the headnode. The JWT token is stored in AWS Secret Manager from the head node. You will need that JWT token in the header of all your REST API requests. 

Don't forget to add secretsmanager:GetSecretValue permission to the sagemaker execution role that runs this notebook

### Inspect the Slurm REST API Schema

We will start by examing the Slurm REST API schema


In [None]:
import requests
import json

slurm_host = dbd_host
slurm_openapi_ep = 'http://'+slurm_host+':8082/openapi/v3'
slurm_rest_base='http://'+slurm_host+':8082/slurm/v0.0.35'

_, get_headers = onprem_pcluster_helper.update_header_token()

resp_api = requests.get(slurm_openapi_ep, headers=get_headers)
print(resp_api)

if resp_api.status_code != 200:
    # This means something went wrong.
    print("Error" , resp_api.status_code)

with open('build/slurm_api.json', 'w') as outfile:
    json.dump(resp_api.json(), outfile)

print(json.dumps(resp_api.json(), indent=2))


### Use REST API callls to interact with ParallelCluster

Then we will make direct REST API requests to retrieve the partitions in response

If you get server errors, you can login to the head-node and check the system logs of "slurmrestd", which is running as a service. 


In [None]:

partition_info = ["name", "nodes", "nodes_online", "total_cpus", "total_nodes"]

##### This works as well, 
# update header in case the token has expired
_, get_headers = onprem_pcluster_helper.update_header_token()

##### call REST API directly
slurm_partitions_url= slurm_rest_base+'/partitions/'
partitions = onprem_pcluster_helper.get_response_as_json(slurm_partitions_url)
#print(partitions['partitions'])
#20.02.4 returns a dict, not an array
onprem_pcluster_helper.print_table_from_dict(partition_info, partitions['partitions'])


### Submit a job
The slurm_rest_api_client job submit function doesn't include the "script" parameter. We will have to use the REST API Post directly. 

The body of the post should be like this.  

```
{"job": {"account": "test", "ntasks": 20, "name": "test18.1", "nodes": [2, 4],
"current_working_directory": "/tmp/", "environment": {"PATH": "/bin:/usr/bin/:/usr/local/bin/","LD_LIBRARY_PATH":
"/lib/:/lib64/:/usr/local/lib"} }, "script": "#!/bin/bash\necho it works"}
```
When the job is submitted through REST API, it will run as the user "slurm". That's what the work directory "/shared/tmp" should be owned by "slurm:slurm", which is done in the post_install script. 

fetch_and_run.sh will fetch the sbatch script and the input file from S3 and put them in /shared/tmp




### Program batch script, input and output files

Within the installation, we included a pre-installed athena++ package with an input file under /shared/tmp. We will use that to submit a job form onpremcluster to awscluster. 


In [None]:
#job_script = "#!/bin/bash\ncd /shared/tmp/orszag-tang-lowres\nsbatch -Mawscluster batch_athena.sh"
job_script = "/shared/tmp/batch_test.sh"

slurm_job_submit_base=slurm_rest_base+'/job/submit'

#in order to use Slurm REST to submit jobs, you need to have the working directory permission set to nobody:nobody. in this case /efs/tmp
data = {'job':{ 'account': 'testaccount', 'partition': 'q1', 'name': 'federation_test', 'current_working_directory':'/shared/tmp/', 'environment': {"PATH": "/bin:/usr/bin/:/usr/local/bin/:/opt/slurm/bin:/opt/amazon/openmpi/bin","LD_LIBRARY_PATH":
"/lib/:/lib64/:/usr/local/lib:/opt/slurm/lib:/opt/slurm/lib64"}}, 'script':job_script}

###
# This job submission will generate two jobs , the job_id returned in the response is for the bash job itself. the sbatch will be the job_id+1 run subsequently.
#
resp_job_submit = onprem_pcluster_helper.post_response_as_json(slurm_job_submit_base, data=json.dumps(data))


print(resp_job_submit)


### List recent jobs

In [None]:
# get the list of all the jobs immediately after the previous step. This should return two running jobs. 
slurm_jobs_base=slurm_rest_base+'/jobs'

jobs = onprem_pcluster_helper.get_response_as_json(slurm_jobs_base)
# print(jobs)
jobs_headers = [ 'job_id', 'job_state', 'account', 'batch_host', 'nodes', 'cluster', 'partition', 'current_working_directory']

# newer version of slurm 
#print_table_from_json_array(jobs_headers, jobs['jobs'])
onprem_pcluster_helper.print_table_from_json_array(jobs_headers, jobs)
                   

A mediu resolution simulation will run about ten minutes, plus the time for the cluster to spin up. Wait till the job finishes running then move to the next sections

# Don't forget to clean up

1. Delete the ParallelCluster
2. Delete the RDS
3. S3 bucket
4. Secrets used in this excercise

Deleting VPC is risky, I will leave it out for you to manually clean it up if you created a new VPC. 

In [None]:
# this is used during developemnt, to reload the module after a change in the module
#try:
#    del sys.modules['pcluster_athena']
#except:
#    #ignore if the module is not loaded
#    print('Module not loaded, ignore')
    
#from pcluster_athena import PClusterHelper
# we added those ingress rules later, if we don't remove them, pcluster delete will fail
try:
    resp = ec2_client.revoke_security_group_ingress(GroupId=aws_pcluster_head_sg , IpPermissions=[ {'FromPort': -1, 'IpProtocol': '-1', 'UserIdGroupPairs': [{'GroupId': onprem_pcluster_head_sg}] } ] ) 
except ClientError  as err:
    print(err , " this is ok , we can ignore")

try:
    resp = ec2_client.revoke_security_group_ingress(GroupId=onprem_pcluster_head_sg , IpPermissions=[ {'FromPort': -1, 'IpProtocol': '-1', 'UserIdGroupPairs': [{'GroupId': aws_pcluster_head_sg}] } ] ) 
except ClientError  as err:
    print(err , " this is ok , we can ignore")
    
    
aws_pcluster_helper = PClusterHelper(aws_pcluster_name, aws_config_name, aws_post_install_script_prefix)
!pcluster delete $aws_pcluster_helper.pcluster_name
aws_pcluster_helper.cleanup_after(KeepRDS=True)

onprem_pcluster_helper = PClusterHelper(onprem_pcluster_name, onprem_config_name, onprem_post_install_script_prefix)
!pcluster delete $onprem_pcluster_helper.pcluster_name
onprem_pcluster_helper.cleanup_after(KeepRDS=True)


