# Introduction to AWS ParallelCluster

## Create a cluster


Before you start, please have the following pre-requisites ready. 
* A VPC that has a public subnet with an internet gateway
* A MySQL RDS database in the same subnet ( or you can use this notebook to create one )
* The SageMaker execution role used for this notebook have permission to create a ParallelCluster, create EC2 keypair, secrets in AWS SecretManager and VPCs 

Details about the policies are described in this document. 
https://docs.aws.amazon.com/parallelcluster/latest/ug/iam.html#parallelclusteruserpolicy

As an alternative, you can create a IAM user that has the policies mentioned above, and add the aws_access_key_id and aws_secret_access_key in the [aws] section of the following config file.  

In [None]:
import boto3
import botocore
import json
import time
import os
import sys
import base64
import docker
import pandas as pd
import importlib
import project_path # path to helper methods
from lib import workshop
from botocore.exceptions import ClientError
from IPython.display import display

sys.path.insert(0, '.')
import pcluster_athena

session = boto3.session.Session()
region = session.region_name
ec2_client = boto3.client('ec2')
iam_client = boto3.client('iam')

# Get the aws account number where this notebook, the cluster sits. It's used in IAM policy resource
my_account_id = boto3.client('sts').get_caller_identity().get('Account')

# specify the following names

# ssh key for access the pcluster. this key is not needed  in this excercise, but useful if you need to ssh into the headnode of the pcluster
key_name = 'pcluster-athena-key'
keypair_saved_path = './'+key_name+'.pem'
# unique name of the pcluster
pcluster_name = 'myTestCluster6g'
config_name = "config-6g"
pcluster_ = 'scripts/post_install_script-6g.sh'


# the rds for the Slurmdbd datastore. We will use a MySQL server as the data store. Server's hostname, username, password will be saved in a secret in Secrets Manager
rds_secret_name = 'slurm_dbd_credential'
# the slurm REST token is generated from the headnode and stored in Secrets Manager. This token is used in makeing REST API calls to the Slurm REST endpoint running on the headnode 
slurm_secret_name = "slurm_token_{}".format(pcluster_name)
# database name for the slurmdbd data store in MySQL database. 
db_name = 'pclusterdb'
# We only need one subnet for the pcluster, but two subnets are needed for RDS instance. If use existing VPC, we will use the default VPC, and the first subnet in default VPC
use_existing_vpc = True

In [None]:
# During development, everytime you update the workshop module, you need to call this:  
importlib.reload(workshop)

# we will not need to use the ssh_key in this excercise. However, you can only download the key once during creation. we will save it in case
try:
    workshop.create_keypair(region, session, key_name, keypair_saved_path)
except ClientError as e:
    if e.response['Error']['Code'] == "InvalidKeyPair.Duplicate":
        print("KeyPair with the name {} alread exists. Skip".format(key_name))
    

## VPC

You can use the existing default VPC or create a new VPC with 2 subnets. 

We will only be using one of the subnets for the ParallelCluster, but both are used for the RDS database. 

In [None]:


if use_existing_vpc:
    vpc_filter = [{'Name':'isDefault', 'Values':['true']}]
    default_vpc = ec2_client.describe_vpcs(Filters=vpc_filter)
    vpc_id = default_vpc['Vpcs'][0]['VpcId']

    subnet_filter = [{'Name':'vpc-id', 'Values':[vpc_id]}]
    subnets = ec2_client.describe_subnets(Filters=subnet_filter)
    subnet_id = subnets['Subnets'][0]['SubnetId']
    subnet_id2 = subnets['Subnets'][1]['SubnetId']    
else: 
    vpc, subnet1, subnet2 = workshop.create_and_configure_vpc()
    vpc_id = vpc.id
    subnet_id = subnet1.id
    subnet_id2 = subnet2.id


In [None]:
# Create the project bucket. 
# we will use this bucket for the scripts, input and output files 


bucket_prefix = pcluster_name.lower()+'-'+my_account_id

# use the bucket prefix as name, don't use uuid suffix
my_bucket_name = workshop.create_bucket(region, session, bucket_prefix, False)
print(my_bucket_name)


## Using SPOT Instances
We will create two queues in this excercise, one using on-demand instances and one using SPOT instances. To use SPOT, we need AWSServiceRoleForEC2SpotFleet service-linked role in this account. 

In [None]:
try:
    iam_client.get_role(RoleName="AWSServiceRoleForEC2SpotFleet")
except ClientError as e:
    if e.response['Error']['Code'] == 'NoSuchEntity':
        print("AWSServiceRoleForEC2SpotFleet doesn't exist, create one ... ")
        iam_client.create_service_linked_role(AWSServiceName='spotfleet.amazonaws.com')
        print("AWSServiceRoleForEC2SpotFleet created successfully")
else: 
    print("AWSServiceRoleForEC2SpotFleet exists")

## RDS Database (MySQL) - used with ParallelCluster for accounting

We will create a simple MySQL RDS database instance to use as a data store for Slurmdbd for accounting. The username and password are stored as a secret in the Secrets Manager. 
The secret is later used to configure Slurmdbd. 

The RDS instance will be created asynchronuously. While the secret is created immediated, the hostname will be available only after the creation is completed. We will have to update the hostname in the secreat afterwards. 

We will update the security group to allow traffic to port 3306 from the cluster in the same vpc


In [None]:
# Create the RDS for cluster accounting
importlib.reload(workshop)


# create a simple mysql rds instance , the username and password will be stored in secrets maanger as a secret
workshop.create_simple_mysql_rds(region, session, db_name, [subnet_id,subnet_id2] ,rds_secret_name)

rds_client = session.client('rds', region)
rds_waiter = rds_client.get_waiter('db_instance_available')

print("Waiting for the DB creation to finish ... ")
try:
    rds_waiter.wait(DBInstanceIdentifier=db_name) 
except botocore.exceptions.WaiterError as e:
    print(e)

print("Finished creating the db.")

#since the rds creation is asynch, need to wait till the creation is done to get the hostname, then update the secret with the hostname
vpc_sgs = workshop.get_sgs_and_update_secret(region, session, db_name, rds_secret_name)
print(vpc_sgs)

# Step 3. get the vpc local CIDR range 
ec2 = boto3.resource('ec2')
vpc = ec2.Vpc(vpc_id)
cidr = vpc.cidr_block

# update the RDS security group to allow inbound traffic to port 3306
workshop.update_security_group(vpc_sgs[0]['VpcSecurityGroupId'], cidr, 3306)


In [None]:
print(vpc_sgs)

### Install pcluster CLI

If you have not installed aws-parallelcluster commandline tool, uncomment the next line of code and executed it. You only need to do it once. 

If you have installed "pcluster" command correctly, it should return "2.10.4"

Note: You only need to do this once in this kernel. If you have not installed pcluster, uncomment the next two lines and run the block. 


In [None]:

#!pip install --upgrade pip
#!sudo pip3 install --upgrade aws-parallelcluster
!pcluster version

### ParallelCluster config file
Start with the the configuration template file 




In [21]:
config_file_name=config_name+'.ini'

!cat config/$config_file_name

[aws]
aws_region_name = ${REGION}

[vpc public]
vpc_id = ${VPC_ID}
master_subnet_id = ${SUBNET_ID}

[cluster default]
key_name = ${KEY_NAME}
base_os = alinux2
scheduler = slurm
master_instance_type = c5.xlarge
s3_read_write_resource = *
vpc_settings = public
ebs_settings = myebs
queue_settings = q1, q2, q3
post_install = ${POST_INSTALL_SCRIPT_LOCATION}
post_install_args = ${POST_INSTALL_SCRIPT_ARGS}
additional_iam_policies = arn:aws:iam::aws:policy/SecretsManagerReadWrite

[queue q1]
compute_resource_settings = cr1
placement_group = DYNAMIC
enable_efa = true
disable_hyperthreading = true
compute_type = ondemand

[queue q2]
compute_resource_settings = cr2
placement_group = DYNAMIC
enable_efa = false
disable_hyperthreading = false
compute_type = spot

[compute_resource cr1]
instance_type = c5n.18xlarge
min_count = 0
initial_count = 0
max_count = 20

[compute_resource cr2]
instance_type = c5n.2xlarge
min_count = 0
initial_count = 0
max_count = 10

[ebs myebs]
shared_dir = /shared
volume_t

#### Setup parameters for PCluster

We will be using a relational database on AWS (RDS) for Slurm accounting (slurmdbd). Please refer to this blog for how to set it up https://aws.amazon.com/blogs/compute/enabling-job-accounting-for-hpc-with-aws-parallelcluster-and-amazon-rds/

Once you set up the MySQL RDS, create a secret in SecretManager with the type "Credentials for RDS", so we don't need to expose the database username/password in plain text in this notebook. 

In [None]:

# this is used during developemnt, to reload the module after a change in the module
try:
    del sys.modules['pcluster_athena']
except:
    #ignore if the module is not loaded
    print('Module not loaded, ignore')
    
from pcluster_athena import PClusterHelper
# create the cluster - # You can rerun the rest of the notebook again with no harm. There are checks in place for existing resoources. 
pcluster_helper = PClusterHelper(pcluster_name, config_name, post_install_script_prefix)

    
    
# the response is a json {"username": "xxxx", "password": "xxxx", "engine": "mysql", "host": "xxxx", "port": "xxxx", "dbInstanceIdentifier", "xxxx"}
rds_secret = json.loads(pcluster_helper.get_slurm_dbd_rds_secret())

post_install_script_location = "s3://{}/{}".format(pcluster_helper.my_bucket_name, post_install_script_prefix)
post_install_script_args = "'" + rds_secret['host']+' '+str(rds_secret['port']) +' ' + rds_secret['username'] + ' ' + rds_secret['password'] + ' ' + pcluster_name + ' ' + region +"'" 


### Post installation script
This script is used to recompile and configure slurm with slurmrestd. We also added the automation of compiling Athena++ in the script. 

Let's take a look at the scrupt:

In [None]:
!cat scripts/pcluster_post_install.sh

#upload the script to S3
session = boto3.Session()
s3_client = session.client('s3')

try:
    resp = s3_client.upload_file('scripts/pcluster_post_install.sh', my_bucket_name, post_install_script_prefix)
except ClientError as e:
    print(e)


Replace the placeholder with value in config.ini

In [None]:
ph = {'${REGION}': region, 
      '${VPC_ID}': vpc_id, 
      '${SUBNET_ID}': subnet_id, 
      '${KEY_NAME}': key_name, 
      '${POST_INSTALL_SCRIPT_LOCATION}': post_install_script_location, 
      '${POST_INSTALL_SCRIPT_ARGS}': post_install_script_args
     }


!mkdir -p build
pcluster_helper.template_to_file("config/"+config_name+".ini", "build/"+config_name", ph)

In [None]:
#!cat build/config

#### Create a pcluster with the config file

The -nr note is used to tell cloudformation not to roll back when there is an error - this is only needed for development. 

After the cluster is created, we will use boto to setup the following permissions
1. Add IAM permission on the head-node instance role to allow access to Secret Manager for storing slurm token 
2. Add Inbound rule to allow "All traffic" from the SageMaker notebook instance (for Slurmrest API access)


In [None]:
!pcluster create $pcluster_name -nr -c build/$config_name



In [None]:
!pcluster list

## Update IAM policy and security group 

Use boto3 to 
1. Update a policy in parallelcluster head-node instance role, to allow the head-node to access Secret Manager.
2. Add inbound rule to allow access to the REST API from this notebook


In [None]:
# update the lib during development
importlib.reload(workshop)
# Use the stack name to find the resources created with the parallelcluster. Use some of the information to update
# the IAM policy and security group
cluster_stack_name = 'parallelcluster-'+pcluster_name


#SGet the head-node's instanace role and headnode security group 
cf_client = boto3.client('cloudformation')
root_role_info = cf_client.describe_stack_resource(StackName=cluster_stack_name, LogicalResourceId='RootRole' )
sg_info = cf_client.describe_stack_resource(StackName=cluster_stack_name, LogicalResourceId='MasterSecurityGroup' )

#Root role  and security group physical resource id
root_role_name = root_role_info['StackResourceDetail']['PhysicalResourceId']
head_sg_name = sg_info['StackResourceDetail']['PhysicalResourceId']

# To put the head/compute nodes under managed instances in SSM, attach AmazonSSMManagedInstanceCore policy to the root_role
# Note - if you enable this, patch schedule might interrupt your computation if the scheduled patch happens during the computation
iam_client.attach_role_policy(RoleName=root_role_name, PolicyArn='arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore')

# Step 3. get the vpc local CIDR range 
ec2 = boto3.resource('ec2')
vpc = ec2.Vpc(vpc_id)
cidr = vpc.cidr_block

workshop.update_security_group(head_sg_name, cidr, 8082)


In [None]:
#### Get the pcluster head-node PrivateIP 
# ! cmd returns a IPython.utils.text.SList, which has grep, fields methods and s,n,p properties
####
pcluster_status = !pcluster status $pcluster_name

# get the second part of 'MasterPrivateIP: 172.16.2.92'
slurm_host = pcluster_status.grep('MasterPrivateIP').s.split()[1]

print(slurm_host)

### Integrate with Slurm REST API running on the head node


![Slurmrestd_diagram](parallelcluster_restd_diagram.png "Slurm REST API on AWS ParallelCluster")

Slurmrestd is currently running on the headnode, using jwt as the auth mechanism. 
In the post_install script, slurmrestd is enabled to run as a daemon with the following command on the head-node. 

We will be using direct REST API calls with JWT token (retrieved from Secret Manager) in the header. 



## Integrate with Slurm REST API running on the head node

SLURM REST is currently running on the headnode, using jwt as the auth mechanism. On the server side, a JWT token is created every 20 minutes by running ```scontrol token username=slurm``` on the head node. The same token is needed in the HTTP header with every GET/POST requrests from the notebook. We use AWS Secret Manager to store the encrypted token so it can be accessed from the notebook. 

### JWT token

To pass it securely to this notebook, we will first create a cron job on the headnode to retrieve the token, then save it in Secrete Manager with a name "slurm_token_{cluster_name}". The default JWT token lifespan is 1800 seconds(30 mins). Run the follow script on the head-node as a cron job to update the token every 20 mins

The following steps are included in the post_install_script. You DO NOT need to run it. 
#### Step 1.  Add permission to the instance role for the head-node
We use additional_iam_role in the pcluster configuration file to attach SecretManager read/write policy to the instance role on the cluster. 

#### Step 2. Create a script "token_refresher.sh" 
Assume we save the following script at /shared/token_refresher.sh 

``` token_refresher.sh
#!/bin/bash

REGION=us-east-1
export $(/opt/slurm/bin/scontrol token -u slurm)

aws secretsmanager describe-secret --secret-id slurm_token --region $REGION

if [ $? -eq 0 ]
then
 aws secretsmanager update-secret --secret-id slurm_token --secret-string "$SLURM_JWT" --region $REGION
else
 aws secretsmanager create-secret --name slurm_token --secret-string "$SLURM_JWT" --region $REGION
fi
```

#### Step 3. Add a file "slurm-token" in /etc/cron.d/

```/etc/cron.d/slurm-token
# Run the slurm token update every 20 minues 
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
*/20 * * * * root /shared/token_refresher.sh                                       
```

#### Step 4. Add permission to access SecretManager for this notebook

Don't forget to add secretsmanager:GetSecretValue permission to the sagemaker execution role that runs this notebook

### Inspect the Slurm REST API Schema

Note: The post install script does several things, which will take a about 20 minutes. if you get a ResourceNotFoundException, please wait till the init process complete and run the following block again. 

In [None]:

import requests


    

slurm_openapi_ep = 'http://'+slurm_host+':8082/openapi/v3'
slurm_rest_base='http://'+slurm_host+':8082/slurm/v0.0.35'

try: 
    _, get_headers = pcluster_helper.update_header_token()
except ClientError as e:
    if e.response['Error']['Code'] == 'ResourceNotFoundException':
        print("Token has not been added to Secrets Manager, please wait and try again")
        raise e
#else:
#    print(get_headers)

try:
    resp_api = requests.get(slurm_openapi_ep, headers=get_headers)
except requests.exceptions.ConnectionError:
    resp_api.status_code = "Connection refused"    

#if resp_api.status_code != 200:
#    # This means something went wrong.
#    print("Error" , resp.status_code)
#    time.sleep(5)

#with open('build/slurm_api.json', 'w') as outfile:
#    json.dump(resp_api.json(), outfile)

print(json.dumps(resp_api.json(), indent=2))


### Use REST API callls to interact with ParallelCluster

Then we will make direct REST API requests to retrieve the partitions in response

If you get server errors, most likely
1. Cron job - token_refresher.sh (every 20 mins) hasn't been run yet after the IAM policy is updated. You can check for the slurm_token_yourClusterName secrete in AWS Secret Manager console. 
2. login to the head-node and check the system logs of "slurmrestd", which is running as a service. 


In [None]:
# this is used during developemnt, to reload the module after a change in the module
try:
    del sys.modules['pcluster_athena']
except:
    #ignore if the module is not loaded
    print('Module not loaded, ignore')
    
from pcluster_athena import PClusterHelper
# create the cluster - # You can rerun the rest of the notebook again with no harm. There are checks in place for existing resoources. 
pcluster_helper = PClusterHelper(pcluster_name)

partition_info = ["name", "nodes", "nodes_online", "total_cpus", "total_nodes"]

##### call REST API directly
slurm_partitions_url= slurm_rest_base+'/partitions/'
partitions = pcluster_helper.get_response_as_json(slurm_partitions_url)

#print(partitions['partitions'])
#20.02.4 returns a dict, not an array
pcluster_helper.print_table_from_dict(partition_info, partitions['partitions'])

# newer slurmrest return proper array
# print_table_from_json_array(partition_info, [partitions['partitions']['q1'], partitions['partitions']['q2']] )


### Submit a job
The slurm_rest_api_client job submit function doesn't include the "script" parameter. We will have to use the REST API Post directly. 

The body of the post should be like this.  

```
{"job": {"account": "test", "ntasks": 20, "name": "test18.1", "nodes": [2, 4],
"current_working_directory": "/tmp/", "environment": {"PATH": "/bin:/usr/bin/:/usr/local/bin/","LD_LIBRARY_PATH":
"/lib/:/lib64/:/usr/local/lib"} }, "script": "#!/bin/bash\necho it works"}
```
When the job is submitted through REST API, it will run as the user "slurm". That's what the work directory "/shared/tmp" should be owned by "slurm:slurm", which is done in the post_install script. 

fetch_and_run.sh will fetch the sbatch script and the input file from S3 and put them in /shared/tmp




### Program batch script, input and output files

To share the pcluster among different users and make sure users can only access their own input and output files, we will use user's ow S3 buckets for input and output files.

The job will be running on the ParallelCluster under /efs/tmp (for example) through a fatch (from the S3 bucket) and run script and the output will be stored in the same bucket under "output" path. 

If the simulation results are stored in vtk files, which can be merged into single block vtk files from individual mesh block vtk files. The merging process is programmed in the batch script after the simulation executions. 

In this notebook, we will use hdf5 format for the output data


In [None]:


# Where the batch script, input file, output files are uploaded to S3
job_name = "orszag-tang-mediumres-q1"
my_prefix = "athena/"+job_name
# fake account_name 
account_name = "test-account-1" 
partition = "q1"
use_efa="YES"
output_format="hdf" # or "vtk"

# template files for input and batch script
input_file_ini = "config/athinput_orszag_tang.ini"
batch_file_ini = "config/batch_athena_sh.ini"

# actual input and batch script files
input_file = "athinput_orszag_tang.input"
batch_file = "batch_athena.sh"
    
###
# Mesh/Meshblock parameters
# nx1,nx2,nx3 - number of zones in x,y,z
# mbx1, mbx2, mbx3 - meshblock size 
# nx1/mbx1 X nx2/mbx2 X nx3/mbx3 = number of meshblocks - this should be the number of cores you are running the simulation on 
# e.g. mesh 100 X 100 X 100 with meshsize 50 X 50 X 50 will yield 2X2X2 = 8 blocks, run this on a cluster with 8 cores 
# test configurations: 
# highres : 512x512x512 on 64x64x64 meshblock needs 512 cores = 16 nodes on q1
# mediumres: 256x256x256 on 64x64x64 meshblock needs 64 cores = 2 nodes on q1 or 16 nodes on q2
# mediumres: 256x256x128 on 64x64x64 meshblock needs 32 cores = 1 nodes on q1 or 8 nodes on q2
# lowres: 128x128x18 on 64x64x64 meshblock needs 8 cores = 1 node on q1 or 2 nodes on q2 

#Mesh - actual domain of the problem 
# 512X512X512 cells with 64x64x64 meshblock - will have 8X8X8 = 512 meshblocks - if running on 32 cores/node
# 512/32=16 nodes
nx1=256
nx2=256
nx3=256

#Meshblock - each meshblock size - not too big 
mbnx1=64
mbnx2=64
mbnx3=64

#Make sure the mesh is divisible by meshblock size
# e.g. num_blocks = (512/64)*(512/64)*(512/64) = 8 x 8 x 8 = 512
num_blocks = (nx1/mbnx1)*(nx2/mbnx2)*(nx3/mbnx3)

###
# Batch file parameters
# num_nodes should be less than or equal to the max number of nodes in your cluster
# num_tasks_per_node should be less than or equal to the max number of nodes in your cluster 
# e.g. 512 meshblocks / 32 core/node * 1 core/meshblock = 16 nodes -  c5n.18xlarge
#num_nodes = 2

# e.g. 64 meshblocks / 4 core/node * 1 core/meshblock = 4 nodes - c5n.2xlarge
num_nodes = 2
num_of_threads = 1

num_tasks_per_node = num_blocks/num_nodes/num_of_threads
cpus_per_task = num_of_threads



#This is where the program is installed on the cluster
exe_path = "/shared/athena-public-version/bin/athena"
#This is where the program is going to run on the cluster
work_dir = '/shared/tmp/'+job_name
ph = { '${nx1}': str(nx1), 
       '${nx2}': str(nx2),
       '${nx3}': str(nx3),
       '${mbnx1}': str(mbnx1),
       '${mbnx2}': str(mbnx2),
       '${mbnx3}': str(mbnx3), 
       '${num_of_threads}' : str(num_of_threads)}
pcluster_helper.template_to_file(input_file_ini, 'build/'+input_file, ph)

ph = {'${nodes}': str(num_nodes),
      '${ntasks-per-node}': str(int(num_tasks_per_node)),
      '${cpus-per-task}': str(cpus_per_task),
      '${account}': account_name,
      '${partition}': partition,
      '${job-name}': job_name,
      '${EXE_PATH}': exe_path,
      '${WORK_DIR}': work_dir,
      '${input-file}': input_file,
      '${BUCKET_NAME}': my_bucket_name,
      '${PREFIX}': my_prefix,
      '${USE_EFA}': use_efa,
      '${OUTPUT_FOLDER}': "output/",
      '${OUTPUT_FORMAT}': output_format,
      '${NUM_OF_THREADS}' : str(num_of_threads)}
pcluster_helper.template_to_file(batch_file_ini, 'build/'+batch_file, ph)

# create batch and 
def upload_athena_files(input_file, batch_file):
    session = boto3.Session()
    s3_client = session.client('s3')

    try:
        resp = s3_client.upload_file('build/'+input_file, my_bucket_name, my_prefix+'/'+input_file)
        resp = s3_client.upload_file('build/'+batch_file, my_bucket_name, my_prefix+'/'+batch_file)
    except ClientError as e:
        print(e)

# upload to S3 for use later
upload_athena_files(input_file, batch_file)

job_script = "#!/bin/bash\n/shared/tmp/fetch_and_run.sh {} {} {} {} {}".format(my_bucket_name, my_prefix, input_file, batch_file, job_name)


In [None]:

slurm_job_submit_base=slurm_rest_base+'/job/submit'

job_script = "#!/bin/bash\n/shared/tmp/fetch_and_run.sh {} {} {} {} {}".format(my_bucket_name,my_prefix, input_file, batch_file, job_name)

#in order to use Slurm REST to submit jobs, you need to have the working directory permission set to nobody:nobody. in this case /efs/tmp
data = {'job':{ 'account': account_name, 'partition':partition , 'name': job_name, 'current_working_directory':'/shared/tmp/', 'environment': {"PATH": "/bin:/usr/bin/:/usr/local/bin/:/opt/slurm/bin:/opt/amazon/openmpi/bin","LD_LIBRARY_PATH":
"/lib/:/lib64/:/usr/local/lib:/opt/slurm/lib:/opt/slurm/lib64"}}, 'script':job_script}

###
# This job submission will generate two jobs , the job_id returned in the response is for the bash job itself. the sbatch will be the job_id+1 run subsequently.
#
resp_job_submit = pcluster_helper.post_response_as_json(slurm_job_submit_base, data=json.dumps(data))


print(resp_job_submit)


### List recent jobs

In [None]:
# get the list of all the jobs immediately after the previous step. This should return two running jobs. 
slurm_jobs_base=slurm_rest_base+'/jobs'

jobs = pcluster_helper.get_response_as_json(slurm_jobs_base)
# print(jobs)
jobs_headers = [ 'job_id', 'job_state', 'account', 'batch_host', 'nodes', 'cluster', 'partition', 'current_working_directory']

# newer version of slurm 
#print_table_from_json_array(jobs_headers, jobs['jobs'])
pcluster_helper.print_table_from_json_array(jobs_headers, jobs)
                   

# Visualize Athena++ Simulation Results
In this notebook, we are going to use the python library comes with Athena++ to read and visualize the simulation results.

In the previous notebook, we saved the simulation results in s3://<bucketname>/athema/$job_name/output folder

Import the hdf python code that came with Athena++

In [None]:
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import clear_output
import h5py

#Do this once. clone the athena++ source code , and the hdf5 python package we need is under vis/python folder

if not os.path.isdir('athena-public-version'):
    !git clone https://github.com/PrincetonUniversity/athena-public-version
else:
    print("Athena++ code already cloned, skip")
    
sys.path.insert(0, 'athena-public-version/vis/python')
import athena_read

In [None]:
data_folder=job_name+'/output'
output_folder = my_bucket_name+'/athena/'+data_folder

if not os.path.isdir(job_name):
    !mkdir -p $job_name
else:
    !rm -rf $job_name/*
    print('project folder exists, remove all old files')
    
!aws s3 cp s3://$output_folder/ ./$data_folder/ --recursive


### Display the hst data
History data shows the overs all parameter changes over time. The time interval can be different from that of the hdf5 files.

In OrszagTang simulations, the variables in the hst files are 'time', 'dt', 'mass', '1-mom', '2-mom', '3-mom', '1-KE', '2-KE', '3-KE', 'tot-E', '1-ME', '2-ME', '3-ME'

All the variables a

In [None]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np

hst = athena_read.hst(data_folder+'/OrszagTang.hst')

# cannot use this reliably because hst and hdf can have different number of time steps. In this case,we have the same number of steps
num_timesteps = len(hst['time'])

print(hst.keys())

plt.plot(hst['time'], hst['dt'])


## Reading HDF5 data files 

The hdf5 data files contain all variables inside all meshblocks. There are some merging and calculating work to be done before we can visualizing the result. Fortunately ,Athena++ vis/hdf package takes care of the hard part. 


In [None]:
# Let's example the content of the hdf files

f = h5py.File(data_folder+'/OrszagTang.out2.00001.athdf', 'r')
# variable lists <KeysViewHDF5 ['B', 'Levels', 'LogicalLocations', 'prim', 'x1f', 'x1v', 'x2f', 'x2v', 'x3f', 'x3v']>
print(f.keys())

#<HDF5 dataset "B": shape (3, 512, 64, 64, 64), type "<f4"> 
print(f['prim'])

### Simulation result data 

Raw athdf data has the following keys
<KeysViewHDF5 ['B', 'Levels', 'LogicalLocations', 'prim', 'x1f', 'x1v', 'x2f', 'x2v', 'x3f', 'x3v']>

After athena_read.athdf() call, the result contains keys, which can be used as the field name
['Coordinates', 'DatasetNames', 'MaxLevel', 'MeshBlockSize', 'NumCycles', 'NumMeshBlocks', 'NumVariables', 'RootGridSize', 'RootGridX1', 'RootGridX2', 'RootGridX3', 'Time', 'VariableNames', 'x1f', 'x1v', 'x2f', 'x2v', 'x3f', 'x3v', 'rho', 'press', 'vel1', 'vel2', 'vel3', 'Bcc1', 'Bcc2', 'Bcc3']


In [None]:
def process_athdf(filename, num_step):
    print("Processing ", filename)
    athdf = athena_read.athdf(filename)
    return athdf

# extract list of fields and take a slice in one dimension, dimension can be 'x', 'y', 'z'
def read_all_timestep (data_file_name_template, num_steps, field_names, slice_number, dimension):

    if not dimension in ['x', 'y', 'z']:
        print("dimension can only be 'x/y/z'")
        return
    
    # would ideally process all time steps together and store themn in memory. However, they are too big, will have to trade time for memory 
    result = {}
    for f in field_names:
        result[f] = list()
        
    for i in range(num_steps):
        fn = data_file_name_template.format(str(i).zfill(5))
        athdf = process_athdf(fn, i)
        for f in field_names:
            if dimension == 'x':
                result[f].append(athdf[f][slice_number,:,:])
            elif dimension == 'y':
                result[f].append(athdf[f][:, slice_number,:])
            else:
                result[f].append(athdf[f][:,:, slice_number])
                        
    return result

def animate_slice(data):
    plt.figure()
    for i in range(len(data)):
        plt.imshow(data[i])
        plt.title('Frame %d' % i)
        plt.show()
        plt.pause(0.2)
        clear_output(wait=True)




In [None]:

data_file_name_template = data_folder+'/OrszagTang.out2.{}.athdf'

# this is time consuming, try do it once
data = read_all_timestep(data_file_name_template, num_timesteps, ['press', 'rho'], 1, 'x')



In [None]:
# Cycle through the time steps and look at pressure
animate_slice(data['press'])

In [None]:
# Now look at density
animate_slice(data['rho'])

# Don't forget to clean up

1. Delete the ParallelCluster
2. Delete the RDS
3. S3 bucket
4. Secrets used in this excercise

Deleting VPC is risky, I will leave it out for you to manually clean it up if you created a new VPC. 

In [None]:
# need to do this first. 
iam_client.detach_role_policy(RoleName=root_role_name, PolicyArn='arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore')

!pcluster delete $pcluster_name

In [None]:
# delete the rds database - keep it if you want to have more records to look at
#workshop.detele_rds_instance(region, session, db_name)
#workshop.delete_secrets_with_force(region, session, [rds_secret_name])

#Delete the secrets
workshop.delete_secrets_with_force(region, session, [slurm_secret_name])

workshop.delete_bucket_completely(my_bucket_name)