Skip to content
This repository was archived by the owner on Mar 13, 2023. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions docs/content/02-tutorials/07-slurm-rest-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
+++
title = "g. Slurm REST API 🌀"
weight = 27
+++

Enable the Slurm REST API. Requires Slurm Accounting.

## Step 1 - Setup Slurm Accounting

Slurm Accounting is required to enable the Slurm REST API. Follow the [instructions](https://pcluster.cloud/02-tutorials/02-slurm-accounting.html) to enable Slurm Accounting but **do not begin cluster creation** after completing Step 4.

## Step 2 - Create a Security Group to allow inbound HTTPS traffic

By default, your cluster will not be able to accept incoming HTTPS requests to the REST API. You will need to [create a security group](https://console.aws.amazon.com/ec2/v2/home?#CreateSecurityGroup:) to change this.

1. Under `Security group name`, enter "Slurm REST API" (or another name of your choosing)
2. Ensure `VPC` matches the cluster's VPC
3. Delete any outbound rules that may have been automatically generated
4. Add an inbound rule and select `HTTPS` under `Type` and `My IP` under `Destination`
5. Click `Create security group`

![Create Security Group](slurm-rest-api/create-security-group.png)

## Step 3 - Configure your cluster

In your cluster configuration, return to the Head Node section and add your security group.

![HeadNode Setup](slurm-rest-api/add-security-group.png)

Under `Advanced options`, you should have already added a script for Slurm Accounting.
In the same multi-runner, click `Add Script` and select `Slurm REST API`.

![HeadNode Setup](slurm-rest-api/add-script.png)

Create your cluster. Make sure you followed the Slurm Accounting tutorial for the rest of the configuration.

## Step 4 - Submit a job

Once the cluster has been successfully created, go to the `Job Scheduling` tab and select `Submit Job`

Choose a name for your job, a number of nodes to run under, and select `Run a script (manual entry)` and enter `srun sleep 30` on line 2 under `#!/bin/bash`.

![Submit Job](slurm-rest-api/submit-job.png)

Click `submit`. If the job was successful, it should be listed as `COMPLETED` in about 30 seconds.

### Troubleshooting

If jobs aren't submitting, it's likely because of security groups. Try manually adding your IP to the new security group you created.
If you're still running issues, you can select `Any IPv4` as the destination (**WARNING:** this may have potential security risks).
1 change: 1 addition & 0 deletions docs/content/02-tutorials/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ pre: "<b>II ⁃ </b>"
| 💾 [Memory Scheduling](02-tutorials/03-memory-scheduling.html) | Schedule using the `--mem` slurm flag. |
| 💰 [Cost Tags](02-tutorials/04-cost-tracking.html) | Track job costs in AWS Cost Explorer by user and project. |
| ⇓ [Downloading](02-tutorials/06-downloading.html) | Download files at cluster start. |
| 🌀 [Slurm API](02-tutorials/07-slurm-rest-api.html) | Enable advanced job submission. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions frontend/src/old-pages/Configure/Components.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ const multiRunner = 'https://raw.githubusercontent.com/aws-samples/pcluster-mana
const knownExtensions = [{name: 'Cloud9', path: 'cloud9.sh', description: 'Cloud9 Install', args: []},
{name: 'Downloader', path: 'downloader.sh', description: 'Downloader', args: [{name: 'Destination', default: '/tmp'}, {name: 'Source'}]},
{name: 'Slurm Accounting', path: 'slurm-accounting.sh', description: 'Slurm Accounting', args: [{name: 'Secret ARN'}, {name: 'RDS Endpoint'}, {name: 'RDS Port', default: '3306'}]},
{name: 'Slurm REST API', path: 'slurm-rest-api.sh', description:'Enable Slurm REST API (Requires Slurm Accounting)', args:[]},
{name: 'Spack', path: "spack.sh", description: 'Install Spack package manager.', args:[{name: 'Spack Root'}]},
{name: 'Memory', path: "mem.sh", description: 'Setup Memory Resource in Slurm.', args:[]},
{name: 'Cost Tags', path: "cost-tags.sh", description: 'Set cost tags on compute instances.', args:[]},
Expand Down
79 changes: 79 additions & 0 deletions resources/files/sacct/nginx.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;

# Load dynamic modules. See /usr/share/doc/nginx/README.dynamic.
include /usr/share/nginx/modules/*.conf;

events {
worker_connections 1024;
}

http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';

access_log /var/log/nginx/access.log main;

sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 4096;

include /etc/nginx/mime.types;
default_type application/octet-stream;

# Load modular configuration files from the /etc/nginx/conf.d directory.
# See http://nginx.org/en/docs/ngx_core_module.html#include
# for more information.
include /etc/nginx/conf.d/*.conf;

# Settings for a TLS enabled server.

server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name _;
root /usr/share/nginx/html;

location / {
proxy_pass http://slurmrestd;

# auth_request /validate/;
# auth_request_set $user_name $upstream_http_x_slurm_user_name;
# auth_request_set $user_token $upstream_http_x_slurm_user_token;
# proxy_set_header X-SLURM-USER-NAME $user_name;
# proxy_set_header X-SLURM-USER-TOKEN $user_token;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Port $server_port;
proxy_pass_request_headers on;
}

ssl_certificate /etc/ssl/certs/nginx-selfsigned.crt;
ssl_certificate_key /etc/ssl/certs/nginx-selfsigned.key;
ssl_client_certificate /etc/ssl/certs/ca.crt;
ssl_session_cache shared:SSL:1m;
ssl_session_timeout 10m;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;

# Load configuration files for the default server block.
include /etc/nginx/default.d/*.conf;

error_page 404 /404.html;
location = /40x.html {
}

error_page 500 502 503 504 /50x.html;
location = /50x.html {
}
}

upstream slurmrestd {
server unix:/var/spool/socket/slurmrestd.sock;
}
}
166 changes: 166 additions & 0 deletions resources/files/sacct/slurm_rest_api.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
require 'json'
return if node['cluster']['node_type'] != 'HeadNode'

slurm_etc = '/opt/slurm/etc'
socket_location = '/var/spool/socket'
state_save_location = '/var/spool/slurm.state'
key_location = state_save_location + '/jwt_hs256.key'
certs_location = '/etc/ssl/certs'
key_and_crt_name = 'nginx-selfsigned'
id = 2005

# Configure Slurm for JWT authentication
ruby_block 'Create JWT key file' do
block do
shell_out!("dd if=/dev/random of=#{key_location} bs=32 count=1")
end
end

file key_location do
owner 'slurm'
group 'slurm'
mode '0600'
end

directory state_save_location do
owner 'slurm'
group 'slurm'
mode '0755'
end

ruby_block 'Add JWT configuration to slurm.conf' do
block do
file = Chef::Util::FileEdit.new("#{slurm_etc}/slurm.conf")
file.insert_line_after_match(/AuthType=*/, "AuthAltParameters=jwt_key=#{key_location}")
file.insert_line_after_match(/AuthType=*/, "AuthAltTypes=auth/jwt")
file.write_file
end
not_if "grep -q auth/jwt #{slurm_etc}/slurm.conf"
end

ruby_block 'Add JWT configuration to slurmdbd.conf' do
block do
file = Chef::Util::FileEdit.new("#{slurm_etc}/slurmdbd.conf")
file.insert_line_after_match(/AuthType=*/, "AuthAltParameters=jwt_key=#{key_location}")
file.insert_line_after_match(/AuthType=*/, "AuthAltTypes=auth/jwt")
file.write_file
end
not_if "grep -q auth/jwt #{slurm_etc}/slurmdbd.conf"
end

service 'slurmctld' do
action :restart
end

ruby_block 'Generate JWT token and create/update AWS secret' do
block do
token_name = "slurm_token_" + node['cluster']['stack_name']
region = node['cluster']['region']

jwt_token = shell_out!("/opt/slurm/bin/scontrol token lifespan=9999999999 \
| grep -oP '^SLURM_JWT\\s*\\=\\s*\\K(.+)'").run_command.stdout

begin
shell_out!("aws secretsmanager create-secret \
--name #{token_name} \
--region #{region} \
--secret-string #{jwt_token}"
).run_command
rescue
shell_out!("aws secretsmanager update-secret \
--secret-id #{token_name} \
--region #{region} \
--secret-string #{jwt_token}"
).run_command
end
end
end

# NGINX installation and configuration
package 'nginx' do
action :install
end

ruby_block 'Generate self-signed key' do
block do
shell_out!("sudo openssl req -x509 -nodes -days 36500 -newkey rsa:2048 \
-keyout /etc/ssl/certs/nginx-selfsigned.key \
-out /etc/ssl/certs/nginx-selfsigned.crt \
-subj ""/CN=#{node['cluster']['stack_name']}"""
).run_command
end
end

group 'nginx' do
comment 'nginx group'
gid id + 1
system true
end

user 'nginx' do
comment 'nginx user'
uid id + 1
gid id + 1
system true
end

file '/etc/nginx/nginx.conf' do
owner 'nginx'
group 'nginx'
mode '0644'
content ::File.open('/tmp/slurm_rest_api/nginx.conf').read
end

service 'nginx' do
action :start
end

# Enable slurmrestd
# TODO: Not idempotent if user is in process
group 'slurmrestd' do
comment 'slurmrestd group'
gid id
system true
end

user 'slurmrestd' do
comment 'slurmrestd user'
uid id
gid id
system true
end

directory socket_location do
owner 'nginx'
group 'nginx'
mode '0777'
end

file '/etc/systemd/system/slurmrestd.service' do
owner 'slurmrestd'
group 'slurmrestd'
mode '0644'
content ::File.open('/tmp/slurm_rest_api/slurmrestd.service').read
end

service 'slurmrestd' do
action :start
end

ruby_block 'Wait for slurmrestd' do
block do
iter=0
until ::File.exists?("#{socket_location}/slurmrestd.sock") || iter > 20 do
sleep 1
iter += 1
end
raise "Timeout waiting for slurmrestd startup" unless iter < 20
end
end

ruby_block 'Modify socket permissions' do
notifies :start, 'service[slurmrestd]', :before
block do
shell_out!("chmod 0666 #{socket_location}/slurmrestd.sock").run_command
end
end
16 changes: 16 additions & 0 deletions resources/files/sacct/slurmrestd.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[Unit]
Description=Slurm REST daemon
After=network.target munge.service slurmctld.service
ConditionPathExists=/opt/slurm/etc/slurm.conf
Documentation=man:slurmrestd(8)

[Service]
Type=simple
User=slurmrestd
Group=slurmrestd
Environment="SLURM_JWT=daemon"
ExecStart=/opt/slurm/sbin/slurmrestd unix:/var/spool/socket/slurmrestd.sock -a rest_auth/jwt
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target
27 changes: 27 additions & 0 deletions resources/scripts/slurm-rest-api.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash

set -x
set -e

mkdir -p /tmp/slurm_rest_api
pushd /tmp/slurm_rest_api

# Copy Slurm configuration files
source_path=https://raw.githubusercontent.com/aws-samples/pcluster-manager/main/resources/files
files=(slurmrestd.service slurm_rest_api.rb nginx.conf)
for file in "${files[@]}"
do
wget -qO- ${source_path}/sacct/${file} > ${file}
done

sudo cinc-client \
--local-mode \
--config /etc/chef/client.rb \
--log_level auto \
--force-formatter \
--no-color \
--chef-zero-port 8889 \
-j /etc/chef/dna.json \
-z slurm_rest_api.rb

set +e