- Setting up the repo
- Running the code on KI-SLURM
- Debugging on KI-SLURM
- Running Jupyter Notebook on a GPU on the cluster
- Cluster usage policy
- Clone this repository in the remote machine
- Set up Miniconda
- Create a new environment:
conda create --name hello_cluster_env python=3.9
- Activate the new environment and install the requirements:
conda activate hello_cluster_env
pip install -r requirements.txt
- You should now be able to run the code as follows:
python main.py --<cmd-line-arguments> <values>
IMPORTANT: Remember, you are not supposed to run your python scripts this way on the clusters. Scripts should always be submitted as jobs (see next section). The login nodes should never be used for any kind of compute (not even, say, to run Tensorboard). Step 4 is only for local installations on your computer.
In the KI-SLURM (Meta) cluster, your account will be penalized by limiting your CPU usage for a while if you run compute-intensive processes on login nodes.
There are two ways to run scripts on the cluster:
- Submit a job using
sbatch
- Run your code in a SLURM interactive session using
srun
Let's begin by looking at sbatch
. First, you need to know which partitions you have access to.
You can use sinfo
to find this information. A sample output might look like this:
PARTITION | AVAIL | TIMELIMIT | NODES | STATE | NODELIST |
---|---|---|---|---|---|
partitionXYZ | up | 2-00:00:00 | 1 | idle | xyzgpu[0-6] |
partitionABC | up | 01:00:00 | 1 | idle | xyzgpu[10, 20] |
See ./scripts/meta/run.sh
for an example of a job script. To submit a job:
- Edit
run.sh
:- Add the partition you want to run the job on
- Adjust the path to your miniconda installation
- Create the directories required for the logs
sbatch scripts/meta/run.sh
squeue
# See all the jobs in the queuesqueue -u user
# See only user's jobs
scancel -u my_user
# Cancel all your jobsscancel <jobid>
# Cancel a specific job
sfree
You can run an interactive session using srun
. You can specify the parameters of the job using the same switches seen in run.sh
. You only require an additional --pty bash
to start a bash session.
srun --partition <your_partition> --mem 6GB --job-name HelloClusterInteractiveSession --pty bash
You will see that you are now logged into a compute node. From here, you may run python scripts as usual:
python main.py --device cuda
Remember that you should only do this from a compute node that you acquired using srun
, never a login node.
To remotely debug code residing in the cluster, we require the Remote-SSH
extension for VSCode. We shall begin by installing it.
Install the Remote-SSH
extension for VSCode as follows:
- Bring up the Extensions view (Ctrl+Shift+X / Cmd+Shift+X). Or,
View
>Extensions
- Install
Remote - SSH
extension (Extension ID: ms-vscode-remote.remote-ssh)
You can run your code that resides on the cluster while debugging it with VSCode as follows:
- Start an interactive session:
srun -p <partition_name> --pty bash <...other options>
- This will log you into a node with the requested resources
myuser@dlcgpuxyz:~$
Here,myuser
is logged into nodedlcgpuxyz
- Now, use the
Remote-SSH
extension to log into that node.View
>Command Palette
>Remote-SSH: Connect to Host...
>+ Add New SSH Host
>ssh myuser@dlcgpuxyz
- Once you're logged into the node, you can navigate to the directory of your repo in the Explorer (Ctrl+Shift+E / Cmd+Shift+E /
View
>Explorer
)
You can now run and debug your code using the same workflows as when the code resides (and runs) on your local machine. Make sure you select your Python Interpreter (Ctrl + Shift + P
> Python: Select Interpreter
) from your conda environment before you run your code.
Your session might get terminated if there is no activity in it for too long (scripts running on the node via VSCode remote ssh does not count). The easiest way to get around this is to start a screen
session once your interactive session begins.
You can run a Jupyter notebook on a GPU on the cluster and access it from the browser on your local machine using SSH port forwarding. Here's how you can do this (the instructions are taken from this tutorial):
-
Run an interactive session as shown in the section above.
-
Once you are logged into a node, start your jupyter notebook.
myuser@dlcgpu50:/path/to/my/directory$ jupyter notebook --no-browser --port=9001
Here,
myuser
is logged into nodedlcgpu50
, and starts a jupyter notebook which listens in on port 9001. -
Create an SSH tunnel to the node that runs your Jupyter notebook.
Open a new terminal on your local machine and:
ssh -t -t myuser@kisxxx.xx.xx.xx -L 9001:localhost:9001 ssh dlcgpu50 -L 9001:localhost:9001
-
Once the tunnel is established, you can open
localhost:9001
on your local machine to work on the Jupyter noteboook as if it is running on your machine, while still accessing the code and data that is available on the cluster.
Steps 2 and 3 can be made easier by adding the following entries in your .bash_profile
on the remote and the local machines.
- On the remote machine (i.e., the cluster):
function jpt(){
jupyter notebook --no-browser --port=$1
}
- On your local machine:
function sshtojptnode(){
# Forwards port $1 from node $2 into port $1 on the local machine and listens to it
ssh -t -t youruser@kisxxxx.xx.xx.xxx -L $1:localhost:$1 ssh $2 -L $1:localhost:$1
}
Once these are ready, remotely running Jupyter notebooks is easy!
- On the remote machine
myuser@dlcgpu13:/path/to/your/directory$ jpt 9001
- On your local machine
sshtojptnode 9001 dlcgpu13
- Now, simply fire up
localhost:9001
on your browser on your local machine.
- Disk storage: Always use workspaces for storing experiment data and for any I/O during a program run. This keeps the load on the login node low and allows faster I/O. Check
man ws_list
,man ws_allocate
andman ws_extend
. - Shared workspace: When working on a collaborative project. Creating and executing this script allows read-write for other users.
- Resource allocation request (SBATCH parameter advice): It is important to know how much resources a job is requesting conditioned on the resource availability,
- CPUs and GPUs requested: Check
sinfo
to see available resources. Resources requested should leave some resources free. If filling up a partition, the cluster communication channel should be notified accordingly.- Each GPU requests 8 CPUs overriding the
--cpus-per-task
or-c
flags
- Each GPU requests 8 CPUs overriding the
- Memory requirement: Specifying
--mem
explicitly affects the resources requested in reality- A node has multiple CPU cores (~20) and the node RAM is split among these cores (~6 GB per CPU)
- Requesting more than 6GB could actually request more than 1 CPU overriding the
--cpus-per-task
or-c
flags - A good practice is to use
srun
or a testsbatch
job to test actual memory requirements
- Alternatively: to keep estimates easier,
-c
can be used directly to request resources (for e.g.-c 2
==--mem 12G
and-c 4
==--mem 24G
) - Time limits: Note the default timelimit of a job on a partition by
sinfo
when not specifying explicitly- The scheduler priority assigned to a job is often inversely proportional to the timilimit specified
- CPUs and GPUs requested: Check
- Array jobs: To prevent filling out a partition and leave resources for other users, using an arrayjob is a must for multiple job deployments
- Using
%n
and the above calculations of resource utilization, the near exact estimate of job runtime can be made - Post deployment, the resource request of a job can be updated
- Update number of jobs in array job:
scontrol update ArrayTaskThrottle=[new n] JobId=[XXX]
- Jobs can be moved to an emptier partition:
scontrol update partition=[new partition] JobId=[XXX]
- Update number of jobs in array job:
- Using
- For any confusion regarding cluster usage and behaviour:
- First search on the internet (or chatGPT of course)
- Ask in the Mattermost channel
- Raise a ticket (if permissions exist) here