MLO: Getting started with the EPFL Cluster

This repository contains the basic steps to start running scripts and notebooks on the EPFL Cluster (RCP) -- so that you don't have to go through the countless documentations by yourself! We also provide scripts that can make your life easier by automating a lot of things. It is based on a similar setup from our friends at TML and CLAIRE, and scripts created by Atli :)

The RCP cluster has A100 (80GB), H100 (80GB), H200 (140GB) and V100 GPUs that you can choose from. The system is built on top of Docker (containers), Kubernetes (automating deployment of containers) and run:ai (scheduler on top of Kubernetes).

For starters, we recommend you to go through the minimal basic setup first and then read the important notes.

If you come up with any question about the cluster or the setup that you do not find answered here, you can check the frequently asked questions page. Also, please do not hesitate to reach out to any of your colleagues. There are some more resources under the quick links below.

Tip

If you have little prior experience with ML workflows, the setup below may seem daunting at first. But the guide tries to make it as simple as possible for you by providing you all commands in order, and with a script that does most of the work for you. The only requirement is that you have a basic understanding of how to use a terminal and git.

Caution

Using the cluster creates costs. Please be mindful of the resources you use. Do not forget to stop your jobs when not used!

Content overview:

MLO: Getting started with the EPFL Clusters
Minimal basic setup
Managing Workflows and Advanced Topics
File overview of this repository
Quick links
- Other cluster-related code repositories

Minimal basic setup

The step-by-step instructions for first time users to quickly get a job running.

Tip

After completing the setup, the TL;DR of the interaction with the cluster (using the scripts in this repo) is:

Get a running job with one GPU that is reserved for you: python csub.py -n sandbox
Connect to a terminal inside your job: runai exec sandbox -it -- zsh
Run your code: cd /mloscratch/homes/<your username>; python main.py
In one go, you can also do: python csub.py -n experiment --train --command "cd /mloscratch/homes/<your username>/<your code>; python main.py "

Important

Make sure you are on the EPFL wifi or connected to the VPN. The cluster is otherwise not accessible.

1: Pre-setup (access, repository)

Group access: You need to have access to the cluster. For that, ask Jennifer or Martin (or someone else) to add you to the group runai-mlo: https://groups.epfl.ch/

Prepare your code: While you are waiting to get access, create a GitHub repository where you will implement your code. Irrespective of our cluster or this guide, it is best practice to keep track of your code with a GitHub repo.

Prepare Weights and Biases or HuggingFace: For logging the results of your experiments, you can use Weights and Biases. Create an account if you don't already have one. You will need an API key to later log your experiments. The same goes for the Huggingface Hub if you want to use their hosted models.

The following are just a bunch of commands you need to run to get started. If you do not understand them in detail, you can copy-paste them into your terminal :)

2: Setup the tools on your own machine

Important

The setup below was tested on macOS with Apple Silicon. If you are using a different system, you may need to adapt the commands. For Windows, we have no experience with the setup and thereby recommend WSL (Windows Subsystem for Linux) to run the commands.

Install kubectl. To make sure the version matches with the clusters (status: 15.12.2023), on macOS with Apple Silicon, run the following commands. For other systems, you will need to change the URL in the command above (check https://kubernetes.io/docs/tasks/tools/install-kubectl/). Make sure that the version matches with the version of the cluster!

# Sketch for macOS with Apple Silicon.
# Download a specific version (here v1.29.6 for Apple Silicon macOS)
curl -LO "https://dl.k8s.io/release/v1.29.6/bin/darwin/arm64/kubectl"
# Linux: curl -LO "https://dl.k8s.io/release/v1.29.6/bin/linux/amd64/kubectl"
# Give it the right permissions and move it.
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
sudo chown root: /usr/local/bin/kubectl

Setup the kube config file: Take our template file kubeconfig.yaml as your config in the home folder ~/.kube/config. Note that the file on your machine has no suffix.

curl -o  ~/.kube/config https://raw.githubusercontent.com/epfml/getting-started/main/kubeconfig.yaml

Install the run:ai CLI for RCP:

# Sketch for macOS with Apple Silicon
# Download the CLI from the link shown in the help section.
# for Linux: replace `darwin` with `linux`
wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/darwin
# Give it the right permissions and move it.
chmod +x ./runai
sudo mv ./runai /usr/local/bin/runai
sudo chown root: /usr/local/bin/runai

3: Login

Login to RCP cluster and check that you can see your projects.

# Login to the cluster
runai login
# Check that things worked fine
runai list projects
# Put default project
runai config project mlo-$GASPAR_USERNAME

Run a quick test to see that you can launch jobs. You need to change $UID (user ID). You can find your personal UID under the administrative data in your profile on people.epfl.ch (e.g. https://people.epfl.ch/alexander.hagele). The group ID is 83070 for MLO group. If you are using this guide from another group, you should change the group ID as well.

# Try to submit a job that mounts our shared storage and see its content.
runai submit \
  --name setup-test-storage \
  --image ubuntu \
  --run-as-uid $UID \
  --run-as-gid 83070 \ 
  --pvc mlo-scratch:/mloscratch \
  -- ls -la /mloscratch/homes
# Check the status of the job
runai describe job setup-test-storage

# Check its logs to see that it ran.
runai logs setup-test-storage

# Delete the successful jobs
runai delete jobs setup-test-storage

The runai submit command already suffices to run jobs. If that is fine for you, you can jump to the section on using provided images and the run:ai CLI here.

However, we provide a few scripts in this repository to make your life easier to get started.

4: Use this repo to start a job

Clone this repository and create a user.yaml file in the root folder of the repo using the template in templates/user_template.yaml.

git clone https://github.com/epfml/getting-started.git
cd getting-started
touch user.yaml # then copy the content from templates/user_template.yaml inside here and update

Fill in user.yaml with your username, userID in user.yaml and also update the working_dir with your username. Do not change anything else in the yaml file. You can find your information in your profile on people.epfl.ch (e.g. https://people.epfl.ch/alexander.hagele) under “Administrative data”. Important for logging (if you want to use wandb), get an API key from Weights and Biases and add it to the yaml. There's also a field for the Huggingface token (like an API key).
Create a pod with 1 GPU (you may need to install pyyaml with pip install pyyaml first).

rcp-cluster # switch to RCP cluster context
python csub.py -n sandbox

Wait until the pod has a 'running' status -- this can take a bit (max ~5 min or so). Check the status of the job with

runai list # shows all jobs
runai describe job sandbox # shows the status of the job sandbox

When it is running, connect to the pod with the command:

runai exec sandbox -it -- zsh

If everything worked correctly, you should be inside a terminal on the cluster!

5: Cloning and running your code

Clone your fork of your GitHub repository (where you have your experiment code) into the pod inside your home folder.

# Inside the pod
cd /mloscratch/homes/<your_username>
git clone https://github.com/<your username>/<your code>.git
cd <your code>

Conda should be automatically installed. To create an environment that contains the packages needed for your experiments, you can do something like

# inside the pod
conda create -n env python=3.10
conda activate env
# inside /mloscratch/homes/<your username>/<your code>
pip install -r requirements.txt

Now you can run the code as you would on your local machine. For example, to run a main.py script (assuming you wrote it in your code), you simply do:

# Inside the pod, inside /mloscratch/homes/<your username>/<your code>
python main.py

Hopefully, this should work and you're up and running! If you set up Weights and Biases, the API key in the user.yaml file should also automatically enable tracking your job on your wandb dashboard (so you can see the loss going down :) )

For remote development (changing code, debugging, etc.), we recommend using VSCode. You can find more information on how to set it up in the VSCode section.

Tip

Generally, the workflow we recommend is simple: develop your code locally or on the cluster (e.g. with VS Code), then push it to your repository. Once you want to try, run it on the cluster with the terminal that is attached via runai exec sandbox -it -- zsh. This way, you can keep your code and experiments organized and reproducible.

Note that your pods can be killed anytime. This means you might need to restart an experiment (with the python csub.py command we give above). You can see the status of your jobs with runai list. If a job has status "Failed", you have to delete it via runai delete job sandbox before being able to start the same job again.

Keep your files inside your home folder: Importantly, when a job is restarted or killed, everything inside the container folders of ~/ are lost. This is why you need to work inside /mloscratch/homes/<your username>. For conda and other things (e.g. ~/.zshrc), we have set up automatic symlinks to files that are persistent on scratch.

To have a job that can run in the background, do python csub.py -n sandbox --train --command "cd /mloscratch/homes/<your username>/<your code>; python main.py "

You're good to go now! :) It's up to you to customize your environment and install the packages you need. Read up on the rest of this README to learn more about the cluster and the scripts.

Caution

Using the cluster creates costs. Please do not forget to stop your jobs when not used!

Managing Workflows and Advanced Topics

Using VSCODE

To easily attach a VSCODE window to a pod we recommend the following steps:

Install the Kubernetes and Dev Containers extensions.
From your VSCODE window, click on Kubernetes -> rcp-cluster -> Workloads -> Pods, and you should be able to see all your running pods.
Right-click on the pod you want to access and select Attach Visual Studio Code, this will start a vscode session attached to your pod.
The symlinks ensure that settings and extensions are stored in mloscratch/homes/<gaspar username> and therefore shared across pods.
Note that when opening the VS code window, it opens the home folder of the pod (not scratch!). You can navigate to your working directory (code) by navigating to /mloscratch/homes/<your username>.

You can also see a pictorial description here.

Managing pods

After starting pods with the script, you can manage your pods using run:ai and the following commands:

runai exec pod_name -it -- zsh # - opens an interactive shell on the pod 
runai delete job pod_name # kills the job and removes it from the list of jobs
runai describe job pod_name # shows information on the status/execution of the job
runai list jobs # list all jobs and their status 
runai logs pod_name # shows the output/logs for the job

Some commands that might come in handy (credits to Thijs):

# Clean up succeeded jobs from run:ai.
runai list | grep " Succeeded " | awk '{print $1}' | parallel runai delete job {}
# Overview of active jobs that fits on your screen.
runai list jobs | sed '1d' | awk '{printf "%-42s %-20s\n", $1, $2}'
# Auto-updating listing of jobs and their states.
watch -n 10 "runai list | sed 1d | awk '{printf \"%0-40s %0-20s\n\", \$1, \$2}'"

Important notes and workflow

We provide the script in this repo as a convenient way of creating jobs (see more details in the section below).

The default job is just an interactive one (with sleep) that you can use for development.
- 'Interactive' jobs are a concept from run:ai. Every user can have 1 interactive GPU. They have higher priority than other jobs and can live up to 12 hours. You can use them for debugging. If you need more than 1 GPU, you need to submit a training job.
For a training job, use the flag --train, and replace the command with your training command. Using a training job allows you to use more than 1 GPU (up to 8 on one node). Moreover, a training job makes sure that the pod is killed when your code/experiment is finished in order to save money.
When choosing types of GPUs on the RCP cluster you have handful of options. You should consider both cost and memory and compute requirements of your job while choosing among them.
- High-end GPUs like H100 and H200 come with significantly higher costs, so they should be used with care.
- A100 GPUs are good enough for most of use cases. If your job does not require large memory, A100 40GB or V100 may be more cost-effective and faster to schedule. If your code is not heavily compute-bound and works well on older hardware, using V100 is preferred.
- For memory-intensive workloads, H200 with 140GB RAM is recommended.
- Overall, if you plan to run a series of jobs, it's a good idea to inform your supervisor in advance.
For specifying the type of GPU while submitting with csub.py, you should use the flag --node_type. If you are submitting directly through CLI, you should use the flag --node-pools instead. In both cases, you should choose from [v100|h100|h200|default|a100-40g], where default corresponds to A100 GPUs. So if you want to use A100 as an example, you should add --node-pools default in CLI submission or --node_type default when submitting csub.py.

Of course, the script is just one suggested workflow that tries to maximize productivity and minimize costs -- you're free to find your own workflow, of course. For whichever workflow you go for, keep these things in mind:

Important

Work within /mloscratch. This is the shared storage that is mounted to your pod.
- Create a directory with your GASPAR username in /mloscratch/ folder. This will be your personal folder. Except under special circumstances, all your files should be kept inside your personal folder (e.g. /mloscratch/nicolas if your username is nicolas) or in your personal home folder (e.g. /mloscratch/homes/nicolas).**
- Should you use the csub.py script, the first run will automatically create a working directory with your username inside /mloscratch/homes.
- Suggestion: use a GitHub repo to store your code and clone it inside your folder.
Moving things onto the cluster or between folders can also be done easily via HaaS machine. For more details on storage, see file management.
Remember that your job can get killed anytime if run:ai needs to make space for other users. Make sure to implement checkpointing and recovery into your scripts.
CPU-only pods are cheap, approx 3 CHF/month, so we recommend creating a CPU-only machine that you can let run and use for code development/debugging through VSCODE.
When your code is ready and you want to run some experiments or you need to debug on GPU, you can create one or more new pods with GPU. Simply specify the command in the python launch script.
Using a training job makes sure that you kill the pod when your code/experiment is finished in order to save money.

Most importantly:

Caution

Using the cluster creates costs. Please do not forget to stop your jobs when not used!

The HaaS machine

The HaaS machine is provided by IT that allows you to move files, create folders, and copy files between mlodata1, mloraw1, and mloscratch, without needing to create a pod. You can access it via:

  # For basic file movement, folder creation, or
  # copying from/to mlodata1 to/from scratch:
  ssh <gaspar_username>@haas001.rcp.epfl.ch

The volumes are mounted inside the folders /mnt/mlo/mlodata1, /mnt/mlo/mloraw1, /mnt/mlo/scratch. See below for what the spaces are used for.

File management

Reminder: the cluster uses kubernetes pods, which means that in principle, any file created inside a pod will be deleted when the pod is killed.

To store files permanently, you need to mount network disks to your pod. In our case, this is mloscratch -- all code and experimentation should be stored there. Except under special circumstances, all your files should be kept inside your personal folder (e.g. /mloscratch/nicolas if your username is nicolas) or in your personal home folder (e.g. /mloscratch/homes/nicolas). Scratch is high-performance storage that is meant to be accessed/mounted from pods. Even though it is called "scratch", you do not need to generally worry about losing data (it is just not replicated across multiple hard drives).

For very secure long-term storage, we have:

mlodata1.
- This is long term storage, backed up carefully with replication (i.e. stored on multiple hard drives). This is meant to contain artifacts that you want to keep for an undetermined amount of time (e.g. things for a publication).
mloraw1
- Not clear right now how this will be used in the future (status: 15.12.2023).

Caution

You cannot mount mlodata or mloraw on pods. Use the haas machine below to access it.

Moving data onto/between storage

Since mloscratch is not replicated, whenever you need things to become permanent, move them to mlodata1. This could be the case for paper artifacts, certain results or checkpoints, and so on.

Currently, if you need to move things between mlodata1 and scratch, you need to do this manually via a machine provided by IT:

  # For basic file movement, folder creation, or
  # copying from/to mlodata1 to/from scratch:
  ssh <gaspar_username>@haas001.rcp.epfl.ch

The volumes are mounted inside the folders /mnt/mlo/mlodata1, /mnt/mlo/mloraw1, /mnt/mlo/scratch. You can copy files between them using cp or rsync.

TODO: Update with permanent machine for MLO once we have it.

More background on the csub script

The python script csub.py is a wrapper around the run:ai CLI that makes it easier to launch jobs. It is meant to be used for both interactive jobs (e.g. notebooks) and training jobs. General usage:

python csub.py -n <job_name> -g <number of GPUs> -t <time> -i ic-registry.epfl.ch/mlo/mlo:v1 --command <cmd> [--train]

Check the arguments for the script to see what they do.

What this repository does on first run:

We provide a default MLO docker image mlo/mlo:v1 that should work for most use cases. If you use this default image, the first time you run csub.py, it will create a working directory with your username inside /mloscratch/homes. Moreover, for each symlink you find the user.yaml file the script will create the respective file/folder inside mloscratch and link it to the home folder of the pod. This is to ensure that these files and folders are persistent across different pods.
- Small caveat: csub.py expects your image to have zsh installed.
The entrypoint.sh script is also installing conda in your scratch home folder. This means that you can manage your packages via conda (as you're probably used to), and the environment is shared across pods.
- In other words: you can use have and environment (e.g. conda activate env) and this environment stays persistent.
Alternatively, the bash script utils/conda.sh that you can find in your pod under docker/conda.sh, installs some packages in utils/extra_packages.txt in the default environment and creates an additional torch environment with pytorch and the packages in utils/extra_packages.txt. It's up to you to run this or manually customize your environment installation and configuration.

Alternative workflow: using the run:ai CLI and base docker images with pre-installed packages

The setup in this repository is just one way of running and managing the cluster. You can also use the run:ai CLI directly, or use the scripts in this repository as a starting point for your own setup. For more details, see the the dedicated readme.

Creating a custom docker image

In case you want to customize it and create your own docker image, follow these steps:

Request registry access: This step is needed to push your own docker images in the container. Try login here https://ic-registry.epfl.ch/ and see if you see members inside the MLO project. The groups of runai are already added, it should work already. If not, reach out to Alex or a colleague.
Install Docker: brew install --cask docker (or any other preferred way according to the docker website). When you execute commands via terminal and you see an error '“Cannot connect to the Docker daemon”', try running docker via GUI the first time and then the commands should work.
Login registry: docker login ic-registry.epfl.ch and use your GASPAR credentials. Same for the RCP cluster: docker login registry.rcp.epfl.ch (but we're currently not using it).

Modify Dockerfile:**

The repo contains a template Dockerfile that you can modify in case you need a custom image
Push the new docker using the script publish.sh
Remember to rename the image (mlo/username:tag) such that you do not overwrite the default one

Additional example: Alternatively, Matteo also wrote a custom one and summarized the steps here: https://gist.github.com/mpagli/6d0667654bf8342eb4923fedf731660e

He created an image that runs by default under his Gaspar user ID and group ID. You can find those IDs in e.g. https://people.epfl.ch/matteo.pagliardini under 'donnees administratives'.
Upload your image to EPFL's registry

docker build . -t <your-tag>
docker login ic-registry.epfl.ch -u <your-epfl-username> -p <your-epfl-password> # use your epfl credentials
docker tag <your-tag> ic-registry.epfl.ch/mlo/<your-tag>
docker push ic-registry.epfl.ch/mlo/<your-tag>

Port forwarding

If you want to access a port on your pod from your local machine, you can use port forwarding. For example, if you want to access a jupyter notebook running on your pod, you can do the following:

kubectl get pods
kubectl port-forward <pod_name> 8888:8888

Distributed training

Newer versions of runai support distributed training, meaning the ability to use run accross multiple compute nodes, even beyond the several GPUs available on one node. This is currently set up on the new RCP Prod cluster (rcp-caas). A nice documentation to get started with distributed jobs is available here.

File overview of this repository

├── utils
    ├── entrypoint.sh             # Sets up credentials and symlinks
    ├── conda.sh                  # Conda installation   
    └── extra_packages.txt        # Extra python packages you want to install 
├── csub.py                       # Creates a pod through run:ai; you can specify the number of GPUS, CPUS, docker image and time 
├── templates
    ├── user_template.yaml                 # Template for your user file COPY IT, DO NOT CHANGE IT
├── Dockerfile                    # Dockerfile example  
├── publish.sh                    # Script to push the docker image in the registry
├── kubeconfig.yaml               # Kubeconfig that you should store in ~/.kube/config
└── README.md                     # This file
├── docs
  ├── faq.md                        # FAQ
  └── runai_cli.md                  # Run:ai CLI guide

Quick links

RCP main page: https://www.epfl.ch/research/facilities/rcp/
Docs: https://wiki.rcp.epfl.ch
Dashboard: https://rcpepfl.run.ai
Docker registry: https://registry.rcp.epfl.ch/
Getting started guide: https://wiki.rcp.epfl.ch/en/home/CaaS/Quick_Start

run:ai docs: https://docs.run.ai

If you want to read up more on the cluster, you can checkout a great in-depth guide by our colleagues at CLAIRE. They have a similar setup of compute and storage:

Compute and Storage @ CLAIRE

Other cluster-related code repositories

These repositories are mostly by previous PhDs. They used these repositories to manage shared compute infrastructure. If you want to contribute, please ask Martin to add you as an editor.

epfml/epfml-utils
- Python package (pip install epfml-utils) for shared tooling.
epfml/mlocluster-setup
- Base docker images, and setup code for semi-permanent shared machines (less recommended).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLO: Getting started with the EPFL Cluster

Minimal basic setup

1: Pre-setup (access, repository)

2: Setup the tools on your own machine

3: Login

4: Use this repo to start a job

5: Cloning and running your code

Managing Workflows and Advanced Topics

Using VSCODE

Managing pods

Important notes and workflow

The HaaS machine

File management

Moving data onto/between storage

More background on the csub script

Alternative workflow: using the run:ai CLI and base docker images with pre-installed packages

Creating a custom docker image

Port forwarding

Distributed training

File overview of this repository

Quick links

Other cluster-related code repositories

About

Uh oh!

Releases

Packages

Contributors 9

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
docs		docs
template		template
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
csub.py		csub.py
kubeconfig.yaml		kubeconfig.yaml
publish.sh		publish.sh

epfml/getting-started

Folders and files

Latest commit

History

Repository files navigation

MLO: Getting started with the EPFL Cluster

Minimal basic setup

1: Pre-setup (access, repository)

2: Setup the tools on your own machine

3: Login

4: Use this repo to start a job

5: Cloning and running your code

Managing Workflows and Advanced Topics

Using VSCODE

Managing pods

Important notes and workflow

The HaaS machine

File management

Moving data onto/between storage

More background on the csub script

Alternative workflow: using the run:ai CLI and base docker images with pre-installed packages

Creating a custom docker image

Port forwarding

Distributed training

File overview of this repository

Quick links

Other cluster-related code repositories

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Uh oh!

Languages

Packages