Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
linusseelinger committed Feb 17, 2024
1 parent dfd11e3 commit d57624c
Showing 1 changed file with 45 additions and 39 deletions.
84 changes: 45 additions & 39 deletions hpc/README.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,71 @@
# README

This load balancer allows any UM-Bridge client to control many parallel instances of any numerical model running on an HPC system.
This load balancer allows any UM-Bridge client to request model evaluations from many parallel instances of any numerical model running on an HPC system.

## File descriptions

- `LoadBalancer.hpp`
## Installation

The main header file that implements the load balancer as a C++ class `LoadBalancer`.
1. **Building the load balancer**

Clone the UM-Bridge repository.

```
git clone https://github.com/UM-Bridge/umbridge.git
```

Then navigate to the `hpc` directory.

- `LoadBalancer.cpp`
```
cd umbridge/hpc
```

Finally, compile the load balancer. Depending on your HPC system, you likely have to load a module providing a recent c++ compiler.

Load balancer executable.
```
make
```

- `LoadBalancer.slurm`
2. **Downloading HyperQueue**

Download HyperQueue from the most recent release at https://github.com/It4innovations/hyperqueue/releases and place the `hq` binary in the `hpc` directory next to the load balancer.

A slurm configuration file, which is used to start a LoadBalancer on a compute node.
## Usage

- `model.slurm`
The load balancer is primarily intended to run on a login node.

A slurm configuration file, which is used to start a slurm job running a model server on a compute node.
1. **Configure resource allocation**

The load balancer instructs HyperQueue to allocate batches of resources on the HPC system, depending on demand for model evaluations. HyperQueue will submit SLURM or PBS jobs on the HPC system when needed, scheduling requested model runs within those jobs. When demand decreases, HyperQueue will cancel some of those jobs again.

Adapt the configuration in ``hpc/hq_scripts/allocation_queue.sh`` to your needs.

## How to start the load balancer
For example, when running a very fast UM-Bridge model on an HPC cluster, it is still advisable to choose medium-sized jobs for resource allocation. That will avoid submitting large numbers of jobs to the HPC system's scheduler, while HyperQueue itself will handle large numbers of small model runs within those jobs.

>The LoadBalancer server is supposed to run at login node, but it can also run at computing node.
2. **Configure model job**

1. Load module that is necessary to compile `cpp` files
> e.g. On Helix it's `module load compiler/gnu`
Adapt the configuration in ``hpc/hq_scripts/job.sh`` to your needs:
* Specify what UM-Bridge model server to run,
* set `#HQ` variables at the top to specify what resources each instance should receive,
* and set the directory of your load balancer binary in `load_balancer_dir`.

2. (**Optional**) Set the port of load balancer: `export PORT=4243`
> Sometimes the default port 4242 of the login node is occupied.
Importantly, the UM-Bridge model server must serve its models at the port specified by the environment variable `PORT`. The value of `PORT` is automatically determined by `job.sh`, avoiding potential conflicts if multiple servers run on the same compute node.

3. Compile and run the server

- Compile the load balancer: `make`
4. **Run load balancer**

- Prepare a model server. Specify the path of your model server file in `model.slurm`, as the variable `server_file`.
> You can also specify slurm parameters in the file `regular-server.slurm`.
- Run the load balancer: `./load-balancer`
Navigate to the `hpc` directory and execute the load balancer.

> You can specify slurm parameters in the file `LoadBalancer.slurm`
> The the LoadBalancer server will occupy a terminal, so you need to start a new one if you want to run a client on the same node.
```
./load-balancer
```

> The Load Balancer will submit a new SLURM job whenever it receives an evaluation request, and cancel the SLURM job when the evaluation is finished.
> The Load Balancer will listen to the hostname of node instead of localhost.
> The regular server in SLURM job will also listen to the hostname and use a random port that is not in use.
5. **Connect from client**

## How to connect a client to the LoadBalancer
Once running, you can connect to the load balancer from any UM-Bridge client on the login node via `http://localhost:4242`. To the client, it will appear like any other UM-Bridge server, except that it can process concurrent evaluation requests.

A client is supposed to run on the login node or at your own device, since it does not perform intensive calculations.

Clients running directly on the login node may connect to the load balancer via `localhost.

Alternatively, you can create an SSH tunnel to the login node, and then run the client on your own device. For example:
## (Optional) Running clients on your own machine while offloading runs to HPC

Alternatively, a client may run on your own device. In order to connect UM-Bridge clients on your machine to the login node, you can create an SSH tunnel to the HPC system.

```
ssh <username>@hpc.cluster.address -N -f -L 4242:<server hostname>:4242
Expand All @@ -62,10 +74,4 @@ Alternatively, you can create an SSH tunnel to the login node, and then run the
# -f : request ssh to go to the background once the ssh connection has been established
```

While the SSH tunnel is running, you can run the client on your own device, and connect it to the load balancer via `localhost:4242`.

## Example

An example server is in the folder `test/MultiplyBy2`. The server `minimal-server.cpp` take the input written in `client.py`, multiply them by 2 and then return.

Currently, it will run and test 4 models in parallel, but the LoadBalancer server will process them in sequence.
While the SSH tunnel is running, you can run the client on your own device, and connect it to the load balancer via `http://localhost:4242`.

0 comments on commit d57624c

Please sign in to comment.