# Perlmutter End-to-end LLM Workflows Guide with Ray

Modified from https://github.com/anyscale/e2e-llm-workflows

## Contents:
* [Setup](#setup)
* [Starting Ray Cluster](#ray-cluster-setup)
    * [Ray Head](#ray-cluster-head)
    * [Ray Workers](#ray-cluster-workers)
    * [Connect to Ray Cluster](#ray-cluster-connect)
* [Dataset](#dataset)

# Setup <a class="anchor" id="setup"></a>

Execute kernel setup script, then reload JupyterHub:
```bash
./setup_kernel.sh
```

Open the `Perlmutter_Ray_LLM.ipynb` and select the `vllm_0.5.0` kernel.


# Starting Ray Cluster <a class="anchor" id="ray-cluster-setup"></a>

We need to start the Ray Head on the login node (including Prometheus and Grafana for metrics). The Ray Head is reserved for comms between the workers and will not perform any ML training. Once the head has been started we can spin up Ray Workers to connect to the head to perform work.

## Ray Head <a class="anchor" id="ray-cluster-head"></a>

Open up a terminal in Jupyterhub and execute the ray head start script:
```bash
./start_ray_head.sh (optional --hf_token <> --no_metrics)
```


## Ray Workers <a class="anchor" id="ray-cluster-workers"></a>

Either run the workers script within a slurm job:
```bash
./start_ray_workers.sh <ray-head-node-address:port>
```

Or submit via sbatch:
```bash
sbatch -A <account> (other slurm args) start_ray_workers.sh <ray-head-node-address:port>
```

## Connect to Ray Cluster <a class="anchor" id="ray-cluster-connect"></a>

In [8]:
import ray


ray.init()

  from .autonotebook import tqdm as notebook_tqdm
2024-07-06 00:52:06,746	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-07-06 00:52:07,048	INFO worker.py:1568 -- Connecting to existing Ray cluster at address: login35:6379...
2024-07-06 00:52:07,101	INFO worker.py:1744 -- Connected to Ray cluster. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


0,1
Python version:,3.10.12
Ray version:,2.24.0
Dashboard:,http://127.0.0.1:8265


In [12]:
import pprint
import os

pprint.pprint(
    ray.cluster_resources() 
    | 
    {'ray_dashboard': f'https://jupyter.nersc.gov/user/{os.getlogin()}/perlmutter-login-node-base/proxy/localhost:8265/',
     'grafana_dashboard': f'https://jupyter.nersc.gov/user/{os.getlogin()}/perlmutter-login-node-base/proxy/3000/d/rayDefaultDashboard'}
)

{'CPU': 128.0,
 'GPU': 4.0,
 'accelerator_type:A100': 1.0,
 'grafana_dashboard': 'https://jupyter.nersc.gov/user/asnaylor/perlmutter-login-node-base/proxy/3000/d/rayDefaultDashboard',
 'memory': 649518375937.0,
 'node:128.55.84.141': 1.0,
 'node:__internal_head__': 1.0,
 'node:login35': 1.0,
 'object_store_memory': 282650732543.0,
 'ray_dashboard': 'https://jupyter.nersc.gov/user/asnaylor/perlmutter-login-node-base/proxy/localhost:8265/'}


# Dataset <a class="anchor" id="dataset"></a>