# vLLM Inference runs at ALCF from remote notebooks

This notebook demonstrates how to run vLLM for serving LLM models on Polaris at ALCF using [Globus Compute](https://funcx.readthedocs.io/en/latest/endpoints.html).  In this example, we will authenticate using Globus Auth. Set up a compute endpoint on Polaris. Register a function that serves models using vLLM with Globus (FuncX) and subsequently launch that function in batch mode remotely so that it runs on Polaris and transfers results back to user.

This notebook can be run from anywhere, it only requires a local installation of Globus software (described below) and access to a Globus Compute Endpoint setup by the user on Polaris that has access to vLLM (also described below).

This demo uses Globus Compute (can also use Globus Flows if needed). Globus Compute is a remote executor for tasks expressed as python functions that are sent to remote machines following a fire-and-forget model.

In this notebook we will first describe necessary setup tasks for the local environment and on Polaris; second, we will describe how to create and test a Globus Compute function that can remotely launch a vLLM on Polaris compute nodes

# Prerequisites
1. Allocation on [Polaris](https://accounts.alcf.anl.gov/#/home)
2. Access to [Globus](https://www.globus.org/)

## Local Setup

This notebook can be run from anywhere.  The only requirement is a local environment, such as a conda environment or python, that has python 3.10 installed along with the Globus packages `globus_compute_sdk` and `globus_cli`.  For e.g.

```bash
python3.10 -m venv vllm-globus-env
source activate vllm-globus-env/bin/activate
pip install notebook globus_compute_sdk globus_cli
python -m ipykernel install --user --name vllm-env --display-name "Python3.10-vllm-env"
jupyter notebook
```
> **__Note:__** <br>
> Change the kernel to point to the vllm env in your notebook. <br/>
> The vllm environment on Polaris should also contain the same python version 3.10. It is therefore necessary for this environment on your local machine to have a python version close to this version.

## Create a Globus Compute Endpoint on Polaris

The first step for a user to execute applications on Polaris through the Globus service is to create a Globus compute endpoint on Polaris.  <b> This requires the user to do a one-time setup task to configure the endpoint </b>.

In a shell seperate from this notebook, log into Polaris.  Copy the file included with this notebook called `vllm_template_config.yaml` and `requirements.txt` to the Polaris filesystem (doesn't matter where).  Inside `vllm_template_config.yaml` you should see options setting your <project name>, your queue (default `debug`), and commands that activate a conda environment on Polaris.

In your shell on Polaris, execute the following commands:

```bash
module load conda
conda create -p /eagle/<project_name>/env/vllm_env python==3.10 --y
conda activate /eagle/<project_name>/env/vllm_env
pip install -r requirements.txt
globus-compute-endpoint configure --endpoint-config vllm_template_config.yaml vllm_endpoint
globus-compute-endpoint start vllm_endpoint
globus-compute-endpoint list
```
This will create an endpoint and display its status.  Its status should be listed as `running`.  There will also be displayed a unique Endpoint ID in the form of a UUID.  Copy that ID and paste it below as a string assigned to `POLARIS_ENDPOINT_FOR_VLLM`.

In [1]:
POLARIS_ENDPOINT_FOR_VLLM = "1debb802-53d2-4ccc-ad7c-378b101bcd6c"

Your endpoint is now active as a daemon process running on the Polaris login node.  It is communicating with the Globus service and waiting for work.  If you ever want to stop the process you can run:
```bash
globus-compute-endpoint stop vllm_endpoint
```
Your process may need to be periodically restarted, for example after Polaris comes back from a maintance period.

If you ever need to make changes to your endpoint configuration, you can find the settings file in `~/.globus_compute/vllm_endpoint/config.yaml`.  Edit this file and then restart the endpoint with `globus-compute-endpoint restart vllm_endpoint` to make the changes active.

This endpoint will submit work to the `debug` queue or any other queue you have access to since this demo is for learning purposes.  In production, LLM users will be able to submit work to the [demand queue](https://docs.alcf.anl.gov/polaris/running-jobs/#queues) which will give immediate access to Polaris compute nodes.

## Create a Function

We first need to create a python function that wraps around the application call.  We will call it `inference_vllm_polaris`. Ensure you change the cache directories to a project folder your have access to within the function
```bash
os.environ['HF_DATASETS_CACHE'] = '/eagle/datascience/atanikanti/vllm/.cache'
os.environ['TRANSFORMERS_CACHE'] = '/eagle/datascience/atanikanti/vllm/.cache'
os.environ['HF_HOME'] = "/eagle/datascience/atanikanti/vllm/.cache"
```

In [12]:
# Define Globus Compute function
def inference_vllm_polaris(
    max_tokens: int = 1024,
    temperature: float = 0.8,
    model_name: str = 'meta-llama/Llama-2-70b-chat-hf',
    tokenizer: str = 'hf-internal-testing/llama-tokenizer',
    prompt: str = None,
    tensor_parallel_size: int = 4
) -> dict:
    """
    Function to infer performance and generate outputs based on the parsed arguments on Polaris supercomputer.

    Argument
    --------
        max_tokens (int): Maximum number of tokens to generate
        temperature (float): Sampling temperature
        model_name (str): Name of the model
        tokenizer (str): Name of the tokenizer
        prompt (str): Prompt to generate
        tensor_parallel_size (int): Size of the tensor parallel. No of GPUs used for inference
        download_dir (str): Directory to download the model
    """

    # Import packages
    import os
    os.environ['HF_DATASETS_CACHE'] = '/eagle/datascience/atanikanti/vllm/.cache' #CHANGE THIS
    os.environ['TRANSFORMERS_CACHE'] = '/eagle/datascience/atanikanti/vllm/.cache' #CHANGE THIS
    os.environ['HF_HOME'] = "/eagle/datascience/atanikanti/vllm/.cache" #CHANGE THIS
    from vllm import LLM, SamplingParams
    import ray
    ray.shutdown()
    ray.init(_temp_dir='/tmp')
    import time
    import json

    # Load image (PIL format)
    print(f"max_tokens: {max_tokens}, temperature: {temperature}, model_name: {model_name}, tokenizer: {tokenizer}, prompt: {prompt}, tensor_parallel_size: {tensor_parallel_size}")
    start_time = time.time()

    params: SamplingParams = SamplingParams(max_tokens=max_tokens, temperature=temperature)
    lm: LLM = LLM(
        model=model_name, tokenizer=tokenizer, tensor_parallel_size=tensor_parallel_size
    )
    if not prompt:
        prompts = [
            "Hello, my name is",
            "The president of the United States is",
            "The capital of France is",
            "The future of AI is",
        ]
        outputs = lm.generate(prompts, params)
    else:
        outputs = lm.generate([prompt], params)
    stats_for_each = []
    total_num_of_tokens = 0
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        num_tokens = len(output.outputs[0].token_ids)
        total_num_of_tokens = total_num_of_tokens + num_tokens
        stats_for_each.append({
            "num_tokens": num_tokens,
            "prompt": prompt,
            "generated_text": generated_text})
    latency = time.time() - start_time
    tokens_per_second = total_num_of_tokens / latency     
    stats = {
        "latency": f"{latency:.2f} sec",
        "tokens_per_second": f"{tokens_per_second:.2f} sec",
        "stats": stats_for_each
    }
    result = json.dumps(stats)
    print(result)
    return result

## Authenticate Client and Test Function

We will now instantiate a Globus Compute client to test the function.  Globus will prompt the user for their credentials if running for the first time.  The user should have a Globus account through their ALCF account and should validate with their ALCF credentials.

In [13]:
# Creating Globus Compute client
# Import packages
from globus_compute_sdk import Client, Executor
import time
gc = Client()
polaris_endpoint_id = POLARIS_ENDPOINT_FOR_VLLM
gce = Executor(endpoint_id=polaris_endpoint_id)

In [14]:
future = gce.submit(inference_vllm_polaris, temperature=0.9, model_name='openlm-research/open_llama_13b')
print(future.result())

TaskExecutionFailed: 
 Traceback (most recent call last):
   File "/var/folders/f9/ff27tdm11x91185x3q1v6_lm0000gq/T/ipykernel_47813/1687978244.py", line 60, in inference_vllm_polaris
 UnboundLocalError: local variable 'latency' referenced before assignment


## Register Function (Optional)

Now that the function has been tested and works, register the function with the Globus service.  This will allow the user to call the function from within a flow.

In [5]:
reconstruction_func = gc.register_function(inference_vllm_polaris)

print(inference_vllm_polaris)

1624aa05-0290-4028-b7da-c16b5a687fd2


In [6]:
future = gce.submit_to_registered_function(args=["/eagle/IRIBeta/als/example"], function_id=inference_vllm_polaris)
future.result()

'Reconstructed data specified in inputOneSliceOfEach.txt'

In [None]:
#Advantages