# llama cpp Inference runs at ALCF from remote notebooks

This notebook demonstrates how to run llama cpp for serving LLM models on Polaris at ALCF using [Globus Compute](https://funcx.readthedocs.io/en/latest/endpoints.html).  In this example, we will authenticate using Globus Auth. Set up a compute endpoint on Polaris. Register a function that serves models using vLLM with Globus (FuncX) and subsequently launch that function in batch mode remotely so that it runs on Polaris and transfers results back to user.

This notebook can be run from anywhere, it only requires a local installation of Globus software (described below) and access to a Globus Compute Endpoint setup by the user on Polaris that has access to vLLM (also described below).

This demo uses Globus Compute (can also use Globus Flows if needed). Globus Compute is a remote executor for tasks expressed as python functions that are sent to remote machines following a fire-and-forget model.

In this notebook we will first describe necessary setup tasks for the local environment and on Polaris; second, we will describe how to create and test a Globus Compute function that can remotely launch a vLLM on Polaris compute nodes

# Prerequisites
1. Allocation on [Polaris](https://accounts.alcf.anl.gov/#/home)
2. An account on [Globus](https://www.globus.org/)

## Local Setup

This notebook can be run from anywhere.  The only requirement is a local environment, such as a conda environment or python, that has python 3.10 installed along with the Globus packages `globus_compute_sdk` and `globus_cli`.  For e.g.

```bash
conda create -n llama-cpp-env python==3.10.12 -y
conda activate llama-cpp-env  
conda install jupyter
conda install chardet
pip install notebook globus_compute_sdk globus_cli
python -m ipykernel install --user --name llama-cpp-env --display-name "Python3.10-llama-cpp-env"
jupyter notebook
```
> **__Note:__** <br>
> Change the kernel to point to the vllm env in your notebook. <br/>
> The vllm environment on Polaris should also contain the same python version 3.10. It is therefore necessary for this environment on your local machine to have a python version close to this version.

## Create a Globus Compute Endpoint on Polaris

The first step for a user to execute applications on Polaris through the Globus service is to create a Globus compute endpoint on Polaris.  <b> This requires the user to do a one-time setup task to configure the endpoint </b>.

In a shell seperate from this notebook, log into Polaris.  Copy the file included with this notebook called [`vllm_template_config.yaml`](../vllm_template_config.yaml) and [`requirements.txt`](../requirements.txt) to the Polaris filesystem (doesn't matter where).  Inside `vllm_template_config.yaml` you should see options setting your `project name`, your queue (default `debug`), and commands that activate a `conda environment` (as done below) on Polaris.

In your shell on Polaris, execute the following commands:

```bash
module use /soft/modulefiles; module load conda
conda create -p /home/openinference_svc/envs/llama-cpp-cuda-env python==3.10.12 -y
conda activate /home/openinference_svc/envs/llama-cpp-cuda-env
module load cudatoolkit-standalone/12.2.2
conda install cmake -y
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
python3 -m pip install -r requirements.txt
pip install globus-compute-endpoint
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release -t server
globus-compute-endpoint configure --endpoint-config llama_cpp_template_config.yaml llama_cpp_endpoint
globus-compute-endpoint start llama_cpp_template_config
globus-compute-endpoint list
```
This will create an endpoint and display its status.  Its status should be listed as `running`.  There will also be displayed a unique Endpoint ID in the form of a UUID.  Copy that ID and paste it below as a string assigned to `POLARIS_ENDPOINT_FOR_LLAMACPP`.

In [70]:
POLARIS_ENDPOINT_FOR_LLAMACPP_Llama38B = "77ee899c-3b8e-4c13-8ce6-3cbb7ebd80b5"

Your endpoint is now active as a daemon process running on the Polaris login node.  It is communicating with the Globus service and waiting for work.  If you ever want to stop the process you can run:
```bash
globus-compute-endpoint stop llama_cpp_endpoint
```
Your process may need to be periodically restarted, for example after Polaris comes back from a maintance period.

If you ever need to make changes to your endpoint configuration, you can find the settings file in `~/.globus_compute/llama_cpp_endpoint/config.yaml`.  Edit this file and then restart the endpoint with `globus-compute-endpoint restart llama_cpp_endpoint` to make the changes active.

This endpoint will submit work to the `debug` queue or any other queue you have access to since this demo is for learning purposes.  In production, LLM users will be able to submit work to the [demand queue](https://docs.alcf.anl.gov/polaris/running-jobs/#queues) which will give immediate access to Polaris compute nodes.

## Create a Function

In [88]:
def llamacpp(prompt):
    import openai
    
    import socket
    import json
    import os
    
    # Determine the hostname
    hostname = socket.gethostname()
    os.environ['no_proxy'] = hostname
    # Construct the base_url
    base_url = f"http://{hostname}:8000/v1"
    
    # Initialize the OpenAI client with the base URL and API key
    client = openai.OpenAI(base_url=base_url, api_key="cxvff_xxxx")
    
    # Send a request to the chat completions endpoint
    response = client.chat.completions.create(
        model="Meta-Llama-3-8B-Instruct-Q8_0.gguf",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2,
        logprobs=True
    )
    
    
    response_dict = response.to_dict()  # This converts the response to a dictionary
    
    
    # Convert the response to a JSON-formatted string
    
    json_response = json.dumps(response_dict, indent=4)
    print(json_response)
    
    # Convert JSON string back to a Python dictionary
    response_dict = json.loads(json_response)
    
    # Access the content of the assistant's message
    assistant_content = response_dict['choices'][0]['message']['content']
    
    print("Assistant's Response Content:")
    print(assistant_content)
    return json_response

## Authenticate Client and Test Function

We will now instantiate a Globus Compute client to test the function.  Globus will prompt the user for their credentials if running for the first time.  The user should have a Globus account through their ALCF account and should validate with their ALCF credentials.

In [89]:
# Creating Globus Compute client
# Import packages
from globus_compute_sdk import Client, Executor
import time
gc = Client()
polaris_endpoint_id = POLARIS_ENDPOINT_FOR_LLAMACPP_Llama38B
gce = Executor(endpoint_id=polaris_endpoint_id)

In [90]:
future = gce.submit(llamacpp,prompt="What are the proteins that interact with RAD51?")

In [91]:
import pprint
pprint.pprint(future.result())

('{\n'
 '    "id": "chatcmpl-65a0be86-53cc-458b-a3fb-bf16144c241c",\n'
 '    "choices": [\n'
 '        {\n'
 '            "finish_reason": "stop",\n'
 '            "index": 0,\n'
 '            "logprobs": null,\n'
 '            "message": {\n'
 '                "content": "RAD51 is a key protein involved in homologous '
 'recombination, a mechanism of DNA repair and genetic recombination. It '
 'interacts with several other proteins to facilitate its function. Some of '
 'the proteins that interact with RAD51 include:\\n\\n1. BRCA2: A tumor '
 'suppressor protein that binds to RAD51 and helps to stabilize it at the site '
 'of DNA damage.\\n2. PALB2: A protein that interacts with BRCA2 and RAD51, '
 'facilitating their interaction and promoting homologous recombination.\\n3. '
 'NBS1: A protein involved in the MRE11-RAD50-NBS1 complex, which is '
 'responsible for detecting DNA double-strand breaks and recruiting RAD51 to '
 'the damage site.\\n4. MRE11: A nuclease that interacts with 