# Deploy, configure, and serve LLMs 

This guide benefits from an Anyscale library for serving LLMs on Anyscale called [RayLLM](http://https://docs.anyscale.com/llms/serving/intro).

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li><b>Part 1:</b> Overview of RayLLM</a></li>
    <li><b>Part 2:</b> Generating a RayLLM Configuration</a></li>
    <li><b>Part 3:</b> Running a RayLLM application </a></li>
    <li><b>Part 4:</b> Querying our RayLLM application</a></li>
</ul>
</div>


## Imports

In [None]:
import os
from typing import Optional

import anyscale
import openai
import ray
from ray import serve

In [None]:
ctx = ray.data.DataContext.get_current()
ctx.enable_operator_progress_bars = False
ctx.enable_progress_bars = False

## 1. Overview of RayLLM
RayLLM provides a number of features that simplify LLM development, including:
- An extensive suite of pre-configured open source LLMs.
- An OpenAI-compatible REST API.

As well as operational features to efficiently scale LLM apps:
- Optimizations such as continuous batching, quantization and streaming.
- Production-grade autoscaling support, including scale-to-zero.
- Native multi-GPU & multi-node model deployments.

To learn more about RayLLM, check out [the docs](http://https://docs.anyscale.com/llms/serving/intro). 

For a full guide on how to deploy LLMs, check out this [workspace template](https://docs.anyscale.com/examples/deploy-llms/)

## 2. Generating a RayLLM Configuration

The first step is to generate a YAML-based configuration for our RayLLM application. Let's first store our Hugging Face token to a local file.

In [None]:
!aws s3 cp s3://anyscale-ray-summit-training-2024/.HF_TOKEN ~/default/.HF_TOKEN --region us-west-2

To do so we will run an this interactive command in a terminal window: (Below are similar prompts to what you will see)

```bash
(base) ray@ip-10-0-4-24:~/default/ray-summit-2024-training/End_to_End_LLMs/bonus$ rayllm gen-config
We have provided the defaults for the following models:
meta-llama/Llama-2-7b-chat-hf
meta-llama/Llama-2-13b-chat-hf
meta-llama/Llama-2-70b-chat-hf
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct
meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3.1-70B-Instruct
mistralai/Mistral-7B-Instruct-v0.1
mistralai/Mixtral-8x7B-Instruct-v0.1
mistralai/Mixtral-8x22B-Instruct-v0.1
google/gemma-7b-it
llava-hf/llava-v1.6-mistral-7b-hf
Please enter the model ID you would like to serve, or enter your own custom model ID: mistralai/Mistral-7B-Instruct-v0.1
GPU type [L4/A10/A100_40G/A100_80G/H100]: A10
Tensor parallelism (1): 1
Enable LoRA serving [y/n] (n): y
LoRA weights storage URI. If not provided, the default will be used. 
(s3://anyscale-production-data-cld-91sl4yby42b2ivfp1inig5suuy/org_uhhav3lw5hg4risfz57ct1tg9s/cld_91sl4yby42b2ivfp1inig5suuy/artifact_storage/lora_fine_tuning): 
Maximum number of LoRA models per replica (16): 
Further customize the auto-scaling config [y/n] (n): n
Enable token authentication?
Note: Auth-enabled services require manual addition to playground. [y/n] (n): y

Your serve configuration file is successfully written to ./serve_20240907010212.yaml

Do you want to start up the server locally? [y/n] (y): y
Run the serving command in the background: [y/n] (y): y
Running: serve run ./serve_20240907010212.yaml --non-blocking
```


## 3.Running a RayLLM application

In the final steps of the interactive command we ran above, we can see that we ran the model locally by executing:

```bash
serve run ./serve_20240907010212.yaml --non-blocking
```

We can validate that the indeed our application is running by checking the Ray Serve dashboard. 

It should now look like this:

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/e2e-llms/deploy_llm_v2.jpg" width=800>

## 4. Querying our LLM application

Let's first build a client to query our LLM

In [None]:
def build_client(base_url: str, api_key: str) -> openai.OpenAI:
    return openai.OpenAI(
        base_url=base_url.rstrip("/") + "/v1",
        api_key=api_key,
    )

In [None]:
client = build_client("http://localhost:8000", "NOT A REAL KEY")

Next, we build a query function to send requests to our LLM application.

In [None]:
def query(
    client: openai.OpenAI,
    llm_model: str,
    system_message: dict[str, str],
    user_message: dict[str, str],
    temperature: float = 0,
    timeout: float = 3 * 60,
) -> Optional[str]:
    model_response = client.chat.completions.create(
        model=llm_model,
        messages=[system_message, user_message],
        temperature=temperature,
        timeout=timeout,
    )
    model_output = model_response.choices[0].message.content
    return model_output

<b style="background-color: yellow;">&nbsp;🔄 REPLACE&nbsp;</b>: Use the job ID of your fine-tuning run

In [None]:
model_info = anyscale.llm.model.get(job_id="prodjob_123") # REPLACE with the job ID for your fine-tuning run

Let's extract the base model ID and the model ID from the model info.

In [None]:
base_model = model_info.base_model_id
finetuned_model_id = model_info.id
finetuned_model_id

<div class="alert alert-block alert-info">

<b>Backup:</b> In case you don't have access to a successful finetuning job, you can copy the artifacts using the following command:

```python
base_model = "mistralai/Mistral-7B-Instruct-v0.1"
finetuned_model_id = "mistralai/Mistral-7B-Instruct-v0.1:aitra:qzoyg"
s3_lora_path = (
    f"{os.environ['ANYSCALE_ARTIFACT_STORAGE']}"
    f"/lora_fine_tuning/{model_id}"
)
!aws s3 sync s3://anyscale-public-materials/llm-finetuning/lora_fine_tuning/{model_id} {s3_lora_path}
```

</div>

Let's first test our base model

In [None]:
query(
    client=client,
    llm_model=base_model,
    system_message={"content": "you are a helpful assistant", "role": "system"},
    user_message={"content": "Hello there", "role": "user"},
)

Let's now query our finetuned LLM using the generated model id

In [None]:
query(
    client=client,
    llm_model=finetuned_model_id,
    system_message={"content": "you are a helpful assistant", "role": "system"},
    user_message={"content": "Hello there", "role": "user"},
)

<b style="background-color: orange;">&nbsp;💡 INSIGHT&nbsp;</b>: Ray Serve and Anyscale support [serving multiple LoRA adapters](https://github.com/anyscale/templates/blob/main/templates/endpoints_v2/examples/lora/DeployLora.ipynb) with a common base model in the same request batch which allows you to serve a wide variety of use-cases without increasing hardware spend. In addition, we use Serve multiplexing to reduce the number of swaps for LoRA adapters. There is a slight latency overhead to serving a LoRA model compared to the base model, typically 10-20%.


Let's test this on our VIGGO dataset by reading in a sample conversation.

In [None]:
test_sample = (
    ray.data.read_json(
        "s3://anyscale-public-materials/llm-finetuning/viggo_inverted/test/data.jsonl"
    )
    .to_pandas()["messages"]
    .tolist()
)
test_conversation = test_sample[0]
test_conversation

We can check to see the response from our base model

In [None]:
response_base_model = query(
    client=client,
    llm_model=base_model,
    system_message=test_conversation[0],
    user_message=test_conversation[1]
)
print(response_base_model)

Let's check if our finetuned model will provide a response with the format that we expect.

In [None]:
response_finetuned_model = query(
    client=client,
    llm_model=finetuned_model_id,
    system_message=test_conversation[0],
    user_message=test_conversation[1]
)

print(response_finetuned_model)

As expected, the finetuned model provides a more accurate and relevant response.

In [None]:
expected_response = test_conversation[-1]
expected_response["content"]

<div class="alert alert-block alert-info">

### Activity: Query the model with few-shot learning

Confirm that indeed few-shot learning will assist our base model by augmenting the prompt.

```python
system_message = test_conversation[0]
user_message = test_conversation[1]

examples = """
Here is the target sentence:
Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.
Output: inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])

Here is the target sentence:
Dirt: Showdown is a sport racing game that was released in 2012. The game is available on PlayStation, Xbox, and PC, and it has an ESRB Rating of E 10+ (for Everyone 10 and Older). However, it is not yet available as a Steam, Linux, or Mac release.
Output: inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])
"""

user_message = {
    "role": "user",
    "content": ... # Hint: update the user message content to include the examples
}, 

# Run the query
query(
    client=client,
    llm_model=base_model,
    system_message=system_message,
    user_message=user_message
)
```


</div>

In [None]:
# Write your solution here


<div class="alert alert-block alert-info">

<details>

<summary> Click here to see the solution </summary>

```python
system_message = test_conversation[0]
user_message = test_conversation[1]

examples = """
Here is the target sentence:
Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.
Output: inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])

Here is the target sentence:
Dirt: Showdown is a sport racing game that was released in 2012. The game is available on PlayStation, Xbox, and PC, and it has an ESRB Rating of E 10+ (for Everyone 10 and Older). However, it is not yet available as a Steam, Linux, or Mac release.
Output: inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])
"""

user_message_with_examples = {
    "role": "user",
    "content": (
f"""
Here are examples of the target output:
{examples}

Now please provide the output for:
Here is the target sentence:
{user_message["content"]}
Output: 
"""
)
}


# Run the query
query(
    client=client,
    llm_model=base_model,
    system_message=system_message,
    user_message=user_message_with_examples
)
```

</details>
<br/>

</div>

Let's clean up and shutdown our RayLLM application.

In [None]:
serve.shutdown()

## Bonus: Deploying as an Anyscale Service

In case you want to productionize your LLM app, you can deploy it as an Anyscale Service. 

To do so, you can use the Anyscale CLI to deploy your application.

```bash
anyscale service deploy -f ./serve_20240907010212.yaml
```

You can then query your application using the same `query` function we defined earlier. Except this time, your client now points to the Anyscale endpoint and your API key is the generated authentication token.

```python
client = build_client("https://<your-endpoint>.serve.anyscale.com/", "<your-auth-token>")
```