# Batch LLM using vLLM and Ray Data

This notebook will provide an example walking through how to use Ray Data's `map_batches` to run batch inference of a LLM.

All dependencies are pre-installed on this cluster. Let's move on to some imports.

In [1]:
from vllm import LLM, SamplingParams
from typing import Dict
import numpy as np
import ray

Set Up defaults that will be used including
* Your [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens). This will be used to download the model. This environment variable can also be set using a [Cluster Environment](https://docs.anyscale.com/configure/dependency-management/cluster-environments).
* Which model to load
* The sampling params object used by vLLM.

In [2]:
# Set the Hugging Face token. Replace the following with your token.
hf_token = "<REPLACE_WITH_YOUR_HUGGING_FACE_USER_TOKEN>"
# Set to the model that you wish to use. Note that using the llama models will require a hugging face token to be set.
hf_model = "meta-llama/Llama-2-7b-chat-hf"
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0, max_tokens=4096)

In [None]:
# Start up Ray, using the hf token as an environment variable so that it's available to all nodes in the cluster
if ray.is_initialized():
    ray.shutdown()
ray.init(runtime_env={"env_vars":{"HF_TOKEN": hf_token}})

Create some sample prompts, and use Ray Data to create a dataset for it.

In [11]:
prompts = [
"""
I always wanted to be a ...
""",
"""
The best way to learn a new language is ...
""",
"""
The biggest challenge facing our society today is ...
""",
"""
One thing I would change about my past is ...
""",
"""
The key to a happy life is ...
""",
]
ds = ray.data.from_items(prompts)

NOTE: This notebook uses local text data defined above, but it can easily be expanded to read from input data stored in cloud storage. To read text files from cloud storage (e.g., AWS S3), you can modify the dataset to be:

```python
ds = ray.data.read_text("s3://anonymous@air-example-data/prompts.txt")
```

Create a class to define logic of batch inference.

In [5]:
class LLMPredictor:
    def __init__(self):
        # Create an LLM.
        self.llm = LLM(model=hf_model)

    def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
        # Generate texts from the prompts.
        # The output is a list of RequestOutput objects that contain the prompt,
        # generated text, and other information.
        outputs = self.llm.generate(batch["item"], sampling_params)
        prompt = []
        generated_text = []
        for output in outputs:
            prompt.append(output.prompt)
            generated_text.append(' '.join([o.text for o in output.outputs]))
        return {
            "prompt": prompt,
            "generated_text": generated_text,
        }

Apply batch inference for all input data by using `map_batches`.

In [12]:
ds = ds.map_batches(
    LLMPredictor,
    # Set the concurrency to the number of LLM instances.
    concurrency=1,
    # Specify the number of GPUs required per LLM instance.
    num_gpus=1,
    # Specify the batch size for inference.
    batch_size=5,
)

Time to execute and view the results!

In [None]:
ds.take_all()

## Convert to an Anyscale Job

To convert this to an Anyscale Job, you can use the `batch_llm.py` as a starting point. It contains the same code, but uses data from an S3 bucket for the prompts.