<a href="https://colab.research.google.com/github/bentoml/workshops/blob/main/openllm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
    <p style="text-align:center">
        <h1>OpenLLM</h1>
        <img alt="BentoML logo" src="https://raw.githubusercontent.com/bentoml/BentoML/main/docs/source/_static/img/bentoml-logo-black.png" width="200"/>
        </br>
        <a href="https://github.com/bentoml/OpenLLM">GitHub</a>
        |
        <a href="https://l.bentoml.com/join-openllm-discord">Community</a>
    </p>
</center>
<h1 align="center">Serving an open-source LLM with OpenLLM</h1>

Open-source LLMs can be very powerful and flexible, but setting them up for serving can be quite difficult. Indeed, the vast landscape of LLMs means that you may have to do the same process for several models in order to evaluate their performance. With OpenLLM, you can adapt and serve many models (including Llama 2 and Falcon) with ease.

In this tutorial, you will learn the following:

- Set up your environment to work with OpenLLM.
- Serve LLMs like Llama 2 with just a single command.
- Explore different ways to interact with the OpenLLM server.
- Advanced features of OpenLLM like quantization.
- Integrate OpenLLM with LangChain.
- Port existing OpenAI applications to use OpenLLM with minimal code changes.
- Create a multi-modal application with OpenLLM and BentoML.

## Setup

Before diving into OpenLLM, let's ensure our environment has everything in place.

In [None]:
RUN_IN_COLAB = False #@param {type: 'boolean'}
SERVER_URL = "http://llama-13b-org-ss-org-1--aws-us-east-1.mt2.bentoml.ai"

print("Installing OpenLLM...")
!pip install --upgrade -q --progress-bar off openllm bentoml langchain openai
print("Done!")

if RUN_IN_COLAB:
  SERVER_URL = "http://localhost:8001"
  print("Installing serving dependencies...")
  !pip install --upgrade -q --progress-bar off openllm[llama,vllm,gptq] tensorrt accelerate bitsandbytes
  !apt install tensorrt
  print("Done!")

  print("Downloading Llama and starting OpenLLM server...")
  !nohup openllm start llama --model-id "NousResearch/llama-2-7b-chat-hf" --backend vllm --port 8001 > openllm.log 2> openllm.err &

import os
os.environ["SERVER_URL"] = SERVER_URL


Installing OpenLLM...
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for optimum (pyproject.toml) ... [?25l[?25hdone
Done!


## [Optional] Check GPU and memory resources

 Use the following script to check the GPU and memory resources in your environment.

In [None]:
import psutil
import torch

ram = psutil.virtual_memory()
ram_total = ram.total / (1024 ** 3)
print("MemTotal: %.2f GB" % ram_total)

print("=============GPU INFO=============")
if torch.cuda.is_available():
    !/opt/bin/nvidia-smi || true
else:
    print("GPU NOT available")
    print("Run `openllm models` to find models which are executable on CPU")

MemTotal: 12.68 GB
Thu Oct 12 10:04:18 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8    10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+--------------------------------------------------------------------

## Serve Llama 2!

While we won't be serving Llama 2 from inside the Colab environment, serving it is straightforward with OpenLLM. With just a single command, you're good to go:

```bash
openllm start llama --model-id "NousResearch/llama-2-7b-chat-hf" --backend vllm --port 8001
```

If you have access to a GPU environment, try it out!

Here, we're serving the smallest Llama 2 model, the 7 billion parameter version. You can simply change `7b` in the string with `13b` or `70b` for the larger parameter sizes.

## View OpenLLM model options

OpenLLM offers a wide range of models and backends. To see the available options, simply run `openllm models`:

In [None]:
!openllm models

[37m╒═══════════╤══════════════════════╤════════════════════════════════╤══════════════════════╤══════════════════════════════╕
│ LLM       │ Architecture         │ Models Id                      │ Installation         │ Runtime                      │
╞═══════════╪══════════════════════╪════════════════════════════════╪══════════════════════╪══════════════════════════════╡
│ chatglm   │ ChatGLMForConditiona │ ['thudm/chatglm-6b',           │ "openllm[chatglm]"   │ ('pt',)                      │
│           │ lGeneration          │ 'thudm/chatglm-6b-int8',       │                      │                              │
│           │                      │ 'thudm/chatglm-6b-int4',       │                      │                              │
│           │                      │ 'thudm/chatglm2-6b',           │                      │                              │
│           │                      │ 'thudm/chatglm2-6b-int4']      │                      │                              │
├──

## Serving with quantization

While you can serve smaller models, like the 7 billion parameter Llama model in the example, you can't serve larger models, as the VRAM requirement is very large.

In [None]:
#@title [Optional] Try to serve Llama 13b
!openllm start llama --model-id "NousResearch/llama-2-13b-chat-hf" --port 8001

In order to serve 13b on a GPU with limited VRAM, we can take advantage of [quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c). In essence, quantization rounds weight values, significantly lowering the VRAM required to hold the model in memory. You can start using quantization super easily with OpenLLM, by simply passing the `--quantize` argument. There are several options, including `int4` and `int8`, but for this workshop, we will be using the state-of-the-art [`gptq` quantization](https://arxiv.org/abs/2210.17323).

In [None]:
!openllm start llama --model-id "TheBloke/Llama-2-13b-Chat-GPTQ" --quantize gptq

Note: You can always fallback to '--serialisation legacy' when running quantisation.[0m
[32mMake sure to check out 'TheBloke/Llama-2-13b-Chat-GPTQ' repository to see if the weights is in 'safetensors' format if unsure.[0m
[33mMake sure to have the following dependencies available: ['fairscale'][0m
'quantize="gptq"' requires 'auto-gptq' and 'optimum>=0.12' to be installed (not available with local environment). Make sure to have 'auto-gptq' available locally: 'pip install "openllm[gptq]"'. OpenLLM will fallback to int8 with bitsandbytes.
Downloading (…)lve/main/config.json: 100% 837/837 [00:00<00:00, 5.01MB/s]
'quantize="gptq"' requires 'auto-gptq' and 'optimum>=0.12' to be installed (not available with local environment). Make sure to have 'auto-gptq' available locally: 'pip install "openllm[gptq]"'. OpenLLM will fallback to int8 with bitsandbytes.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/openllm/serialisation/transformers/__init__.py", li

# Use the OpenLLM server

### Check server status

Before you interact with the OpenLLM server, it's crucial to ensure that it is up and running. The output of the `curl` command should start with `HTTP/1.1 200 OK`, meaning everything is in order.

If it says `curl: (6) Could not resolve host: SERVER_URL`, ensure you have run the setup step.

If it says `curl: (7) Failed to connect to localhost...`, then check `./openllm.log` and `./openllm.err`; likely the server has failed to start or is still in the process of starting.

If it says `HTTP/1.1 503 Service Unavailable`, the server is still starting and you should wait a bit and retry.

In [None]:
!curl -i {SERVER_URL}/readyz

HTTP/1.1 200 OK
[1mDate[0m: Thu, 12 Oct 2023 09:55:29 GMT
[1mContent-Type[0m: text/plain; charset=utf-8
[1mContent-Length[0m: 1
[1mConnection[0m: keep-alive
[1mX-Powered-By[0m: Yatai
[1mX-Yatai-Org-Name[0m: unknown
[1mX-Yatai-Bento[0m: meta-llama--llama-2-13b-chat-hf-service:0ba94ac9b9e1d5a0037780667e8b219adde1908c




### Raw HTTP

There are several ways you can interact with an OpenLLM server. Since it is a standard HTTP server that accepts JSON as input, you can simply use `cURL` (or any HTTP client of your choice):

In [None]:
!curl -k -X 'POST' \
  "$SERVER_URL/v1/generate_stream" \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"write a tagline for an ice cream shop\n", "llm_config": {"max_new_tokens": 8192}}'

### OpenLLM client

Of course, there's a lot of boilerplate in a standard HTTP request. OpenLLM includes both a CLI and Python client for making requests to a server.

In [None]:
import openllm
client = openllm.client.HTTPClient(SERVER_URL)

res = await client.query("what is the weight of the earth?", return_response='raw')
print(res)

In [None]:
!openllm query --endpoint {SERVER_URL} --timeout 120 "What is the weight of the earth?"

### LangChain integration

OpenLLM also integrates with LangChain, allowing for easy use if you're familiar with LangChain APIs. It also allows you to integrate with any other LangChain-supported LLMs and APIs.

In [None]:
from langchain.prompts import PromptTemplate
from langchain.llms import OpenLLM
from langchain.chains import LLMChain

prompt = PromptTemplate(
    input_variables=["country"],
    template="What are some phrases that would be useful to know when visiting {country}?",
)

llm = OpenLLM(server_url=SERVER_URL)

llmchain = LLMChain(llm=llm, prompt=prompt)

country = input("Country: ")
llmchain.run(country)

### OpenAI compatibility

Finally, we have just recently added OpenAI-compatible endpoints, so you can seamlessly port your OpenAI application to use OpenLLM by simply setting `openai.api_base`!

In [None]:
import openai
openai.api_base = SERVER_URL + "/v1"
openai.api_key = ""

messages = [
    {"role": "system", "content": "You are an intelligent assistant."},
    {"role": "user", "content": "Write me a haiku\n"}
  ]
chat = openai.ChatCompletion.create(model="llama2", messages=messages)

print(f"{chat.choices[0].message.content}")

ModuleNotFoundError: ignored

# BentoML

OpenLLM allows you to easily integrate it as part of a multi-modal application using BentoML. Here's a short example of what that might look like:

In [None]:
from __future__ import annotations

import bentoml
import openllm

model = "llama"

llm_config = openllm.AutoConfig.for_model(model)
llm_runner = openllm.Runner(model, llm_config=llm_config)

other_runner = bentoml.sklearn.get("my_cool_model:latest").to_runner()

svc = bentoml.Service(name="llm-service", runners=[llm_runner, other_runner])

@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
  llm_answer, other_answer = await asyncio.gather(
      llm_runner.generate.async_run(input_text),
      other_runner.classify(tokenize(other_answer))
  )
  return answer[0]["generated_text"] + f"\nclassified as: {other_answer}"

You can then take this Bento you have created and upload it to BentoCloud and deploy it at scale!

```bash
bentoml build
bentoml push example_bento
```