<a href="https://colab.research.google.com/github/bentoml/workshops/blob/main/openllm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
    <p style="text-align:center">
        <h1>OpenLLM</h1>
        <img alt="BentoML logo" src="https://raw.githubusercontent.com/bentoml/BentoML/main/docs/source/_static/img/bentoml-logo-black.png" width="200"/>
        </br>
        <a href="https://github.com/bentoml/OpenLLM">GitHub</a>
        |
        <a href="https://l.bentoml.com/join-openllm-discord">Community</a>
    </p>
</center>
<h1 align="center">Serving an open-source LLM with OpenLLM</h1>

Open-source LLMs can be very powerful and flexible, but setting them up for serving can be quite difficult. Indeed, the vast landscape of LLMs means that you may have to do the same process for several models in order to evaluate their performance. With OpenLLM, you can adapt and serve many models (including Llama 2 and Falcon) with ease.

In this tutorial, you will learn the following:

- Set up your environment to work with OpenLLM.
- Serve LLMs like Llama 2 with just a single command.
- Explore different ways to interact with the OpenLLM server.
- Advanced features of OpenLLM like quantization.
- Integrate OpenLLM with LangChain.
- Port existing OpenAI applications to use OpenLLM with minimal code changes.
- Create a multi-modal application with OpenLLM and BentoML.

## Setup

Before diving into OpenLLM, let's ensure our environment has everything in place.

For the workshop, we will be using a remote OpenLLM server hosted on BentoCloud, as the resource limits of free Colab are quite small for LLMs.

If you are running this after the workshop, you can provide the URL of your own server, or alternatively simply tick `RUN_IN_COLAB` to attempt to run a server in the Colab runtime.

In [None]:
RUN_IN_COLAB = False #@param {type: 'boolean'}
SERVER_URL = "https://llm.b2.vc" # @param {type: 'string'}

print("Installing OpenLLM...")
!pip install --upgrade -q --progress-bar off openllm bentoml langchain openai
print("Done!")

if RUN_IN_COLAB:
  SERVER_URL = "http://localhost:8001"
  print("Installing serving dependencies...")
  !pip install --upgrade -q --progress-bar off openllm[llama,vllm,gptq] tensorrt accelerate bitsandbytes
  !apt install tensorrt
  print("Done!")

  print("Downloading Llama and starting OpenLLM server...")
  !nohup openllm start llama --model-id "NousResearch/llama-2-7b-chat-hf" --backend vllm --port 8001 > openllm.log 2> openllm.err &

import os
os.environ["SERVER_URL"] = SERVER_URL


Installing OpenLLM...
Done!


## Serve Llama 2!

While we won't be serving Llama 2 from inside the Colab environment, serving it is straightforward with OpenLLM. With just a single command, you're good to go:

```bash
openllm start llama --model-id "NousResearch/llama-2-7b-chat-hf" --backend vllm --port 8001
```

If you have access to a GPU environment, try it out!

Here, we're serving the smallest Llama 2 model, the 7 billion parameter version. You can simply change `7b` in the string with `13b` or `70b` for the larger parameter sizes.

## View OpenLLM model options

OpenLLM offers a wide range of models and backends. To see the available options, simply run `openllm models`:

In [None]:
!openllm models

[37m╒═══════════╤══════════════════════╤════════════════════════════════╤══════════════════════╤══════════════════════════════╕
│ LLM       │ Architecture         │ Models Id                      │ Installation         │ Runtime                      │
╞═══════════╪══════════════════════╪════════════════════════════════╪══════════════════════╪══════════════════════════════╡
│ chatglm   │ ChatGLMForConditiona │ ['thudm/chatglm-6b',           │ "openllm[chatglm]"   │ ('pt',)                      │
│           │ lGeneration          │ 'thudm/chatglm-6b-int8',       │                      │                              │
│           │                      │ 'thudm/chatglm-6b-int4',       │                      │                              │
│           │                      │ 'thudm/chatglm2-6b',           │                      │                              │
│           │                      │ 'thudm/chatglm2-6b-int4']      │                      │                              │
├──

## Serving with quantization

Unless you have a compute GPU with significants amounts of VRAM, you can't serve larger models, as the VRAM requirement is very large.

In [None]:
#@title [Optional] Try to serve Llama 13b
!openllm start llama --model-id "NousResearch/llama-2-13b-chat-hf" --port 8001

In order to serve 13b on a GPU with limited VRAM, we can take advantage of [quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c). In essence, quantization rounds weight values, significantly lowering the VRAM required to hold the model in memory. You can start using quantization super easily with OpenLLM, by simply passing the `--quantize` argument. There are several options, including `int4` and `int8`, but for this workshop, we will be using the state-of-the-art [`gptq` quantization](https://arxiv.org/abs/2210.17323).

In [None]:
!openllm start llama --model-id "TheBloke/Llama-2-13b-Chat-GPTQ" --quantize gptq --serialization legacy

# Use the OpenLLM server

### Check server status

Before you interact with the OpenLLM server, it's crucial to ensure that it is up and running. The output of the `curl` command should start with `HTTP/1.1 200 OK`, meaning everything is in order.

If it says `curl: (6) Could not resolve host: SERVER_URL`, ensure you have run the setup step.

If it says `curl: (7) Failed to connect to localhost...`, then check `./openllm.log` and `./openllm.err`; likely the server has failed to start or is still in the process of starting.

If it says `HTTP/1.1 503 Service Unavailable`, the server is still starting and you should wait a bit and retry.

In [5]:
!curl -i {SERVER_URL}/readyz

HTTP/2 200 
[1mdate[0m: Thu, 12 Oct 2023 23:54:59 GMT
[1mcontent-type[0m: text/plain; charset=utf-8
[1mcontent-length[0m: 1
[1mstrict-transport-security[0m: max-age=15724800; includeSubDomains
[1mx-powered-by[0m: Yatai
[1mx-yatai-org-name[0m: unknown
[1mx-yatai-bento[0m: meta-llama--llama-2-13b-chat-hf-service:v4-new-prompt
[1mcf-cache-status[0m: DYNAMIC
[1mreport-to[0m: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=br84OpSy120f0MHCC6lUYKJTTf%2Ffl3bp5tiycOXgFL2b4NijUersHA7jkkdYq48rTsnYgRNPYdkmzycgWKUkOUSJfYk4CGWm0%2FeQ9aPSqHb3b7j0ceDsvFaLWn8%3D"}],"group":"cf-nel","max_age":604800}
[1mnel[0m: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
[1mserver[0m: cloudflare
[1mcf-ray[0m: 81534564bf6eb0e5-ATL
[1malt-svc[0m: h3=":443"; ma=86400




### Raw HTTP

There are several ways you can interact with an OpenLLM server. Since it is a standard HTTP server that accepts JSON as input, you can simply use `cURL` (or any HTTP client of your choice):

In [None]:
!curl -k -X 'POST' \
  "$SERVER_URL/v1/generate_stream" \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"write a tagline for an ice cream shop\n", "llm_config": {"max_new_tokens": 8192}}'

Sure! Here are some tagline options for an ice cream shop:

1. "Savor the Sweet Life"
2. "Cool Treats for a Hot World"
3. "Where Every Day is a Sundae"
4. "Chill Out with Us"
5. "The Creamery of Your Dreams"
6. "A Scoop Above the Rest"
7. "Taste the Joy in Every Bite"
8. "Satisfy Your Cravings, Indulge in Our Delights"
9. "The Perfect Treat for Any Mood"
10. "Experience the Magic of Ice Cream"

I hope these tagline options inspire you and help you come up with the perfect one for your ice cream shop! 

### OpenLLM client

Of course, there's a lot of boilerplate in a standard HTTP request. OpenLLM includes both a CLI and Python client for making requests to a server.

In [None]:
!openllm query --endpoint {SERVER_URL} --timeout 120 "What is the weight of the earth?"
# os.environ["OPENLLM_ENDPOINT"] = SERVER_URL
# !openllm query --endpoint --timeout 120 "What is the weight of the earth?" --sample-params max-new-tokens=8192

[36mHello! [0m[36mI'm [0m[36mhappy [0m[36mto [0m[36mhelp [0m[36myou [0m[36mwith [0m[36myour [0m[36mquestion. [0m[36mHowever, [0m[36mI [0m[36mwant [0m[36mto [0m[36mpoint [0m[36mout [0m[36mthat [0m[36mthe [0m[36mquestion [0m[36m"What [0m[36mis [0m[36mthe [0m[36mweight [0m[36mof [0m[36mthe [0m[36mearth?" [0m[36mis [0m[36mnot [0m[36mfactually [0m[36mcoherent. [0m[36mThe [0m[36mEarth [0m[36mis [0m[36ma [0m[36mplanet, [0m[36mand [0m[36mit [0m[36mdoes [0m[36mnot [0m[36mhave [0m[36ma [0m[36mweight [0m[36mas [0m[36mit [0m[36mis [0m[36mnot [0m[36man [0m[36mobject [0m[36mthat [0m[36mcan [0m[36mbe [0m[36mweighed. [0m[36mThe [0m[36mEarth's [0m[36mmass [0m[36mis [0m[36mapproximately [0m[36m5.972 [0m[36mx [0m[36m10^24 [0m[36mkilograms, [0m[36mbut [0m[36mthis [0m[36mis [0m[36ma [0m[36mmeasure [0m[36mof [0m[36mits [0m[36mtotal [0m[36mmass, [0m[36mnot [0m[36mits [

In [None]:
import openllm

# sync API
# client = openllm.client.HTTPClient(SERVER_URL)
# res = client.query("what is the weight of the earth?", max_new_tokens=8192)

async_client = openllm.client.AsyncHTTPClient(SERVER_URL, timeout=120)
res = await async_client.query("what is the weight of the earth?", max_new_tokens=8192)
print(res.responses[0])

 The weight of the Earth is approximately 5.972 x 10^24 kilograms. This is based on the Earth's mass, which is estimated to be around 5.972 x 10^24 kilograms, and its density, which is approximately 5.514 g/cm^3. However, it's important to note that the Earth's weight is not a fixed value, as it can vary slightly due to the planet's slightly oblate shape and the effects of gravitational forces on its mass. Additionally, it's important to note that this value is an estimate based on scientific measurements and calculations, and it may not be exact or up-to-date.


Here's an example of creating a back-and forth chat using LLaMA2's chat format:

In [None]:
messages = []

while True:
  message = input("User: ")
  if messages:
    prompt = "</s><s>[INST] ".join(messages) + "</s><s>[INST] " + message
  else:
    prompt = message
  resp = await async_client.query(prompt, max_new_tokens=8192)
  print(f"Llama2: {resp.responses[0]}")
  messages.append(f"{message} [/INST] {resp.responses[0]}")

### OpenAI compatibility

Finally, we have just recently added OpenAI-compatible endpoints, so you can seamlessly port your OpenAI application to use OpenLLM by simply setting `openai.api_base`!

In [None]:
import openai
openai.api_base = SERVER_URL + "/v1"
openai.api_key = ""

messages = [
    {"role": "system", "content": "You are an intelligent assistant."},
    {"role": "user", "content": "Write me a haiku\n"}
  ]
chat = openai.ChatCompletion.create(model="llama2", messages=messages)

print(chat.choices[0].message.content)

# BentoML

OpenLLM allows you to easily integrate it as part of a multi-modal application using BentoML. Here's a short example of what that might look like:

In [None]:
from __future__ import annotations

import asyncio

import bentoml
import openllm

model = "llama"

llm_config = openllm.AutoConfig.for_model(model)
llm_runner = openllm.Runner(model, llm_config=llm_config)

other_runner = bentoml.sklearn.get("my_cool_model:latest").to_runner()

svc = bentoml.Service(name="llm-service", runners=[llm_runner, other_runner])

@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
  llm_answer, other_answer = await asyncio.gather(
      llm_runner.generate.async_run(input_text),
      other_runner.classify(input_text),
  )
  return llm_answer[0]["generated_text"] + f"\nclassified as: {other_answer}"

You can then take this Bento you have created and upload it to BentoCloud and deploy it at scale!

```bash
bentoml build
bentoml push example_bento
```