<center>
    <p style="text-align:center">
        <img alt="BentoML logo" src="https://raw.githubusercontent.com/bentoml/BentoML/main/docs/source/_static/img/bentoml-logo-black.png" width="200"/>
        </br>
        <a href="https://github.com/bentoml/">GitHub</a>
        |
        <a href="https://l.bentoml.com/join-openllm-discord">Community</a>
    </p>
</center>
<h1 align="center">Sentence Embedding with BentoML</h1>

BentoML is a framework for building reliable, scalable, and cost-efficient AI applications. It comes with everything you need for model serving, application packaging, and production deployment.

This is a live demo for [sentence-embedding-bento](https://github.com/bentoml/sentence-embedding-bento). In this tutorial, you will learn the following:
- Set up your environment to work with BentoML
- Serve a REST API server for generating text embeddings with just one-line command
- Explore different ways to interact with the server
- Build the bentos for deployment
- Production Deployment

## Set up

Before diving into this project, let's ensure our environment has everything in place.

In [2]:
import os

UPDATE_REPO = True  #@param {type:"boolean"}
PROJECT_NAME = 'sentence-embedding-bento'

if not PROJECT_NAME in os.getcwd():
  ![ ! -d $PROJECT_NAME ] && echo -= Initial setup =- && git clone https://github.com/bentoml/sentence-embedding-bento.git
  %cd $PROJECT_NAME

if UPDATE_REPO:
  !echo -= Updating repo =- && git pull

print("Installing dependencies...")
!pip install --upgrade -q --progress-bar off -r requirements.txt
print("Done!")

-= Updating repo =-
Already up to date.
Installing dependencies...
Done!


## Serve sentence-embedding!


There are 2 possible ways to get the server ready:

- attempt to run server in the Colab runtime locally
- Or you can choose to launch it outside the colab environment as the resource limits of free Colab are quite small for LLMs.

### Run it in Colab

Fistly, let's download a sentence embedding model using bentoml sdk.
This will save `all-MiniLM-L6-v2` in your local BentoML model store.

💁 Here, you can try any other embedding models in https://huggingface.co/models?library=sentence-transformers just change the model variable in this code block

In [None]:
import bentoml
from transformers import AutoTokenizer, AutoModel

model_id="sentence-transformers/all-MiniLM-L6-v2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

bentoml.transformers.save_model("all-MiniLM-L6-v2", model)
bentoml.transformers.save_model("all-MiniLM-L6-v2-tokenizer", tokenizer)



Model(tag="all-minilm-l6-v2-tokenizer:ofib35tmto5keasc", path="/root/bentoml/models/all-minilm-l6-v2-tokenizer/ofib35tmto5keasc/")

serving it is straightforward with BentoML. With just a single command, you're good to go:

> 👀 bentoml will start a server difinted in `service.py` listening 3000 port as deafult. And to unblock the following steps, we will run it in background via `nohup`

In [None]:
!nohup bentoml serve > bentoml.log 2>&1 &
SERVER_URL = "http://localhost:3000"

### Run it outside

First, ensure you have Docker installed and running in your server.

Launch the embedding service with the prebuilt image by the following command:

```
docker run --rm -p 3000:3000 ghcr.io/bentoml/sentence-embedding-bento:latest
```

To run model inference with GPU, install the NVIDIA Container Toolkit and use the GPU-enabled docker image instead:

```
docker run --gpu --rm -p 3000:3000 ghcr.io/bentoml/sentence-embedding-bento-gpu:latest
```

provide the URL of your own server

In [None]:
SERVER_URL = "http://xxx.xx" # @param {type: 'string'}

## Try sentence-embedding server

### Check server status

Before you interact with the OpenLLM server, it's crucial to ensure that it is up and running. The output of the `curl` command should start with `HTTP/1.1 200 OK`, meaning everything is in order.

If it says `curl: (6) Could not resolve host: SERVER_URL`, ensure you have run the setup step.

If it says `curl: (7) Failed to connect to localhost...`, then check `./openllm.log` and `./openllm.err`; likely the server has failed to start or is still in the process of starting.

If it says `HTTP/1.1 503 Service Unavailable`, the server is still starting and you should wait a bit and retry.

In [None]:
!curl -i {SERVER_URL}/readyz

HTTP/1.1 200 OK
[1mdate[0m: Tue, 17 Oct 2023 03:58:05 GMT
[1mserver[0m: uvicorn
[1mcontent-length[0m: 1
[1mcontent-type[0m: text/plain; charset=utf-8




### Web UI

In [None]:
import sys
if 'google.colab' in sys.modules:
    #using colab proxy URL
    from google.colab.output import eval_js
    print("you are in colab runtime. try it out in %s" % eval_js("google.colab.kernel.proxyPort(3000)"))

you are in colab runtime. try it out in https://1c7imreyyom-496ff2e9c6d22116-3000-colab.googleusercontent.com/


### Raw HTTP

In [None]:
!curl -X POST {SERVER_URL}/encode \
   -H 'Content-Type: application/json' \
   -d '["hello world, how are you?", "I love fried chicken sandwiches!"]'

[[0.17826364934444427, 0.10717688500881195, 0.43587279319763184, 0.20607613027095795, 0.0022702317219227552, -0.6842638254165649, 0.28179970383644104, 0.004625807050615549, -0.5464844703674316, 0.12071722000837326, 0.008337091654539108, 0.05009095370769501, -0.009813488461077213, -0.09429076313972473, 0.11632076650857925, -0.311547189950943, 0.31155329942703247, -0.5299899578094482, -0.6968725323677063, 0.11798126250505447, -0.13786333799362183, 0.2133282721042633, 0.3239438831806183, 0.33048930764198303, -0.3040446937084198, -0.14464575052261353, 0.059592656791210175, 0.191707581281662, -0.014646864496171474, -0.41965341567993164, -0.27630409598350525, 0.2672696113586426, 0.1408645063638687, -0.12229768186807632, -0.130007803440094, 0.24715524911880493, -0.05008126050233841, -0.730891764163971, -0.20369082689285278, 0.015536967664957047, 0.35080134868621826, -0.15579760074615479, -0.08321177959442139, -0.15525977313518524, 0.30729877948760986, -0.26019930839538574, 0.03653553128242493

### BentoML client

In [None]:
from bentoml.client import Client
import nest_asyncio
nest_asyncio.apply()

client = Client.from_url(SERVER_URL)

samples = [
  "The dinner was great!",
  "The weather is great today!",
  "I love fried chicken sandwiches!"
]
print(client.encode(samples))



[[-0.20782252  0.57161289 -0.12791678 ...  0.51041508 -0.54378605
  -0.05389049]
 [-0.11160703  0.43717682  0.80714953 ... -0.07692268 -0.70888156
   0.53045732]
 [-0.6013819  -0.44266695 -0.03185841 ...  0.45817858 -0.44026256
  -0.00560874]]


## Build Bento

Bento is the standardize distribution format, which is supported by an array of downstream deployment tools provided in the BentoML eco-system. It captures your service code, models, and configurations in one place, version control it automatically, and ensures reproducibility across yoru development and production environments. Learn more in [BentoML Documentation](https://docs.bentoml.com/en/latest/concepts/bento.html)

In [None]:
!bentoml build -f bentofile.yaml

Converting 'all-MiniLM-L6-v2' to lowercase: 'all-minilm-l6-v2'.
Converting 'all-MiniLM-L6-v2-tokenizer' to lowercase: 'all-minilm-l6-v2-tokenizer'.
'labels' should be a dict[str, str] and enforced by BentoML. Converting all values to string.

██████╗ ███████╗███╗   ██╗████████╗ ██████╗ ███╗   ███╗██╗
██╔══██╗██╔════╝████╗  ██║╚══██╔══╝██╔═══██╗████╗ ████║██║
██████╔╝█████╗  ██╔██╗ ██║   ██║   ██║   ██║██╔████╔██║██║
██╔══██╗██╔══╝  ██║╚██╗██║   ██║   ██║   ██║██║╚██╔╝██║██║
██████╔╝███████╗██║ ╚████║   ██║   ╚██████╔╝██║ ╚═╝ ██║███████╗
╚═════╝ ╚══════╝╚═╝  ╚═══╝   ╚═╝    ╚═════╝ ╚═╝     ╚═╝╚══════╝

[32mSuccessfully built Bento(tag="sentence-embedding-svc:wt2ymutmukn2iasc").[0m
[34m
Possible next steps:

 * Containerize your Bento with `bentoml containerize`:
    $ bentoml containerize sentence-embedding-svc:wt2ymutmukn2iasc  [or bentoml build --containerize][0m
[34m
 * Push to BentoCloud with `bentoml push`:
    $ bentoml push sentence-embedding-svc:wt2ymutmukn2iasc [or bentoml 

## Production Deployment

BentoML provides a number of deployment options. The easiest way to set up a production-ready endpoint of your text embedding service is via BentoCloud, the serverless cloud platform built for BentoML, by the BentoML team.

Next steps:

1. Sign up for a BentoCloud account [here](https://www.bentoml.com/).
2. Get an API Token, see instructions [here](https://docs.bentoml.com/en/latest/bentocloud/getting-started/ship.html#acquiring-an-api-token).
3. Push your Bento to BentoCloud: `bentoml push sentence-embedding-svc:latest`
4. Deploy via Web UI, see [Deploying on BentoCloud](https://docs.bentoml.com/en/latest/bentocloud/getting-started/ship.html#deploying-your-bento)