## Install dependencies

In [None]:
!pip install git+https://github.com/huggingface/peft
!pip install huggingface_hub
!pip install ray[serve]
!pip install evaluate
!pip install vllm

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
folder = '/content/drive/My Drive/conference/llama_7b_class'
model_checkpoint = f'{folder}/assets'

## Ray

The first deployment option we will try out is Ray.

Ray Serve is a scalable model serving library for building online inference APIs.

### Define the app

In [5]:
import requests
from starlette.requests import Request
from typing import Dict
import torch
from peft import AutoPeftModelForCausalLM, PeftConfig, PeftModel
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
from ray.serve import Application
from ray import serve

@serve.deployment(ray_actor_options={"num_gpus": 1})
class TextClassificationDeployment:
    def __init__(self, lora_weights, model_type="llama", task="classification"):
        if model_type == "flan":
            config = PeftConfig.from_pretrained(lora_weights)
            self._model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,)
            self._model = PeftModel.from_pretrained(self._model, lora_weights)
            self.tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
        else:
            self._model = AutoPeftModelForCausalLM.from_pretrained(lora_weights,
                                    low_cpu_mem_usage=True,
                                    torch_dtype=torch.float16,
                                    device_map='auto')
            self.tokenizer = AutoTokenizer.from_pretrained(lora_weights)
        self.max_number_of_tokens = 20 if task == "classification" else 100

    def generate(self, text):
        input_ids = self.tokenizer(
            text, return_tensors="pt", truncation=True
        ).input_ids.cuda()
        with torch.inference_mode():
            outputs = self._model.generate(
                    input_ids=input_ids,
                    max_new_tokens=self.max_number_of_tokens,
                    do_sample=True,
                    top_p=0.95,
                    temperature=1e-3,
                )
            result = self.tokenizer.batch_decode(
                    outputs.detach().cpu().numpy(), skip_special_tokens=True
                )[0]
            return [result]

    async def __call__(self, http_request: Request):
        json_request = await http_request.json()
        return self.generate(json_request['text'])

app = TextClassificationDeployment.bind(model_checkpoint)

### Run the server

In [None]:
serve.run(app)

### Send request

Because Ray server runs in the background, we can make the request in the same notebook.

In [10]:
prompt_text = "Classify the following sentence that is delimited with triple backticks. ### Sentence:I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. ### Class:"

In [17]:
print(requests.post("http://localhost:8000/", json={"text": prompt_text}).json()[0][len(prompt_text):])

rec.autos


[36m(ServeReplica:default:TextClassificationDeployment pid=2348)[0m INFO 2023-11-06 19:34:48,396 TextClassificationDeployment default#TextClassificationDeployment#tRiGLe 0457912f-0ee4-4602-a2fe-e4899a762e25 / default replica.py:726 - __CALL__ OK 646.3ms


## vLLM

Now we will check another option which is a vLLM, a library for LLM inference and serving that uses PagedAttention technique.

To use vLLM we need:

1.   Merge LoRA layers of the fine-tuned model with base model
2.   Upload new model files to HuggingFace


### Creating model repository on HuggingFace

For this step you need to have HuggingFace account and Read/Write token that you can create here: https://huggingface.co/settings/tokens

Below you should define next variables:



*   username: your HugggingFace username
*   model_folder: name of the local folder where files of the merged model will be saved

In [14]:
import os

username = "mariiaponom" # your HugggingFace username
model_folder = "llama_7b_classification" # name of the model local folder where files will be saved
model_repo = f'{username}/{model_folder}'#

os.makedirs(model_repo, exist_ok=True)

In [15]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the '

In [None]:
!huggingface-cli repo create $model_folder --type model

[90mgit version 2.34.1[0m
[90mgit-lfs/3.0.2 (GitHub; linux amd64; go 1.18.1)[0m

You are about to create [1mmariiaponom/llama_7b_classification[0m
Proceed? [Y/n] Y

Your repo now lives at:
  [1mhttps://huggingface.co/mariiaponom/llama_7b_classification[0m

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/mariiaponom/llama_7b_classification



### Merging fine-tuned model with LoRA layers into one model

In [None]:
import argparse
import torch
import os
import pandas as pd
import evaluate
import datasets
from datasets import load_dataset
import pickle
import warnings
from pathlib import Path
from peft import PeftModel, PeftConfig, AutoPeftModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
)

model_type = "llama"

if model_type == "seq2seq":
    config = PeftConfig.from_pretrained(model_checkpoint)

    model = AutoModelForSeq2SeqLM.from_pretrained(
        config.base_model_name_or_path,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        config.base_model_name_or_path
    )
else:
    model = AutoPeftModelForCausalLM.from_pretrained(
        model_checkpoint,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = model.merge_and_unload()
model.save_pretrained(model_folder, push_to_hub=True, repo_id=model_repo)
tokenizer.save_pretrained(model_folder, push_to_hub=True, repo_id=model_repo)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.


Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.59G [00:00<?, ?B/s]

('/content/drive/My Drive/conference/llama_7b_class/merged_model/tokenizer_config.json',
 '/content/drive/My Drive/conference/llama_7b_class/merged_model/special_tokens_map.json',
 '/content/drive/My Drive/conference/llama_7b_class/merged_model/tokenizer.json')

### Start the server

vLLm unlike Ray is running continuously to serve requests, and the cell won't complete until the server process is terminated. Therefore we will have to make requests outside of Colab notebook. For this we need to get the external address to access this notebook.

#### Install ngrok

In [19]:
!wget -q -c -nc https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip -n ngrok-stable-linux-amd64.zip

Archive:  ngrok-stable-linux-amd64.zip


#### Run ngrok to tunnel vLLM server port 8000 to the outside world

In [20]:
get_ipython().system_raw('./ngrok http 8000 &')

#### Get the public URL where you can access the vLLM server

In [21]:
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

https://e803-34-143-212-190.ngrok.io


#### Run the server

In [None]:
!python -m vllm.entrypoints.openai.api_server --model $model_repo

INFO 11-04 19:55:20 llm_engine.py:72] Initializing an LLM engine with config: model='mariiaponom/test_colab', tokenizer='mariiaponom/test_colab', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 11-04 19:56:29 llm_engine.py:207] # GPU blocks: 26, # CPU blocks: 512
[32mINFO[0m:     Started server process [[36m27217[0m]
[32mINFO[0m:     Waiting for application startup.
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     Uvicorn running on [1mhttp://0.0.0.0:8000[0m (Press CTRL+C to quit)
[32mINFO[0m:     54.86.50.139:0 - "[1mPOST / HTTP/1.1[0m" [31m404 Not Found[0m
INFO 11-04 19:56:57 async_llm_engine.py:371] Received request cmpl-736721c10bce458aa1b5962874254dc3: prompt: 'San Francisco is a', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperat

### Send request

Copy the url from one of the previouls cells (with the markdown "Get the public URL where you can access the vLLM server").

Open the following notebook: https://colab.research.google.com/drive/11Sy2j0GnAi0rAxHuBnYhfOepmxaUkSfo?usp=drive_link

