<a href="https://colab.research.google.com/github/behzad-amini/AI-INTEGRATED-JUPYTER/blob/main/Publish_LLM_With_Ngrok.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we are gonna load our model and make an OpenAI compatible API for it using the Flask. Then we will publish the API using Ngrok free service.

In [None]:
!pip install -U transformers accelerate optimum auto-gptq
!pip install flask pyngrok

## Load Your Model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")

Generate Response

In [None]:
def gen_response(messages):
  text = tokenizer.apply_chat_template(
      messages,
      tokenize=False,
      add_generation_prompt=True
  )
  model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

  generated_ids = model.generate(
      model_inputs.input_ids,
      attention_mask=model_inputs.attention_mask,
      max_new_tokens=512
  )
  generated_ids = [
      output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
  ]

  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  return response

## Publish The Model

Setup Ngrok

You should obtain your Ngrok auth token [here](https://dashboard.ngrok.com/tunnels/authtokens). And then set it in the environment variable `NGROK_TOKEN`.

In [None]:
import os
from pyngrok import ngrok, conf

port = 8000

conf.get_default().auth_token = os.environ["NGROK_TOKEN"]
public_url = ngrok.connect(port).public_url

print(f"Ngrok Tunnel '{public_url}' -> 'http://127.0.0.1:{port}'")

Setup API

In [None]:
import os
import time
import random

from flask import Flask, request, jsonify

app = Flask(__name__)
app.config["BASE_URL"] = public_url

@app.route("/")
def index():
  return "hello"

@app.route("/chat/completions", methods=["POST"])
def chat_completions():
    # Note: Better to validate the request data
    data = request.json
    resp_content = gen_response(data["messages"])

    response = {
        "id": str(random.randint(111111, 999999)),
        "object": "chat.completion",
        "created": time.time(),
        "model": data["model"],
        "choices": [{
            "message": {"role": "assistant", "content": resp_content}
        }]
    }
    return jsonify(response)

app.run(port=port)

## Test With OpenAI Client

In [None]:
#!pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="<PUBLIC-URL>",  # Copy the printed public url.
    api_key="fake",  # It will be ignored.
)

response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant",
        },
        {
            "role": "user",
            "content": "What is Ngrok",
        }
    ],
    model="qwen2",
)
print(response.choices[0].message.content)