# How to Serve MistralFlite on TGI

## Start TGI Server

In [1]:
!mkdir -p models

In [3]:
!bash ./docker_build.sh

Sending build context to Docker daemon  79.87kB
Step 1/28 : FROM debian:bullseye-slim as pytorch-install
 ---> 1411dd05edfc
Step 2/28 : ARG PYTORCH_VERSION=2.0.1
 ---> Using cache
 ---> ce0b1c250647
Step 3/28 : ARG PYTHON_VERSION=3.9
 ---> Using cache
 ---> 31613d3e3dae
Step 4/28 : ARG CUDA_VERSION=11.8
 ---> Using cache
 ---> a6c754f560da
Step 5/28 : ARG MAMBA_VERSION=23.1.0-1
 ---> Using cache
 ---> 595317c72a6c
Step 6/28 : ARG CUDA_CHANNEL=nvidia
 ---> Using cache
 ---> 4f0012d80a92
Step 7/28 : ARG INSTALL_CHANNEL=pytorch
 ---> Using cache
 ---> 424e31a0e8cb
Step 8/28 : ARG TARGETPLATFORM
 ---> Using cache
 ---> 23a82f9ff2ee
Step 9/28 : ENV PATH /opt/conda/bin:$PATH
 ---> Using cache
 ---> 951b4363dcfa
Step 10/28 : RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends         build-essential         ca-certificates         ccache         curl         git &&         rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 858c5d97d790
Step 11/28 :

In [4]:
!bash start_mistrallite.sh

a6f065db42eddb6e27879abc8e5a11845fe0658295853fda17efd4c4af3846e1


> Warning: You may need to wait for 10+ minutes for the docker container to be ready for the first time.

## Perform Inference

In [7]:
!pip install text_generation==0.6.1

Collecting text_generation
  Obtaining dependency information for text_generation from https://files.pythonhosted.org/packages/14/f7/cadf3a0fc619a72d7c667d16e96ef0a5b4c557e6e2b4788a0360dfba4fee/text_generation-0.6.1-py3-none-any.whl.metadata
  Downloading text_generation-0.6.1-py3-none-any.whl.metadata (7.8 kB)
Collecting aiohttp<4.0,>=3.8 (from text_generation)
  Obtaining dependency information for aiohttp<4.0,>=3.8 from https://files.pythonhosted.org/packages/41/8e/4c48881316bbced3d13089c4d0df4be321ce79a0c695d82dee9996aaf56b/aiohttp-3.8.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading aiohttp-3.8.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp<4.0,>=3.8->text_generation)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m3.4 MB/s[0m eta [36

In [5]:
from text_generation import Client

SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)

def invoke_tgi(prompt, 
                      random_seed=1, 
                      max_new_tokens=400, 
                      print_stream=True,
                      assist_role=True):
    if (assist_role):
        prompt = f"<|prompter|>{prompt}</s><|assistant|>"
    output = ""
    for response in tgi_client.generate_stream(
        prompt,
        do_sample=False,
        max_new_tokens=max_new_tokens,
        return_full_text=False,
        #temperature=None,
        #truncate=None,
        #seed=random_seed,
        #typical_p=0.2,
    ):
        if hasattr(response, "token"):
            if not response.token.special:
                snippet = response.token.text
                output += snippet
                if (print_stream):
                    print(snippet, end='', flush=True)
    return output




In [6]:
prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)

 There are several challenges to supporting a long context for LLMs, including:

1. Memory Limitations: LLMs are limited in the amount of data they can store in memory, which can limit the length of the context they can support.

2. Computational Complexity: Generating a response for a long context can be computationally expensive, especially for larger models.

3. Training Data: Training an LLM to support a long context requires a large amount of training data, which can be difficult and time-consuming to collect.

4. Bias and Unanswerable Questions: As the context gets longer, there is a greater risk of bias and unanswerable questions, which can be difficult for the model to handle.

5. Human Evaluation: Evaluating the quality of responses for a long context can be challenging, as it may require human evaluators to read and understand a large amount of text.

6. Scalability: As the context gets longer, it becomes more difficult to scale the model to handle the increased computational

Try the long context of over 13,400 tokens, which are copied from [Amazon Aurora FAQs](https://aws.amazon.com/rds/aurora/faqs/)

In [7]:
with open("../example_long_ctx.txt", "r") as fin:
    task_instruction = fin.read()
    task_instruction = task_instruction.format(
        my_question="please tell me how does pgvector help with Generative AI and give me some examples."
    )
prompt = f"<|prompter|>{task_instruction}</s><|assistant|>"
result = invoke_tgi(prompt)

 pgvector is an open-source extension for PostgreSQL supported by Amazon Aurora PostgreSQL-Compatible Edition.

You can use pgvector to store, search, index, and query billions of embeddings that are generated from machine learning (ML) and artificial intelligence (AI) models in your database, such as those from Amazon Bedrock (limited preview) or Amazon SageMaker. A vector embedding is a numerical representation that represents the semantic meaning of content such as text, images, and video.

With pgvector, you can query embeddings in your Aurora PostgreSQL database to perform efficient semantic similarity searches of these data types, represented as vectors, combined with other tabular data in Aurora. This enables the use of generative AI and other AI/ML systems for new types of applications such as personalized recommendations based on similar text descriptions or images, candidate match based on interview notes, customer service next best action recommendations based on successful 