# Serving MistralLite on HuggingFace Text Generation Inference Container

This notebook provides a step-by-step walkthrough of deploying the open source MistralLite model for natural language generation with HuggingFace's [Text Generation Inference Container](https://github.com/huggingface/text-generation-inference). In this notebook, we will deploy the container for LLM inference, and invoke the deployed endpoint with example prompts. 

## Start TGI Server

This repository has bash scripts to download the model, and prepare and deploy the Docker container with the LLM for inference. Execute the following cells to run pre-written bash scripts to configure the Docker container and launch the model.

In [None]:
!mkdir -p models

In [None]:
!bash ./docker_build.sh

In [None]:
!bash start_mistrallite.sh

> Warning: You may need to wait for 10+ minutes for the docker container to be ready for the first time.

## Perform Inference

We can now invoke the model by first installing the `text-generation` library and define an invocation function to prompt the deployed model. Example prompts including a long-context prompt are included to execute.

In [None]:
!pip install text_generation==0.6.1

In [None]:
from text_generation import Client

SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)

def invoke_tgi(prompt, 
                      random_seed=1, 
                      max_new_tokens=400, 
                      print_stream=True,
                      assist_role=True):
    if (assist_role):
        prompt = f"<|prompter|>{prompt}</s><|assistant|>"
    output = ""
    for response in tgi_client.generate_stream(
        prompt,
        do_sample=False,
        max_new_tokens=max_new_tokens,
        return_full_text=False,
        #temperature=None,
        #truncate=None,
        #seed=random_seed,
        #typical_p=0.2,
    ):
        if hasattr(response, "token"):
            if not response.token.special:
                snippet = response.token.text
                output += snippet
                if (print_stream):
                    print(snippet, end='', flush=True)
    return output

In [None]:
prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)

Try the long context of over 13,400 tokens, which are copied from [Amazon Aurora FAQs](https://aws.amazon.com/rds/aurora/faqs/)

In [None]:
with open("../example_long_ctx.txt", "r") as fin:
    task_instruction = fin.read()
    task_instruction = task_instruction.format(
        my_question="please tell me how does pgvector help with Generative AI and give me some examples."
    )
prompt = f"<|prompter|>{task_instruction}</s><|assistant|>"
result = invoke_tgi(prompt)