# Serving MistralLite on Custom Text Generation Inference Container

This notebook provides a step-by-step walkthrough of customizing an inference container and deploying the open source MistralLite model for natural language generation by modifying HuggingFace's [Text Generation Inference Container](https://github.com/huggingface/text-generation-inference). In this notebook, we will deploy the customized container for LLM inference, and invoke the deployed endpoint with example prompts. 

## Start TGI Server

Execute the following cells to deploy the LLM for long contexts. It may take a few minutes for the Docker container to initialize. 

In [1]:
!mkdir -p models

In [2]:
!docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
      --model-id amazon/MistralLite \
      --dtype bfloat16 \
      --max-input-length 12000 \
      --max-total-tokens 12288 \
      --max-batch-prefill-tokens 12288 \
      --trust-remote-code

da50da01935dc1671c7f3f09cff8494cd48f2d2e9d5b213eb9a6a017d9047bb3


> **Warning:** You may need to wait for 10+ minutes for the docker container to be ready for the first time.

## Perform Inference

We can now invoke the model by first installing the `text-generation` library and define an invocation function to prompt the deployed model. Example prompts including a long-context prompt are included to execute.

In [3]:
!pip install text_generation==0.6.1

Collecting text_generation==0.6.1
  Obtaining dependency information for text_generation==0.6.1 from https://files.pythonhosted.org/packages/14/f7/cadf3a0fc619a72d7c667d16e96ef0a5b4c557e6e2b4788a0360dfba4fee/text_generation-0.6.1-py3-none-any.whl.metadata
  Downloading text_generation-0.6.1-py3-none-any.whl.metadata (7.8 kB)
Collecting aiohttp<4.0,>=3.8 (from text_generation==0.6.1)
  Obtaining dependency information for aiohttp<4.0,>=3.8 from https://files.pythonhosted.org/packages/41/8e/4c48881316bbced3d13089c4d0df4be321ce79a0c695d82dee9996aaf56b/aiohttp-3.8.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading aiohttp-3.8.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting huggingface-hub<1.0,>=0.12 (from text_generation==0.6.1)
  Obtaining dependency information for huggingface-hub<1.0,>=0.12 from https://files.pythonhosted.org/packages/ef/b5/b6107bd65fa4c96fdf00e4733e2fe5729bb9e5e09997f63074bb43d3ab28/huggingface_

In [4]:
from text_generation import Client

SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)

def invoke_tgi(prompt, 
                      random_seed=1, 
                      max_new_tokens=400, 
                      print_stream=True,
                      assist_role=True):
    if (assist_role):
        prompt = f"<|prompter|>{prompt}</s><|assistant|>"
    output = ""
    for response in tgi_client.generate_stream(
        prompt,
        do_sample=False,
        max_new_tokens=max_new_tokens,
        return_full_text=False,
        #temperature=None,
        #truncate=None,
        #seed=random_seed,
        #typical_p=0.2,
    ):
        if hasattr(response, "token"):
            if not response.token.special:
                snippet = response.token.text
                output += snippet
                if (print_stream):
                    print(snippet, end='', flush=True)
    return output




In [7]:
prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))