# How to Serve MistralFlite on TGI

## Start TGI Server

In [3]:
!mkdir -p models

In [112]:
!docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
      --model-id amazon/MistralLite \
      --max-input-length 8192 \
      --max-total-tokens 16384 \
      --max-batch-prefill-tokens 16384 \
      --trust-remote-code

5cb854d3699438d50ab9ee41cd49e67da1366455a4c8a3ae4e47e41ffa440468


> Warning: You may need to wait for 10+ minutes for the docker container to be ready for the first time.

## Perform Inference

In [7]:
!pip install text_generation==0.6.1

Collecting text_generation
  Obtaining dependency information for text_generation from https://files.pythonhosted.org/packages/14/f7/cadf3a0fc619a72d7c667d16e96ef0a5b4c557e6e2b4788a0360dfba4fee/text_generation-0.6.1-py3-none-any.whl.metadata
  Downloading text_generation-0.6.1-py3-none-any.whl.metadata (7.8 kB)
Collecting aiohttp<4.0,>=3.8 (from text_generation)
  Obtaining dependency information for aiohttp<4.0,>=3.8 from https://files.pythonhosted.org/packages/41/8e/4c48881316bbced3d13089c4d0df4be321ce79a0c695d82dee9996aaf56b/aiohttp-3.8.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading aiohttp-3.8.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp<4.0,>=3.8->text_generation)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m3.4 MB/s[0m eta [36

In [117]:
from text_generation import Client

SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)

def invoke_tgi(prompt, 
                      random_seed=1, 
                      max_new_tokens=400, 
                      print_stream=True,
                      assist_role=True):
    if (assist_role):
        prompt = f"<|prompter|>{prompt}<|/s|><|assistant|>"
    output = ""
    for response in tgi_client.generate_stream(
        prompt,
        do_sample=False,
        max_new_tokens=max_new_tokens,
        temperature=None,
        truncate=None,
        seed=random_seed,
        typical_p=0.2,
    ):
        if hasattr(response, "token"):
            if not response.token.special:
                snippet = response.token.text
                output += snippet
                if (print_stream):
                    print(snippet, end='', flush=True)
    return output

In [118]:
prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)

 1.  Access to sufficient training data
    To support a long context for LLM, the model needs to be trained on a large and diverse set of data that captures a wide range of language phenomena and concepts.  However, acquiring and annotating this data can be time-consuming and costly, and may require specialized knowledge or resources.
    2.  Data imbalance and bias
    The data used to train the model may not accurately reflect the real-world distribution of language use, and may be biased towards certain demographic groups or topics.  This can result in the model producing biased or inaccurate outputs.
    3.  Memory limitations
    The amount of memory required to store and process a long context can be prohibitive, especially for larger models.  This can limit the effectiveness of the model in handling complex or nuanced language phenomena.
    4.  Training time and computational resources
    Training a model that can handle a long context can take a significant amount of time an