# Using MistralLite with vLLM Server

This notebook provides a step-by-step walkthrough of using the open source MistralLite model for natural language generation with [vLLM](https://vllm.readthedocs.io/en/latest/index.html). We will initialize the model serving container using bash scripts, and perform inference using [OpenAI-compatible serving](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html#openai-compatible-server). The notebook includes code to load and start the model container, test it on short prompts, and demonstrate its capabilities on longer prompts from files. 

In [None]:
!bash ./start_vllm_container.sh

In [None]:
!pip install openai==0.28.1

# Invoking the vLLM Container

Start invoking the deployed MistralLite model by replacing the OpenAI API key with your API key, and execute the following cells to query with example prompts with and without additional context.

In [None]:
import openai

def invoke_vllm(prompt, assist_role=True, stream=True, max_tokens=400):

    # Modify OpenAI's API key and API base to use vLLM's API server.
    openai.api_key = "EMPTY"
    vllm_container_port = 8000
    openai.api_base = f"http://localhost:{vllm_container_port}/v1"

    # List models API
    models = openai.Model.list()
    #print("Models:", models)

    model = models["data"][0]["id"]

    # Completion API
    
    if (assist_role):
        prompt = f"<|prompter|>{prompt}</s><|assistant|>"

    completion = openai.Completion.create(
        model=model,
        prompt=prompt,
        echo=False,
        n=1,
        stream=stream,
        max_tokens=max_tokens,
        #ignore_eos=False,
        #temperature=0.1,
        temperature=0,
        #top_p=0.95,
        #stop=["</s>"],
     )
    if stream:
        output = ""
        for c in completion:
            #print(c)
            d = c["choices"][0]["text"]
            output += d
            print(d, end='', flush=True)      
    else:
        output = completion["choices"][0]["text"]
        print(output)
    return output

In [None]:
prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_vllm(prompt)

In [None]:
with open("../example_long_ctx.txt", "r") as fin:
    task_instruction = fin.read()
    task_instruction = task_instruction.format(
        my_question="please tell me how does pgvector help with Generative AI and give me some examples."
    )
prompt = f"<|prompter|>{task_instruction}</s><|assistant|>"
invoke_vllm(prompt)