# LISA v2 Demo

In this notebook, we'll dive into using LISA for your own LLM applications. LISA supports the [OpenAI specification](https://platform.openai.com/docs/api-reference), so you can use it as a drop-in replacement for any other OpenAI-compatible models that your application already uses.

This demo uses the [openai-python](https://github.com/openai/openai-python) library, so you may need to install the dependency with `pip install openai` if you have not done so already.

In [None]:
# Import libraries that will be used in this demo:

## Just for printing things nicely
from IPython.lib.pretty import pretty

## OpenAI library for communicating with LLMs
from openai import OpenAI
from openai.types.chat import ChatCompletionMessage

## Configuration and Validation
There are a few differences in how LISA handles its API tokens, so we will need some configuration to set up the OpenAI client, but after that, you will be free to use it like the rest of your OpenAI clients.

**Notice**: This demo is dependent on the models that are deployed in your account. There are two more fields where you would have to replace values:
- YOUR-TEXTGEN-MODEL-HERE - in the "Chatting With Your Model" cell, replace this text with a text generation model ID
- YOUR-EMBEDDING-MODEL-HERE - In the "Embeddings" cell, replace this text with an embedding model ID

In [None]:
#### REQUIRED INFO, FILL THIS OUT FIRST

# If you still need an API token, follow the instructions here to get one: https://github.com/awslabs/LISA?tab=readme-ov-file#programmatic-api-tokens
api_token = "YOUR-TOKEN-HERE"

# The base URL is the LISA Serve REST API load balancer, plus the "/v2/serve" path, as that is where LISA starts handling the OpenAI spec.
# If you set up a custom domain with a certificate, use that domain here, otherwise, you may use the LISA Serve REST API ALB name directly.
lisa_serve_base_url = "https://YOUR-ALB-HERE/v2/serve"

#### END REQUIRED INFO

# initialize OpenAI client
client = OpenAI(
    api_key="ignored", # LISA ignores this field, but it must be defined # pragma: allowlist-secret not a real key
    base_url=lisa_serve_base_url,
    default_headers={"Api-Key": api_token},
)

# As an example, let's list models. If this succeeds, then we configured our client correctly.
model_list = client.models.list().data
print(pretty(model_list))

## Chatting With Your Model

Let's query our model by using the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create). Because we've already configured our client and have confirmed that we see a list of models, our next steps are:
1. Identify a model we want to use
2. Set up initial messages context for talking with the model
3. Record context if we want to ask followup questions

The following cell does all three of these.

In [None]:
# Let's continue using one of the models found in that list. Edit the following to match one in your response.
model = "YOUR-TEXTGEN-MODEL-HERE"

# Let's start a conversation!
# Not all models support the "system" role, so for a more general purpose demo, this example uses the "user" role for the first message.
messages = [
    {"role": "user", "content": "You are a helpful and friendly AI assistant who answers questions succinctly and accurately. If you do not know the answer, you truthfully admit it instead of making up an answer. All following messages are a conversation between you, the AI asstant and a user. Acknowledge that you understand."},
    {"role": "assistant", "content": "Understood."},
    {"role": "user", "content": "How do I translate the following into Dutch? 'I have no bananas today.'"},
]
chat_response = client.chat.completions.create(
    model=model,
    messages=messages
)
assistant_message = chat_response.choices[0].message
print(assistant_message.content.strip()) # Print how our model responded

# Let's append that message to the context, and keep asking questions after
messages.append(assistant_message)

### Chatting With Context

Now that we've make a call to the model and received and recorded a response, we can continue the conversation as if we were talking to another human. The following cell asks a highly context-sensitive question that would not make sense without a conversation before it. By adding the model's response to the list of messages, and by adding our next query to that same list, we send the entire conversation history in the request, and the model is now capable of answering the question. In this case, the model will correctly replace the word "banana" with "orange" and fulfill a request to translate text into another language, which is only achievable with the context from the previous messages.

In [None]:
# Given we asked about bananas, a fruit, and the messages contain context for translation with a fruit, we should still expect a translation response.
messages.append(
    {"role": "user", "content": "What about oranges instead?"}
)
chat_response = client.chat.completions.create(
    model=model,
    messages=messages
)
assistant_message = chat_response.choices[0].message
print(assistant_message.content.strip())
# And add to context
messages.append(assistant_message)

In [None]:
# Let's add that most recent message to the context, and print out what we have so far:

print(pretty(messages))

## Completions

In case your application still has requirements to use the [legacy Completions API](https://platform.openai.com/docs/api-reference/completions/create), this can still be supported. Using the same client, we can use the API to generate text, like so:

In [None]:
completions_response = client.completions.create(
    model=model,
    prompt="Generate a fully commented Python code block that can print the phrase 'Hello, World!' 4 times. This code block should be embedded in a markdown block so that it can be rendered in a Jupyter notebook"
)
print(completions_response.choices[0].text.strip())

## Streaming

Because querying an LLM does not produce instantaneous output, you may want to consider streaming so that you get tokens as they become available. If your model supports it, then we can handle streaming with the LISA endpoint too.
Using our same client, let's ask for a lot of text that we can stream instead of waiting for the entire response. Streaming maintains the connection to the LLM, allowing us to retrieve and process tokens as soon as the model makes them available.

In [None]:
chat_streaming_response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "user", "content": "In as many words and as much detail as possible, describe how to create a peanut butter and jelly sandwich. Assume that the user is starting with fresh peanuts and fresh grapes."},
    ],
    stream=True
)
for chunk in chat_streaming_response:
    print(chunk.choices[0].delta.content or "", end="")

## Embeddings

If your application integrates with a vector datastore, then you are almost certainly going to need some form of vector generation. If LISA is serving an embedding model for you, then the [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create) is what you need.

The following will show how to call an embedding model so that you can handle the vectors directly in your application.

In [None]:
# Change this model so that it matches your embedding model as listed in the models at the beginning of this demo
embedding_model = "YOUR-EMBEDDING-MODEL-HERE"

embeddings_response = client.embeddings.create(
    model=embedding_model,
    input="Hello, world!",
)
vector = embeddings_response.data[0].embedding

# Print out some vector stats instead of showing a huge number of numbers
print(f"Vector length: {len(vector)}, Vector min: {min(vector)}, Vector max: {max(vector)}")

Once you've retrieved the vector from your model, you can use that in a variety of ways, most commonly in Retrieval Augmented Generation (RAG) applications. In the case of document ingestion, you would run your documents through the embedding model and then store the vectors in a vector-optimized datastore, such as OpenSearch or PGVector within PostgreSQL. For new queries, you can generate a vector of the prompt and use that for a document similarity search in your datastores. If your vector database returns a relevant hit, then it can be added to the user prompt to provide additional context to the LLM. In this way, the text generation model will provide factually relevant information based on the documents you have uploaded to your datastores.

## Conclusion

From this demo, we have used the OpenAI client to perform the following tasks:
- Chat Completions
- Chat Completions with context
- Chat Completions with streaming
- Legacy Completions
- Embeddings

All of these operations are natively supported through the OpenAI library, and we have performed all of these operations by making a single client that accesses the LISA Serve endpoint. This demo heavily relied on the `openai-python` library, but LISA will work *anywhere* that an OpenAI client can be instantiated.