<a href="https://colab.research.google.com/github/badlogic/genai-workshop/blob/main/08_retrieval_augemented_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval-augmented generation
LLMs are limited by the data they were trained on, potentially leading to outdated, generic, or contextually shallow outputs. Adding additional information to the LLM through full pre-training or fine tuning is costly, and can not be done frequently. For individuals and companies that want to get their domain specific knowledge into an LLM, pre-training is out of the question, and fine-tuning is ill-fit to teach the LLM new information. Even if this were possible, we'd still face the problem of model parameters encoding information lossily.

**Retrieval-augmented generation (RAG)** addresses these issues by employing **grounding** (discussed in the last section) to provide the LLM with up-to-date, domain and user query specific information as part of the prompt.

The basic architecture of a RAG system is deceptively simple:

<center><img src="https://marioslab.io/uploads/genai/rag.png"></center>

Given a user query, the goal of a RAG system is it, to find relevant information from **domain specific content** stored in documents, databases, or accessible via APIs, and pass that information along with the user query to an LLM. The LLM then references the information in the prompt to answer the user query.

From a birds eye view, a RAG system consists of two separate services running in parallel:

1. **Index**: the domain specific content is ingested from various data sources, like databases, files, or APIs, pre-processed, and stored in one or more sparse and/or dense retrieval systems. This is a recurring process, to keep the indices up-to-data. The indexing service provides functionality to retrieve relevant information based on a query.

2. **Query answering**: A user submits a query to the system. The following steps are executed:
  1. **Retrieval of relevant information**: the query is used to retrieve relevant information from the indexing service.    
  2. **Prompt construction**: The retrieved information, the conversation history, and the query are combined into a prompt, with instructions for the LLM on how to answer the query based on the retrieved information.
  3. **Response**: the prompt is input into an LLM and the response is returned to the user and recorded for the next conversation turn.


Let's have a look at the components and processes in a RAG.

> **Note:** please execute the next code cells before you continue. They contain helper functions used by the code examples below.


In [4]:
!pip -q install openai tiktoken

> **Note:** Enter your own OpenAI API key below!

In [6]:
from openai import OpenAI
import tiktoken

# Use your own OpenAI API key here.
client = OpenAI(api_key = "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")

messages = []
model_name="gpt-3.5-turbo"
max_tokens = 12000
temperature=0

# Uncomment to use a model served locally via `ollama serve`
# client = OpenAI(
#    base_url = 'http://localhost:11434/v1',
#    api_key='ollama', # required, but unused
# )
# model_name="mixtral:latest"

enc = tiktoken.get_encoding("cl100k_base")
def num_tokens(message):
    return len(enc.encode(message))

def truncate_messages(messages, max_tokens):
    total_tokens = sum(num_tokens(message["content"]) for message in messages)
    if total_tokens <= max_tokens:
        return messages

    truncated_messages = messages[:1]
    remaining_tokens = max_tokens - num_tokens(truncated_messages[0]["content"])
    for message in reversed(messages[1:]):
        tokens = num_tokens(message["content"])
        if remaining_tokens >= tokens:
            truncated_messages.insert(1, message)
            remaining_tokens -= tokens
        else:
            break
    return truncated_messages

def complete(message, max_response_tokens=2048):
    global messages
    messages.append({"role": "user", "content": message})
    truncated_messages = truncate_messages(messages, max_tokens=max_tokens)
    stream = client.chat.completions.create(
        model=model_name,
        messages=truncated_messages,
        stream=True,
        temperature=temperature,
        max_tokens=max_response_tokens
    )
    reply = ""
    for response in stream:
        token = response.choices[0].delta.content
        if (token is None):
            break
        reply += token
        print(token, end='')

    reply = {"role": "assistant", "content": reply}
    messages.append(reply)
    total_tokens = sum(num_tokens(message["content"]) for message in truncated_messages)
    print(f'\nTokens: {total_tokens}')

def clear_history():
  global messages
  messages = [];

def print_history():
  global messages
  for message in messages:
    print("<" + message["role"] + ">")
    print(message["content"])
    print()

def system_prompt(message):
  global messages
  prompt = { "role": "system", "content": message }
  if (len(messages) == 0):
    messages.append(prompt)
  else:
    messages[0] = prompt

## Index
The index stores domain specific content, usually just referred to as **documents**. It allows **retrieval of the most relevant documents** for given a user query. A full RAG system can have more than one index.

The process of storing documents in the index is called **ingestion**, or **indexing**. Indexing should happen frequently, or whenever information is added or updated, so we can provide up-to-date information to the LLM.

Since we will stuff domain specific information into the prompt passed to the LLM, we need to ensure that this information fits into the context window. We thus don't just index and retrieve full documents, but document **chunks**.

A chunk is a part of a document, which is big enough to contribute information for question answering by the LLM, but not so big, that it can't fit into the LLMs token window along-side the conversation history and user question.

Cutting a document up into these chunks is called **chunking**, for which various strategies exist, e.g.

* **Character or token based chunking**: the content is split into equally sized chunks of `n` characters or tokens each. Chunks may also overlap. A basic evaluation of the effect of different chunk sizes can be found [here](https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5)
* **Semantic chunking**: the content is split into sentences. Each sentence is embedded as a vector. The semantic chunker then accumulates sentences until either a maximum chunk size is reaached, or the similarity between the current set of sentences with the next sentence is smaller than a threshold. See [semantic chunking](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking.html) in the LlamaIndex documentation for more information.
* **Structured chunking**: the content is often structured into sections and paragraphs, e.g. if encoded as HTML or markdown, which imply semantic relatedness. Structured chunking splits content into chunks based on this structure.

Each document chunk is stored in the index separately, usually including metadata that stores information like where the chunk came from. When we retrieve relevant content for a query, we actually retrieve chunks, not the full document, since we will stuff chunks into the LLM prompt.

Indices are usually **sparse or dense retrieval systems**, which we've already investigated in the section on embeddings. [Elastic Seach](https://www.elastic.co/elasticsearch) is a popular sparse retrieval system. For dense retrieval systems like [Chroma](https://www.trychroma.com/), [Pinecone](https://www.pinecone.io/), or [Milvus](https://milvus.io/) are popular.

While sparse retrieval systems usually can be directly fed each chunks text, dense retrieval systems require an additional pre-procesing step in form of embedding the chunk texts to a latent vector space. We've already seen how to embed text into a latent vector space using an embedding model. Another popular choice is to use [OpenAI's embedding API](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings). The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) on Hugging Face gives a good overview of embedding models and their relative performance.

When we retrieve relevant documents for a user query from a dense retrieval system, we also need to **embed the user query using the same embedding model**!

## Retrieval of relevant information
This step is usually encapsulate in a system component called **the retriever** sits on top of the indices. It is responsible for preprocessing the user query, using it to retrieve relevant chunks from each index, and ranking and selecting the top-k chunks to be passed along in the LLM prompt.

One crucial pre-processing step found in most retrievers is **query expansion**. This takes the user query, and augments it with additional information, so it is more likely we can find relevant information. This is especially helpful in turn-by-turn conversations. Consider this example:

```
<user>
Who is the CEO of the company ETM. Also tell me what products to provide.

<assistant>
The CEO of ETM is Dipl.-Ing. Dr.techn. Bernhard Reichl. ETM develops the SCADA system SIMATIC WinCC Open Architecture, which is designed for applications requiring a high degree of client-specific adaptability, large and/or complex applications, and projects with specific system requirements and functions.

<user>
What is his age?
```

The first user query contains enough information to retrieve meaningful information from the index.

However, the second user question is problematic. We will not get any relevant chunks for the query `What is his age?`, as it lacks contextual information, such as who "he" is.

We can use an LLM to expand this query based on the conversation history.





In [7]:
clear_history()
complete("""You are given a conversation and new message, both delimited by triple backticks.
Expand the new message by resolving and references to persons, entities or locations, in the
conversation with their full name.

Cconversation:
```
<user>
Who is the CEO of the company ETM. Also tell me what products to provide.

<assistant>
The CEO of ETM is Dipl.-Ing. Dr.techn. Bernhard Reichl. ETM develops the SCADA system SIMATIC WinCC Open Architecture, which is designed for applications requiring a high degree of client-specific adaptability, large and/or complex applications, and projects with specific system requirements and functions.
```

New Message:
```
What is his age?
```
""")

What is Dipl.-Ing. Dr.techn. Bernhard Reichl's age?
Tokens: 163


This expanded query is much more likely to return relevant chunks from the indices! The expanded query can be used to improve retrieval performance for both sparse and dense indices.

Once we've retrieved all relevant chunks from the indices, we need to select the ones we will pass along in the prompt to the LLM.

Usually, sparse and dense retrieval systems will already sort their results such that the most similar chunks come first. A simple approach is thus to just select the top-k chunks that fit into the LLM token window, along with the new user query, the conversation history, and the system prompt that instructs the LLM how to answer the query.

However, a problem called **[Lost in the middle](https://arxiv.org/abs/2307.03172)** might sometimes make this not the best option. Empirical analysis has shown, that many LLMs focus on the beginning and end of a prompt more than on the mid-section. This means that relevant chunks in the middle of the prompt, which may perfectly answer the user query, may not get enough attention. The solution: take the chunks in the middle, and swap them with chunks at the end.

Not all ranking problems are this simple. If we **retrieve relevant chunks from multiple indices**, we get one list of chunks per index. Each list is sorted according to the similarity measure used by the respective index, so similarities between lists are not comparable.

In order to pick the top-k relevant chunks across multiple, separately scored lists of chunks, we need to emply **model-based reranking**: the chunk lists are combined into a single list. This combined list and the user query are then presented to a reranking model, which sorts the chunks in the list according to their relevance to the user query. We can then pick the top-k chunks from this reranked list. Services like [Cohere Rerank](https://txt.cohere.com/rerank/) can provide this functionality in a single line of code.





## Prompt construction
At this stage, we have the list of top-k relevant chunks, the user query, and  conversation history. These need to be combined into a prompt along with instructions for the LLM how to answer the user query.

We can use **grounding** prompt engineering techniques to construct the prompt from these pieces of information. Here is an example prompt:

```
You are provided with a conversation history, a set of relevant information, and a user query.

The conversation history:
"""
<user>
...
<assistant>
...
<user>
...
<assistant>
...
"""

The relevant information:
"""
[
  {
    "url": "file://documents/do...",
    "information": "..."
  },
  {
  "url": "http://domain.com/xyz.html",
  "information": "..."
  },
]
"""

The query:
"""

"""

Answer the query based on the relevant information and conversation history.
Output your answer in Markdown.
Cite the relevant information by adding Markdown links where appropriate.
```

Ideally, the LLM will follow the instructions and produce an answer in Markdown format, including links to the documents that informed its answer where appropriate.

This technique thus not only allows us to ground the LLM in our own data, but also gives the user citations to explore the information in more depth.

Of course, we need to ensure that the prompt fits into the context window of the LLM. We can use a truncation strategy like we've implemented in the last section, to truncate both the conversation history, as well as the contextual information, until the prompt is small enough.

Ideally, we provide the LLM with as much information as we can, so in general, we'll use most of the token window. However, this comes at a price: both financially (if you use paid LLM services) as well as in terms of response times.

## Response
Now that we have the prompt, we can simply input it into the LLM to get a hopefully correct, and ground response, including citations.

The user query as well as the LLM's answer are added to the conversation history and the system is ready for the next conversation turn.*italicized text*

## Evaluation
RAG systems can be very sensitive to the smallest changes. E.g. making small changes to instructions in or format of the prompt can have a huge impact on the answer quality. Similarly, changing the chunking strategy, or using a diffeernt embedding strategy will change which chunks will be retrieved for a query, and will also influence the order of the retrieved chunks.

As such, it is advisable to set up a continuous evaluation pipeline, with which we can monitor the impact of changes to the RAG system along two dimensions:

* **Retriever evaluation**: given a query, and a query expansion reranker mechanism, does the retriever return:
  * the most **relevant** information?  
* **Response evaluation**: given a query, a set of relevant information, and the instructions, does the LLM respond
  * **faithfully** based only on the provided information or did it hallucinate?
  * **relevant** to the query?
  * **correctly**

[LlamaIndex](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/root.html), a framework for building production RAG system, has a comprehensive evaluation suite integrated, with which the above described evaluation metrics can be evaluated.

Notably, RAG evaluation systems as implemented in LlamaIndex often do not require human labels, such as query/response pairs, but rely on "gold" LLMs to generate test data, and judge metrics like faithfulness or correctness.

## WinCC OA JS Coding Buddy - RAG without retrieval
A full-blown RAG system is only needed if the data we want the LLM to use as grounding does not fit into the token window.

Before we build a simple RAG system, let us see how far we can get if we can indeed fit our grounding data entirely into the prompt. We are still augmenting the answer generation, but we are not retrieving anything.

ETM provides a simple JavaScript API for their WinCC OA system. We want to create a coding assistant that knows this API and can help users write applications that use this API.


### Data
We'll use the [WinCC OA JS API documentation](https://www.winccoa.com/documentation/WinCCOA/3.18/en_US/oaJsApi/oaJsApi.html) as our data source. It is a single HTML file, which contains the full source code of the API, including JSDoc strings.

We are not interested in the actual implementation, but only in the JSDoc strings, which document the entire API surface and includes examples for each function. Perfect!

Let's write a function that preprocesses this HTML file by extracting the JSDoc strings.

In [10]:
import requests
import os
import re
from bs4 import BeautifulSoup

def download_api_docs():
    url = "https://www.winccoa.com/documentation/WinCCOA/3.18/en_US/oaJsApi/oaJsApi.class.js.html"
    response = requests.get(url)
    html_content = response.text
    soup = BeautifulSoup(html_content)
    code_element = soup.find('code')
    if code_element:
        code_text = code_element.text
    else:
        return "No <code> tag found."

    pattern = r'/\*\*(?:.|\n)*?\*/'
    matches = re.findall(pattern, code_text)
    api = ""
    for match in matches:
        api += match + "\n"

    return api

api_docs = download_api_docs()
print(api_docs[:1000])

/**
 *
 * @class oaJsApi
 */
/**
   *
   *  This Callback will be fired in case of an exception. If no errorCallback is registered, oaJsApi will console.error to console.
   *  Arguments can be different, depends who calls the error handler. Could be catch block of javascript or WinCCOA.
   * @callback errorCallback

   * @example
   *
   * oaJsApi.registerListeners({
   *  error: function()
   *  {
   *    console.error(arguments);
   *  }
   * });
   */
/**
   * Opens a WebSock Object to the given baseUrl.
   * @function oaJsApi.connect

   * @param {String} baseUrl
   * @param {Object=} options Configuration Options
   * @param {Object=} options.webViewId WebView ID
   * @param {Object=} options.listeners register global listeners
   * @param {requestCallback=} options.listeners.success success Callback of connect
   * @private
   * @example
   *
   *  oaJsApi.connect("wss://localhost:12345", {
   *    baseParams: {
   *       webViewId: 1
   *    },
   *    listeners: {
   *       

Looking good! We can use the `num_tokens` function to count the tokens we will use up when passing the entire API documentation as part of the prompt:

In [11]:
num_tokens(api_docs)

7293

We will use OpenAI's `gpt-3.5-turbo` model to answer coding related questions. Each request will have at least 7293 tokens. What will that cost us?

Looking at the [pricing page](https://openai.com/pricing), 1000 tokens passed to `gpt-3.5-turbo` cost $0.0005

In [12]:
print(f'Minimum cost per query: ${0.0005 * num_tokens(api_docs) / 1000: 3f}')

Minimum cost per query: $ 0.003647


This cost doesn't include the additional tokens we need to send the conversation history, instructions, and latest query to the model. It also doesn't include the cost incurred by the response. But it gives us a good ballpark figure to work with.

If we used a RAG instead, we would try to fill the token window as well, so the cost compared to a full RAG would likely be similar.

### Prompt construction
Let's create our system prompt.

In [13]:
system_prompt(f"""
You are a helpful assistant. You know about the WinCC OA JS API and can answer questions related to it.

Here is the API documentation:

```
{api_docs}
```

Only answer questions related to the API above and its usage.
""")

We just smack the the entire API documentation right in the middle of the system prompt. Let's give it a try.

### Trying it out

In [14]:
complete("Who are you?")

I am a helpful assistant knowledgeable about the WinCC OA JS API. I can provide information and answer questions related to the API and its usage.
Tokens: 7389


In [15]:
complete("How can I enumerate all datapoints?")

To enumerate all data points, you can use the `dpNames` function provided by the WinCC OA JS API. Here is an example of how you can use it:

```javascript
oaJsApi.dpNames('*', null, {
   success: function(data) {
      console.log(data);
   },
   error: function() {
      console.error(arguments);
   }
});
```

In this example, the `dpNames` function is called with the pattern `'*'` to match all data points. The `null` parameter is used to retrieve all data point types. The success callback function will log the data points to the console.
Tokens: 7526


In [16]:
complete("I want to list all datapoints, their description, and the historic values for the last 7 days")

To achieve this, you can use multiple functions provided by the WinCC OA JS API. You can first use the `dpNames` function to get a list of all data points, then for each data point, you can use the `dpGetDescription` function to get its description, and finally, you can use the `dpGetPeriod` function to retrieve historic values for the last 7 days.

Here is an example of how you can accomplish this:

```javascript
// Step 1: Get a list of all data points
oaJsApi.dpNames('*', null, {
   success: function(dataPoints) {
      // Step 2: For each data point, get its description and historic values for the last 7 days
      dataPoints.forEach(function(dp) {
         // Get description of the data point
         oaJsApi.dpGetDescription(dp, null, {
            success: function(description) {
               console.log('Data Point: ' + dp);
               console.log('Description: ' + description);

               // Get historic values for the last 7 days
               let endTime = new Da

In [17]:
complete("Can you rewrite this in async/await style?")

Certainly! Here is the rewritten code using async/await syntax:

```javascript
async function getDataPointsInfo() {
   try {
      // Step 1: Get a list of all data points
      const dataPoints = await new Promise((resolve, reject) => {
         oaJsApi.dpNames('*', null, {
            success: resolve,
            error: reject
         });
      });

      // Step 2: For each data point, get its description and historic values for the last 7 days
      for (const dp of dataPoints) {
         // Get description of the data point
         const description = await new Promise((resolve, reject) => {
            oaJsApi.dpGetDescription(dp, null, {
               success: resolve,
               error: reject
            });
         });

         console.log('Data Point: ' + dp);
         console.log('Description: ' + description);

         // Get historic values for the last 7 days
         const endTime = new Date(); // Current time
         const startTime = new Date(endTime.getTim

In [18]:
complete("hm shouldn't you wrap the existing api, then use async/Await?")

Yes, you are correct. To use async/await with the existing API functions that follow a callback-based pattern, you can wrap those functions in Promises. Here is the updated code with the API functions wrapped in Promises for use with async/await:

```javascript
function dpNamesAsync() {
   return new Promise((resolve, reject) => {
      oaJsApi.dpNames('*', null, {
         success: resolve,
         error: reject
      });
   });
}

function dpGetDescriptionAsync(dp) {
   return new Promise((resolve, reject) => {
      oaJsApi.dpGetDescription(dp, null, {
         success: resolve,
         error: reject
      });
   });
}

function dpGetPeriodAsync(startTime, endTime, dp) {
   return new Promise((resolve, reject) => {
      oaJsApi.dpGetPeriod(startTime, endTime, 7, dp, {
         success: resolve,
         error: reject
      });
   });
}

async function getDataPointsInfo() {
   try {
      const dataPoints = await dpNamesAsync();

      for (const dp of dataPoints) {
         const

In [19]:
complete("Can you rewrite this in TypeScript?")

Certainly! Here is the rewritten code using TypeScript:

```typescript
function dpNamesAsync(): Promise<string[]> {
   return new Promise((resolve, reject) => {
      oaJsApi.dpNames('*', null, {
         success: (data: string[]) => resolve(data),
         error: reject
      });
   });
}

function dpGetDescriptionAsync(dp: string): Promise<string> {
   return new Promise((resolve, reject) => {
      oaJsApi.dpGetDescription(dp, null, {
         success: (description: string) => resolve(description),
         error: reject
      });
   });
}

function dpGetPeriodAsync(startTime: Date, endTime: Date, dp: string): Promise<number[]> {
   return new Promise((resolve, reject) => {
      oaJsApi.dpGetPeriod(startTime, endTime, 7, dp, {
         success: (historicValues: number[]) => resolve(historicValues),
         error: reject
      });
   });
}

async function getDataPointsInfo() {
   try {
      const dataPoints: string[] = await dpNamesAsync();

      for (const dp of dataPoints) {
  

In [20]:
complete("Please remove all comments")

Here is the TypeScript code without comments:

```typescript
function dpNamesAsync(): Promise<string[]> {
   return new Promise((resolve, reject) => {
      oaJsApi.dpNames('*', null, {
         success: (data: string[]) => resolve(data),
         error: reject
      });
   });
}

function dpGetDescriptionAsync(dp: string): Promise<string> {
   return new Promise((resolve, reject) => {
      oaJsApi.dpGetDescription(dp, null, {
         success: (description: string) => resolve(description),
         error: reject
      });
   });
}

function dpGetPeriodAsync(startTime: Date, endTime: Date, dp: string): Promise<number[]> {
   return new Promise((resolve, reject) => {
      oaJsApi.dpGetPeriod(startTime, endTime, 7, dp, {
         success: (historicValues: number[]) => resolve(historicValues),
         error: reject
      });
   });
}

async function getDataPointsInfo() {
   try {
      const dataPoints: string[] = await dpNamesAsync();

      for (const dp of dataPoints) {
         con

In [21]:
complete("What other APIs exist that you haven't shown me yet?")

Here are some additional APIs from the WinCC OA JS API documentation that have not been covered yet:

1. `toCtrl`: Triggers the messageReceived event on the WebView Ewo.
2. `toCtrlFn`: Triggers an existing Control-Function.
3. `msgToWebViewEwo`: Forwards the given params to another WebView Ewo.
4. `dpConnect`: Calls a callback function whenever the passed data point values/attributes change.
5. `dpDisconnect`: Disconnects the hotlink.
6. `dpSet`: Assigns values to data point attributes.
7. `dpGet`: Reads values of data point attributes.
8. `setValue`: Sets any number of graphics attributes of a graphics object.
9. `getValue`: Reads graphics attribute values for a graphics object in variables.
10. `dpGetAsynch`: Returns the historic values that were valid at a specific time.
11. `customFunction`: Allows calling a self-implemented method in the WssUserRequestHandler.
12. `cnsGetCompletions`: Looks up a pattern in CNS identifiers and IDs.
13. `dpQuery`: Retrieves attribute values with the

This looks pretty good!