# Examples from the Retrieval Augmented Generation Video

Here you will find all code parts from the Retrieval Augmented Generation video in the order in which they occur in the video.

<div class="alert alert-block alert-info">

**Info:**
The Streamlit example is not included in this notebook because it is not executable in Jupyter notebooks. You can find this example in the `../chatbot-rag` directory.
</div>

## Install necessary libraries

To run the examples, please install the following libraries first.

In [None]:
!pip install tiktoken langchain-text-splitters langchain-chroma langchain-huggingface

## Part 1: Download your knowledge base

Our intention is to develop an expert chatbot about prompt engineering. For this we use the contents of the page [promptingguide.ai](https://www.promptingguide.ai).
With a little research, we find out that this page is generated from the following repository on [Github](https://github.com/dair-ai/Prompt-Engineering-Guide).
And that's great, because we can then download the content of the repository directly and use it for our knowledge base.

In [None]:
import requests
import zipfile
import io

url = 'https://github.com/dair-ai/Prompt-Engineering-Guide/archive/refs/heads/main.zip'
response = requests.get(url)

with zipfile.ZipFile(io.BytesIO(response.content)) as the_zip_file:
    the_zip_file.extractall('./') 
print("File unzipped successfully!")

## Part 2: Import the relevant parts and do a little preprocessing

If we look at the [Github repository](https://github.com/dair-ai/Prompt-Engineering-Guide), we see that the relevant and English language parts are in the [ar-pages](https://github.com/dair-ai/Prompt-Engineering-Guide/tree/main/ar-pages) directory and end with `.mdx` extensions. This will be the starting point of our knowledge base.

MDX is a format that allows you to write JSX (JavaScript XML) embedded within Markdown content. This enables you to use React components directly in your Markdown files. MDX is commonly used in documentation sites and other React-based web applications to combine the simplicity of Markdown with the power of React components.

And that's great again. Because language models are very good at understanding and generating Markdown. We are not interested in the JavaScript parts, but most of the file content is formatted in Markdown. So that should work.

The following code splits the content of multiple `.ar.mdx` files into chunks and counts the number of chunks. It uses `tqdm` for a progress bar, `glob` to find files, and `RecursiveCharacterTextSplitter` to split text. This script iterates through all `.ar.mdx` files, reads their content, splits it into chunks, and appends each chunk to a list. Finally, it counts and returns the total number of chunks.

To ensure that our texts fit into the context window of our embeddings (i.e. do not become too large) we use a `RecursiveCharacterTextSplitter`.  The `RecursiveCharacterTextSplitter` is a tool for dividing large texts into smaller chunks, typically for easier processing or analysis. It splits text into segments based on a specified maximum size, like 10,000 characters. The splitter ensures that each chunk is contextually meaningful by adjusting split points, avoiding breaks in the middle of words or sentences. This recursive approach helps manage large documents efficiently while maintaining readability.

In [None]:
from tqdm.notebook import tqdm
from langchain_text_splitters import RecursiveCharacterTextSplitter
from glob import glob

text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)

chunks = []

for doc in tqdm(glob("Prompt-Engineering-Guide-main/ar-pages/**/*.ar.mdx")):
    with open(doc) as f:
        for chunk in text_splitter.split_text(f.read()):
            try:
                chunks.append(chunk)
            except Exception as ex:
                print(doc, len(chunk), "not processable", str(ex))

len(chunks)

So we have about 80 articles about prompt engineering in our Prompt Engineering Guide, which we have broken down into just over 100 content chunks for our knowledge base.

## Part 3: Build your knowledge base

We now only need to convert these into embedding vectors and save them in a vector store. We use Chroma as the vector store for this and work with an [BGE M3 Embedding](https://arxiv.org/abs/2402.03216). BGE M3 Embedding is characterised by its versatility in multi-linguality, multi-functionality and multi-granularity. It supports more than 100 working languages and is suitable for multilingual and cross-language retrieval tasks. It is capable of processing inputs of varying granularity, ranging from short sentences to long documents with up to 8192 tokens and demonstrates similar performance to the commercial OpenAI embeddings as the following comparison is showing.

![width:250px](https://huggingface.co/BAAI/bge-m3/resolve/main/imgs/others.webp)

In [None]:
from langchain_chroma import Chroma
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings

bge_m3_embeddings = HuggingFaceEndpointEmbeddings(model="https://bge-m3-embedding.llm.mylab.th-luebeck.dev")
bge_m3 = Chroma.from_texts(chunks, bge_m3_embeddings, collection_name="bge_m3")
knowledge_base = bge_m3.as_retriever(search_kwargs={'k': 3})

## Part 4: Query your knowledge base

And now we come to the fun part. We ask our Knowledge Store for Retrieval Augmented Generation and get hits of indexed content chunks from the Prompt Engineering Guide that deal with this.

To estimate how large the generated context will be that we will put into our language model, we use `tiktoken` and estimate the number of tokens that would be required for the GPT-3.5-turbo model (the assumption here is that this number of tokens should be about right for our Llama3 models as well).

<div class="alert alert-block alert-warning">

**Transfer:**

Try to adapt the code and provide an interactive query using `ipywidgets`.
</div>

In [None]:
import tiktoken
tokens = tiktoken.encoding_for_model("gpt-3.5-turbo")

docs = knowledge_base.invoke("What is prompt chaining?")
for doc in docs:
    print("---")
    print(len(doc.page_content))
    print(doc.page_content)

ctx = "\n".join(d.page_content for d in docs)
f"{len(tokens.encode(ctx))} tokens"

OK, wir sehen das für unterschiedliche Beispiele, die Tokenanzahl meist unter der 5000 Tokengrenze unseres Llama3 70B Modells (5000 Token) liegt (und eigentlich immer deutlich unter den 7500 Input Tokens unseres Llama3 8B Modells). Damit sollten wir also einen interaktiven Prompt Engineering Guide hinbekommen.

## Part 5: Connect your Knowledge Base with your LLM using a Prompt Template

<div class="alert alert-block alert-warning">

**Transfer:**

Try to adapt the code and provide an interactive query and answer generation using `ipywidgets`.
</div>

In [None]:
from openai import OpenAI

query = "Show me examples of RAG."

client = OpenAI(base_url="https://chat-large.llm.mylab.th-luebeck.dev/v1", api_key="-")

docs = knowledge_base.invoke(query)
context = "\n".join(d.page_content for d in docs)

chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are KIRA, a prompt engineering expert. You answer questions based on the context you have retrieved from your knowledge base."},
        {"role": "system", "content": f"Context: {context}"},
        {"role": "user", "content": query }
    ],
    model="", stream=True, max_tokens=3000
)

print(f"Generated output based on {len(tokens.encode(context))} tokens:")
for message in chat_completion:
    if not message.choices[0].finish_reason:
        print(message.choices[0].delta.content, end='')

Great. We hope this notebook has helped you to understand how the answer generation of large language models can be guided using trusted knowledge stores. This should reduce hallucination effects.

If you have any questions, please do not hesitate to ask them. Our staff will see what we can do.

<img src="https://mylab.th-luebeck.de/images/mylab-logo-without.png" width=200px>