# Examples from the Retrieval Augmented Generation Video (Extended Version using the LangChain Expression Language, LCEL)

**Author:** **Keno Teppris** (ask for support, even if you are not in his team, chair of Team #7, Baltic Perspectives, Room: 25-2.16)

Here you will find all code parts from the Retrieval Augmented Generation video in the order in which they occur in the video.

<div class="alert alert-block alert-info">

**Info:**
The Streamlit example is not included in this notebook because it is not executable in Jupyter notebooks. You can find this example in the `../chatbot-rag` directory.
</div>

## Install necessary libraries

To run the examples, please install the following libraries first.

In [None]:
!pip install tiktoken langchain langchain-community langchain-text-splitters langchain-chroma langchain-huggingface langchain-openai

# Introduction to RAG Indexing

Welcome to the RAG (Retrieval-Augmented Generation) Indexing section! In this section, we will explore the innovative approach of integrating retrieval mechanisms with generative models to enhance the capabilities of AI systems. The RAG methodology empowers models to dynamically retrieve relevant information from a knowledge base (or "document store") during the generation process.

<img src="https://js.langchain.com/v0.2/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png" width="800"/>

## Workflow Overview

1. **Load:** The first step in the RAG process involves loading the necessary data. Data can be in various formats such as JSON files or URLs pointing to data sources. This is crucial for building a diverse and comprehensive document store.

2. **Split:** Once data is loaded, it's split into manageable parts or chunks. Splitting is essential for both processing efficiency and effective data management.

3. **Embed:** Each chunk of data is then embedded into a vector space. These embeddings represent the semantic content of the data in a format that machines can understand and compare.

4. **Store:** Finally, the embeddings are stored in a structured format. This "store" acts as the retrieval database during the generation phase, allowing the model to access and utilize the information efficiently.

5.  **Test:** Test the retriever by querying documents.

By the end of this tutorial, you'll gain hands-on experience with each of these steps, understand how they interconnect, and appreciate their importance in building powerful AI models that can leverage external knowledge bases effectively.


## 1. Load: Download your knowledge base

Our intention is to develop an expert chatbot about prompt engineering. For this we use the contents of the page [promptingguide.ai](https://www.promptingguide.ai).
With a little research, we find out that this page is generated from the following repository on [Github](https://github.com/dair-ai/Prompt-Engineering-Guide).
And that's great, because we can then download the content of the repository directly and use it for our knowledge base.

In [None]:
import requests
import zipfile
import io

url = 'https://github.com/dair-ai/Prompt-Engineering-Guide/archive/refs/heads/main.zip'
response = requests.get(url)

with zipfile.ZipFile(io.BytesIO(response.content)) as the_zip_file:
    the_zip_file.extractall('./') 
print("File unzipped successfully!")

## 2. Split: Import the relevant parts and do a little preprocessing

If we look at the [Github repository](https://github.com/dair-ai/Prompt-Engineering-Guide), we see that the relevant and English language parts are in the [ar-pages](https://github.com/dair-ai/Prompt-Engineering-Guide/tree/main/ar-pages) directory and end with `.mdx` extensions. This will be the starting point of our knowledge base.

MDX is a format that allows you to write JSX (JavaScript XML) embedded within Markdown content. This enables you to use React components directly in your Markdown files. MDX is commonly used in documentation sites and other React-based web applications to combine the simplicity of Markdown with the power of React components.

And that's great again. Because language models are very good at understanding and generating Markdown. We are not interested in the JavaScript parts, but most of the file content is formatted in Markdown. So that should work.

The following code splits the content of multiple `.ar.mdx` files into chunks and counts the number of chunks. It uses `tqdm` for a progress bar, `glob` to find files, and `RecursiveCharacterTextSplitter` to split text. This script iterates through all `.ar.mdx` files, reads their content, splits it into chunks, and appends each chunk to a list. Finally, it counts and returns the total number of chunks.

To ensure that our texts fit into the context window of our embeddings (i.e. do not become too large) we use a `RecursiveCharacterTextSplitter`.  The `RecursiveCharacterTextSplitter` is a tool for dividing large texts into smaller chunks, typically for easier processing or analysis. It splits text into segments based on a specified maximum size, like 10,000 characters. The splitter ensures that each chunk is contextually meaningful by adjusting split points, avoiding breaks in the middle of words or sentences. This recursive approach helps manage large documents efficiently while maintaining readability.


In [None]:
from tqdm.notebook import tqdm
from langchain_text_splitters import RecursiveCharacterTextSplitter
from glob import glob

text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)

chunks = []

for doc in tqdm(glob("Prompt-Engineering-Guide-main/ar-pages/**/*.ar.mdx")):
    with open(doc) as f:
        for chunk in text_splitter.split_text(f.read()):
            try:
                chunks.append(chunk)
            except Exception as ex:
                print(doc, len(chunk), "not processable", str(ex))

len(chunks)

So we have about 80 articles about prompt engineering in our Prompt Engineering Guide, which we have broken down into just over 100 content chunks for our knowledge base.

## 3.-4., Embed and store: Build your knowledge base

We now only need to convert these into embedding vectors and save them in a vector store. We use Chroma as the vector store for this and work with an [BGE M3 Embedding](https://arxiv.org/abs/2402.03216). BGE M3 Embedding is characterised by its versatility in multi-linguality, multi-functionality and multi-granularity. It supports more than 100 working languages and is suitable for multilingual and cross-language retrieval tasks. It is capable of processing inputs of varying granularity, ranging from short sentences to long documents with up to 8192 tokens and demonstrates similar performance to the commercial OpenAI embeddings as the following comparison is showing.

![width:250px](https://huggingface.co/BAAI/bge-m3/resolve/main/imgs/others.webp)

In [None]:
from langchain_chroma import Chroma
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings

bge_m3_embeddings = HuggingFaceEndpointEmbeddings(model="https://bge-m3-embedding.llm.mylab.th-luebeck.dev")
bge_m3 = Chroma.from_texts(chunks, bge_m3_embeddings, collection_name="bge_m3")
retriever = bge_m3.as_retriever(search_kwargs={'k': 3})

## TASK: Test retriever by querying your knowledge base

And now we come to the fun part. We ask our Knowledge Store for Retrieval Augmented Generation and get hits of indexed content chunks from the Prompt Engineering Guide that deal with this.

To estimate how large the generated context will be that we will put into our language model, we use `tiktoken` and estimate the number of tokens that would be required for the GPT-3.5-turbo model (the assumption here is that this number of tokens should be about right for our Llama3 models as well).

<div class="alert alert-block alert-warning">

**Transfer:**

Try to adapt the code and provide an interactive query using `ipywidgets`.
</div>

In [None]:
import tiktoken
tokens = tiktoken.encoding_for_model("gpt-3.5-turbo")

docs = retriever.invoke("What is prompt chaining?")
for doc in docs:
    print("---")
    print(len(doc.page_content))
    print(doc.page_content)

ctx = "\n".join(d.page_content for d in docs)
f"{len(tokens.encode(ctx))} tokens"

OK, we see that for different examples, the token count is usually below the 5000 token limit of our Llama3 70B model (5000 tokens) (and actually always well below the 7500 input tokens of our Llama3 8B model). This should allow us to create an interactive prompt engineering guide.

## TASK: Connect your Knowledge Base with your LLM using a Prompt Template

<div class="alert alert-block alert-warning">

**Transfer:**

Try to adapt the code and provide an interactive query and answer generation using `ipywidgets`.
</div>

# Example usage of LangChain Expression Language (LCEL)

LangChain Expression Language (LCEL) is a declarative language for chaining LangChain components. It supports streaming, async execution, and optimized parallel execution. LCEL allows prototypes to be put into production without code changes, providing flexibility and reliability for complex chains.

## Runnable Interface

The Runnable interface standardizes how to define and invoke custom chains. It includes methods like `stream`, `invoke`, and `batch` for synchronous and asynchronous execution. Many LangChain components, such as chat models and output parsers, implement this interface, facilitating easy customization and integration. 

For more information, visit the [LangChain documentation](https://python.langchain.com/v0.2/docs/concepts/#langchain-expression-language).

## Retrieval and Answering

In this section, we will leverage our knowledge base to retrieve relevant document chunks. Using both the context and the query, we'll construct prompts that guide our model's response process. The model will then generate answers based on the provided context, demonstrating how dynamic retrieval enhances the quality and relevance of model outputs.

<img src="https://python.langchain.com/v0.2/assets/images/rag_retrieval_generation-1046a4668d6bb08786ef73c56d4f228a.png" width="800"/>

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()
llm = ChatOpenAI(
    base_url="https://chat-large.llm.mylab.th-luebeck.dev/v1",
    api_key="-",
    streaming=True,
    max_tokens=3000
)

chain = llm | parser
output = chain.invoke("Hi, how are you?")
print(output)

The code provided sets up a language model chain using LangChain. Here's what happens:

1. **Parser Setup**: A `StrOutputParser` is initialized to process the output into a string format.
2. **Language Model Configuration**: `ChatOpenAI` is instantiated with a base URL, an API key, streaming enabled, and a maximum token limit of 3000.
3. **Chain Creation**: The language model and parser are combined into a chain using the `|` operator.
4. **Invocation**: The chain is invoked with the input "Hi, how are you?", and the output is printed.

This setup processes and returns a response from the language model.

## Streaming
The following code sets up an asynchronous chain to process input using LangChain. Here's what happens:

1. **Asynchronous Invocation**: Uses `astream` to asynchronously process the input "Hi, how are you?" and print each chunk of the response in real-time.

In [None]:
async for chunk in chain.astream("Hi, how are you?"):
    print(chunk, end="", flush=True)

### Introduction to Prompts

This code demonstrates how to create and use a prompt with LangChain for generating responses:

1. **Prompt Template**: A `ChatPromptTemplate` is created with a template string, "tell me a joke about {topic}".
2. **Chain Creation**: The prompt template is combined with a language model (`llm`) and a parser (`parser`) to form a chain.
3. **Asynchronous Invocation**: The chain is invoked asynchronously with the topic "Large Language Models", and each chunk of the response is printed in real-time.

This setup dynamically generates and processes prompts for flexible interactions with the language model.

In [None]:
from langchain_core.prompts import ChatPromptTemplate


prompt = ChatPromptTemplate.from_template("Tell me a joke about {topic}")

chain = prompt | llm | parser

async for chunk in chain.astream({"topic": "Large Language Models"}):
    print(chunk, end="", flush=True)

# Introduction to Question-Answering Chains using LangChain

In this section, we will explore how to create a question-answering chain using LangChain, a powerful tool for integrating language models with retrieval-based systems. We build a system that can fetch relevant context from a set of documents and use this context to provide concise and accurate answers to questions.

The key component of our system is:
1. **Question-Answering Chain**: This uses the retrieved context to generate an answer to the question.

We will start by incorporating the retriever into a question-answering chain using a specific prompt format. This prompt instructs the system on how to use the retrieved context to answer the question. The prompt emphasizes providing concise answers within three sentences.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

# 2. Incorporate the retriever into a question-answering chain.
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}" # the context variable will be filled with the retrieved context and is expected from any predefined rag chain
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In this code:
- `system_prompt` defines how the assistant should respond using the retrieved context.
- `prompt` structures the interaction between the system and the user.
- `question_answer_chain` combines the language model with the prompt to create answers. The word "stuff" refers to just combine all the retrieved context chunks. For other chain types, checkout the langchain documentation.
- `rag_chain` ties everything together, allowing the system to retrieve relevant documents and use them to answer questions.

This setup enables us to build an efficient question-answering system that leverages the strengths of both retrieval-based and generative models.

## Question-Answering Chain to Retrieve and Answer Questions

In this section, we will see how to use the question-answering chain we created to answer a specific question. Let's take a look at the code:

In [None]:
question = "What are context transformer and why should i use them?"

result = rag_chain.invoke({"input": question})
print(result["answer"])

In [None]:
print(result.keys())  # Output: dict_keys(['input', 'context', 'answer'])

print("Question:", result["input"])
print("-"*10)
print("First context:", result["context"][0].page_content[:500] + "\n[...]")
print("-"*10)
print("Answer:", result["answer"])


In this code:
- We define a question that we want the system to answer.
- We use the `rag_chain` we created earlier to process the question. The `invoke` method of `rag_chain` is called with the question as input.
- The result is stored in the `result` variable, and we print the answer using `result["answer"]`.

When we call `result.keys()`, we get `dict_keys(['input', 'context', 'answer'])`. This means that the result is a dictionary containing three keys:
1. **input**: This is the original question we provided to the system.
2. **context**: This contains the pieces of retrieved context that the system used to generate the answer.
3. **answer**: This is the concise answer generated by the system based on the provided context.

### Why does the result dictionary contain these keys?

- **input**: Keeping the original input question helps in debugging and understanding what question was asked, especially when dealing with multiple questions.
- **context**: This key provides transparency and traceability, allowing us to see the exact pieces of information the system used to derive its answer. This is crucial for understanding the reasoning process of the model and ensuring that the context used is relevant and accurate.
- **answer**: The final answer provided by the system, which is what we are primarily interested in.

In your application you can use this information to provide metadata or links to the source files, references in your applications.

## Creating a History-Aware Question-Answering Chain

In this lesson, we will extend our question-answering system to be aware of the conversation history. This enhancement allows the system to handle questions that reference previous interactions, providing more contextually accurate answers.

We'll go through the steps to create a history-aware retriever and integrate it with our existing retrieval chain.

### Key Components

1. **History-Aware Retriever**: Reformulates questions considering the conversation history.
2. **Conversational Chain**: Maintains the conversation history across multiple interactions.

Here is the code to set up these components:

In [None]:
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder

contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

In [None]:
# We reuse our llm and retriever
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

In [None]:
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

store = {}


def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]


conversational_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

### Explanation:

1. **contextualize_q_system_prompt**: This prompt ensures that the system reformulates the user's question to be self-contained, removing dependencies on the previous chat history.
   
2. **contextualize_q_prompt**: This is a `ChatPromptTemplate` that uses the system prompt and includes placeholders for the chat history and the user's input.

3. **history_aware_retriever**: This retriever uses the `contextualize_q_prompt` to reformulate questions considering the chat history.

4. **rag_chain**: We recreate our retrieval chain using the history-aware retriever.

5. **ChatMessageHistory**: This class manages the chat history for different sessions.

6. **get_session_history**: This function retrieves the chat history for a given session ID, creating a new history if one doesn't exist.

7. **RunnableWithMessageHistory**: This class integrates the `rag_chain` with the session-based message history, allowing the system to maintain context across interactions.

### How It Works

- The system now keeps track of the conversation history for each session.
- When a new question is asked, the `history_aware_retriever` reformulates it using the chat history to make it self-contained.
- The reformulated question is then processed by the retrieval chain to provide a contextually accurate answer.

This setup enhances the system's ability to handle follow-up questions and references to previous interactions, making it more effective in conversational settings.


In [None]:
output = conversational_rag_chain.invoke(
    {"input": "How to add citations to a RAG chain?"},
    config={
        "configurable": {"session_id": "abc12"}
    },  # constructs a key "abc123" in `store`.
)

print(output["answer"])

In [None]:
output = conversational_rag_chain.invoke(
    {"input": "How could a citation prompt look like?"},
    config={
        "configurable": {"session_id": "abc12"}
    }
)

print(output["answer"])

### Access message history

In [None]:
from langchain_core.messages import AIMessage

markdown_string = ""

for message in store["abc12"].messages:
    prefix = "AI" if isinstance(message, AIMessage) else "User"

    markdown_string += f"\n**{prefix}:** {message.content}\n"

In [None]:
from IPython.display import Markdown, display

display(Markdown(markdown_string))

## TASK Advanced: Add citations to the answer.

<div class="alert alert-block alert-warning">

**Transfer:**

Checkout the langchain tutorial to add citations to your answer output: [QA Citations](https://python.langchain.com/v0.2/docs/how_to/qa_citations/)
</div>

# Enhancing the Question-Answering Chain with Contextual Compression

Congratulations on building a history-aware question-answering system! By now, you've seen how integrating chat history into the retrieval and question-answering process can significantly improve the system's ability to provide contextually accurate answers. But the journey doesn't have to end here. There are always ways to refine and enhance your system further.

## Next Steps: Contextual Compression

One powerful technique you can explore to improve your question-answering chain is **contextual compression**. Contextual compression involves reducing the amount of context while retaining the most relevant information. This can help in scenarios where the retrieved context is too large to process efficiently or when you want to focus on the most critical parts of the context.

To implement contextual compression, you can follow the guide provided in the LangChain documentation:

### Steps to Implement Contextual Compression

1. **Understand Contextual Compression**: Read through the [LangChain documentation on contextual compression](https://python.langchain.com/v0.2/docs/how_to/contextual_compression/) to get a detailed understanding of the concept and its benefits.

2. **Integrate Compression Techniques**: Use the techniques described in the documentation to compress the context retrieved by your system. This could involve summarizing documents, extracting key phrases, or using machine learning models to identify the most relevant pieces of information.

3. **Update Your Chain**: Modify your existing retrieval and question-answering chain to include a step for contextual compression. This could be done by creating a new component in the chain that processes the retrieved context before it's used to generate the answer.

4. **Experiment and Evaluate**: Test your improved system with various questions and chat histories. Evaluate the quality of the answers and the efficiency of the system. Compare it with the previous version to see the improvements.


By incorporating contextual compression, you can make your question-answering system even more robust and efficient, providing high-quality answers while handling larger and more complex contexts.

## TASK Advanced RAG: Add Pre or Post Retrieval components 

<div class="alert alert-block alert-warning">

**Transfer:**

Try out different predefined RAG components and add them to your chain to see if it improves the results: [Contextual Compression](https://python.langchain.com/v0.2/docs/how_to/contextual_compression/)
</div>

Great. We hope this notebook has helped you to understand how the answer generation of large language models can be guided using trusted knowledge stores. This should reduce hallucination effects.

If you have any questions, please do not hesitate to ask Keno (Track chair of Baltic Perspectives), even if you are not a member of his team. He is in room 25-2.16.

<img src="https://mylab.th-luebeck.de/images/mylab-logo-without.png" width=200px>