![ALT_TEXT_FOR_SCREEN_READERS](./header.png)

# Exercise 4.B Retrieval Augmented Generation

The goal of this exercise is to build a chatbot demo which allows you to talk about the content of documents. The method behind this exercise is called retrieval augmented generation (RAG).
The detailed tasks in this exercise are:
- install a local large language model using the application Ollama
- setup a new environment with the required packages
- implement a simple chatbot using langchain[2]
- test the chatbot on a specific technical document
- test the impact of the embedding model and the augmentation on the quality of the results

We are using Ollama[1] for local execution of the LLM and the framework langchain[2] for the access to the model.

- [1] https://ollama.com/
- [2] https://www.langchain.com/

# Considerations

- Install Ollama on your computer
- Install additional software packages into the environment by uncommenting the pip install commands one time
- Select a model based on your memory size of the laptop
- This is less a coding example, rather just the integration with a local LLM

# Requirements

- R0: Install the required packages using the pip commands
- R1: Install the Ollama software
- R2: Find a model which is running on your machine
- R3: Start the server for the model
- R4: Connect the server to the notebook
- R5: Run the code parts until the first query
- R6: Test the RAG function with and without supporting context for the huggingface embedding
- R7: Test the RAG function with and without supporting context for the ollama embedding
- R8: Test the RAG function with an easy and one difficult query

# Install Packages

In [None]:
#%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph

In [None]:
#%pip install -qU langchain-ollama

In [None]:
#%pip install -qU langchain-core

In [None]:
#%pip install -qU pypdf

In [None]:
#%pip install -qU langchain-huggingface

In [None]:
#%pip install -qU ipywidgets

# Import Basic Modules

In [None]:
import os
import pprint

# Prepare Embedding Model

In [None]:
#
# Load embeddings model (we will test two different models R6, R7)
#
from langchain_ollama import OllamaEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# R6
embeddings = OllamaEmbeddings(model="llama3")

# R7
#embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [None]:
#
# Test embedding model
#

In [None]:
single_embedding = embeddings.embed_query("Woman")
print(str(single_embedding)[:100])

In [None]:
single_embedding = embeddings.embed_query("Man")
print(str(single_embedding)[:100])

In [None]:
print(len(single_embedding))

## Load PDF from File

In [None]:
#
# Define PDF file to read
#
file_path = "./documents/bitcoin.pdf"

In [None]:
#
# Load content of PDF file into memory
#
from langchain_community.document_loaders import PyPDFLoader

In [None]:
loader = PyPDFLoader(file_path,mode="single")

In [None]:
pages = loader.load()

In [None]:
#
# Check content of pages
#

In [None]:
pprint.pp(f"{pages[0].metadata}\n")

In [None]:
print(pages[0].page_content[:1000])

## Text Splitter

A text splitter breaks a text into smaller pieces. Those smaller parts are then loaded into the vector database together with their embeddings. There are many different text splitting strategies. The best ones are observing the content (semantic splitters).

In [None]:
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

In [None]:
documents = text_splitter.split_documents(pages)

In [None]:
print(len(documents))

In [None]:
pprint.pp(documents[0])

## Create Vectorstore and fill it

A vector store is a database which stores information together with the embedding of the information. In our case it is a memory store.

In [None]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore.from_documents(documents, embeddings)

In [None]:
#
# This functions allows us to search for similar text pieces in the vector store. 
# We give it a query, it calculates the embedding and then searches in the embeddings for similar entires and returns the text of the entry.
#
docs = vector_store.similarity_search("What is Proof-of-work?", k=3)

In [None]:
#
# Show all found similar text parts
#
for doc in docs:
    pprint.pp(f'Doc {doc}\n')

## Setup LLM

In [None]:
from langchain_ollama import ChatOllama
from langchain_core.messages import SystemMessage

In [None]:
llm = ChatOllama(model="llama3.2")

## Setup Simple Sequence Solution

Here we simulate a RAG setup by executing just the required steps one by one.

In [None]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.prompts import PromptTemplate

In [None]:
#
# User query (R8)
#
query = "What is proof-of-work in the context of bitcoin?" # simple query
#query = "What is the title of the work of W. Feller?" # complex query

In [None]:
#
# Collect supporting information from database
#
retrieved_docs = vector_store.similarity_search(query, k=3)

In [None]:
#
# Convert found documents from database
#
context = "\n\n".join(doc.page_content for doc in retrieved_docs)

In [None]:
context

In [None]:
system_prompt = """ 
You are a helpful assistant answering questions from users.

You have the following context information for answering the user query:
-------------------------------------
{context}
-------------------------------------

Now, please answer the following user question based on the context information. 
Do not use your own knowledge.
If the context information is not sufficient, do not answer the query.

User query:
{query}
"""
prompt = PromptTemplate.from_template(system_prompt)


In [None]:
print(system_prompt)

In [None]:
#
# Setup prompt
#
messages = prompt.invoke({"query": query, "context": context})

In [None]:
#
# Call LLM
#
output = llm.invoke(messages)

In [None]:
print(output.content)

In [None]:
#
# Check if the context really helped. Make the same query to the LLM, but remove the context.
#

In [None]:
messages = prompt.invoke({"query": query, "context": '' })

In [None]:
output = llm.invoke(messages)

In [None]:
print(output.content)

## Setup Agentic Solution

We are using now the agent framework of langchain, langgraph to setup an agent which can do RAG.

In [None]:
from langchain_core.documents import Document

In [None]:
# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"query": state["question"], "context": docs_content})
    system_prompt
    response = llm.invoke(messages)
    return {"answer": response.content}

# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

In [None]:
response = graph.invoke({"question": "What is Proof-of-work?"})
print(response["answer"])

# A better solution

RAG suffers from the problem that text parts often miss the context and that a question does not always have a similar embedding than the answer. There are ways to soften those problems. 

- Add context to each text part (e.g. a summary of the complete chapter)
- Use better text splitting concepts (e.g. semantic chunking)
- Rewrite the user query if no suitable text part is found (ReAct)
- Combine retrieval strategies (e.g. add key word based retrievers)

Lets try the ReAct (rethink and act) approach. This agent includes a reflection step which checks the intermediate results and rewrites the query in case of a poor retrieval quality.
A detailed solution can be found at https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_agentic_rag.ipynb.

The following adaptations are required:

- change the LLMs (one LLM init is sufficient)

In [None]:
# your code here