# RAG Application - Document Q&A

In this notebook, we are going to see in a step-by-step manner how to build a document Q&A application using a simple RAG pipeline. 

To that end, **Gemini AI models** will be used for embedding and generating answers, **ChromaDB** as the vector database, and **LangChain** for managing the retrieval process.

Based on: https://python.plainenglish.io/building-a-rag-application-with-gemini-ai-step-by-step-guide-24636dd21f5b 

## Getting Started

* Install the python SDK to use the `Gemini API`
* Install langchain_community (this package contains third-party integrations -> to use `pyPDF`) 
* Install langchain-chroma integration package to access `ChromaDB` 

In [2]:
%pip install -qU langchain-google-genai
%pip install -qU langchain_community pypdf
%pip install -qU "langchain-chroma>=0.1.2"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Libraries

In [3]:
import os

from dotenv import load_dotenv  # to load environment variables
from pathlib import Path  

from IPython.display import Markdown  # to get output in Markdown style

from langchain_community.document_loaders import PyPDFLoader  # to loa PDFs
from langchain.text_splitter import RecursiveCharacterTextSplitter  # langChain text splitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings  # langChain access to google GenAI embedding models
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_chroma import Chroma  # LangChain access to Crhoma DB
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain


## Setup Google API key

https://ai.google.dev/gemini-api/docs/api-key 

* Secure your API key in a environment variable file (.env) and load it using `load_dotenv()`
* Ignore the .env file in gitignore

In [4]:
dotenv_path = Path('./env')
load_dotenv()

GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')

## Q&A System - Step by step

### 1 - Load documents from PDF
The first step is to load a PDF document into the system. We use `PyPDFLoader` from the `langchain_community` library to achieve this.

In [5]:
loader = PyPDFLoader("./Data/Newwhitepaper_Foundational Large Language models & text generation.pdf")  # Load your PDF file
data = loader.load()
print(data)

[Document(metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 20.0 (Macintosh)', 'creationdate': '2024-11-12T11:43:11-07:00', 'moddate': '2024-11-12T11:43:17-07:00', 'trapped': '/False', 'source': './Data/Newwhitepaper_Foundational Large Language models & text generation.pdf', 'total_pages': 75, 'page': 0, 'page_label': '1'}, page_content='Foundational \nLarge Language \nModels & \nText Generation\nAuthors: Mohammadamin Barektain,  \nAnant Nawalgaria, Daniel J. Mankowitz,  \nMajd Al Merey, Yaniv Leviathan, Massimo Mascaro,  \nMatan Kalman, Elena Buchatskaya,                                     \nAliaksei Severyn, and Antonio Gulli'), Document(metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 20.0 (Macintosh)', 'creationdate': '2024-11-12T11:43:11-07:00', 'moddate': '2024-11-12T11:43:17-07:00', 'trapped': '/False', 'source': './Data/Newwhitepaper_Foundational Large Language models & text generation.pdf', 'total_pages': 75, 'page': 1, 'page

### 2 - Split the Document into Chunks

To handle large documents efficiently, we split the PDF into smaller chunks using the `RecursiveCharacterTextSplitter` class.

In [6]:
# Chunk_size: number of characters in the chunk
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=1000)
docs = text_splitter.split_documents(data)

print("Total number of Chunks: ", len(docs))  # Check how many chunks we have
# for chunk in docs:
#     print(chunk.page_content)

Total number of Chunks:  616


### 3 - Generate embeddings with Gemini AI

Next, to embed these chunks using Gemini AI, we access one of the models available in genAI. Embeddings are vector representations of text data, and they allow us to perform similarity-based retrieval.

In [7]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

### 4 - Create a Vector Store for Document Retrieval

We now store the document chunks and their embeddings in a vector database, which will allow us to retrieve similar documents based on user queries.

In this example, we are using Chroma as our vector database. Chroma is one of the many options available for storing and retrieving embeddings efficiently. 

`from_documents` create a vector store from a list of documents (docs) using the embedding function specified (defaults to none)

`as_retriever` returns a vector store retriever from the vector store, k: amounts of documents to return

In [8]:
vectorstoredb = Chroma.from_documents(documents=docs, embedding=embeddings)
retriever = vectorstoredb.as_retriever(search_type="similarity", search_kwargs={"k": 5})

### 5 - Retrieve Documents Based on a Query

To test the retrieval system, we ask "Wjat is a large language model?" as an example, and retrieve the most relevant document chunks (the fist one is the one with the highest similarity score)

In [9]:
retrieved_docs = retriever.invoke("What is a large language model")
print(len(retrieved_docs))
print(retrieved_docs[0].page_content)  # Print the first retrieved document

5
Foundational Large Language Models & Text Generation
8
September 2024
Large language models
A language model predicts the probability of a sequence of words. Commonly, when given 
a prefix of text, a language model assigns probabilities to subsequent words. For example, 
given the prefix “The most famous city in the US is…”, a language model might predict high 
probabilities to the words “New York” and “Los Angeles” and low probabilities to the words 
“laptop” or “apple”. You can create a basic language model by storing an n-gram table,2 while 
modern language models are often based on neural models, such as transformers.
Before the invention of transformers1, recurrent neural networks (RNNSs) were the popular 
approach for modeling sequences. In particular, “long short-term memory” (LSTM) and 
“gated recurrent unit” (GRU) were common architectures.3 This area includes language 
problems such as machine translation, text classification, text summarization, and question-


### 6 - Build a Question-Answering (Q&A) System

Now we move to the core of the RAG model by building a question-answering chain using the `ChatGoogleGenerativeAI` model from Gemini AI.

This step initializes the Gemini AI model for generating responses based on retrieved context.

In [10]:
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.3)

### 7 - Create the RAG Chain

We combine document retrieval with question answering using a custom prompt. The system will retrieve relevant documents and generate concise answers.

This prompt structure ensures the model generates answers concisely and within a specific context.

NOTES:
* create_stuff_documents_chain: create a chain for passing a list of documents to a model
    * the prompt must contain a context (see "{context}" in the system prompt)
* create_retrieval_chain: retriever parameter contains the documents (the context)

In [11]:
# Define a system prompt 
system_prompt = (
   "You are a AI expert. Provide clear, concise answers based on the provided context. "
    "If the information is not found in the context, state that the answer is unavailable. "
    "Use a maximum of three sentences."
    "\n\n"
    "{context}"
)

# Set up the prompt for the QA chain -> langchain_core > prompts > ChatPromptTemplate to see the template
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}")  # input = message passed through invoke method later on
    ]
)

# Create the RAG chain
chain = create_stuff_documents_chain(llm, prompt)  
rag_chain = create_retrieval_chain(retriever, chain)

### 8 - Ask a question

In [14]:
#response = rag_chain.invoke({"input": "What is a LLM"})
response = rag_chain.invoke({"input": "Give me examples of prompt engineering and let me know in which document was found"})
Markdown(response['answer'])

Examples of prompt engineering include providing clear instructions, examples, keywords, formatting, and background details.  This information is found in the provided text.

## Next steps

Some ideas to improve this exercise:

1. Chat history: The LLM has no history and it´s lacking interaction

2. References/citation: as the metadata is not good, chunks don't save its procedence/reference

2. Create an Agent with the next flow: 
    1. Add metadata
    2. Prompt -> Agent: to give a summary of each pdf to the LLM to decide which one to split and make vector search
    3. LLM -> RAG: one the right pdf is splitted into chunks, RAG chain to get the proper chunks
    4. RAG -> LLM -> answer