# RAG Application - Document Q&A

In this notebook, we are going to see in a step-by-step manner how to build a document Q&A application using a simple RAG pipeline. 

To that end, **Gemini AI models** will be used for embedding and generating answers and **ChromaDB** as the vector database. The RAG module will be constructed manually.

## Getting Started

* Install the python SDK to use the `Gemini API`
* Install langchain_community (this package contains third-party integrations -> e.g. pyPDF loaders`) 

In [2]:
%pip install -qU langchain-google-genai
%pip install -qU langchain_community

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Libraries

In [3]:
# %pip install YPython
# %pip install dotenv
# %pip install langChain

In [4]:
import os
import chromadb

from dotenv import load_dotenv  # to load environment variables (for API key variable)
from pathlib import Path  

from IPython.display import Markdown  # to get output in Markdown style

from langchain_community.document_loaders import PyPDFDirectoryLoader # to load PDFs from a folder
from langchain.text_splitter import RecursiveCharacterTextSplitter  # langChain text splitter
import google.generativeai as genai
from langchain_google_genai import GoogleGenerativeAIEmbeddings  # langChain access to google GenAI embedding models
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry


  from .autonotebook import tqdm as notebook_tqdm


## Setup Google API key

https://ai.google.dev/gemini-api/docs/api-key 

* Secure your API key in a environment variable file (.env) and load it using `load_dotenv()`
* Ignore the .env file in gitignore

In [5]:
dotenv_path = Path('./env')
load_dotenv()

GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')

## Q&A System - Step by step

### 1 - Load documents
The first step is to load PDF documents into the system. We use `PyPDFDirectoryLoader` from the `langchain_community` library to achieve this.

In [6]:
loader = PyPDFDirectoryLoader("../Data/")  # to load multiple files from a folder
docs = loader.load()
print(docs)



### 2 - Split the Documents into Chunks

To handle large documents efficiently, we split the documents into smaller chunks using the `RecursiveCharacterTextSplitter` class.

In [7]:
# Chunk_size: number of characters in the chunk
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=1000)
chunks = text_splitter.split_documents(docs)

print("Total number of Chunks: ", len(chunks))  # Check how many chunks we have

Total number of Chunks:  2072


Every chunk has a `metadata` param (dictionary) that contains the key `source` of it (pdf path).

In [8]:
chunks[0].metadata['source']

'../Data/Newwhitepaper_Prompt Engineering_v4.pdf'

### 3 - Generate embeddings with Gemini AI

Next, to embed these chunks using Gemini AI, we access one of the models available in genAI. Embeddings are vector representations of text data, and they allow us to perform similarity-based retrieval.

In [10]:
genai.configure(api_key=GOOGLE_API_KEY)

for m in genai.list_models():
    if "embedContent" in m.supported_generation_methods:
        print(m.name)

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


In [11]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

# example
vector = embeddings.embed_query("Hello, world")
print("The vector's first values: ", vector[:5])
print("The length of the output vector (vector's dimensionality): ",len(vector))

The vector's first values:  [0.0065857404842972755, -0.009072314016520977, -0.052920885384082794, 0.007686096243560314, -0.026785606518387794]
The length of the output vector (vector's dimensionality):  768


The problem with this embedding function is that it can not be used for documents. It's expecting `strings`, not a `document` object.

In [12]:
# example: we can only send the page content -> we miss the metadata
vector = embeddings.embed_documents(chunks[0].page_content)

The solution, as we will see in the next step, is to create a `custom function`

### 4 - Create a Vector Store for Document Retrieval

We now store the document chunks and their embeddings in a vector database, which will allow us to retrieve similar documents based on user queries.

In this example, we are using Chroma as our vector database. Chroma is one of the many options available for storing and retrieving embeddings efficiently. 

1. Create a Chroma client
chroma_client = chromadb.Client()

2. Create a collection: where you'll store your embeddings, documents, and any additional metadata. Collections index your embeddings and documents, and enable efficient retrieval and filtering
    * By default, Chroma uses the **Sentence Transformers** `all-MiniLM-L6-v2` model to create embeddings.
    * to customize one, we just need to implement the `embedding function` protocol.

3. Add documents to the collection: Chroma will store your text and handle embedding and indexing automatically. You can also customize the embedding model. You must provide unique string IDs for your documents.

In [13]:
DB_NAME = "my_rag_db"

# 1. Create a Chroma client
chroma_client = chromadb.Client()

In [14]:
# 2. Collection: Custom embedding function
# Define new class that inherits from "EmbeddingFunction" class all the properties and methods and can add its own
class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries (Class attribute: document_mode)
    document_mode = True

    # Define a method (_class_) tha makes the class instance callable like a function
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        retry_policy = {"retry": retry.Retry(predicate=retry.if_transient_error)}

        response = genai.embed_content(
            model="models/text-embedding-004",
            content=input,
            task_type=embedding_task,
            request_options=retry_policy,
        )
        # Response will be a dictionary with metadata and key "embedding" that we are interested in
        return response["embedding"]
    

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)

In [15]:
# 3. Add documents to the collection
db.add(documents=[chunks[i].page_content for i in range(len(chunks))],
       metadatas=[chunks[j].metadata for j in range(len(chunks))],
       ids=[str(k) for k in range(len(chunks))])

### 5 - Retrieve Documents Based on a Query

To test the retrieval system with the custom embedding function created before, we ask a query as an example, and retrieve the most relevant document chunks (the first one is the one with the highest similarity score)

In [21]:
# Switch to query mode when generating embeddings.
embed_fn.document_mode = False

# Search the Chroma DB using the specified query.
query = "what is an agent and tell me where did you find the information?"

result = db.query(query_texts=[query], n_results=1)
[[passage]] = result["documents"]
[[context]] = result['metadatas']

print(context["source"])
#print(passage)

../Data/Newwhitepaper_Agents2.pdf


In [17]:
result["documents"][0]

['5\nSeptember 2024\nWhat is an agent?\nIn its most fundamental form, a Generative AI agent can be defined as an application that \nattempts to achieve a goal by observing the world and acting upon it using the tools that it \nhas at its disposal. Agents are autonomous and can act independently of human intervention, \nespecially when provided with proper goals or objectives they are meant to achieve. Agents \ncan also be proactive in their approach to reaching their goals. Even in the absence of \nexplicit instruction sets from a human, an agent can reason about what it should do next to \nachieve its ultimate goal. While the notion of agents in AI is quite general and powerful, this \nwhitepaper focuses on the specific types of agents that Generative AI models are capable of \nbuilding at the time of publication.\nIn order to understand the inner workings of an agent, let’s first introduce the foundational \ncomponents that drive the agent’s behavior, actions, and decision making. Th

In [35]:
print(result.keys())

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'data', 'metadatas', 'distances', 'included'])


### 6 - Augmented Generation: build a Question-Answering (Q&A) System

Now that we have found a relevant passage from the set of documents, the retrieval step, the next one is the augmented generation step. To that end, we are going to use a generative AI model from Gemini `gemini-1.5-flash`.

In addition, define a proper prompt to sent to the LLM model together with the input query and the context.

In [58]:
model = genai.GenerativeModel("gemini-1.5-flash-latest")

In [88]:
prompt = f"""
You are a AI expert. Provide clear, concise answers based on the provided context. 
If the information is not found in the context, state that the answer is unavailable. 
Use a maximum of three sentences.

QUESTION: {query}
PASSAGE: {passage}
CONTEXT: {context["source"]}
"""

In [87]:
answer = model.generate_content(prompt)
print(answer.text)

The provided text mentions various prompting techniques but doesn't give specific examples.  The document, "../Data/Newwhitepaper_Prompt Engineering_v4.pdf," discusses prompt engineering as an iterative process of crafting, testing, analyzing, and refining prompts.  More detail on specific techniques is unavailable in this excerpt.



## Next steps

1. The retrieval process is not in working order to increase the number of results and, thus, having more references for a more precise answer later on. As an example, in the retrieval document we are missing real prompt techniques examples. This could be fixed by changing the passage variable.

2. Improve the prompt:
    * Specify answer layout to get a clear statement and the reference of the document (citation)