# **Retrieval-Augmented Generation (RAG) Model for QA Bot**



---


**Problem Statement:**
Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)
bot for a business. Use a vector database like Pinecone DB and a generative model like
Cohere API (or any other available alternative). The QA bot should be able to retrieve
relevant information from a dataset and generate coherent answers.

**Task Requirements:**
1. Implement a RAG-based model that can handle questions related to a provided
document or dataset.
2. Use a vector database (such as Pinecone) to store and retrieve document
embeddings efficiently.
3. Test the model with several queries and show how well it retrieves and generates
accurate answers from the document.

---



Install package requirements

In [1]:
# Setup
!pip install pinecone-client sentence-transformers PyMuPDF cohere



 Import required Libraries

In [2]:
# Import Libraries
import os
import fitz  # PyMuPDF for PDF processing
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
import numpy as np
import cohere
from typing import List, Tuple

Initialize Pinecone and Cohere

In [5]:
# Initialize Pinecone and Cohere
PINECONE_API_KEY="##############################"  # Replace with your actual Pinecone API key
cohere_api_key = '##############################'  # Replace with your actual Cohere API key

# Initialize Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)
INDEX_NAME = 'qa-bot-index'

# Delete existing index if it exists
if INDEX_NAME in pc.list_indexes().names():
    pc.delete_index(INDEX_NAME)

# Create a new index
pc.create_index(
    name=INDEX_NAME,
    dimension=384,
    metric='euclidean',
    spec=ServerlessSpec(cloud='aws', region='us-east-1')
)

# Connect to the index
index = pc.Index(INDEX_NAME)

# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Initialize Cohere client
co = cohere.Client(cohere_api_key)


Functions for PDF Processing and Querying

In [23]:
# Define Functions for PDF Processing and Querying
def load_pdf(file_path: str) -> str:
    """Load text content from a PDF file."""
    document = fitz.open(file_path)
    text = ""
    for page in document:
        text += page.get_text()
    return text

def preprocess_text(text: str) -> List[str]:
    """Split the extracted text into manageable segments."""
    return text.split('\n')  # Adjust splitting logic as necessary

def store_embeddings(documents: List[str]) -> None:
    """Generate and store embeddings for document segments in Pinecone."""
    embeddings = embedding_model.encode(documents, convert_to_tensor=True)
    for i, (doc, emb) in enumerate(zip(documents, embeddings)):
        index.upsert([(str(i), emb.cpu().numpy().tolist())])  # Store each embedding

def answer_query(query: str, documents: List[str]) -> Tuple[str, float]:
    """Retrieve the most relevant document segment based on the user's query."""
    query_embedding = embedding_model.encode(query, convert_to_tensor=True)
    similarities = np.array([
        np.dot(query_embedding, emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(emb))
        for emb in embedding_model.encode(documents, convert_to_tensor=True).numpy()
    ])

    best_idx = np.argmax(similarities)
    return documents[best_idx], similarities[best_idx]

def answer_query_with_cohere(query: str, documents: List[str]) -> str:
    """Retrieve the most relevant document segment and generate a response using Cohere."""
    # response_segment, _ = answer_query(query, documents)
    prompt = f"Given the following context from the document:\n{documents}\n\nPlease provide a detailed answer to the question: {query}\nA:"

    # Generate a response using Cohere
    response = co.generate(
        model='command-r-plus',  # Choose the appropriate model
        prompt=prompt,
        max_tokens=150
    )

    return response.generations[0].text.strip()

Load PDF and Process

In [19]:
# Load PDF and Process
pdf_file_path = '/content/Gen AI Engineer _ Machine Learning Engineer Assignment.pdf'  # Replace with your actual PDF path
pdf_text = load_pdf(pdf_file_path)  # Load text from PDF
documents = preprocess_text(pdf_text)  # Preprocess text into segments
store_embeddings(documents)  # Store embeddings in Pinecone

In [20]:
pdf_text

"Gen AI Engineer / Machine Learning Engineer Assignment\nPart 1: Retrieval-Augmented Generation (RAG) Model for QA Bot\nProblem Statement:\nDevelop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)\nbot for a business. Use a vector database like Pinecone DB and a generative model like\nCohere API (or any other available alternative). The QA bot should be able to retrieve\nrelevant information from a dataset and generate coherent answers.\nTask Requirements:\n1.\nImplement a RAG-based model that can handle questions related to a provided\ndocument or dataset.\n2.\nUse a vector database (such as Pinecone) to store and retrieve document\nembeddings efficiently.\n3.\nTest the model with several queries and show how well it retrieves and generates\naccurate answers from the document.\nDeliverables:\n●\nA Colab notebook demonstrating the entire pipeline, from data loading to question\nanswering.\n●\nDocumentation explaining the model architecture, approach to retriev

In [21]:
documents

['Gen AI Engineer / Machine Learning Engineer Assignment',
 'Part 1: Retrieval-Augmented Generation (RAG) Model for QA Bot',
 'Problem Statement:',
 'Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)',
 'bot for a business. Use a vector database like Pinecone DB and a generative model like',
 'Cohere API (or any other available alternative). The QA bot should be able to retrieve',
 'relevant information from a dataset and generate coherent answers.',
 'Task Requirements:',
 '1.',
 'Implement a RAG-based model that can handle questions related to a provided',
 'document or dataset.',
 '2.',
 'Use a vector database (such as Pinecone) to store and retrieve document',
 'embeddings efficiently.',
 '3.',
 'Test the model with several queries and show how well it retrieves and generates',
 'accurate answers from the document.',
 'Deliverables:',
 '●',
 'A Colab notebook demonstrating the entire pipeline, from data loading to question',
 'answering.',
 '●',
 'D

Example Queries

In [25]:
# Example Queries with Cohere
test_queries = [
    "what's the documents requirement",
    "What are the key points discussed in the document?"
]

for query in test_queries:
    response = answer_query_with_cohere(query, documents)
    print(f"Query: {query}\nResponse: {response}\n")

Query: what's the documents requirement
Response: The document requirements for the assignment are to provide a Colab notebook demonstrating the entire pipeline, from data loading to question answering, and to include documentation explaining the model architecture, approach to retrieval, and how generative responses are created. Additionally, several example queries and their corresponding outputs should be provided.

Query: What are the key points discussed in the document?
Response: The document outlines a two-part assignment for a Gen AI Engineer or Machine Learning Engineer. 

Part 1 involves developing a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA) bot. The bot should be able to retrieve relevant information from a dataset and generate coherent answers. The specific tasks include implementing the RAG model, using a vector database for efficient document embedding storage and retrieval, and testing the model with various queries to assess its accuracy. 

**README Section**

# PDF Question Answering Bot

## Overview
This notebook implements a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA) bot using Cohere for generative responses and Pinecone for vector storage. The bot processes a PDF document, retrieves relevant information, and allows users to query its content effectively.

## Objectives
- Develop a QA bot that answers questions based on a provided PDF document.
- Utilize Pinecone as a vector database to store and retrieve document embeddings.
- Implement Cohere as the generative model to formulate accurate answers.

---

## Instructions
1. **Setup**: Run the first cell to install the required packages:
    ```python
    !pip install pinecone-client sentence-transformers PyMuPDF cohere
    ```
   
2. **Initialize Pinecone**: Set your Pinecone API key in the designated cell and run it to initialize the Pinecone client and create an index.

3. **Set Up Cohere**: Similarly, input your Cohere API key to enable the generative model for answering queries.

4. **Define Functions**: The following cells contain functions for:
   - Loading and processing PDF documents.
   - Storing document embeddings in Pinecone.
   - Answering user queries based on the retrieved context.

5. **Load PDF**: Provide the path to your PDF file in the designated cell and run it to process the document and store embeddings.

6. **Ask Questions**: Use the example queries provided in the last cell or modify them to test the QA bot with your own questions.

---

## Example Queries
- "What are the key points discussed in the document?"
- "Can you explain the main arguments presented?"
- "What conclusions can be drawn from the findings?"

---

## Requirements
- Python 3.x
- Pinecone API key
- Cohere API key
- PDF document for testing

---

## Important Notes
- Ensure your PDF is structured in a way that allows effective segmentation for accurate answers.
- If you encounter issues with responses, consider refining prompts or verifying document content.

---