# DocQuery: Intelligent Document Retrieval System
## A RAG-powered Question Answering Pipeline with Pinecone Database

### Overview
### This notebook implements a Retrieval Augmented Generation (RAG) pipeline for intelligent document querying. The system processes PDF documents, converts them into searchable vectors, and answers questions using the context from these documents.

In [6]:
pwd

'G:\\RAG'

In [7]:
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
import time

In [8]:
import openai


In [5]:
OPENAI_API_KEY = "Enter api key"
PINECONE_API_KEY = "Enter api key"

## PDF Processing
The system uses PyPDF2 to extract text from PDF documents. The text is then split into meaningful chunks while preserving context and document structure.

In [1]:
import PyPDF2
import re
import uuid

def convert_pdf_to_docs(pdf_path):
    docs = []
    
    try:
        # Open the PDF file
        with open(pdf_path, 'rb') as file:
            # Create a PDF reader object
            pdf_reader = PyPDF2.PdfReader(file)
            
            # Iterate through all pages
            for page_num in range(len(pdf_reader.pages)):
                # Get the page object
                page = pdf_reader.pages[page_num]
                
                # Extract text from the page
                text = page.extract_text()
                
                # Split text into paragraphs (split by double newlines)
                paragraphs = re.split(r'\n\s*\n', text)
                
                # Clean paragraphs and create documents
                for paragraph in paragraphs:
                    # Clean and normalize the text
                    cleaned_text = ' '.join(paragraph.split())
                    
                    # Skip empty paragraphs
                    if cleaned_text.strip():
                        # Create document dictionary
                        doc = {
                            "id": str(uuid.uuid4())[:8],  # Generate a unique ID
                            "text": cleaned_text
                        }
                        docs.append(doc)
    
    except Exception as e:
        print(f"Error processing PDF: {str(e)}")
        return None
    
    return docs

## I am assuming the relevant files of the business that we need to retrieve data from are in pdf format. I have taken a random QnA pdf from swayam portal website as a sample

In [24]:

pdf_path = "business_faq.pdf"
    
# Convert PDF to documents
result_docs = convert_pdf_to_docs(pdf_path)
    
if result_docs:
    # Print the results
    print(f"Successfully converted PDF into {len(result_docs)} documents:")
    for doc in result_docs:
        print(f"\nID: {doc['id']}")
        print(f"Text: {doc['text'][:100]}...")  # Print first 100 characters
else:
    print("Failed to convert PDF")


Successfully converted PDF into 25 documents:

ID: 3036de5f
Text: ALL INDIA COUNCIL FOR TECHNICAL EDUCATION (AICTE ), NEW DELHI SWAYAM Cell FREQUENT LY ASKED QUESTION...

ID: 82fd2ef3
Text: 1. What is SWAYAM? SWAYAM (Study Webs of Active -learning for Young Aspiring Minds); India Chapter o...

ID: 0e978994
Text: can be downloaded/printed (3) self - assessment tests through tests and quizzes and (4) an online di...

ID: b7a7cd4a
Text: 12. Will the Courses launched on SWAYAM address the issues concerning shortage & quality of Teachers...

ID: c22f3acc
Text: institutes are engaged in development of e - content. 15. Is SWAYAM a part of Digital India Programm...

ID: ae1cfd51
Text: 17. Targets to be achieved through SWAYAM? The specific target proposed is addressed to the needs of...

ID: efa7e9fd
Text: Provide robust Internet Cloud (with CDN) and sufficient bandwidth for concurrent viewings of 1 Milli...

ID: f2717025
Text: 22. Has the Ministry issued any Guidelines on the preparation of M

In [3]:
result_docs

In [9]:
pc = Pinecone(api_key=PINECONE_API_KEY )

## Vector Database Integration
### - Generate embeddings using multilingual-e5-large
### - Store vectors in Pinecone serverless index
### - Enable efficient similarity search
### - I have used the same method used in official pinecone documentation https://docs.pinecone.io/guides/get-started/quickstart

## Pinecone index is configured with:

### - Dimension: 1024
### - Metric: Cosine similarity
### - Infrastructure: AWS Serverless
### - Region: us-east-1


In [16]:
def initialize_pinecone():
    pc = Pinecone(api_key=PINECONE_API_KEY)
    
    
    embeddings = pc.inference.embed(
        model="multilingual-e5-large",
        inputs=[d['text'] for d in result_docs],
        parameters={"input_type": "passage", "truncate": "END"}
    )
    
    
    index_name = "example-index1"
    try:
        pc.create_index(
            name=index_name,
            dimension=1024,
            metric="cosine",
            spec=ServerlessSpec(
                cloud='aws', 
                region='us-east-1'
            ) 
        )
    except Exception as e:
        print(f"Index might already exist: {e}")
    
    
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)
    

    index = pc.Index(index_name)
    
    
    records = [
        {
            "id": d['id'],
            "values": e['values'],
            "metadata": {'text': d['text']}
        }
        for d, e in zip(sample_docs, embeddings)
    ]
    
    index.upsert(
        vectors=records,
        namespace="example-namespace"
    )
    
    return pc, index

## Query Processing
### The system follows these steps for each query:

### - Convert question to vector embedding
### - Find relevant documents in Pinecone
### - Generate context-aware response using LLM

In [11]:

def get_embedding(pc, text):
    query_embedding = pc.inference.embed(
        model="multilingual-e5-large",
        inputs=[text],
        parameters={"input_type": "query"}
    )
    return query_embedding[0].values


def find_relevant_docs(index, query_vector):
    results = index.query(
        namespace="example-namespace",
        vector=query_vector,
        top_k=3,
        include_values=False,
        include_metadata=True
    )
    return [match.metadata['text'] for match in results.matches]


In [29]:
from openai import OpenAI

def generate_answer(question, context):
    
    client = OpenAI(api_key=OPENAI_API_KEY)
    
    context_text = "\n".join(context)
    prompt = f"""Using the following context, answer the question. If the answer isn't in the context, say "I don't have enough information to answer that."
    
    Context:
    {context_text}
    
    Question: {question}
    Answer:"""
    
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=500
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error generating answer: {e}")
        return "Sorry, I couldn't generate an answer."

In [26]:
def answer_question(pc, index, question):
    query_vector = get_embedding(pc, question)
    relevant_docs = find_relevant_docs(index, query_vector)
    
    if not relevant_docs:
        return "Sorry, I couldn't find any relevant information."
    
    return generate_answer(question, relevant_docs)

In [44]:
pc, index = initialize_pinecone()

Index might already exist: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'x-pinecone-api-version': '2024-07', 'X-Cloud-Trace-Context': '656f1772bc7924a89ed020b581f428a2', 'Date': 'Mon, 30 Dec 2024 09:01:04 GMT', 'Server': 'Google Frontend', 'Content-Length': '85', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"ALREADY_EXISTS","message":"Resource  already exists"},"status":409}



In [53]:
question = "Do I need a dedicated internet connection?"
print(find_relevant_docs(index,get_embedding(pc,question)))

['1. Do I need a dedicated internet connection? or I can consume course contents offline once logged in? You would need an active internet connection to consume course contents. 2. What are my system requirements to login in to the swayam portal? Following are the system requirements: 1) Laptop/Desktop with stable internet connection. 2) Latest flash player should be installed in your computer. 3) Operating system: Mac iOS, Microsoft Windows, Android, Linux. 4) Browser: Inte rnet Explorer, Chrome*, Safari, Firefox, Opera. 5) Port 1935 RTMP or Port 80 should be open. 3. Can I also download my content for offline access? Yes provided faculty has given the download access to the students. [G] COURSE & ITS SETTINGS', '1. Once you get enrolled into the course you need to go to the course schedule to access all the course material assigned by the faculty. Once you get enrolled into the course you need to go to the course schedule to access all the course material assigned by the faculty. 2. 

In [46]:
answer = answer_question(pc, index, question)
print(answer)

API Response: {'status': 'OK', 'request_id': '7d3a8ad4-239c-464e-9699-2e5e44da24ef', 'data': {'message': "I don't have enough information to answer that.", 'images': [], 'web_searches': [], 'sources': [], 'conversation_id': '8954353d526cd8c03ed1a92d', 'conversation_expiration': '2024-12-30T15:01:15.0572128Z', 'conversation_ended': False, 'is_user_message_offensive': False, 'user_messages_limit': 4, 'user_messages_remaining': 3}}
Sorry, I couldn't generate an answer.


# Error Occurring only because i dont have enough balance in my openai account

## I am completing the demonstration using CoPilot API because it is free

In [3]:
import requests

def generate_answer_copilot(question, context):
    url = "https://copilot5.p.rapidapi.com/copilot"
    headers = {
        "Content-Type": "application/json",
        "x-rapidapi-host": "copilot5.p.rapidapi.com",
        "x-rapidapi-key": " "  # Your API key goes here
    }
    
    context_text = "\n".join(context)
    payload = {
        "message": f"Using the following context, answer the question. If the answer isn't in the context, say 'I don't have enough information to answer that.'\n\nContext:\n{context_text}\n\nQuestion: {question}\nAnswer:",
        "conversation_id": None,
        "tone": "BALANCED",
        "markdown": False,
        "photo_url": None
    }
    
    try:
        response = requests.post(url, headers=headers, json=payload)
        response.raise_for_status()
        data = response.json()
        #print("API Response:", data)
        return data['data']['message']
    except requests.exceptions.RequestException as e:
        print(f"Request error: {e}")
        return "Sorry, I couldn't generate an answer."
    except Exception as e:
        print(f"Error generating answer: {e}")
        return "Sorry, I couldn't generate an answer."


In [74]:
def answer_question_copilot(pc, index, question):
    query_vector = get_embedding(pc, question)
    relevant_docs = find_relevant_docs(index, query_vector)
    
    if not relevant_docs:
        return "Sorry, I couldn't find any relevant information."
    #print(question)
    #print(relevant_docs)
    return generate_answer_copilot(question, relevant_docs)

In [76]:
question = "Do I need a dedicated internet connection?"

In [78]:
answer = answer_question_copilot(pc, index, question)
print(answer)

You would need an active internet connection to consume course contents.
