## Ananya Agrawal (ananyaa2)

## Assignment3 - Retrieval Augmented Generation

In this asignment, you will develop a RAG solution to answer questions about a repository of research papers. The assignment requires you to parse the paper PDF files, chunk and index the data, and then design and execute an evaluation of the retriever results. In Naïve RAG, the query is compared to documents in the vector database for retrieval of the top N documents that match the query. The language model is then used to summarize the retrieved documents into an answer to the user query. Research papers are highly structured documents with technically deep content, in contrast to blogs, which contain more general and introductory content. This means that queries may be unlikely to match relevant chunks of the paper without additional processing, such as information extraction or summarization.

One approach to address this problem is to use the language model to generate answerable questions from chunks of each paper. The generated questions can then be indexed as "documents" in a vector database, and the user query can be matched against the most similar questions. By maintaining a mapping between the indexed, generated question and the paper chunk, the retrieval process can then produce the most relevant chunks for use in summarizing an answer to the user query.

## Setup the functions for prompting

In [2]:
from openai import OpenAI
import os
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def prompt_model(prompt):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        store=True,
        messages=[
            {"role": "user", 'content': prompt}
        ]
    )
    return completion.choices[0].message.content

## Parse data from source

In [3]:
import os, bibtexparser, pypdf, logging

# silence non-critical errors while parsing PDF files
logging.getLogger("pypdf").setLevel(logging.CRITICAL)

data_path = 'data/'
data = {}

files = os.listdir(data_path)
print('Reading %i files:' % len(files))
for f in files:
    path = os.path.join(data_path, f)

    # each datum will have at least these attributes
    d = {'filepath': None, 'title': None, 'text': None}

    # parse bibtex file, if exists
    if path.endswith('.bib'):
        if path[:-4] in data:
            d = data[path[:-4]]

        bib = bibtexparser.load(open(path, 'r'))
        if 'title' in bib.entries[0]:
            d['title'] = bib.entries[0]['title']
            data[path[:-4]] = d

    # parse pdf text, if exists
    if path.endswith('.pdf'):
        if path[:-4] in data:
            d = data[path[:-4]]

        print('  File: %s' % f)
        text = ''
        reader = pypdf.PdfReader(path)
        for page in reader.pages:
            text += page.extract_text()
        d['filepath'] = path
        d['text'] = text
        data[path[:-4]] = d

data = [d for d in data.values()]

Reading 53 files:
  File: 2023.findings-emnlp.620.pdf
  File: 29728-Article Text-33782-1-2-20240324-3.pdf
  File: 2024.acl-long.642.pdf
  File: 2021.findings-emnlp.320.pdf
  File: 2020.coling-main.207.pdf
  File: 2202.01110v2.pdf
  File: 2212.14024v2.pdf
  File: 2024.emnlp-industry.66.pdf
  File: 8917_Retrieval_meets_Long_Cont.pdf
  File: NeurIPS-2023-lift-yourself-up-retrieval-augmented-text-generation-with-self-memory.pdf
  File: NeurIPS-2023-leandojo-theorem-proving-with-retrieval-augmented-language-models.pdf
  File: NeurIPS-2020-retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks.pdf
  File: 2023.acl-long.557.pdf
  File: tacl_a_00605.pdf
  File: 3637870.pdf
  File: 2023.emnlp-main.495.pdf
  File: 3626772.3657834.pdf
  File: 2402.19473v6.pdf
  File: 3626772.3657957.pdf
  File: 2024.eacl-demo.16.pdf
  File: 967_generate_rather_than_retrieve_.pdf
  File: 23-0037.pdf
  File: 2022.naacl-main.191.pdf
  File: 2312.10997v5.pdf
  File: 947_Augmented_Language_Models_.pdf


## Pre-process the data

Prior to indexing and chunking the data, the data may need to be pre-processed. This can be done to remove portions of the data irrelevant to queries to reduce mismatches between the user query and the index. This is not required for this assignment.

In [4]:
import re

def clean_text(text):
    """Enhanced cleaning of text: removes noise, normalizes content, and filters irrelevant sections."""

    # lower-case everything for normalization
    text = text.lower()
    
    # Remove URLs and emails
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Links
    text = re.sub(r'\S+@\S+', '', text)  # Emails

    # Remove common unwanted sections
    text = re.sub(r'(References|Bibliography|Table of Contents|Acknowledgements).*$', '', text, flags=re.IGNORECASE)
    text = re.sub(r'Figure \d+:.*?$', '', text, flags=re.IGNORECASE)

    # Remove author names (assumption: authors are mentioned at the top in "Firstname Lastname, Firstname Lastname")
    text = re.sub(r'([A-Z][a-z]+ [A-Z][a-z]+(,? [A-Z][a-z]+)*)', '', text)

    # Remove author-like names (e.g., "John Doe, Jane Smith")
    text = re.sub(r'[A-Z][a-z]+ [A-Z][a-z]+(,? [A-Z][a-z]+)?', '', text)

    # Remove standalone numbers, section headings like "3.1 Introduction"
    text = re.sub(r'^\d{1,2}\.\d{1,2}\s+|\b\d{1,2}\b', '', text)

    # Remove unwanted characters and multiple spaces
    text = re.sub(r'[^\w\s.,!?-]', '', text)  # Keeps words, spaces, and basic punctuation

    # Remove reference markers like [12] or (2020)
    text = re.sub(r'\[\d+\]|\(\d{4}\)', '', text)

    # Remove non-ASCII characters (emojis, symbols)
    text = re.sub(r'[^\x00-\x7F]+', '', text)

    # Remove dates
    text = re.sub(r'\d{1,2}\s?[A-Z][a-z]{2,}\s?\d{4}', '', text) 


    # Remove common academic noise (like section numbers)
    text = re.sub(r'\d{1,2}\.\d{1,2}|\d{1,2}', '', text)  # Matches section numbers like "3.2" or "5"

    # Strip out any leftover reference markers
    text = re.sub(r'\[\d+\]', '', text)

    # Expand common contractions for consistency
    contractions = {
        "don't": "do not", "can't": "cannot", "won't": "will not",
        "it's": "it is", "i'm": "i am", "they're": "they are",
        "isn't": "is not", "aren't": "are not", "wasn't": "was not"
    }
    for key, value in contractions.items():
        text = text.replace(key, value)

    # Remove excessive whitespace, punctuation issues
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'[^\w\s.,!?-]', '', text)  # Keeps only words, spaces, punctuation

    # Remove very short meaningless sentences
    text = '. '.join([sentence for sentence in text.split('. ') if len(sentence.split()) > 5])

    return text


# Apply cleaning to each document
for d in data:
    if d['text']:
        d['text'] = clean_text(d['text'])


## Chunk data and generate indices

User queries will be matched to indexes that best approximate the text chunks used to summarize an answer. For this assignment, you may chunk the text and then prompt the model to generate questions that are answerable by the text. The generated questions can then be used as the "documents" stored in the vector database.

In [5]:
def chunk_text(text, chunk_size=768, overlap=200):
    """Chunk text into overlapping windows to preserve context."""
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    return chunks


def generate_questions(chunks):
    """Generate specific, high-quality questions for each chunk."""
    chunk_questions = []

    for chunk in chunks:
        prompt = (
            "You are a research assistant tasked with generating precise, fact-based questions. "
            "Given the following text, generate 3 questions that are detailed, highly specific, and relevant to the content. "
            "Ensure the questions focus on entities, concepts, definitions, or critical points made in the text. "
            "Avoid overly broad or vague questions.\n\n"
            f"Text:\n{chunk}\n\n"
            "Return the questions in bullet points. Example format:\n"
            "- What is the primary contribution of the FLARE model?\n"
            "- How does the RAG model address hallucinations in LLMs?\n"
            "- What evaluation metrics are commonly used for RAG systems?"
        )

        questions = prompt_model(prompt)
        chunk_questions.append({'chunk': chunk, 'questions': questions})
        
    return chunk_questions



# Process each document
for d in data:
    if d['text']:
        chunks = chunk_text(d['text'])
        d['chunks'] = generate_questions(chunks)


## Build the vector database

When building the vector database, be sure to maintain a mapping between the generated questions and the chunks that can be used later to retrieve the chunks from the most similar indices to the user query provided.

You may also add the function to query the vector database that you will use later.

In [6]:
import chromadb
from sentence_transformers import SentenceTransformer

# Initialize ChromaDB client and collection
client = chromadb.Client()
collection = client.create_collection("research_papers")
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Initialize the embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

def build_vector_database(data):
    """Generate embeddings and store question-chunk pairs in the ChromaDB collection."""
    for doc in data:
        if 'chunks' in doc:
            for entry in doc['chunks']:
                questions = entry['questions'].split('\n')  # Split questions by newlines
                for i, question in enumerate(questions):
                    embedding = embedder.encode(question).tolist()
                    collection.add(
                        embeddings=[embedding],
                        metadatas=[{'chunk': entry['chunk']}],
                        ids=[f"{doc['filepath']}_{i}"]
                    )
build_vector_database(data)

def query_vector_database(query, top_k=5):
    """Retrieve the most relevant chunks based on query similarity."""
    query_embedding = embedder.encode(query).tolist()
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
    )
    return [result['chunk'] for result in results['metadatas'][0]]


def generate_answer_with_feedback(query, retrieved_chunks):
    """Generate an answer, handling cases where relevant info is missing."""
    context = "\n\n".join(retrieved_chunks) if retrieved_chunks else "No relevant chunks found."

    prompt = (
        "You are an expert research assistant. Use the provided context to answer the question. "
        "If the context does not contain enough information, reply with 'IDK' instead of guessing.\n\n"
        f"Question: {query}\n\n"
        f"Context:\n{context}\n\n"
        "Answer:"
    )

    answer = prompt_model(prompt)
    return answer



  from .autonotebook import tqdm as notebook_tqdm
Insert of existing embedding ID: data/2023.findings-emnlp.620.pdf_0
Add of existing embedding ID: data/2023.findings-emnlp.620.pdf_0
Insert of existing embedding ID: data/2023.findings-emnlp.620.pdf_1
Add of existing embedding ID: data/2023.findings-emnlp.620.pdf_1
Insert of existing embedding ID: data/2023.findings-emnlp.620.pdf_2
Add of existing embedding ID: data/2023.findings-emnlp.620.pdf_2
Insert of existing embedding ID: data/2023.findings-emnlp.620.pdf_0
Add of existing embedding ID: data/2023.findings-emnlp.620.pdf_0
Insert of existing embedding ID: data/2023.findings-emnlp.620.pdf_1
Add of existing embedding ID: data/2023.findings-emnlp.620.pdf_1
Insert of existing embedding ID: data/2023.findings-emnlp.620.pdf_2
Add of existing embedding ID: data/2023.findings-emnlp.620.pdf_2
Insert of existing embedding ID: data/2023.findings-emnlp.620.pdf_0
Add of existing embedding ID: data/2023.findings-emnlp.620.pdf_0
Insert of existing 

## Conduct experiments to evaluate user queries

Report your average precision, recall and F1 score. You are welcome to sample the model multiple times for each query when computing your average, or you may sample once per query.

In [7]:
import json
from bert_score import BERTScorer

scorer = BERTScorer(model_type='bert-base-uncased')

def rerank_chunks(query, retrieved_chunks):
    """Rerank retrieved chunks using BERTScore."""
    scores = []

    for chunk in retrieved_chunks:
        P, R, F1 = scorer.score([chunk], [query])
        scores.append((F1.item(), chunk))

    scores.sort(reverse=True, key=lambda x: x[0])
    return [chunk for _, chunk in scores[:3]]  # Keep the top 3 reranked chunks

def expand_query(query):
    """Expand the query to improve retrieval."""
    prompt = (
        "Expand the following query to improve document retrieval. Generate 3 alternate phrasings of the query:\n\n"
        f"{query}\n\n"
        "Return the expanded queries in bullet points."
    )
    expansions = prompt_model(prompt)
    return expansions.split("\n")

def multi_query_retrieve(query, top_k=5):
    """Run multi-query retrieval and merge results."""
    expanded_queries = expand_query(query)
    all_chunks = []

    for q in expanded_queries:
        retrieved_chunks = query_vector_database(q, top_k=top_k)
        all_chunks.extend(retrieved_chunks)

    # Deduplicate and rerank
    unique_chunks = list(set(all_chunks))
    reranked_chunks = rerank_chunks(query, unique_chunks)

    return reranked_chunks


def load_queries(file_path):
    """Load queries and answers from a JSON file."""
    with open(file_path, 'r') as file:
        return json.load(file)

def run_evaluation(query_file, top_k=3):
    """Evaluate the RAG system against provided queries using BERTScore."""
    queries = load_queries(query_file)
    results = []

    for entry in queries:
        query = entry["query"]
        ground_truth_answer = entry["answer"]

        # Retrieve and rerank
        retrieved_chunks = query_vector_database(query, top_k=top_k)
        reranked_chunks = rerank_chunks(query, retrieved_chunks)

        # Handle IDK for unanswerable queries

        predicted_answer = "IDK" if not reranked_chunks else " ".join(reranked_chunks[:3])


        # Compute BERTScore
        P, R, F1 = scorer.score([predicted_answer], [ground_truth_answer])

        results.append({
            "query": query,
            "predicted_answer": predicted_answer,
            "ground_truth": ground_truth_answer,
            "precision": P.mean().item(),
            "recall": R.mean().item(),
            "f1": F1.mean().item()
        })

    return results
    

def report_metrics(results, dataset_name):
    """Aggregate and report average precision, recall, and F1 using BERTScore."""
    avg_precision = sum(r['precision'] for r in results) / len(results)
    avg_recall = sum(r['recall'] for r in results) / len(results)
    avg_f1 = sum(r['f1'] for r in results) / len(results)

    print(f"\nEvaluation Results for {dataset_name}:")
    print(f"Average Precision (BERTScore): {avg_precision:.2f}")
    print(f"Average Recall (BERTScore): {avg_recall:.2f}")
    print(f"Average F1 Score (BERTScore): {avg_f1:.2f}")


# Evaluate on dev and test queries
dev_results = run_evaluation('dev-questions.json')
test_results = run_evaluation('test-queries.json')

# Report metrics
report_metrics(dev_results, "Development Questions")
report_metrics(test_results, "Test Questions")






Evaluation Results for Development Questions:
Average Precision (BERTScore): 0.31
Average Recall (BERTScore): 0.45
Average F1 Score (BERTScore): 0.36

Evaluation Results for Test Questions:
Average Precision (BERTScore): 0.30
Average Recall (BERTScore): 0.52
Average F1 Score (BERTScore): 0.38
