# Retrieval-Augmented Generation (RAG) for Textual Data Analysis

## Introduction

This project demonstrates the use of **Retrieval-Augmented Generation (RAG)** to enhance information retrieval and text generation. RAG combines **retrieval-based** methods with **generative models** to improve the relevance and accuracy of responses by incorporating external knowledge during generation. The dataset used is a public domain book from **Project Gutenberg**, which serves as a knowledge base for the retrieval and generation tasks.

## How It Works

1. **Dataset**:
   The dataset consists of a public domain book, processed into smaller text chunks (e.g., sentences or paragraphs) for easier embedding and indexing.

2. **Text Extraction and Preprocessing**:
   The book's text is extracted, cleaned, and split into manageable chunks for embedding and indexing.

3. **Embedding and Indexing**:
   Sentences are embedded using **Sentence-Transformers** into high-dimensional vectors, then indexed with **FAISS** for fast similarity search.

4. **Query and Retrieval**:
   A query is embedded, and the most relevant text segments are retrieved from the indexed dataset using FAISS.

5. **Text Generation**:
   The retrieved sentences are passed to a **GPT-2** model, which generates a detailed response based on the context.

6. **Output**:
   The final output is a coherent, contextually accurate response synthesized from the retrieved information.

## Dataset Used

The dataset is a text extracted from a public domain book available on **Project Gutenberg**, used as the knowledge base for retrieval and generation.

## Applications

- **Question Answering**: Generate accurate answers to queries by combining retrieval and generation.
- **Document Summarization**: Summarize large texts by retrieving and synthesizing key information.
- **Chatbots**: Improve chatbot accuracy by integrating real-time retrieval with generation.

## Key Libraries and Tools

- **FAISS**: For efficient search and indexing of text embeddings.
- **Sentence-Transformers**: To generate high-quality sentence embeddings.
- **GPT-2**: For text generation based on retrieved information.
- **PyMuPDF (fitz)**: For extracting text from PDFs.

## Installing required tools

In [None]:
!pip install faiss-cpu transformers PyMuPDF

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting PyMuPDF
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF, faiss-cpu
Successfully installed PyMuPDF-1.25.1 faiss-cpu-1.9.0.post1


## Importing necessary libraries

In [None]:
import fitz  # for PyMuPDF
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import GPT2LMHeadModel, GPT2Tokenizer

## Function to extract text from PDF

In [None]:
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text

pdf_path = "sherlock_holmes.pdf"
pdf_text = extract_text_from_pdf(pdf_path)
print(pdf_text[:500])

The Project Gutenberg eBook of The Adventures of Sherlock Holmes
This ebook is for the use of anyone anywhere in the United States and most other
parts of the world at no cost and with almost no restrictions whatsoever. You
may copy it, give it away or re-use it under the terms of the Project Gutenberg
License included with this ebook or online at www.gutenberg.org. If you are
not located in the United States, you will have to check the laws of the country
where you are located before using this


# Embedding Text to FAISS Database

The text from pdf is embedded into high-dimensional vectors using **Sentence-Transformers** to capture the semantic meaning of each sentence. These embeddings are then indexed using **FAISS** to enable efficient retrieval of the most relevant sentences based on a given query.

In [None]:
# Load pre-trained model for embeddings (sentence-transformers)
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Function to embed text
def embed_text(text):
    sentences = text.split('\n')  # Split by newlines (or you can split by punctuation)
    embeddings = embedder.encode(sentences, convert_to_numpy=True)
    return sentences, embeddings

# Embed the PDF text
sentences, embeddings = embed_text(pdf_text)

# Create a FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # Flat (brute-force) index

# Add embeddings to FAISS index
index.add(np.array(embeddings))

# Check if the embeddings are correctly added
print("Number of embeddings in index:", index.ntotal)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Number of embeddings in index: 8912


# Querying FAISS Database

Query is embedded into a vector using the same model used for the dataset, and then the **FAISS** index is queried to retrieve the top `k` most relevant sentences based on cosine similarity.

The resulting sentences are validated and filtered for relevance, ensuring that only meaningful content is returned. This allows the system to retrieve highly relevant information from the text corpus, which can then be used for generating detailed responses.

In [None]:
def query_faiss(query, k=5):
    """
    Search the FAISS index for the top `k` most relevant sentences based on the query.
    """
    # Embed the query
    query_embedding = embedder.encode([query], convert_to_numpy=True)

    # Search FAISS index
    D, I = index.search(query_embedding, k)  # D: distances, I: indices of top k results

    # Validate and retrieve the sentences corresponding to the top k results
    results = []
    for idx in I[0]:
        if 0 <= idx < len(sentences):  # Ensure valid index
            results.append(sentences[idx])

    # Filter out overly short results (e.g., empty or very brief strings)
    filtered_results = [res for res in results if len(res.strip()) > 10]

    return filtered_results

# Example usage
query = "Provide a detailed description of Sherlock Holmes, including his profession, personality traits, notable cases, and the significance of his character in literature."
results = query_faiss(query, k=20)  # Adjust `k` as needed

# Display the results
print("\nTop matching sentences:\n")
for i, result in enumerate(results, 1):
    print(f"{i}. {result}")




Top matching sentences:

1. interested in Mr. Sherlock Holmes’ cases.”
2. The Adventures of Sherlock Holmes
3. OF SHERLOCK HOLMES ***
4. OF SHERLOCK HOLMES ***
5. Sherlock Holmes was a man, however, who, when he had an unsolved problem
6. Title: The Adventures of Sherlock Holmes
7. The Project Gutenberg eBook of The Adventures of Sherlock Holmes
8. really an object of interest to the celebrated Mr. Sherlock Holmes.
9. at the dénouement of the little mystery. I found Sherlock Holmes alone, however,
10. from Reading to the little Berkshire village. There were Sherlock Holmes, the
11. last eight years studied the methods of my friend Sherlock Holmes, I find many
12. stricken man. To Holmes, as I could see by his eager face and peering eyes, very
13. and how the best plans of Mr. Sherlock Holmes were beaten by a woman’s wit.
14. I could see that Holmes was favourably impressed by the manner and speech of
15. “I am endeavouring to tell you everything, Mr. Holmes, which may have any
16. adv

# Using Generation Models

Using the **GPT-2 model**, the code generates text from a given prompt by applying advanced techniques such as **temperature** (for creativity) and **top-p nucleus sampling** (for more natural responses). It ensures diversity in the output with `no_repeat_ngram_size` and handles padding appropriately using the `eos_token`.

In [None]:
# Load GPT-2 model and tokenizer
model_name = 'gpt2-large'
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Ensuring the tokenizer has a pad token for proper attention masking
tokenizer.pad_token = tokenizer.eos_token

# Function to generate text from a prompt
def generate_text(prompt, max_length=600):
    # Tokenize prompt with attention mask
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=3,  # To avoid repetitive phrases
        pad_token_id=tokenizer.eos_token_id,  # To ensure proper padding handling
        temperature=0.7,  # ATmperature for creativity
        top_p=0.9         # Nucleus sampling
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Generating results

Using the results from the FAISS search to form a prompt

In [None]:
context = " ".join(results)  # Replace 'results' with FAISS search results
prompt = f"Based on the following context, answer the question in detail:\n{context}\n\nQuestion: Describe Sherlock Holmes, his profession, personality, notable cases, and his importance in literature."

generated_answer = generate_text(prompt)

print("\nGenerated answer:\n\n")
print(generated_answer)




Generated answer:


Based on the following context, answer the question in detail:
interested in Mr. Sherlock Holmes’ cases.” The Adventures of Sherlock Holmes OF SHERLOCK HOLMES *** OF SHERLOCK HOLMES *** Sherlock Holmes was a man, however, who, when he had an unsolved problem Title: The Adventures of Sherlock Holmes The Project Gutenberg eBook of The Adventures of Sherlock Holmes really an object of interest to the celebrated Mr. Sherlock Holmes. at the dénouement of the little mystery. I found Sherlock Holmes alone, however, from Reading to the little Berkshire village. There were Sherlock Holmes, the last eight years studied the methods of my friend Sherlock Holmes, I find many stricken man. To Holmes, as I could see by his eager face and peering eyes, very and how the best plans of Mr. Sherlock Holmes were beaten by a woman’s wit. I could see that Holmes was favourably impressed by the manner and speech of “I am endeavouring to tell you everything, Mr. Holmes, which may have any 