# **🔍 Retrieval-Augmented Generation (RAG) Pipeline for Wikipedia QA**

# This notebook implements a lightweight **RAG (Retrieval-Augmented Generation)** pipeline using:

- Wikipedia API for retrieving documents
- Sentence Transformers (`all-mpnet-base-v2`) for embeddings
- FAISS for semantic search
- Hugging Face QA model (`roberta-base-squad2`) for answer generation

The system takes a **topic** and **user question**, retrieves relevant Wikipedia context, and returns an intelligent answer.

## 📦 Step 1: Install Required Libraries

We begin by installing all required Python libraries:
- `wikipedia` for fetching Wikipedia content
- `gradio` for building a simple UI (optional)
- `transformers`, `sentence-transformers` for model loading
- `faiss-cpu` for fast vector search


In [None]:
# !pip install wikipedia
# !pip install sentence-transformers
# !pip install faiss-cpu
# !pip install transformers


## 📘 Step 2: Load Wikipedia Article

Given a topic (e.g., "Artificial Intelligence"), we fetch the full content of the corresponding Wikipedia page. This forms the base of our retrieval corpus.


In [None]:
import wikipedia

def get_wikipedia_content(topic):
    try:
        page = wikipedia.page(topic)
        return page.content
    except wikipedia.exceptions.PageError:
        return None
    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Ambiguous topic. Please be more specific. Options: {e.options}")
        return None

topic = input("Enter a topic to learn about: ")
document = get_wikipedia_content(topic)

if not document:
    print("Could not retrieve information.")
    exit()
def chunk_text(text, max_length=256, overlap=20):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + max_length, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += max_length - overlap
    return chunks

chunks = chunk_text(document)
print(f"{len(chunks)} chunks created")

Enter a topic to learn about: Cloud Computing
24 chunks created


In [None]:
from transformers import AutoTokenizer

token = "HF_TOKEN"

tokenizer = AutoTokenizer.from_pretrained(
    "sentence-transformers/all-mpnet-base-v2",
    token=token  # ✅ new way
)


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

## 🧠 Step 3: Text Chunking + Embedding

We split the article into small overlapping chunks (~256 tokens) to preserve semantic coherence. Each chunk is then converted into a dense embedding using `all-mpnet-base-v2` from Sentence Transformers.


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load the same model used for tokenizer
embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", token=token)

# Encode your text chunks
embeddings = embedding_model.encode(chunks, convert_to_numpy=True)

# Confirm embedding shape
print("Embedding shape:", embeddings.shape)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding shape: (24, 768)


## 🔍 Step 4: Build FAISS Index for Similarity Search

FAISS enables fast similarity search over large embedding vectors. We add all chunk embeddings into an index and search for the most relevant ones based on cosine similarity with the user’s query.


In [None]:
# Step 1: Create FAISS index for fast similarity search
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

# Step 2: Add embeddings to index
index.add(embeddings)
print("FAISS index created with", index.ntotal, "documents.")


FAISS index created with 24 documents.


## 🤖 Step 5: Question Answering with Transformers

Using the top-k retrieved text chunks as context, we run a question-answering pipeline using Hugging Face’s `roberta-base-squad2` model to generate an accurate answer.


In [None]:
query = input("Ask a question about the topic: ")
query_embedding = embedding_model.encode([query])

k = 3  # top 3 most relevant chunks
distances, indices = index.search(np.array(query_embedding), k)

retrieved_chunks = [chunks[i] for i in indices[0]]

print("\nTop Retrieved Chunks:")
for i, chunk in enumerate(retrieved_chunks):
    print(f"\nChunk {i+1}:\n{chunk}")


Ask a question about the topic: What is Cloud Computing is used for?

Top Retrieved Chunks:

Chunk 1:
allows users to deploy and operate MongoDB databases on Amazon Web Services (AWS), Microsoft Azure, and Google Cloud. OVHcloud – A France-based cloud provider known for its emphasis on data sovereignty, offering private cloud, dedicated servers, and European-hosted cloud solutions. Lambda Labs – Provides GPU cloud computing tailored for AI research, deep learning, and machine learning development. Paperspace – Specializes in cloud computing solutions for AI and machine learning, with a focus on scalable GPU access. RunPod – Offers cloud computing infrastructure optimized for AI applications and model deployment. == Similar concepts == The goal of cloud computing is to allow users to take benefit from all of these technologies, without the need for deep knowledge about or expertise with each one of them. The cloud aims to cut costs and helps the users focus on their core business instea

In [None]:
from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2",
    tokenizer="deepset/roberta-base-squad2",
    token=token  # ✅ pass your Hugging Face token here
)


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cpu


## ✅ Final Output

We show:
- 📚 The retrieved context chunks from Wikipedia
- 🧠 The generated answer using the RAG architecture

This gives users both transparency and confidence in how the answer was derived.


In [None]:
context = " ".join(retrieved_chunks)

answer = qa_pipeline(
    question="What are the main tasks in NLP?",
    context=context
)

print("\n🧠 Answer:")
print(answer["answer"])



🧠 Answer:
server time and network storage


## 👨‍💻 Author

**Achyuth Kumar Miryala**  
Master’s in Data Science | University of North Texas  
📍 Denton, TX  
📫 [achyuthkumar286@gmail.com](mailto:achyuthkumar286@gmail.com)  
🔗 [LinkedIn](https://www.linkedin.com/in/achyuthkumarmiryala/) | [GitHub](https://github.com/achyuthkumarmiryala)**
