# Building a Real RAG Pipeline: Wikipedia Knowledge Retrieval and Question Answering with FAISS, Sentence Transformers, and Gemini

## A Practical End-to-End Example with Langchain and Google's LLM

Welcome to this hands-on Jupyter Notebook where we'll build a complete Retrieval Augmented Generation (RAG) pipeline!  We'll take you step-by-step through the process of:

1.  **Retrieving Knowledge:** Fetching content from Wikipedia.
2.  **Embedding Sentences:** Using Sentence Transformers to create vector representations of our knowledge.
3.  **Indexing with FAISS:** Building an efficient index for fast semantic search.
4.  **Question Generation (Optional):**  Using Gemini to create example questions (for demonstration).
5.  **Question Answering with RAG:**  Using Gemini, Langchain, and our FAISS index to answer questions based on retrieved Wikipedia knowledge.

Let's get started! 🚀

## Setting up the Environment and Installing Libraries

In [None]:
!pip install faiss-cpu sentence-transformers wikipedia-api langchain-google-genai python-dotenv openai

In [1]:
import os
import random
import time

import wikipediaapi
from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI
from sentence_transformers import SentenceTransformer

import faiss

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

## Retrieving Knowledge from Wikipedia

Wikipedia is an incredible source of information, and for our RAG system, it will serve as our knowledge base. We'll use the `wikipedia-api` library to retrieve the content of a specific article.  Let's choose the "Artificial Intelligence" Wikipedia page as our starting point.



In [2]:
# Initialize Wikipedia API
wiki = wikipediaapi.Wikipedia(
    language='en',
    extract_format=wikipediaapi.ExtractFormat.WIKI,
    user_agent='RAG Knowledge Graph Demo/1.0'
)

# Get Wikipedia article (e.g., "Artificial Intelligence")
article_title = "Artificial Intelligence"
page = wiki.page(article_title)

if page.exists():
    text_content = page.text
    print(f"Successfully retrieved Wikipedia article: {article_title}")
    
    # Split content into sentences (using the same approach)
    sentences = []
    for paragraph in text_content.split("\n\n"):
        for sentence in paragraph.split("\n"):
            sentence = sentence.strip()
            if sentence:
                sentences.append(sentence)
    
    print(f"Extracted {len(sentences)} sentences from Wikipedia article")
else:
    print(f"Article '{article_title}' not found")
    exit()

Successfully retrieved Wikipedia article: Artificial Intelligence
Extracted 278 sentences from Wikipedia article


## Generating Sentence Embeddings

Sentence embeddings are the key to semantic search. They represent the *meaning* of text as dense vectors in a high-dimensional space.  Sentences with similar meanings will have vectors that are "close" to each other in this space. We'll use the `sentence-transformers` library and the `all-MiniLM-L6-v2` model, which provides a good balance of speed and semantic quality.



In [3]:
model_name = "all-MiniLM-L6-v2"
embedder = SentenceTransformer(model_name)

# Generate embeddings for all sentences
sentence_embeddings = embedder.encode(sentences, convert_to_numpy=True)

print(f"Embeddings shape: {sentence_embeddings.shape}")

Embeddings shape: (278, 384)


## Building a FAISS Index

FAISS (Facebook AI Similarity Search) is a library specifically designed for efficient similarity search in large vector datasets. We'll build an **IVF (Inverted File) index with Product Quantization (PQ)**.  Remember from our article, IVF helps to narrow down the search space by clustering vectors, and PQ compresses vectors for memory efficiency and speed.


In [10]:
dimension = sentence_embeddings.shape[1]  # Embedding dimension
num_vectors = sentence_embeddings.shape[0]  # Number of sentences
nlist = 8  # Number of Voronoi cells (inverted lists)
M = 2  # Number of sub-quantizers for PQ
nbits = 8  # Bits per sub-quantizer

# Create the index factory string for IVFPQ
index_factory_string = f"IVF{nlist},PQ{M}x{nbits}"

# Instantiate the FAISS index
index = faiss.index_factory(dimension, index_factory_string)
index.nprobe = 8

print("Training index...")
index.train(sentence_embeddings)  # Train on the embeddings

print("Adding vectors to index...")
index.add(sentence_embeddings)  # Add the embeddings to the index

print(f"Index is trained: {index.is_trained}")
print(f"Number of vectors in index: {index.ntotal}")

Training index...




Adding vectors to index...
Index is trained: True
Number of vectors in index: 278


## Generating Example Questions with Gemini

Let's use Google's Gemini model to generate questions based on some randomly selected sentences from our Wikipedia article.  This will give us realistic questions that are grounded in our knowledge base.

**Important:** To run this section, you need to have your Google API key set up in a `.env` file as `GOOGLE_API_KEY=YOUR_API_KEY`.


In [11]:
# os.environ["GOOGLE_API_KEY"] = "your_google_api_key"

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", api_key=os.getenv("GOOGLE_API_KEY"))

# Select 4 random sentences with more than 50 characters
selected_sentences = {}
while len(selected_sentences) < 4:
    idx = random.randint(0, len(sentences) - 1)
    random_sentence = sentences[idx]
    if len(random_sentence) > 50:
        selected_sentences[idx] = random_sentence


# Generate questions using Gemini
question_prompt = """
For the following sentence, create one specific question that can be answered using only the information in the sentence. 
Make the question clear and focused.

Sentence: {sentence}

Generate only the question, without any additional text or explanation.
"""

query_texts = []
print("Selected sentences and generated questions:\n")

for i, sentence in selected_sentences.items():
    response = llm.invoke(question_prompt.format(sentence=sentence))
    query_texts.append(response.content.strip())
    
    print(f"Sentence {i}: {sentence}")
    print(f"Question {i}: {response.content.strip()}\n")
    
    time.sleep(2) # Sleep for 2 seconds to avoid rate limiting

Selected sentences and generated questions:

Sentence 150: Artificial intelligence provides a number of tools that are useful to bad actors, such as authoritarian governments, terrorists, criminals or rogue states.
Question 150: According to the sentence, what types of bad actors find artificial intelligence tools useful?

Sentence 125: In 2024, the Wall Street Journal reported that big AI companies have begun negotiations with the US nuclear power providers to provide electricity to the data centers. In March 2024 Amazon purchased a Pennsylvania nuclear-powered data center for $650 Million (US). Nvidia CEO Jen-Hsun Huang said nuclear power is a good option for the data centers.
Question 125: According to the Wall Street Journal, in what year did big AI companies begin negotiations with US nuclear power providers?

Sentence 7: Early researchers developed algorithms that imitated step-by-step reasoning that humans use when they solve puzzles or make logical deductions. By the late 1980s

## Performing Semantic Search and Question Answering with RAG

Here's where the magic happens! We'll take the questions (either the ones generated by Gemini or your own questions), perform a semantic search using our FAISS index to retrieve relevant sentences from Wikipedia, and then use Gemini again to answer the question, *grounded in the retrieved context*.  This is Retrieval Augmented Generation in action!


In [13]:
prompt = """Consider the following context from Wikipedia: {sentences}
Now answer the following question using only the context provided: {query}?"""

k = 1  # number of nearest neighbors to retrieve

for query_text in query_texts:
    # Embed the query text
    query_embedding = embedder.encode(query_text, convert_to_numpy=True).reshape(
        1, -1
    )  # reshape to 2D array

    print(f"\nSearching index for query: '{query_text}'...")
    distances, indices = index.search(query_embedding, k)  # perform the search

    context = ""

    print("\nQuery:", query_text)
    print("\nRetrieved sentences:")
    for i, idx in enumerate(indices[0]):
        print(f"\nRank {i+1}: {sentences[idx]} (Distance: {distances[0][i]:.4f})")
        context += sentences[idx] + "\n"

    response = llm.invoke(prompt.format(sentences=sentences[idx], query=query_text))
    print(f"Answer: {response.content.strip()}")
        
    # separator between queries for better readability
    print("\n" + "="*80 + "\n")
    time.sleep(2)  # 2 seconds wait between call to avoid rate limiting


Searching index for query: 'According to the sentence, what types of bad actors find artificial intelligence tools useful?'...

Query: According to the sentence, what types of bad actors find artificial intelligence tools useful?

Retrieved sentences:

Rank 1: Artificial intelligence provides a number of tools that are useful to bad actors, such as authoritarian governments, terrorists, criminals or rogue states. (Distance: 0.4878)
Answer: According to the sentence, authoritarian governments, terrorists, criminals, or rogue states find artificial intelligence tools useful.



Searching index for query: 'According to the Wall Street Journal, in what year did big AI companies begin negotiations with US nuclear power providers?'...

Query: According to the Wall Street Journal, in what year did big AI companies begin negotiations with US nuclear power providers?

Retrieved sentences:

Rank 1: In 2024, the Wall Street Journal reported that big AI companies have begun negotiations with the 