# Phase 1 Final Project – Naive RAG Chatbot (War and Peace)

This notebook downloads *War and Peace* (Project Gutenberg), builds a FAISS index using HuggingFace embeddings, and demonstrates dense (FAISS) and sparse (BM25) retrieval. No OpenAI API key is required.

In [4]:
# Install dependencies (run once)
!pip install -q langchain langchain-community langchain-text-splitters faiss-cpu sentence-transformers rank-bm25 requests

print('Install step finished (or already installed).')

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m69.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[0mInstall step finished (or already installed).


In [5]:
# Download War and Peace from Project Gutenberg
import requests
from pathlib import Path

Path('data').mkdir(exist_ok=True)
url = 'https://www.gutenberg.org/files/2600/2600-0.txt'
res = requests.get(url)
res.raise_for_status()
open('data/war_and_peace.txt', 'w', encoding='utf-8').write(res.text)
print('Downloaded War and Peace —', len(res.text), 'characters')

Downloaded War and Peace — 3293552 characters


In [6]:
# Ingest and chunk the document
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = TextLoader('data/war_and_peace.txt', encoding='utf-8')
docs_raw = loader.load()
print('Loaded raw documents:', len(docs_raw))

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = splitter.split_documents(docs_raw)
print('Created chunks:', len(docs))

print('\n=== Sample chunk ===\n')
print(docs[0].page_content[:800])

Loaded raw documents: 1
Created chunks: 4500

=== Sample chunk ===

﻿The Project Gutenberg eBook of War and Peace, by Leo Tolstoy

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: War and Peace

Author: Leo Tolstoy

Translators: Louise and Aylmer Maude

Release Date: April, 2001 [eBook #2600]
[Most recently updated: June 14, 2022]

Language: English

Character set encoding: UTF-8

Produced by: An Anonymous Volunteer and David Widger

*** START OF THE PROJECT GUT


In [7]:
# Build embeddings (HuggingFace) and FAISS vectorstore
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from pathlib import Path

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

vs = FAISS.from_documents(docs, embeddings)
Path('indexes').mkdir(exist_ok=True)
vs.save_local('indexes/faiss_index')
print('FAISS index saved to indexes/faiss_index')

  embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

FAISS index saved to indexes/faiss_index


In [8]:
# Build BM25 (sparse) index
from rank_bm25 import BM25Okapi
texts = [d.page_content for d in docs]
tokenized = [t.split() for t in texts]
bm25 = BM25Okapi(tokenized)
print('BM25 index ready — number of documents:', len(texts))

BM25 index ready — number of documents: 4500


In [9]:
# Retrieval and simple RAG query function
from typing import List

def dense_retrieve(query: str, k: int = 3) -> List[str]:
    res = vs.similarity_search(query, k=k)
    return [r.page_content for r in res]

def sparse_retrieve(query: str, k: int = 3) -> List[str]:
    scores = bm25.get_scores(query.split())
    idxs = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [texts[i] for i in idxs]

def hybrid_retrieve(query: str, k: int = 3) -> List[str]:
    dense = dense_retrieve(query, k=k)
    sparse = sparse_retrieve(query, k=k)
    merged = []
    for d in dense + sparse:
        if d not in merged:
            merged.append(d)
    return merged[:k]

def rag_query(query: str, method: str = 'dense', k: int = 3) -> str:
    if method == 'dense':
        ctx = dense_retrieve(query, k=k)
    elif method == 'sparse':
        ctx = sparse_retrieve(query, k=k)
    else:
        ctx = hybrid_retrieve(query, k=k)
    return '\\n\\n---\\n\\n'.join(ctx)

print('Dense sample snippet:')
print(dense_retrieve('Who is Pierre Bezukhov?', k=2)[0][:600])
print('\nSparse sample snippet:')
print(sparse_retrieve('Who is Pierre Bezukhov?', k=2)[0][:600])

Dense sample snippet:
Pierre, on unexpectedly becoming Count Bezúkhov and a rich man, felt
himself after his recent loneliness and freedom from cares so beset and
preoccupied that only in bed was he able to be by himself. He had to
sign papers, to present himself at government offices, the purpose of
which was not clear to him, to question his chief steward, to visit his
estate near Moscow, and to receive many people who formerly did not
even wish to know of his existence but would now have been offended
and grieved had he chosen not to see them. These different
people—businessmen, relations, and acquaintances alik

Sparse sample snippet:
“What, teasing again? Go to the devil! Eh?” said Anatole, making a
grimace. “Really it’s no time for your stupid jokes,” and he left
the room.

Dólokhov smiled contemptuously and condescendingly when Anatole had
gone out.

“You wait a bit,” he called after him. “I’m not joking, I’m
talking sense. Come here, come here!”

Anatole returned and looked at 

In [10]:
# Interactive prompt loop — run and type questions (type exit to stop)
while True:
    q = input('\nAsk a question (or type "exit"): ').strip()
    if not q:
        print('Please type a question or "exit".')
        continue
    if q.lower() in ('exit','quit'):
        print('Exiting.')
        break
    method = input('Retriever (dense/sparse/hybrid) [dense]: ').strip().lower() or 'dense'
    k = input('Top-k [3]: ').strip()
    try:
        k = int(k) if k else 3
    except:
        k = 3
    ans = rag_query(q, method=method, k=k)
    print('\n--- Retrieved answer (truncated 2000 chars) ---\n')
    print(ans[:2000])



Ask a question (or type "exit"): Who is Pierre Bezukhov?
Retriever (dense/sparse/hybrid) [dense]: densw
Top-k [3]: 4

--- Retrieved answer (truncated 2000 chars) ---

Pierre, on unexpectedly becoming Count Bezúkhov and a rich man, felt
himself after his recent loneliness and freedom from cares so beset and
preoccupied that only in bed was he able to be by himself. He had to
sign papers, to present himself at government offices, the purpose of
which was not clear to him, to question his chief steward, to visit his
estate near Moscow, and to receive many people who formerly did not
even wish to know of his existence but would now have been offended
and grieved had he chosen not to see them. These different
people—businessmen, relations, and acquaintances alike—were all
disposed to treat the young heir in the most friendly and flattering
manner: they were all evidently firmly convinced of Pierre’s noble
qualities. He was always hearing such words as: “With your remarkable
kindness,” or, 

KeyboardInterrupt: Interrupted by user