# Embeddings and Vector Store (AskMyDocs RAG)

## Goal
Convert chunked corpus into embeddings and build a vector index for semantic retrieval.

### Inputs
- data/processed/arxiv_chunks_*.csv

### Outputs
- Vector index (FAISS)
- Metadata mapping

In [4]:
# Run once per environment (then comment this out)
!pip -q install torch sentence-transformers faiss-cpu pandas pyarrow

In [6]:
#import and install modules
import os
import logging
from pathlib import Path

import numpy as np
import pandas as pd
import faiss
import torch
from sentence_transformers import SentenceTransformer

# Keep notebook output clean
logging.getLogger("huggingface_hub").setLevel(logging.ERROR)
logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("sentence_transformers").setLevel(logging.ERROR)
os.environ["TOKENIZERS_PARALLELISM"] = "false"

PROJECT_ROOT = Path("..").resolve()
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"

print("Project root:", PROJECT_ROOT)
print("Processed dir:", PROCESSED_DIR)
print("Torch:", torch.__version__)

Project root: C:\Users\vidus\Projects\RAG-LLM-Projects\AskMyDocs-RAG-Chatbot
Processed dir: C:\Users\vidus\Projects\RAG-LLM-Projects\AskMyDocs-RAG-Chatbot\data\processed
Torch: 2.8.0+cpu


### Load latest file

In [9]:
chunk_files = sorted(PROCESSED_DIR.glob("arxiv_chunks_*.csv"))
assert chunk_files, f"No chunk files found in {PROCESSED_DIR}"

CHUNKS_FILE = chunk_files[-1]  # most recent
df_chunks = pd.read_csv(CHUNKS_FILE)

# Required column check
assert "chunk_text" in df_chunks.columns, "Expected column 'chunk_text' not found"

texts = df_chunks["chunk_text"].astype(str).tolist()

print("Using:", CHUNKS_FILE.name)
print("Rows:", len(df_chunks))
print("Chunks:", len(texts))

Using: arxiv_chunks_20260223_1021.csv
Rows: 136
Chunks: 136


### Load embedding model

In [12]:
model = SentenceTransformer("all-MiniLM-L6-v2")
print("Model loaded ✅")

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

Model loaded ✅


### Create embeddings

In [15]:
embeddings = model.encode(
    texts,
    convert_to_numpy=True,
    show_progress_bar=False
).astype("float32")

print("Embeddings shape:", embeddings.shape)  # (n_chunks, 384)

Embeddings shape: (136, 384)


### Build FAISS index (cosine similarity)

In [18]:
# Cosine similarity via normalized vectors + inner product
faiss.normalize_L2(embeddings)

dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(embeddings)

print("FAISS index size:", index.ntotal)
print("FAISS dim:", dim)

FAISS index size: 136
FAISS dim: 384


### Save FAISS + metadata mapping

In [21]:
OUT_DIR = PROCESSED_DIR / "faiss"
OUT_DIR.mkdir(parents=True, exist_ok=True)

INDEX_PATH = OUT_DIR / "arxiv_faiss.index"
META_PATH  = OUT_DIR / "arxiv_chunks_meta.parquet"

faiss.write_index(index, str(INDEX_PATH))

# Keep metadata lightweight + useful for retrieval
meta_cols = [c for c in ["chunk_id", "doc_id", "chunk_index", "title", "categories", "update_date", "chunk_text"] if c in df_chunks.columns]
df_meta = df_chunks[meta_cols].copy()
df_meta.to_parquet(META_PATH, index=False)

print("Saved index:", INDEX_PATH)
print("Saved metadata:", META_PATH)
print("Metadata columns:", meta_cols)

Saved index: C:\Users\vidus\Projects\RAG-LLM-Projects\AskMyDocs-RAG-Chatbot\data\processed\faiss\arxiv_faiss.index
Saved metadata: C:\Users\vidus\Projects\RAG-LLM-Projects\AskMyDocs-RAG-Chatbot\data\processed\faiss\arxiv_chunks_meta.parquet
Metadata columns: ['chunk_id', 'doc_id', 'chunk_index', 'title', 'categories', 'update_date', 'chunk_text']


## Quick retrieval test (sanity check)

In [24]:
def search(query: str, k: int = 5) -> pd.DataFrame:
    q = model.encode([query], convert_to_numpy=True).astype("float32")
    faiss.normalize_L2(q)
    scores, idxs = index.search(q, k)

    results = df_meta.iloc[idxs[0]].copy()
    results["score"] = scores[0]
    
    cols = ["score"] + [c for c in ["chunk_id", "doc_id", "title", "categories", "update_date", "chunk_text"] if c in results.columns]
    return results[cols]

search("neural network for text classification", k=5)

Unnamed: 0,score,chunk_id,doc_id,title,categories,update_date,chunk_text
16,0.516908,704.1028__0,704.1028,A neural network approach to ordinal regression,cs.LG cs.AI cs.NE,2007-05-23,Title: A neural network approach to ordinal re...
83,0.456485,705.1209__1,705.1209,Artificial Intelligence for Conflict Management,cs.AI,2007-05-23,neural networks. The results show that SVMs pr...
82,0.410357,705.1209__0,705.1209,Artificial Intelligence for Conflict Management,cs.AI,2007-05-23,Title: Artificial Intelligence for Conflict Ma...
94,0.381806,705.2235__0,705.2235,Response Prediction of Structural System Subje...,cs.AI,2007-05-23,Title: Response Prediction of Structural Syste...
17,0.373325,704.1028__1,704.1028,A neural network approach to ordinal regression,cs.LG cs.AI cs.NE,2007-05-23,ges of traditional neural networks: learning i...


##  Retrieval Validation Successful

The FAISS vector index was successfully created and tested using a sample semantic query.  
The system returned the top-k most relevant chunks, confirming that:

- Text embeddings were generated correctly using `all-MiniLM-L6-v2`
- Cosine similarity search is functioning as expected
- Metadata mapping correctly links vectors back to source documents
- Retrieved results align with the semantic meaning of the query

The semantic retrieval layer is now fully operational.

---

##  Generated Artifacts

The following reusable artifacts were saved:

- `data/processed/faiss/arxiv_faiss.index` — Vector index  
- `data/processed/faiss/arxiv_chunks_meta.parquet` — Metadata mapping  

These files allow semantic search without recomputing embeddings.

---

##  Next Step: RAG Inference Pipeline

The next stage will implement the Retrieval-Augmented Generation workflow:

1. Load the saved FAISS index  
2. Load metadata mapping  
3. Retrieve top-k relevant chunks for a user query  
4. Construct a context-aware prompt  
5. Generate a grounded answer using an LLM  
6. Return the answer with source citations  

---

This notebook completes the **Vector Store Construction Layer** of the AskMyDocs RAG system.

## Why the Vector Store Construction Layer Is Important

The Vector Store Construction Layer is the foundation of the Retrieval-Augmented Generation (RAG) pipeline. It transforms raw text documents into searchable semantic representations, enabling accurate and efficient retrieval.

### Key Reasons

- **Enables Semantic Search**  
  Converts text into embeddings, allowing the system to retrieve results based on meaning rather than exact keyword matches.

- **Prevents Hallucinations**  
  Retrieved context grounds the LLM’s responses in real documents, reducing incorrect or fabricated answers.

- **Improves Accuracy and Relevance**  
  Cosine similarity search ensures that the most semantically related chunks are selected for answering a query.

- **Supports Scalability**  
  FAISS enables fast similarity search across thousands or millions of vectors, making the system production-ready.

- **Promotes Modular Architecture**  
  Separates retrieval from generation, allowing independent upgrades to embedding models, vector databases, or LLMs.

- **Enables Reusability**  
  Once built, the saved FAISS index and metadata mapping can be reused without recomputing embeddings.

---

In summary, the Vector Store Construction Layer converts static documents into a dynamic, searchable knowledge base that powers the entire RAG system.