# Invest RAG — Vector Index Construction

Financial QA is high-stakes.  
Reliable retrieval is critical to prevent hallucination.

This notebook builds the semantic vector index used in the RAG pipeline.

In [None]:
from pathlib import Path

PROJECT_ROOT = Path.cwd()
assert (PROJECT_ROOT / "src").exists(), (
    f"Run this notebook from the project root (invest-rag/). Current cwd={PROJECT_ROOT}"
)
print("Project root:", PROJECT_ROOT)


Project root: c:\Users\CG\Desktop\invest-rag


## Step Overview

chunks.jsonl → embeddings → FAISS index → artifacts

All heavy logic is implemented inside `src/`.

## Concept: Embeddings

Embeddings map text into dense vectors.

Similarity between query and chunks is computed
in vector space rather than by keyword matching.

Implementation: `src/llm/embedding.py`

## Concept: Cosine Similarity via FAISS
We use L2-normalized vectors with `IndexFlatIP`.

For normalized vectors:
cosine(u, v) = u · v

Implementation: `src/retrieval/build_vector_index.py`

This approach is simple, interpretable, and reproducible.

In large-scale production systems,
approximate indexes (IVF, HNSW) may be used instead.

## Modular Architecture

This notebook does NOT:

- Create an OpenAI client
- Implement retry logic
- Perform embedding loops
- Normalize vectors manually
- Build FAISS index directly

All implementation details are encapsulated inside:

- `src/llm/embedding.py`
- `src/retrieval/build_vector_index.py`
- `src/retrieval/vector_store.py`

The notebook simply triggers the build.

In [2]:
# OpenAI client (loads API key & model names from .env)
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()  # reads PROJECT_ROOT/.env

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("OPENAI_API_KEY not found. Create a .env file in the project root.")

EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-large")

client = OpenAI(api_key=OPENAI_API_KEY)

## Build Vector Index

The function `build_vector_index()` performs:

- Deterministic chunk ordering
- Empty text filtering
- Batched embedding with retry
- Optional L2 normalization
- FAISS `IndexFlatIP` construction
- Embedding dimension integrity check
- Artifact persistence

In [3]:
CHUNKS_PATH = PROJECT_ROOT / "data" / "processed" / "chunks.jsonl"
INDEX_DIR   = PROJECT_ROOT / "indexes" / "faiss"
INDEX_PATH  = INDEX_DIR / "index.bin"
META_PATH   = INDEX_DIR / "meta.jsonl"

In [4]:
from src.retrieval.build_vector_index import build_vector_index

result = build_vector_index(
    chunks_path=CHUNKS_PATH,
    index_path=INDEX_PATH,
    meta_path=META_PATH,
)

print(result)

Embedding: 100%|██████████| 68/68 [00:22<00:00,  2.96it/s]


BuildIndexResult(index_path='c:\\Users\\CG\\Desktop\\invest-rag\\indexes\\faiss\\index.bin', meta_path='c:\\Users\\CG\\Desktop\\invest-rag\\indexes\\faiss\\meta.jsonl', n_vectors=4297, dim=1536)


## Retrieval Test

We validate that semantically relevant chunks
are retrieved correctly.

In [5]:
from src.retrieval.vector_store import VectorStore
from src.llm.embedding import embed_query
from src.retrieval.vector_store import l2_normalize

vs = VectorStore.load(index_path=INDEX_PATH, meta_path=META_PATH)
qv = l2_normalize(embed_query("..."))
results = vs.search_by_vector(qv, k=5)

## Persisted Artifacts

The build step automatically saves:

- index.bin
- meta.jsonl
- build_config.json
- chunks_manifest.json

These ensure reproducibility and traceability.

Implementation: `src/retrieval/build_vector_index.py`

### Save Retrieval Debug Logs

Debug logs help analyze:
- retrieval failures
- false positives
- ranking quality

This mirrors real-world RAG monitoring.

In [6]:
from src.eval.retrieval_logging import log_retrieval

LOG_PATH = PROJECT_ROOT / "logs" / "retrieval_debug.jsonl"

query = "What are the main supply chain risks mentioned by NVIDIA?"

log_retrieval(
    log_path=LOG_PATH,
    query=query,
    results=results,
    comment="Initial retrieval check"
)

print(f"✅ Retrieval log saved to: {LOG_PATH}")

✅ Retrieval log saved to: c:\Users\CG\Desktop\invest-rag\logs\retrieval_debug.jsonl
