# Invest-RAG

## Motivation

Financial questions are high-stakes:
LLM hallucinations can mislead investment decisions.

This project builds a Retrieval-Augmented Generation system
that grounds answers in financial documents.

All notebooks must be executed from the project root (invest-rag/).
The project uses editable install mode (`pip install -e .`) to resolve imports.

In [None]:
from pathlib import Path

PROJECT_ROOT = Path.cwd()
assert (PROJECT_ROOT / "src").exists(), (
    f"Run this notebook from the project root (invest-rag/). Current cwd={PROJECT_ROOT}"
)

print("Project root:", PROJECT_ROOT)

Project root: c:\Users\CG\Desktop\invest-rag


# 00. Setup & Ingest

## Goal
Build a reproducible data pipeline that prepares documents for RAG:
raw docs â†’ cleaned text â†’ sentence split â†’ chunks (+ metadata)

## Why this step matters
Most RAG failures come from data issues:
- inconsistent schema
- missing metadata
- noisy or duplicated text

A clean ingestion pipeline improves downstream retrieval and evaluation.

## Pipeline Overview
1) Load raw documents  
2) Validate schema (data contract)  
3) Clean + normalize text  
4) Sentence split + chunking  
5) Save chunk dataset + manifests for reproducibility

## Outputs (Artifacts)
- `data/processed/chunks.jsonl` : chunk records with metadata
- `data/processed/chunks_manifest.json` : summary stats + provenance
- `data/processed/build_config.json` : pipeline configuration used in this run

## Checkpoints
- #docs ingested
- #chunks generated
- avg chunk length
- 1â€“2 sample chunks preview

In [8]:
from scripts.init_project import make_project

make_project(PROJECT_ROOT)

âœ… Project initialized at: c:\Users\CG\Desktop\invest-rag


## 1. Define a Data Contract  
Consistency of data format is critical for a RAG pipeline.  
This cell documents the schema for input documents (`news_summary.jsonl / report_excerpt.jsonl / disclosure_note.jsonl`) so that chunking, embedding, and evaluation follow the same contract.

A clear data contract:
- prevents silent bugs
- makes evaluation fair
- allows scaling to larger datasets

# Ingest

## 1. Cleaning + chunking  
Chunking is one of the most impactful factors in RAG quality.

Poor chunking â†’ poor retrieval â†’ hallucinated answers.

Embedding long documents directly can hurt retrieval quality, so we split them into manageable chunks.  
This cell applies light cleaning (whitespace/header removal), chunks by sentence grouping, and generates reproducible `chunk_id`s.

In [None]:
import re, hashlib
import json

HEADER_PATTERNS = [
    r"^\s*ìš”ì•½\s*[:ï¼š]\s*",
    r"^\s*Summary\s*[:ï¼š]\s*",
    r"^\s*í•µì‹¬\s*[:ï¼š]\s*",
]

def clean_text(text: str) -> str:
    if not text:
        return ""
    t = text.strip()
    t = t.replace("\u00a0", " ")
    t = re.sub(r"[ \t]+", " ", t)
    t = re.sub(r"\n{3,}", "\n\n", t)
    for pat in HEADER_PATTERNS:
        t = re.sub(pat, "", t)
    return t.strip()

def split_sentences(text: str):
    '''
    Sentence splitting (prototype baseline)

    Sentence splitting uses a simple regex-based splitter as a lightweight baseline.
    This is sufficient for the synthetic dataset, and can be swapped later with a more robust tokenizer/sentence segmenter if needed.
    '''
    # Simple sentence splitter for mixed Korean/English text 
    t = re.sub(r"\s*\n\s*", " ", text.strip())
    if not t:
        return []
    # Sentence boundary candidates: ".", "!", "?", and Korean ending "ë‹¤."
    parts = re.split(r"(?<=[\.\!\?])\s+|(?<=ë‹¤\.)\s+", t)
    return [p.strip() for p in parts if p.strip()]

def chunk_sentences(sents, max_chars=450, overlap_sents=1):
    chunks = []
    cur = []
    cur_len = 0

    def flush():
        nonlocal cur, cur_len
        if cur:
            chunks.append(" ".join(cur).strip())
        cur = []
        cur_len = 0

    for s in sents:
        if not cur:
            cur = [s]
            cur_len = len(s)
            continue

        if cur_len + 1 + len(s) <= max_chars:
            cur.append(s)
            cur_len += 1 + len(s)
        else:
            flush()
            # Overlap strategy: append the last sentence of the previous chunk to the next chunk
            if overlap_sents > 0 and chunks:
                prev_sents = split_sentences(chunks[-1])
                prefix = prev_sents[-overlap_sents:] if len(prev_sents) >= overlap_sents else prev_sents
                cur = prefix + [s]
                cur_len = sum(len(x) for x in cur) + (len(cur)-1)
            else:
                cur = [s]
                cur_len = len(s)

    flush()
    return chunks

def make_chunk_id(doc_id: str, chunk_index: int, chunk_text: str) -> str:
    '''
    Deterministic chunk IDs

    Each `chunk_id` is generated deterministically as `sha1(doc_id | chunk_index | chunk_text)`,
    so the same input documents produce identical chunk IDs across runs (useful for reproducible eval and debugging).

    We use SHA-1 hashing to generate deterministic chunk IDs.
    Given the same document and chunk text, the ID will always be identical. ensuring reproducibility across runs and experiments.
    '''
    h = hashlib.sha1(f"{doc_id}|{chunk_index}|{chunk_text}".encode("utf-8")).hexdigest()[:12]
    return f"{doc_id}_c{chunk_index:02d}_{h}"

def doc_to_chunks(doc, max_chars=450, overlap_sents=1):
    '''
    Chunk size choice (max_chars=450, overlap_sents=1)

    We use ~450 characters per chunk to balance retrieval granularity and context density:
    smaller chunks improve precision but may lose key evidence; larger chunks improve recall but dilute similarity signals.
    A 1-sentence overlap reduces boundary effects (facts split across chunks) with minimal duplication cost.
    '''
    content = clean_text(doc.get("content", ""))
    sents = split_sentences(content)

    if len(content) <= max_chars:
        chunk_texts = [content] if content else []
    else:
        chunk_texts = chunk_sentences(sents, max_chars=max_chars, overlap_sents=overlap_sents)

    out = []
    meta = {k: doc.get(k) for k in ["doc_id", "company","year","section","source"]}
    for i, ct in enumerate(chunk_texts):
        out.append({
            "chunk_id": make_chunk_id(doc["doc_id"], i, ct),
            "chunk_index": i,
            "text": ct,
            "metadata": meta
        })
    return out

all_chunks = []
#for d in docs:
    #all_chunks.extend(doc_to_chunks(d, max_chars=450, overlap_sents=1))

def load_jsonl(path: str):
    docs = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                docs.append(json.loads(line))
    return docs

docs = load_jsonl("data/samples/sec_docs.jsonl")

for d in docs:
    all_chunks.extend(doc_to_chunks(d, max_chars=450, overlap_sents=1))

print("Total chunks:", len(all_chunks))
print("Example chunk keys:", all_chunks[0].keys())
print("Example chunk_id:", all_chunks[0]["chunk_id"])

Total chunks: 4297
Example chunk keys: dict_keys(['chunk_id', 'chunk_index', 'text', 'metadata'])
Example chunk_id: nvidia_2024_item_1_business_c00_937b0a34d837


## 3. Sanity check & statistics
We check chunk counts and length distribution to avoid extreme cases.  


In [6]:
import json
from collections import Counter

CHUNKS_PATH = PROJECT_ROOT / "data/processed/chunks.jsonl"
CHUNKS_PATH.parent.mkdir(parents=True, exist_ok=True)

lengths = [len(c["text"]) for c in all_chunks]

print("ðŸ“Š Chunk stats")
print("Chunks count:", len(lengths))
print(
    "Min/Median/Max length:",
    min(lengths),
    sorted(lengths)[len(lengths)//2],
    max(lengths)
)

cnt = Counter([c["metadata"]["doc_id"] for c in all_chunks])
print("\nðŸ“Š Chunks per doc (top 5):")
print(cnt.most_common(5))

ðŸ“Š Chunk stats
Chunks count: 4297
Min/Median/Max length: 7 412 5228

ðŸ“Š Chunks per doc (top 5):
[('meta_2024_item_1a_risk_factors', 821), ('amd_2024_item_1a_risk_factors', 557), ('nvidia_2024_item_1a_risk_factors', 448), ('microsoft_2024_item_1a_risk_factors', 368), ('apple_2024_item_1a_risk_factors', 322)]


## 4. Save & Summary
We also inspect per-document chunk counts, then save the results to `chunks.jsonl`.

Future work: evaluate chunk size and overlap hyperparameters using retrieval metrics (Recall@k, MRR).

In [7]:
# Save chunks.jsonl
with CHUNKS_PATH.open("w", encoding="utf-8") as f:
    for c in all_chunks:
        f.write(json.dumps(c, ensure_ascii=False) + "\n")

print("âœ… Saved chunks:", CHUNKS_PATH)

âœ… Saved chunks: c:\Users\CG\Desktop\invest-rag\data\processed\chunks.jsonl


In [9]:
from datetime import datetime

MANIFEST_PATH = PROJECT_ROOT / "data/processed/chunks_manifest.json"
CONFIG_PATH   = PROJECT_ROOT / "data/processed/build_config.json"

manifest = {
    "created_at": datetime.utcnow().isoformat() + "Z",
    "n_docs": len(docs),
    "n_chunks": len(all_chunks),
    "chunk_max_chars": 450,
    "overlap_sents": 1,
    "inputs": "sec_docs.jsonl",
}

build_config = {
    "cleaning": {"header_patterns": HEADER_PATTERNS},
    "sentence_split": "regex-based (prototype)",
    "chunking": {"max_chars": 450, "overlap_sents": 1},
    "outputs": {
        "chunks_jsonl": str(CHUNKS_PATH),
        "manifest": str(MANIFEST_PATH),
        "build_config": str(CONFIG_PATH),
    },
}

MANIFEST_PATH.write_text(json.dumps(manifest, ensure_ascii=False, indent=2), encoding="utf-8")
CONFIG_PATH.write_text(json.dumps(build_config, ensure_ascii=False, indent=2), encoding="utf-8")

print("âœ… Saved manifest:", MANIFEST_PATH)
print("âœ… Saved build config:", CONFIG_PATH)

âœ… Saved manifest: c:\Users\CG\Desktop\invest-rag\data\processed\chunks_manifest.json
âœ… Saved build config: c:\Users\CG\Desktop\invest-rag\data\processed\build_config.json


  "created_at": datetime.utcnow().isoformat() + "Z",


In [None]:
from pathlib import Path
import json

PROCESSED_DIR = PROJECT_ROOT / "data/processed"

ARTIFACTS = {
    "chunks_jsonl": PROCESSED_DIR / "chunks.jsonl",
    "manifest": PROCESSED_DIR / "chunks_manifest.json",
    "build_config": PROCESSED_DIR / "build_config.json",
}

print("ðŸ“¦ Final Artifacts")
print("-" * 40)

for name, path in ARTIFACTS.items():
    exists = path.exists()
    size = path.stat().st_size if exists else 0
    print(f"{name:15} | exists={exists} | size={size} bytes | {path}")

# Optional: preview manifest
if ARTIFACTS["manifest"].exists():
    manifest = json.loads(ARTIFACTS["manifest"].read_text(encoding="utf-8"))
    print("\nðŸ§¾ Manifest Summary:")
    for k, v in manifest.items():
        print(f"  {k}: {v}")

ðŸ“¦ Final Artifacts
----------------------------------------
chunks_jsonl    | exists=True | size=466302254 bytes | c:\Users\CG\Desktop\invest-rag\data\processed\chunks.jsonl
manifest        | exists=True | size=170 bytes | c:\Users\CG\Desktop\invest-rag\data\processed\chunks_manifest.json
build_config    | exists=True | size=585 bytes | c:\Users\CG\Desktop\invest-rag\data\processed\build_config.json

ðŸ§¾ Manifest Summary:
  created_at: 2026-03-01T09:41:07.709914Z
  n_docs: 15
  n_chunks: 4297
  chunk_max_chars: 450
  overlap_sents: 1
  inputs: sec_docs.jsonl
