# Data Preparation for RAG (AskMyDocs)

## Goal
Convert arXiv metadata into a clean, filtered, chunked corpus suitable for Retrieval-Augmented Generation (RAG).

### Outputs
1. Clean document table (1 row per paper)
2. Chunked corpus table (many rows per paper)

### What this notebook does
- Loads raw arXiv metadata
- Cleans text fields (title, abstract)
- Parses categories into list form
- Filters dataset to target categories
- Builds a canonical `text` field for embedding
- Chunks text into retrieval-friendly segments
- Saves processed datasets to `data/processed/`

In [2]:
#import modules
import os
import re
import json
from pathlib import Path
from datetime import datetime

import numpy as np
import pandas as pd

pd.set_option("display.max_colwidth", 120)
pd.set_option("display.max_columns", 200)

In [8]:
# --- Project paths (adjust if your repo differs) ---
PROJECT_ROOT = Path("..")  # if notebook is in /notebooks
RAW_DIR = PROJECT_ROOT / "data" / "raw"
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"

PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# --- Input file ---
# Update this to your actual raw file name
RAW_FILE = RAW_DIR / "arxiv-metadata-oai-snapshot.json"

# --- Target categories (from your EDA) ---
TARGET_CATS = {"cs.CL", "cs.LG", "cs.AI"}

# --- Basic filtering thresholds ---
MIN_ABSTRACT_CHARS = 200  # tighten/loosen as needed

# --- Chunking config ---
CHUNK_SIZE_CHARS = 1200   # simple, fast baseline
CHUNK_OVERLAP_CHARS = 200

# --- Output names ---
RUN_TAG = datetime.now().strftime("%Y%m%d_%H%M")
DOCS_OUT = PROCESSED_DIR / f"arxiv_docs_clean_{RUN_TAG}.csv"
CHUNKS_OUT = PROCESSED_DIR / f"arxiv_chunks_{RUN_TAG}.csv"
SCHEMA_OUT = PROCESSED_DIR / f"schema_{RUN_TAG}.json"

In [10]:
print("Current working directory:", Path.cwd())
print("RAW_DIR exists?", RAW_DIR.exists())
print("RAW_FILE exists?", RAW_FILE.exists())

if RAW_DIR.exists():
    print("\nFiles inside raw folder:")
    for f in RAW_DIR.iterdir():
        print(" -", f.name)

Current working directory: C:\Users\vidus\Projects\RAG-LLM-Projects\AskMyDocs-RAG-Chatbot\notebooks
RAW_DIR exists? True
RAW_FILE exists? True

Files inside raw folder:
 - .gitkeep
 - arxiv-metadata-oai-snapshot.json


In [12]:
import pandas as pd

# Load a sample first (full file is huge)
df = pd.read_json(RAW_FILE, lines=True, nrows=10000)

print("Shape:", df.shape)
display(df.head(2))
print("\nColumns:", df.columns.tolist())

Shape: (10000, 14)


Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",Calculation of prompt diphoton production cross sections at Tevatron and\n LHC energies,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massiv...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007 19:18:42 GMT'}, {'version': 'v2', 'created': 'Tue, 24 Jul 2007 20:10:...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky, P. M., ], [Yuan, C. -P., ]]"
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib/1.0/,"We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use\nit obtain a characterization of the ...","[{'version': 'v1', 'created': 'Sat, 31 Mar 2007 02:26:18 GMT'}, {'version': 'v2', 'created': 'Sat, 13 Dec 2008 17:26...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"



Columns: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'versions', 'update_date', 'authors_parsed']


## Clean + Select Only What is nedded

In [15]:
import re

KEEP_COLS = [
    "id", "title", "abstract", "categories",
    "update_date", "authors", "authors_parsed"
]

df_base = df[KEEP_COLS].copy()

_ws = re.compile(r"\s+")
_nl = re.compile(r"(\\n|\n|\r)")

def clean_text(x):
    if pd.isna(x):
        return ""
    x = str(x)
    x = _nl.sub(" ", x)
    x = _ws.sub(" ", x)
    return x.strip()

df_base["title_clean"] = df_base["title"].apply(clean_text)
df_base["abstract_clean"] = df_base["abstract"].apply(clean_text)
df_base["abstract_len"] = df_base["abstract_clean"].str.len()

df_base.head(2)

Unnamed: 0,id,title,abstract,categories,update_date,authors,authors_parsed,title_clean,abstract_clean,abstract_len
0,704.0001,Calculation of prompt diphoton production cross sections at Tevatron and\n LHC energies,A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massiv...,hep-ph,2008-11-26,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan","[[Balázs, C., ], [Berger, E. L., ], [Nadolsky, P. M., ], [Yuan, C. -P., ]]",Calculation of prompt diphoton production cross sections at Tevatron and LHC energies,A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive p...,980
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use\nit obtain a characterization of the ...",math.CO cs.CG,2008-12-13,Ileana Streinu and Louis Theran,"[[Streinu, Ileana, ], [Theran, Louis, ]]",Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use it obtain a characterization of the fam...",795


### Filter to Target CS Categories

In [18]:
TARGET_CATS = {"cs.CL", "cs.LG", "cs.AI"}
MIN_ABSTRACT_CHARS = 200

def parse_categories(cat_str):
    if pd.isna(cat_str):
        return []
    return cat_str.split(" ")

df_base["categories_list"] = df_base["categories"].apply(parse_categories)

def has_target_cat(cat_list):
    return any(c in TARGET_CATS for c in cat_list)

df_base["has_target_cat"] = df_base["categories_list"].apply(has_target_cat)

print("Total rows:", len(df_base))
print("Rows with target categories:", df_base["has_target_cat"].sum())
print("Rows with abstract >= 200 chars:", (df_base["abstract_len"] >= 200).sum())

df_rag = df_base[
    (df_base["has_target_cat"]) &
    (df_base["abstract_len"] >= MIN_ABSTRACT_CHARS)
].copy()

df_rag = df_rag.drop_duplicates(subset=["id"]).reset_index(drop=True)

print("Filtered df_rag shape:", df_rag.shape)
df_rag.head(2)

Total rows: 10000
Rows with target categories: 76
Rows with abstract >= 200 chars: 9659
Filtered df_rag shape: (76, 12)


Unnamed: 0,id,title,abstract,categories,update_date,authors,authors_parsed,title_clean,abstract_clean,abstract_len,categories_list,has_target_cat
0,704.0047,Intelligent location of simultaneously active acoustic emission sources:\n Part I,"The intelligent acoustic emission locator is described in Part I, while Part\nII discusses blind source separation...",cs.NE cs.AI,2009-09-29,T. Kosel and I. Grabec,"[[Kosel, T., ], [Grabec, I., ]]",Intelligent location of simultaneously active acoustic emission sources: Part I,"The intelligent acoustic emission locator is described in Part I, while Part II discusses blind source separation, t...",1170,"[cs.NE, cs.AI]",True
1,704.005,Intelligent location of simultaneously active acoustic emission sources:\n Part II,"Part I describes an intelligent acoustic emission locator, while Part II\ndiscusses blind source separation, time ...",cs.NE cs.AI,2007-05-23,T. Kosel and I. Grabec,"[[Kosel, T., ], [Grabec, I., ]]",Intelligent location of simultaneously active acoustic emission sources: Part II,"Part I describes an intelligent acoustic emission locator, while Part II discusses blind source separation, time del...",948,"[cs.NE, cs.AI]",True


## Category filtering results (RAG corpus)

- Loaded **10,000** arXiv records from the raw JSON snapshot.
- Applied two filters to create a RAG-ready subset:
  1. **Topic filter:** keep papers whose `categories` include at least one of  
     `{"cs.CL", "cs.LG", "cs.AI"}`
  2. **Quality filter:** keep papers with `abstract_len >= 200` characters

### Summary
- **Total rows:** 10,000  
- **Rows with target categories:** 76  
- **Rows with abstract ≥ 200 chars:** 9,659  
- **Final filtered RAG dataset (`df_rag`) shape:** **(76, 12)**

### Output dataset
`df_rag` now contains the cleaned and filtered fields needed for RAG, including:
- `id`, `title_clean`, `abstract_clean`
- `categories`, `categories_list`
- `update_date`, `authors_parsed`
- `abstract_len`, `has_target_cat`

This dataset will be used next to build the canonical document text (`doc_text`) and create retrieval chunks for embedding.

### Build doc_text (one document per paper)

In [23]:
df_rag["update_date_parsed"] = pd.to_datetime(df_rag["update_date"], errors="coerce")

def build_doc_text(row):
    cats = " ".join(row["categories_list"]) if isinstance(row["categories_list"], list) else ""
    date_part = ""
    if pd.notna(row["update_date_parsed"]):
        date_part = f"\nLast updated: {row['update_date_parsed'].date()}"
    return (
        f"Title: {row['title_clean']}\n"
        f"Abstract: {row['abstract_clean']}\n"
        f"Categories: {cats}"
        f"{date_part}"
    )

df_rag["doc_text"] = df_rag.apply(build_doc_text, axis=1)
df_rag[["id", "title_clean", "categories", "abstract_len"]].head(3)


Unnamed: 0,id,title_clean,categories,abstract_len
0,704.0047,Intelligent location of simultaneously active acoustic emission sources: Part I,cs.NE cs.AI,1170
1,704.005,Intelligent location of simultaneously active acoustic emission sources: Part II,cs.NE cs.AI,948
2,704.0304,The World as Evolving Information,cs.IT cs.AI math.IT q-bio.PE,788


## Build Canonical Document Text (`doc_text`)

To prepare the dataset for Retrieval-Augmented Generation (RAG), we constructed a canonical text representation for each paper.

Each `doc_text` includes:

- **Title**
- **Abstract**
- **Categories**
- **Last updated date (if available)**

This ensures:
- Consistent structure across documents
- Rich contextual signals for embeddings
- Better retrieval quality

Example structure:

Title: <paper title>  
Abstract: <paper abstract>  
Categories: <arXiv categories>  
Last updated: <YYYY-MM-DD>

The resulting dataset (`df_rag`) now contains one cleaned, structured document per paper, ready for chunking and embedding.

### Chunk for Retrieval

In [31]:
CHUNK_SIZE_CHARS = 800
CHUNK_OVERLAP_CHARS = 150

def chunk_text(text, chunk_size=CHUNK_SIZE_CHARS, overlap=CHUNK_OVERLAP_CHARS):
    if not text:
        return []
    if chunk_size <= overlap:
        raise ValueError("chunk_size must be > overlap")

    chunks = []
    start = 0
    n = len(text)

    while start < n:
        end = min(start + chunk_size, n)
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        if end == n:
            break
        start = end - overlap

    return chunks


rows = []
for _, row in df_rag.iterrows():
    doc_id = row["id"]
    chunks = chunk_text(row["doc_text"])

    for i, ch in enumerate(chunks):
        rows.append({
            "chunk_id": f"{doc_id}__{i}",
            "doc_id": doc_id,
            "chunk_index": i,
            "chunk_text": ch,
            "title": row["title_clean"],
            "categories": row["categories"],
            "update_date": row["update_date"]
        })

df_chunks = pd.DataFrame(rows)

print("Documents:", len(df_rag))
print("Chunks:", len(df_chunks))
df_chunks.head(3)

Documents: 76
Chunks: 136


Unnamed: 0,chunk_id,doc_id,chunk_index,chunk_text,title,categories,update_date
0,704.0047__0,704.0047,0,Title: Intelligent location of simultaneously active acoustic emission sources: Part I\nAbstract: The intelligent ac...,Intelligent location of simultaneously active acoustic emission sources: Part I,cs.NE cs.AI,2009-09-29
1,704.0047__1,704.0047,1,rning from examples. Locator performance was tested on different test specimens. Tests have shown that the accuracy ...,Intelligent location of simultaneously active acoustic emission sources: Part I,cs.NE cs.AI,2009-09-29
2,704.005__0,704.005,0,Title: Intelligent location of simultaneously active acoustic emission sources: Part II\nAbstract: Part I describes ...,Intelligent location of simultaneously active acoustic emission sources: Part II,cs.NE cs.AI,2007-05-23


## Chunking Results

The filtered RAG dataset contains:

- **76 documents** (papers in cs.CL, cs.LG, cs.AI)
- **136 retrieval chunks**

Each document was split using:
- Chunk size: 800 characters
- Overlap: 150 characters

Chunking ensures:
- Better semantic retrieval
- Improved answer grounding
- Scalable architecture for larger corpora

We now have:

- `df_rag` → one row per paper  
- `df_chunks` → one row per retrieval chunk  

The chunked dataset is ready for embedding and vector indexing.

## Save Clean Outputs

In [35]:
import json

# Save cleaned documents
df_rag.to_csv(DOCS_OUT, index=False)

# Save chunked corpus
df_chunks.to_csv(CHUNKS_OUT, index=False)

schema = {
    "documents": len(df_rag),
    "chunks": len(df_chunks),
    "target_categories": sorted(list(TARGET_CATS)),
    "min_abstract_chars": MIN_ABSTRACT_CHARS,
    "chunk_size_chars": CHUNK_SIZE_CHARS,
    "chunk_overlap_chars": CHUNK_OVERLAP_CHARS,
    "docs_path": str(DOCS_OUT),
    "chunks_path": str(CHUNKS_OUT)
}

with open(SCHEMA_OUT, "w") as f:
    json.dump(schema, f, indent=2)

print("Saved:")
print(" -", DOCS_OUT)
print(" -", CHUNKS_OUT)
print(" -", SCHEMA_OUT)

Saved:
 - ..\data\processed\arxiv_docs_clean_20260223_1021.csv
 - ..\data\processed\arxiv_chunks_20260223_1021.csv
 - ..\data\processed\schema_20260223_1021.json


## Data Preparation Complete

The RAG corpus has been successfully prepared and saved.

### Outputs Generated
- Cleaned documents file  
- Chunked retrieval corpus file  
- Schema metadata JSON (for reproducibility)

### Corpus Statistics
- Documents: 76
- Retrieval chunks: 136
- Target domains: cs.CL, cs.LG, cs.AI

This dataset is now ready for:

1. Embedding generation
2. Vector index construction
3. Semantic retrieval testing
4. RAG answer generation with citations

The next step is to build the vector store.