# Sample Outputs – Generative AI RAG (Part 1)
**Name:** Wenshu (Demi) Diao

This notebook shows sample outputs from my Part 1 pipeline:
- chunking stats
- embedding dimension check
- ChromaDB retrieval examples (top-k neighbors + distances)

Dataset is not included in the repo. See README for download and file path.

**Note:**
This notebook expects an OpenAI API key to be provided either via the
`OPENAI_API_KEY` environment variable or via a local `openai.txt` file
(not included in the repository).

### 1) Environment check

In [1]:
import sys, chromadb, pandas as pd
import openai

print("Python:", sys.version)
print("chromadb:", chromadb.__version__)
print("pandas:", pd.__version__)
print("openai:", openai.__version__)

Python: 3.13.7 (v3.13.7:bcee1c32211, Aug 14 2025, 19:10:51) [Clang 16.0.0 (clang-1600.0.26.6)]
chromadb: 1.4.1
pandas: 3.0.0
openai: 2.16.0


### 2) Load dataset + show basic info

In [2]:
import pandas as pd
from pathlib import Path

DATA_PATH = Path("data/ai_agents_jobs/AI_Agents_Ecosystem_2026.csv")
df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
print("Columns:", list(df.columns))
df.head(2)

Shape: (1206, 5)
Columns: ['Title', 'Source', 'Date', 'Description', 'Link']


Unnamed: 0,Title,Source,Date,Description,Link
0,Client Support Specialist at Clipboard Health,RemoteJob,2026-01-16,About the Role\n \nClipboard Health is looking...,https://remotive.com/remote-jobs/customer-serv...
1,Senior Independent AI Engineer / Architect at ...,RemoteJob,2026-01-16,"Location: Americas, Europe, or Israel\nThe Opp...",https://remotive.com/remote-jobs/software-deve...


### 3) Chunking experiment output (Markdown + code cell)
sample code from chunk_smoketest.py.

Reproduce the key stats:
* total rows
* total chunks
* avg chunks/row for 2–3 configs
* one example chunk + metadata

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def row_to_doc(row) -> str:
    return "\n".join([
        f"TITLE: {row.get('Title','')}",
        f"SOURCE: {row.get('Source','')}",
        f"DATE: {row.get('Date','')}",
        f"DESCRIPTION: {row.get('Description','')}",
    ])

df2 = df.copy()
df2["Description"] = df2["Description"].fillna("").astype(str)
df2["doc_text"] = df2.apply(row_to_doc, axis=1)

texts = df2["doc_text"].astype(str).tolist()

configs = [
    {"chunk_size": 350, "chunk_overlap": 50},
    {"chunk_size": 700, "chunk_overlap": 100},
    {"chunk_size": 1000, "chunk_overlap": 150},
]

for cfg in configs:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=cfg["chunk_size"],
        chunk_overlap=cfg["chunk_overlap"],
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    total_chunks = sum(len(splitter.split_text(t)) for t in texts)
    print(cfg, "avg_chunks/row:", round(total_chunks/len(texts), 3), "total_chunks:", total_chunks)

  from .autonotebook import tqdm as notebook_tqdm


{'chunk_size': 350, 'chunk_overlap': 50} avg_chunks/row: 2.109 total_chunks: 2544
{'chunk_size': 700, 'chunk_overlap': 100} avg_chunks/row: 1.002 total_chunks: 1209
{'chunk_size': 1000, 'chunk_overlap': 150} avg_chunks/row: 1.0 total_chunks: 1206


#### 4) Embedding dimension sanity check
Show:
- model name
- embedding dimension
- one similarity example

In [7]:
from pathlib import Path
import os
from openai import OpenAI

PROJECT_ROOT = Path.cwd()

api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    api_key = (PROJECT_ROOT / "openai.txt").read_text().strip()

client = OpenAI(api_key=api_key)

EMBED_MODEL = "text-embedding-3-small"

texts = [
    "reinforcement learning agent roles",
    "multi-agent orchestration with tool calling",
    "customer support specialist"
]
resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
vecs = [d.embedding for d in resp.data]

print("Model:", EMBED_MODEL)
print("Num vectors:", len(vecs))
print("Dim:", len(vecs[0]))

Model: text-embedding-3-small
Num vectors: 3
Dim: 1536


### 5) Connect to ChromaDB + show collection stats
Show:
- collection exists
- count

In [5]:
import chromadb
from pathlib import Path

DB_DIR = Path("chroma_db")
client_chroma = chromadb.PersistentClient(path=str(DB_DIR))
print("Collections:", [c.name for c in client_chroma.list_collections()])

col = client_chroma.get_collection("ai_agents_jobs_2026")
print("Collection count:", col.count())

Collections: ['ai_agents_jobs_2026']
Collection count: 652


### 6) Retrieval examples
- 3–5 queries
- top-5 results each
- title/source/date/link + distance
- short snippet

In [8]:
queries = [
    "reinforcement learning agent roles in finance",
    "multi-agent orchestration using LangGraph",
    "tool-calling agents in production systems",
]

q_resp = client.embeddings.create(model=EMBED_MODEL, input=queries)
q_vecs = [d.embedding for d in q_resp.data]

results = col.query(
    query_embeddings=q_vecs,
    n_results=5,
    include=["documents", "metadatas", "distances"]
)

for qi, q in enumerate(queries):
    print("\n" + "="*80)
    print("Query:", q)
    for rank in range(5):
        md = results["metadatas"][qi][rank]
        dist = results["distances"][qi][rank]
        doc = results["documents"][qi][rank]
        print(f"\n  Rank {rank+1} | dist={dist:.4f}")
        print("  Title:", md.get("title",""))
        print("  Source/Date:", md.get("source",""), "|", md.get("date",""))
        print("  Link:", md.get("link",""))
        print("  Snippet:", doc[:160].replace("\n"," "), "...")


Query: reinforcement learning agent roles in finance

  Rank 1 | dist=1.0595
  Title: UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning
  Source/Date: ArXiv | 2026-01-14
  Link: http://arxiv.org/abs/2601.09215v1
  Snippet: TITLE: UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning SOURCE: ArXiv DATE: 2026-01-14 DESCRIPTION: User sim ...

  Rank 2 | dist=1.0642
  Title: Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
  Source/Date: ArXiv | 2026-01-14
  Link: http://arxiv.org/abs/2601.09667v2
  Snippet: TITLE: Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning SOURCE: ArXiv DATE: 2026-01-14 DESCRIPTION: Multi-agent systems have evolved int ...

  Rank 3 | dist=1.0814
  Title: When Personas Override Payoffs: Role Identity Bias in Multi-Agent LLM Decision-Making
  Source/Date: ArXiv | 2026-01-15
  Link: http://arxiv.org/abs/2601.10102v1
  Snipp