# Sample Outputs – Generative AI RAG (Part 1)

This section shows sample outputs from my Part 1 pipeline:
- chunking stats
- embedding dimension check
- ChromaDB retrieval examples (top-k neighbors + distances)

Dataset is not included in the repo. See README for download and file path.

**Note:**
This notebook expects an OpenAI API key to be provided either via the
`OPENAI_API_KEY` environment variable or via a local `openai.txt` file
(not included in the repository).

### 1) Environment check

In [5]:
import sys, chromadb, pandas as pd
import openai

print("Python:", sys.version)
print("chromadb:", chromadb.__version__)
print("pandas:", pd.__version__)
print("openai:", openai.__version__)

Python: 3.13.7 (v3.13.7:bcee1c32211, Aug 14 2025, 19:10:51) [Clang 16.0.0 (clang-1600.0.26.6)]
chromadb: 1.4.1
pandas: 3.0.0
openai: 2.16.0


### 2) Load dataset + show basic info

In [6]:
import pandas as pd
from pathlib import Path

DATA_PATH = Path("../data/ai_agents_jobs/AI_Agents_Ecosystem_2026.csv")
df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
print("Columns:", list(df.columns))
df.head(2)

Shape: (1206, 5)
Columns: ['Title', 'Source', 'Date', 'Description', 'Link']


Unnamed: 0,Title,Source,Date,Description,Link
0,Client Support Specialist at Clipboard Health,RemoteJob,2026-01-16,About the Role\n \nClipboard Health is looking...,https://remotive.com/remote-jobs/customer-serv...
1,Senior Independent AI Engineer / Architect at ...,RemoteJob,2026-01-16,"Location: Americas, Europe, or Israel\nThe Opp...",https://remotive.com/remote-jobs/software-deve...


### 3) Out-of-scope filter evidence (post–Oct 2023)

In [7]:
df2 = df.copy()
df2["Date"] = pd.to_datetime(df2["Date"], errors="coerce")

cutoff = pd.Timestamp("2023-10-01")
post_cutoff = df2[df2["Date"] >= cutoff]

print("Full date range:", df2["Date"].min(), "→", df2["Date"].max())
print("Rows post-cutoff (>= 2023-10-01):", post_cutoff.shape[0], "out of", df2.shape[0])
post_cutoff[["Title", "Source", "Date", "Link"]].head(5)

Full date range: 2009-07-01 00:00:00 → 2026-01-16 00:00:00
Rows post-cutoff (>= 2023-10-01): 885 out of 1206


Unnamed: 0,Title,Source,Date,Link
0,Client Support Specialist at Clipboard Health,RemoteJob,2026-01-16,https://remotive.com/remote-jobs/customer-serv...
1,Senior Independent AI Engineer / Architect at ...,RemoteJob,2026-01-16,https://remotive.com/remote-jobs/software-deve...
2,Senior Independent Software Developer at A.Team,RemoteJob,2026-01-16,https://remotive.com/remote-jobs/software-deve...
3,"Show HN: Gambit, an open-source agent harness ...",HackerNews,2026-01-16,https://github.com/bolt-foundry/gambit
4,Show HN: Use-AI: trivially add AI automation t...,HackerNews,2026-01-16,https://github.com/meetsmore/use-ai


### 4) Chunking experiment output (Markdown + code cell)
sample code from chunk_smoketest.py.

Reproduce the key stats:
* total rows
* total chunks
* avg chunks/row for 2–3 configs
* one example chunk + metadata

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

df3 = df2.copy()
df3["Description"] = df3["Description"].fillna("").astype(str)

def row_to_doc(row) -> str:
    return "\n".join([
        f"TITLE: {row.get('Title','')}",
        f"SOURCE: {row.get('Source','')}",
        f"DATE: {row.get('Date','')}",
        f"DESCRIPTION: {row.get('Description','')}",
    ])

df3["doc_text"] = df3.apply(row_to_doc, axis=1)
texts = df3["doc_text"].astype(str).tolist()

configs = [
    {"chunk_size": 350, "chunk_overlap": 50},
    {"chunk_size": 700, "chunk_overlap": 100},
    {"chunk_size": 1000, "chunk_overlap": 150},
]

for cfg in configs:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=cfg["chunk_size"],
        chunk_overlap=cfg["chunk_overlap"],
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    total_chunks = sum(len(splitter.split_text(t)) for t in texts)
    print(cfg, "avg_chunks/row:", round(total_chunks/len(texts), 3), "total_chunks:", total_chunks)

{'chunk_size': 350, 'chunk_overlap': 50} avg_chunks/row: 2.11 total_chunks: 2545
{'chunk_size': 700, 'chunk_overlap': 100} avg_chunks/row: 1.004 total_chunks: 1211
{'chunk_size': 1000, 'chunk_overlap': 150} avg_chunks/row: 1.0 total_chunks: 1206


### 5) Show 1 example row → chunks (human sanity)

In [9]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=700,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""],
)

example_text = texts[0]
chunks = splitter.split_text(example_text)

print("Example doc length:", len(example_text))
print("Num chunks:", len(chunks))
print("\n--- Chunk #1 ---\n", chunks[0][:800])
if len(chunks) > 1:
    print("\n--- Chunk #2 ---\n", chunks[1][:800])

Example doc length: 610
Num chunks: 1

--- Chunk #1 ---
 TITLE: Client Support Specialist at Clipboard Health
SOURCE: RemoteJob
DATE: 2026-01-16 00:00:00
DESCRIPTION: About the Role
 
Clipboard Health is looking for highly motivated, customer-focused individuals to join our team as B2B Support Specialists (Workplace Support Agents). This is not a traditional call center role—you will be the frontline specialist for our most valuable business clients, our workplace customers. Your job is to proactively solve client issues, prevent churn, and ensure a seamless experience for our customers. 
This is primarily a voice-based role, with additional responsibilities


### 6) Embedding sanity check (OpenAI, prints dim)

In [11]:
import os
from openai import OpenAI

BASE_DIR = Path(__file__).resolve().parent if "__file__" in globals() else Path.cwd()
PROJECT_ROOT = BASE_DIR.parent

api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    api_key = (PROJECT_ROOT / "openai.txt").read_text().strip()
client = OpenAI(api_key=api_key)

EMBED_MODEL = "text-embedding-3-small"
probe_texts = [
    "reinforcement learning agent roles",
    "multi-agent orchestration with tool calling",
    "customer support specialist",
]

resp = client.embeddings.create(model=EMBED_MODEL, input=probe_texts)
vecs = [d.embedding for d in resp.data]

print("Model:", EMBED_MODEL)
print("Num vectors:", len(vecs))
print("Dim:", len(vecs[0]))

Model: text-embedding-3-small
Num vectors: 3
Dim: 1536


### 7) Retrieval example from your persisted ChromaDB

In [18]:
import os
from pathlib import Path
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# notebook → project root
PROJECT_ROOT = Path.cwd().parent
DB_DIR = PROJECT_ROOT / "chroma_db"
COLLECTION_NAME = "ai_agents_jobs_2026"
TOP_K = 5

assert DB_DIR.exists(), f"Missing {DB_DIR}. Run ingestion first."
print("Using DB_DIR:", DB_DIR)

# Load key (env var OR openai.txt)
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    key_path = PROJECT_ROOT / "openai.txt"
    if key_path.exists():
        api_key = key_path.read_text().strip()
if not api_key:
    raise RuntimeError("Missing OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = api_key

# Load vectorstore (uses same embedding model as ingestion)
emb = OpenAIEmbeddings(model="text-embedding-3-small")
vs = Chroma(
    collection_name=COLLECTION_NAME,
    persist_directory=str(DB_DIR),
    embedding_function=emb,
)

print("Count:", vs._collection.count())

queries = [
    "multi-agent orchestration using LangGraph",
    "tool-calling agents in production systems",
    "reinforcement learning agent roles in finance",
]

for q in queries:
    print("\n" + "=" * 80)
    print("Query:", q)
    docs = vs.similarity_search(q, k=TOP_K)
    for i, d in enumerate(docs, start=1):
        md = d.metadata or {}
        snippet = (d.page_content or "").replace("\n", " ")[:200]
        print(f"\n  Rank {i} | {md.get('title','')}")
        print(f"  {md.get('source','')} | {md.get('date','')}")
        print("  Link:", md.get("link",""))
        print("  Snippet:", snippet, "...")

Using DB_DIR: /Users/demidiao/PycharmProjects/generative-ai-rag/chroma_db
Count: 652

Query: multi-agent orchestration using LangGraph

  Rank 1 | Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems
  ArXiv | 2026-01-15
  Link: http://arxiv.org/abs/2601.10560v1
  Snippet: TITLE: Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems SOURCE: ArXiv DATE: 2026-01-15 DESCRIPTION: Multi-agent systems (MAS) enable complex reasoning by coordinating multiple age ...

  Rank 2 | SC-MAS: Constructing Cost-Efficient Multi-Agent Systems with Edge-Level Heterogeneous Collaboration
  ArXiv | 2026-01-14
  Link: http://arxiv.org/abs/2601.09434v1
  Snippet: TITLE: SC-MAS: Constructing Cost-Efficient Multi-Agent Systems with Edge-Level Heterogeneous Collaboration SOURCE: ArXiv DATE: 2026-01-14 DESCRIPTION: Large Language Model (LLM)-based Multi-Agent Syst ...

  Rank 3 | Beyond Rule-Based Workflows: An Information-Flow-Orchestrated Multi-Agents Paradigm via Agent-to-Agen

# Sample Outputs — Generative AI RAG (Part 2)

This section records outputs for:
- Original LLM (no RAG)
- Simple RAG
- RAG + HyDE
- RAG + Reranking

Evaluated on one question/prompt.

In [19]:
import os
from pathlib import Path
from typing import List, Optional

from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

In [20]:
# -----------------------
# Config
# -----------------------
BASE_DIR = Path(__file__).resolve().parent if "__file__" in globals() else Path.cwd()
PROJECT_ROOT = BASE_DIR.parent
DB_DIR = PROJECT_ROOT / "chroma_db"

COLLECTION_NAME = "ai_agents_jobs_2026"
EMBED_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4o-mini"

TOP_K = 5
CANDIDATES_K = 40
RERANK_SNIPPET_CHARS = 350

In [21]:
# -----------------------
# OpenAI key
# -----------------------
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    # optional convenience for course environment; ensure openai.txt is gitignored
    key_path = PROJECT_ROOT / "openai.txt"
    if key_path.exists():
        api_key = key_path.read_text().strip()

if not api_key:
    raise RuntimeError("Missing OPENAI_API_KEY. Please export it in your environment.")

os.environ["OPENAI_API_KEY"] = api_key

In [22]:
# -----------------------
# Vector store
# -----------------------
if not DB_DIR.exists():
    raise RuntimeError(f"Missing {DB_DIR}. Run Part 1 ingestion to build the Chroma DB first.")

embeddings = OpenAIEmbeddings(model=EMBED_MODEL)
vectorstore = Chroma(
    collection_name=COLLECTION_NAME,
    persist_directory=str(DB_DIR),
    embedding_function=embeddings,
)

# Note: _collection is internal; useful for debugging in coursework
try:
    count = vectorstore._collection.count()
except Exception:
    count = "unknown"

print("Persist dir:", DB_DIR)
print("Collection:", COLLECTION_NAME)
print("Count:", count)

Persist dir: /Users/demidiao/PycharmProjects/generative-ai-rag/chroma_db
Collection: ai_agents_jobs_2026
Count: 652


In [23]:
# -----------------------
# Retrieval helpers
# -----------------------
def retrieve_simple(query: str, k: int = TOP_K):
    """Simple RAG baseline: plain similarity search."""
    return vectorstore.similarity_search(query, k=k)

def retrieve_candidates(query: str, k: int = CANDIDATES_K):
    """Candidate pool for HyDE / rerank (can be MMR for diversity)."""
    return vectorstore.max_marginal_relevance_search(
        query,
        k=k,
        fetch_k=max(120, k * 3),
        lambda_mult=0.6
    )

def format_docs(docs: List) -> str:
    parts = []
    for i, d in enumerate(docs, start=1):
        md = d.metadata or {}
        header = (
            f"[{i}] title={md.get('title','')} | source={md.get('source','')} | "
            f"date={md.get('date','')} | link={md.get('link','')}"
        )
        parts.append(header + "\n" + (d.page_content or ""))
    return "\n\n".join(parts)

def format_docs_for_rerank(docs: List, snippet_chars: int = RERANK_SNIPPET_CHARS) -> str:
    parts = []
    for i, d in enumerate(docs, start=1):
        md = d.metadata or {}
        text = (d.page_content or "").replace("\n", " ").strip()[:snippet_chars]
        header = (
            f"[{i}] title={md.get('title','')} | source={md.get('source','')} | "
            f"date={md.get('date','')} | link={md.get('link','')}"
        )
        parts.append(header + "\n" + text)
    return "\n\n".join(parts)

In [24]:
# -----------------------
# LLMs
# -----------------------
parser = StrOutputParser()
llm = ChatOpenAI(model=LLM_MODEL, temperature=0.2, max_tokens=650)
hyde_llm = ChatOpenAI(model=LLM_MODEL, temperature=0.0, max_tokens=250)
rerank_llm = ChatOpenAI(model=LLM_MODEL, temperature=0.0, max_tokens=120)

In [25]:
# -----------------------
# Prompts
# -----------------------
base_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the user's question as best you can. If you are unsure, say you don't know."),
    ("user", "{question}")
])

rag_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Answer using ONLY the provided context. "
     "When you reference a passage, cite the SAME bracket number as in the context header (e.g., [1]). "
     "If the context does not contain the answer, reply exactly: "
     "'I don't know based on the provided context.'"),
    ("user", "Question: {question}\n\nContext:\n{context}\n\nAnswer:")
])

hyde_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Write a short hypothetical answer (5-8 sentences max) that would likely appear in the target documents. "
     "Focus on key terms that should match relevant documents. "
     "Do not cite sources. Do not invent specific tool names unless implied by the question."),
    ("user", "Question: {question}\nHypothetical answer:")
])

rerank_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are reranking retrieved passages for relevance to the question. "
     "Return ONLY a comma-separated list of the best passage numbers, in order, like: 3,1,5,2,4. "
     "Return exactly {top_k} numbers."),
    ("user", "Question: {question}\n\nPassages:\n{context}\n\nBest {top_k} passage numbers:")
])

In [26]:
# -----------------------
# Methods
# -----------------------
def answer_no_rag(question: str) -> str:
    return (base_prompt | llm | parser).invoke({"question": question})

def answer_simple_rag(question: str) -> str:
    docs = retrieve_simple(question, k=TOP_K)
    context = format_docs(docs)
    return (rag_prompt | llm | parser).invoke({"question": question, "context": context})

def answer_hyde_rag(question: str) -> str:
    hypothetical = (hyde_prompt | hyde_llm | parser).invoke({"question": question})
    candidates = retrieve_candidates(hypothetical, k=CANDIDATES_K)
    docs = candidates[:TOP_K]
    context = format_docs(docs)
    return (rag_prompt | llm | parser).invoke({"question": question, "context": context})

def parse_rerank_order(order: str, max_n: int, top_k: int) -> List[int]:
    try:
        nums = [int(x.strip()) for x in order.split(",")]
        nums = [n for n in nums if 1 <= n <= max_n]
        seen, cleaned = set(), []
        for n in nums:
            if n not in seen:
                cleaned.append(n)
                seen.add(n)
        cleaned = cleaned[:top_k]
        if len(cleaned) < top_k:
            cleaned.extend([n for n in range(1, max_n + 1) if n not in seen][: (top_k - len(cleaned))])
        return [n - 1 for n in cleaned]
    except Exception:
        return list(range(min(top_k, max_n)))

def answer_rerank_rag(question: str) -> str:
    candidates = retrieve_candidates(question, k=CANDIDATES_K)
    if not candidates:
        return "I don't know based on the provided context."

    context_all = format_docs_for_rerank(candidates, snippet_chars=RERANK_SNIPPET_CHARS)
    order = (rerank_prompt | rerank_llm | parser).invoke({
        "question": question,
        "context": context_all,
        "top_k": TOP_K
    })

    idxs = parse_rerank_order(order, max_n=len(candidates), top_k=TOP_K)
    chosen = [candidates[i] for i in idxs]
    context = format_docs(chosen)
    return (rag_prompt | llm | parser).invoke({"question": question, "context": context})

In [27]:
if __name__ == "__main__":
    questions = [
        "From the dataset, list 3 frameworks/tools used for multi-agent orchestration. For each, provide one source + date and a one-sentence use case.",
        "Find 3 sources about deploying LLM agents in production. What constraints/requirements do they mention (latency, safety, monitoring, access control, tool calling)? Provide title + date + link.",
        "Identify two sources that discuss tool-integrated reasoning / tool-calling. What tasks do they target, and what is the core idea? Include source + date + link.",
        "In 2026-dated sources, what themes appear around evaluation, reliability, monitoring, or safety of agents? Give 2–3 themes and cite at least 2 sources.",
        "Which sources mention LoRA (parameter-efficient fine-tuning), and what are they using it for? Provide date + link for each.",
    ]

    for i, q in enumerate(questions, start=1):
        print("\n" + "=" * 100)
        print(f"Q{i}: {q}")

        print("\n--- Original LLM (no RAG) ---")
        print(answer_no_rag(q))

        print("\n--- Simple RAG ---")
        print(answer_simple_rag(q))

        print("\n--- RAG + HyDE ---")
        print(answer_hyde_rag(q))

        print("\n--- RAG + Reranking ---")
        print(answer_rerank_rag(q))


Q1: From the dataset, list 3 frameworks/tools used for multi-agent orchestration. For each, provide one source + date and a one-sentence use case.

--- Original LLM (no RAG) ---
Here are three frameworks/tools used for multi-agent orchestration, along with a source and a brief use case for each:

1. **JADE (Java Agent Development Framework)**
   - **Source**: Bellifemine, F., Caire, G., & Greenwood, D. (2007). "Developing Multi-Agent Systems with JADE." *Wiley*.
   - **Use Case**: JADE is used to develop complex multi-agent systems where agents can communicate and collaborate to solve problems, such as in automated trading systems.

2. **ROS (Robot Operating System)**
   - **Source**: Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., & Leibs, J. (2009). "ROS: An Open-Source Robot Operating System." *ICRA Workshop on Open Source Software*.
   - **Use Case**: ROS facilitates multi-agent orchestration in robotic systems, enabling multiple robots to coordinate their actions for t