CatchUp v2

You study. CatchUp connects.

Problem Statement

Dropping a single document into GPT works fine for summarization. The problem is the forest: when you have a PDF, a notebook, and a few screenshots all covering related concepts, no tool connects what lives across them. You get isolated answers, not a unified understanding.

CatchUp ingests unstructured study materials — PDF, Jupyter notebooks, images — parses them through a multi-format pipeline, generates structured study notes, and automatically links shared concepts across documents. The result is searchable, query-able knowledge that grows as you add more material.

Architecture

flowchart TD
    A[File Upload\nPDF · ipynb · Image] --> B{Format Router}

    B -->|PDF| C[DoclingLoader\ntext + figure extraction]
    B -->|ipynb| D[nbformat\ncode · markdown · output cells]
    B -->|Image| E[VLM Pipeline\nclassify → analyze]

    C --> F[Unified Schema\nDocument → Blocks + metadata]
    D --> F
    E --> F

    F --> G[Note Generator\nLLM · prompts v1.x]
    F --> H[Concept Extractor\ncanonical name · backlink]

    G --> I[(SQLite\nmetadata)]
    H --> I
    G --> J[(ChromaDB\nvectors)]
    H --> J

    I --> K[Streamlit UI]
    J --> K

    K --> L[Study Notes\nmarkdown viewer]
    K --> M[Concept Map\npyvis graph]
    K --> N[RAG Q&A\nsource citation]

Per-stage technology:

Stage	Technology
PDF parsing	DoclingLoader (Docling)
Notebook parsing	nbformat
Image classification + analysis	VLM API (OpenAI / Google / Anthropic)
Note generation	LLM + versioned prompts (`prompts/note_generation.py`)
Concept extraction	LLM + canonical name normalization
Vector search	LangChain RetrievalChain + ChromaDB
Metadata storage	SQLite
Observability	JSONL logging → Langfuse (planned)
UI	Streamlit + pyvis

Key Features

Parsing & Input

✅ PDF parser — DoclingLoader, text + figure block extraction
✅ ipynb parser — nbformat, separates code / markdown / output cells
✅ Storage layer — SQLite (metadata) + ChromaDB (vectors) + JSONL API logging
✅ VLM client wrapper — 10 models across OpenAI, Google, and Anthropic; unified interface with per-call cost tracking
✅ VLM prompts v1.1 — type-specific prompts: vlm_code, vlm_diagram, vlm_text; structured JSON output with confidence + error fields
✅ Image parser — VLM-based 5-class classification (code / diagram / text / equation / other) + type-specific routing

LLM Pipeline

✅ Note generation prompts v1.4 — study note prompt with versioned iteration history (v1.0 → v1.4.1); per-version quality delta recorded in prompts/VERSION_LOG.md
⬜ Note generation pipeline — end-to-end document → markdown study note
⬜ Concept extraction + cross-document backlink
⬜ RAG Q&A with source citation (block id / page number)

Evaluation

⬜ Evaluation framework — golden set (15–25 docs) + Before/After comparison (raw doc → LLM vs CatchUp → LLM)
⬜ VLM comparison experiment — 12 models, 8-axis analysis (details in VLM Comparison)
⬜ Edge case handling — taxonomy + fallback strategy (empty images, mixed-type, prompt injection, low-res)

Operations & UI

⬜ Streamlit UI — upload, note viewer, concept map, RAG Q&A, keyword + vector search
⬜ Langfuse observability — stage-level latency / token / cost dashboard
⬜ Deployment — Streamlit Cloud or HuggingFace Spaces

Tech Stack

Category	Technology
Frontend	Streamlit
VLM — OpenAI (bootcamp-funded)	GPT-4o-mini, GPT-4.1 nano, GPT-4.1 mini, GPT-5 nano, GPT-5 mini
VLM — Google	Gemini 3.0 Flash, Gemini 3.1 Pro
VLM — Anthropic	Claude Haiku 4.5, Claude Sonnet 4.6
VLM — Open-source	Qwen2-VL 7B INT4 (must), LLaVA-1.6 7B (should), PaliGemma (stretch) via RunPod GPU
LLM	Same model pool as VLM — note generation, concept extraction, evaluation
Document parsing	DoclingLoader (PDF), nbformat (ipynb)
RAG	LangChain RetrievalChain + ChromaDB
Database	SQLite (metadata) + ChromaDB (vectors)
Visualization	pyvis / streamlit-agraph
Observability	Langfuse
Deployment	Streamlit Cloud or HuggingFace Spaces
GPU (open-source VLM)	RunPod

VLM Comparison

A systematic comparison of 12 VLM models on IT study material parsing is planned for Week 3.

Models: 9 commercial APIs (OpenAI × 5, Google × 2, Anthropic × 2) + open-source Qwen2-VL 7B (required), LLaVA-1.6 7B and PaliGemma if time permits.

8-axis analysis:

Cross-provider comparison — same price tier across OpenAI / Google / Anthropic
Price tier comparison — cost range spans ~150×; quality delta measured
Commercial vs. open-source — cost / privacy / latency trade-offs
Reasoning vs. non-reasoning — GPT-5 (CoT) vs. GPT-4.1 on document layout parsing (no existing benchmark covers this)
Generational jump — GPT-4o-mini → GPT-4.1 → GPT-5 quality delta
Nano vs. mini gap — within the same generation, 5–8× cost difference vs. quality difference
Router classification accuracy — image-type precision/recall including edge cases
Resolution optimization — original / 1600px / 1024px / 512px × grayscale; cost and quality impact quantified

Evaluation metrics: NED + AST parse success rate (code), LLM-as-judge 3-axis rubric (diagrams), ANLS + reading order consistency (text), Kruskal-Wallis test for cross-model significance.

Gap vs. existing benchmarks: DocVQA and OCRBench measure short-answer extraction. This experiment targets markdown structure preservation, diagram-to-text transformation, and Korean/English mixed IT domain — areas no public benchmark currently covers.

Results coming soon.

Project Context

Portfolio purpose: Demonstrating an LLM Engineer pivot from a Computer Vision background.

VLM as leverage: CV expertise in visual data understanding directly applies to the input layer — image classification, document layout interpretation, multi-modal parsing.
LLM pipeline as proof: The core competency being demonstrated is pipeline design, prompt engineering (versioned, evaluated), RAG architecture, evaluation framework, and operational observability — not just calling an API.
Why this problem: The gap between "GPT can summarize one document" and "a system that connects knowledge across documents" is an engineering problem, not just a model problem. CatchUp is the engineering argument.

Development Status

Period	Focus	Weight	Status
Week 1 (first half)	Multi-format parsing · unified schema · DB setup	50%	✅ Done
Week 1 (second half)	VLM/LLM note generation · image-type prompts	50%	🔧 In Progress
Week 2 (first half)	RAG Q&A · evaluation framework · error handling	30%	⬜ Upcoming
Week 2 (second half)	Concept extraction · backlink · Streamlit UI	20%	⬜ Upcoming
Week 3	VLM comparison experiment · observability · deployment · README	Wrap-up	⬜ Upcoming

Current position: Week 1 second half — note generation pipeline and demo UI.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.streamlit		.streamlit
data		data
db		db
docs		docs
eval		eval
experiments		experiments
llm		llm
models		models
parsers		parsers
prompts		prompts
rag		rag
tests		tests
ui		ui
utils		utils
vlm		vlm
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CatchUp v2

Problem Statement

Architecture

Key Features

Parsing & Input

LLM Pipeline

Evaluation

Operations & UI

Tech Stack

VLM Comparison

Project Context

Development Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CatchUp v2

Problem Statement

Architecture

Key Features

Parsing & Input

LLM Pipeline

Evaluation

Operations & UI

Tech Stack

VLM Comparison

Project Context

Development Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages