Skip to content

feat: PDF support + local file ingestion + price-table drift check#24

Merged
askalf merged 3 commits into
masterfrom
claude/plan-deep-dive-next-7b8Lx
May 6, 2026
Merged

feat: PDF support + local file ingestion + price-table drift check#24
askalf merged 3 commits into
masterfrom
claude/plan-deep-dive-next-7b8Lx

Conversation

@askalf
Copy link
Copy Markdown
Owner

@askalf askalf commented May 6, 2026

Summary

Closes the two biggest content-coverage gaps in deepdive in one PR.

  • PDF source support — both web-fetched (https://x/paper.pdf, or anything served as application/pdf) and local (--include=./paper.pdf). Routed through pdfjs-dist via a dedicated extractor; the headless browser short-circuits to a plain HTTP GET so we get the bytes (Chromium's PDF viewer doesn't expose useful text via the DOM).
  • Local file ingestion — new --include=<path>[,<path>] flag accepts files or directories. Supports .pdf (needs pdfjs-dist), .md, .txt, .html. Local sources sit at the head of the kept list, get the lowest [N] citation IDs, and render as file:///abs/path URLs the user can click.
  • Price-table drift check in deepdive doctor — closes the v0.6.0 commitment that "drift is intentional, audit happens at PR time" by warning if PRICE_TABLE_VERIFIED_AT is more than 90 days stale.

How pdfjs-dist is shipped

Per the design conversation: as an optional, lazy-imported dependency, not a runtime dep. This preserves the "one runtime dependency" headline guarantee on default installs. Users who want PDF support run npm install -g pdfjs-dist once; deepdive doctor reports the state.

If pdfjs-dist isn't installed:

  • Web PDFs are skipped with a fetch.skipped event whose reason is "pdf-no-extractor"
  • Local PDFs are recorded in LocalIngestResult.skipped[] with reason "pdfjs-dist not installed"

pdfjs-dist is added as a devDependency so CI tests exercise the real extractor (round-trip a minimal in-memory PDF through extractPdfText).

Headline DX

# Mix your project notes with web research:
deepdive "what's our policy on retroactive billing?" \
  --include=~/notes/billing,./CONTRIBUTING.md \
  --search=brave --deep

Hosted research tools (Perplexity, OpenAI DR, Gemini DR) cannot do this — your notes don't leave your machine, and the cited answer points back at file:// URLs the user can click open.

What's added

  • src/pdf.ts (~180 lines, no new runtime deps): extractPdfText, isPdfExtractorAvailable, looksLikePdf, joinTextItems, dedupeRunningHeadersFooters, PdfExtractorMissingError
  • src/local.ts (~140 lines): ingestLocalPaths, expandPaths, stripTags
  • BrowserSession PDF short-circuit + new mimeType / bytes fields on FetchedPage
  • New CLI flags: --include, --pdf-max-pages (default 50)
  • New env vars: DEEPDIVE_INCLUDE, DEEPDIVE_PDF_MAX_PAGES
  • New agent events: include.done, fetch.skipped reason "pdf-no-extractor"
  • AgentConfig.include, AgentConfig.pdfMaxPages for library consumers
  • deepdive doctor checks: pdf.extractor, pricing.table
  • New PRICE_TABLE_VERIFIED_AT, PRICE_TABLE_STALE_AFTER_DAYS, and daysAgo() exports in src/pricing.ts
  • README "PDFs and local files" section; CHANGELOG entry under v0.7.0; package.json bumped

What's explicitly out of scope (v1)

  • Recursive directory walking (--include=$HOME shouldn't ingest a thousand files; one level deep is defensive)
  • Glob patterns (defer until there's a real ask — the current "file or dir" surface covers the common cases)
  • OCR for image-only PDFs (no text layer = nothing to extract; that's a Tesseract-class problem, not pdfjs's)
  • A --ocr flag

Test plan

  • 11 unit tests in test/pdf.test.mjs (pure helpers + an in-memory minimal-PDF round-trip via pdfjs-dist)
  • 9 unit tests in test/local.test.mjs (stripTags, expandPaths dedup + dir expansion + missing path, ingestLocalPaths MD/TXT/HTML/word-cap/skipped)
  • 2 agent-loop integration tests: --include ingestion alongside web sources; PDF byte→synth end-to-end with a hand-rolled minimal PDF
  • 5 new pricing tests covering daysAgo math and drift-constant coherence
  • 3 new doctor tests (pdf.extractor ok, pricing.table ok-when-fresh, pricing.table warn-when-stale)
  • 4 CLI/config flag-plumbing tests
  • npm run typecheck clean
  • npm run build clean
  • npm test — 314/314 passing (up from 275)
  • --help smoke test shows --include and --pdf-max-pages
  • deepdive doctor --json smoke test shows the two new categories

claude added 2 commits May 6, 2026 15:41
…heck

Closes the two biggest content-coverage gaps in deepdive: real research
hits PDFs constantly (academic papers, RFCs, standards docs), and the
most useful sources are often already on the user's laptop (notes,
internal docs). Both now work.

PDF extraction (src/pdf.ts, ~180 lines):
- Detected by URL extension or Content-Type: application/pdf
- Routed through pdfjs-dist instead of the headless browser DOM
  (Chromium's PDF viewer doesn't expose useful text)
- Page cap default 50, configurable via --pdf-max-pages / DEEPDIVE_PDF_MAX_PAGES
- Frequency-based dedup of running headers/footers (60% threshold across pages)
- BrowserSession.fetch short-circuits PDF URLs to a plain HTTP GET via
  Playwright's request context (page.goto can hang on PDFs at networkidle)

To preserve the "one runtime dependency" headline: pdfjs-dist is NOT a
runtime dep. It's dynamically imported on first use; missing → source
skipped with fetch.skipped event reason "pdf-no-extractor" and a one-line
install hint. Added as devDependency for tests. Users opt in via
`npm install -g pdfjs-dist`. deepdive doctor reports the install state.

Local file ingestion (src/local.ts, ~140 lines):
- New flag --include=<paths> (comma-separated files / dirs)
- Supports .pdf (needs pdfjs-dist), .md, .txt, .html
- Pre-fetched sources land at the head of the kept list (lowest [N] ids)
- file:///abs/path URLs in the citation footer
- Dir expansion is one level deep (defensive — pointing at $HOME shouldn't
  ingest a thousand files)

Doctor: pdf + pricing.table checks
- pdf.extractor: ok / info depending on pdfjs-dist resolution
- pricing.table: warns if PRICE_TABLE_VERIFIED_AT (new constant) is more
  than 90 days old. Closes v0.6.0's "drift is intentional, audit happens
  at PR time" loop — undeclared drift now produces a visible warning.

35 new tests (11 pdf, 9 local, 2 agent integration including end-to-end
PDF byte→synth, 5 pricing drift, 3 doctor, 4 CLI/config plumbing).
Suite goes from 275 → 310. Typecheck clean, build clean.

v0.7.0.
- src/local.ts:152 — bad HTML filtering regexp. The </script>/</style>
  patterns required the exact byte sequence; </script > (with whitespace
  before the close bracket) would slip through. Switched to lazy
  `[\s\S]*?<\/script\s*>` which tolerates whitespace and is bounded
  (no nested quantifier risk).
- src/local.ts:157 — double unescape. Sequential .replace() calls
  decoded "&amp;lt;" to "<" instead of "&lt;". Replaced the chain with
  a single-pass decode via a callback so each entity in the original
  string is decoded exactly once.
- src/pdf.ts:211 — polynomial regex. ` *\n *\/g` against an input with
  many spaces and no newlines is O(n²) — engine consumes spaces at
  each starting position then fails on the missing \n. Rewrote
  collapseWhitespace as a per-line walk (split → trim → filter blanks).
  All operations are linear.

Added 2 regression tests covering </script > whitespace and the
&amp;lt; double-unescape case. 312/312 passing.
Comment thread src/local.ts Fixed
CodeQL flagged the </script\s*> form (from b4c4b04) as still insufficient
because closing tags like </script\t\n bar> or </style attr=x> have
non-whitespace content between the tag name and the >. Browsers accept
those, so the sanitizer must too.

Switched to </script\b[^>]*> which:
- requires a word-boundary after the tag name (no </scriptbar>)
- accepts any non-> character until the close bracket
- is bounded ([^>]* with no nested quantifier — no polynomial risk)

Added 2 regression tests covering </script\t\n bar> and </style attr=x>.
314/314 passing.

Same offline-trusted-input scope caveat as before — this is for
converting saved HTML files to text, not sanitizing untrusted input.
@askalf askalf merged commit d3ed6d0 into master May 6, 2026
5 checks passed
@askalf askalf deleted the claude/plan-deep-dive-next-7b8Lx branch May 6, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants