feat: PDF support + local file ingestion + price-table drift check by askalf · Pull Request #24 · askalf/deepdive

askalf · 2026-05-06T15:42:25Z

Summary

Closes the two biggest content-coverage gaps in deepdive in one PR.

PDF source support — both web-fetched (https://x/paper.pdf, or anything served as application/pdf) and local (--include=./paper.pdf). Routed through pdfjs-dist via a dedicated extractor; the headless browser short-circuits to a plain HTTP GET so we get the bytes (Chromium's PDF viewer doesn't expose useful text via the DOM).
Local file ingestion — new --include=<path>[,<path>] flag accepts files or directories. Supports .pdf (needs pdfjs-dist), .md, .txt, .html. Local sources sit at the head of the kept list, get the lowest [N] citation IDs, and render as file:///abs/path URLs the user can click.
Price-table drift check in deepdive doctor — closes the v0.6.0 commitment that "drift is intentional, audit happens at PR time" by warning if PRICE_TABLE_VERIFIED_AT is more than 90 days stale.

How `pdfjs-dist` is shipped

Per the design conversation: as an optional, lazy-imported dependency, not a runtime dep. This preserves the "one runtime dependency" headline guarantee on default installs. Users who want PDF support run npm install -g pdfjs-dist once; deepdive doctor reports the state.

If pdfjs-dist isn't installed:

Web PDFs are skipped with a fetch.skipped event whose reason is "pdf-no-extractor"
Local PDFs are recorded in LocalIngestResult.skipped[] with reason "pdfjs-dist not installed"

pdfjs-dist is added as a devDependency so CI tests exercise the real extractor (round-trip a minimal in-memory PDF through extractPdfText).

Headline DX

# Mix your project notes with web research:
deepdive "what's our policy on retroactive billing?" \
  --include=~/notes/billing,./CONTRIBUTING.md \
  --search=brave --deep

Hosted research tools (Perplexity, OpenAI DR, Gemini DR) cannot do this — your notes don't leave your machine, and the cited answer points back at file:// URLs the user can click open.

What's added

src/pdf.ts (~180 lines, no new runtime deps): extractPdfText, isPdfExtractorAvailable, looksLikePdf, joinTextItems, dedupeRunningHeadersFooters, PdfExtractorMissingError
src/local.ts (~140 lines): ingestLocalPaths, expandPaths, stripTags
BrowserSession PDF short-circuit + new mimeType / bytes fields on FetchedPage
New CLI flags: --include, --pdf-max-pages (default 50)
New env vars: DEEPDIVE_INCLUDE, DEEPDIVE_PDF_MAX_PAGES
New agent events: include.done, fetch.skipped reason "pdf-no-extractor"
AgentConfig.include, AgentConfig.pdfMaxPages for library consumers
deepdive doctor checks: pdf.extractor, pricing.table
New PRICE_TABLE_VERIFIED_AT, PRICE_TABLE_STALE_AFTER_DAYS, and daysAgo() exports in src/pricing.ts
README "PDFs and local files" section; CHANGELOG entry under v0.7.0; package.json bumped

What's explicitly out of scope (v1)

Recursive directory walking (--include=$HOME shouldn't ingest a thousand files; one level deep is defensive)
Glob patterns (defer until there's a real ask — the current "file or dir" surface covers the common cases)
OCR for image-only PDFs (no text layer = nothing to extract; that's a Tesseract-class problem, not pdfjs's)
A --ocr flag

Test plan

…heck Closes the two biggest content-coverage gaps in deepdive: real research hits PDFs constantly (academic papers, RFCs, standards docs), and the most useful sources are often already on the user's laptop (notes, internal docs). Both now work. PDF extraction (src/pdf.ts, ~180 lines): - Detected by URL extension or Content-Type: application/pdf - Routed through pdfjs-dist instead of the headless browser DOM (Chromium's PDF viewer doesn't expose useful text) - Page cap default 50, configurable via --pdf-max-pages / DEEPDIVE_PDF_MAX_PAGES - Frequency-based dedup of running headers/footers (60% threshold across pages) - BrowserSession.fetch short-circuits PDF URLs to a plain HTTP GET via Playwright's request context (page.goto can hang on PDFs at networkidle) To preserve the "one runtime dependency" headline: pdfjs-dist is NOT a runtime dep. It's dynamically imported on first use; missing → source skipped with fetch.skipped event reason "pdf-no-extractor" and a one-line install hint. Added as devDependency for tests. Users opt in via `npm install -g pdfjs-dist`. deepdive doctor reports the install state. Local file ingestion (src/local.ts, ~140 lines): - New flag --include=<paths> (comma-separated files / dirs) - Supports .pdf (needs pdfjs-dist), .md, .txt, .html - Pre-fetched sources land at the head of the kept list (lowest [N] ids) - file:///abs/path URLs in the citation footer - Dir expansion is one level deep (defensive — pointing at $HOME shouldn't ingest a thousand files) Doctor: pdf + pricing.table checks - pdf.extractor: ok / info depending on pdfjs-dist resolution - pricing.table: warns if PRICE_TABLE_VERIFIED_AT (new constant) is more than 90 days old. Closes v0.6.0's "drift is intentional, audit happens at PR time" loop — undeclared drift now produces a visible warning. 35 new tests (11 pdf, 9 local, 2 agent integration including end-to-end PDF byte→synth, 5 pricing drift, 3 doctor, 4 CLI/config plumbing). Suite goes from 275 → 310. Typecheck clean, build clean. v0.7.0.

- src/local.ts:152 — bad HTML filtering regexp. The </script>/</style> patterns required the exact byte sequence; </script > (with whitespace before the close bracket) would slip through. Switched to lazy `[\s\S]*?<\/script\s*>` which tolerates whitespace and is bounded (no nested quantifier risk). - src/local.ts:157 — double unescape. Sequential .replace() calls decoded "&lt;" to "<" instead of "<". Replaced the chain with a single-pass decode via a callback so each entity in the original string is decoded exactly once. - src/pdf.ts:211 — polynomial regex. ` *\n *\/g` against an input with many spaces and no newlines is O(n²) — engine consumes spaces at each starting position then fails on the missing \n. Rewrote collapseWhitespace as a per-line walk (split → trim → filter blanks). All operations are linear. Added 2 regression tests covering </script > whitespace and the &lt; double-unescape case. 312/312 passing.

CodeQL flagged the </script\s*> form (from b4c4b04) as still insufficient because closing tags like </script\t\n bar> or </style attr=x> have non-whitespace content between the tag name and the >. Browsers accept those, so the sanitizer must too. Switched to </script\b[^>]*> which: - requires a word-boundary after the tag name (no </scriptbar>) - accepts any non-> character until the close bracket - is bounded ([^>]* with no nested quantifier — no polynomial risk) Added 2 regression tests covering </script\t\n bar> and </style attr=x>. 314/314 passing. Same offline-trusted-input scope caveat as before — this is for converting saved HTML files to text, not sanitizing untrusted input.

claude added 2 commits May 6, 2026 15:41

github-advanced-security AI found potential problems May 6, 2026

View reviewed changes

Comment thread src/local.ts Fixed

askalf merged commit d3ed6d0 into master May 6, 2026
5 checks passed

askalf deleted the claude/plan-deep-dive-next-7b8Lx branch May 6, 2026 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PDF support + local file ingestion + price-table drift check#24

feat: PDF support + local file ingestion + price-table drift check#24
askalf merged 3 commits into
masterfrom
claude/plan-deep-dive-next-7b8Lx

askalf commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

askalf commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How pdfjs-dist is shipped

Headline DX

What's added

What's explicitly out of scope (v1)

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

askalf commented May 6, 2026 •

edited

Loading

How `pdfjs-dist` is shipped