feat: PDF support + local file ingestion + price-table drift check#24
Merged
Conversation
…heck Closes the two biggest content-coverage gaps in deepdive: real research hits PDFs constantly (academic papers, RFCs, standards docs), and the most useful sources are often already on the user's laptop (notes, internal docs). Both now work. PDF extraction (src/pdf.ts, ~180 lines): - Detected by URL extension or Content-Type: application/pdf - Routed through pdfjs-dist instead of the headless browser DOM (Chromium's PDF viewer doesn't expose useful text) - Page cap default 50, configurable via --pdf-max-pages / DEEPDIVE_PDF_MAX_PAGES - Frequency-based dedup of running headers/footers (60% threshold across pages) - BrowserSession.fetch short-circuits PDF URLs to a plain HTTP GET via Playwright's request context (page.goto can hang on PDFs at networkidle) To preserve the "one runtime dependency" headline: pdfjs-dist is NOT a runtime dep. It's dynamically imported on first use; missing → source skipped with fetch.skipped event reason "pdf-no-extractor" and a one-line install hint. Added as devDependency for tests. Users opt in via `npm install -g pdfjs-dist`. deepdive doctor reports the install state. Local file ingestion (src/local.ts, ~140 lines): - New flag --include=<paths> (comma-separated files / dirs) - Supports .pdf (needs pdfjs-dist), .md, .txt, .html - Pre-fetched sources land at the head of the kept list (lowest [N] ids) - file:///abs/path URLs in the citation footer - Dir expansion is one level deep (defensive — pointing at $HOME shouldn't ingest a thousand files) Doctor: pdf + pricing.table checks - pdf.extractor: ok / info depending on pdfjs-dist resolution - pricing.table: warns if PRICE_TABLE_VERIFIED_AT (new constant) is more than 90 days old. Closes v0.6.0's "drift is intentional, audit happens at PR time" loop — undeclared drift now produces a visible warning. 35 new tests (11 pdf, 9 local, 2 agent integration including end-to-end PDF byte→synth, 5 pricing drift, 3 doctor, 4 CLI/config plumbing). Suite goes from 275 → 310. Typecheck clean, build clean. v0.7.0.
- src/local.ts:152 — bad HTML filtering regexp. The </script>/</style> patterns required the exact byte sequence; </script > (with whitespace before the close bracket) would slip through. Switched to lazy `[\s\S]*?<\/script\s*>` which tolerates whitespace and is bounded (no nested quantifier risk). - src/local.ts:157 — double unescape. Sequential .replace() calls decoded "&lt;" to "<" instead of "<". Replaced the chain with a single-pass decode via a callback so each entity in the original string is decoded exactly once. - src/pdf.ts:211 — polynomial regex. ` *\n *\/g` against an input with many spaces and no newlines is O(n²) — engine consumes spaces at each starting position then fails on the missing \n. Rewrote collapseWhitespace as a per-line walk (split → trim → filter blanks). All operations are linear. Added 2 regression tests covering </script > whitespace and the &lt; double-unescape case. 312/312 passing.
CodeQL flagged the </script\s*> form (from b4c4b04) as still insufficient because closing tags like </script\t\n bar> or </style attr=x> have non-whitespace content between the tag name and the >. Browsers accept those, so the sanitizer must too. Switched to </script\b[^>]*> which: - requires a word-boundary after the tag name (no </scriptbar>) - accepts any non-> character until the close bracket - is bounded ([^>]* with no nested quantifier — no polynomial risk) Added 2 regression tests covering </script\t\n bar> and </style attr=x>. 314/314 passing. Same offline-trusted-input scope caveat as before — this is for converting saved HTML files to text, not sanitizing untrusted input.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the two biggest content-coverage gaps in deepdive in one PR.
https://x/paper.pdf, or anything served asapplication/pdf) and local (--include=./paper.pdf). Routed throughpdfjs-distvia a dedicated extractor; the headless browser short-circuits to a plain HTTP GET so we get the bytes (Chromium's PDF viewer doesn't expose useful text via the DOM).--include=<path>[,<path>]flag accepts files or directories. Supports.pdf(needspdfjs-dist),.md,.txt,.html. Local sources sit at the head of the kept list, get the lowest[N]citation IDs, and render asfile:///abs/pathURLs the user can click.deepdive doctor— closes the v0.6.0 commitment that "drift is intentional, audit happens at PR time" by warning ifPRICE_TABLE_VERIFIED_ATis more than 90 days stale.How
pdfjs-distis shippedPer the design conversation: as an optional, lazy-imported dependency, not a runtime dep. This preserves the "one runtime dependency" headline guarantee on default installs. Users who want PDF support run
npm install -g pdfjs-distonce;deepdive doctorreports the state.If
pdfjs-distisn't installed:fetch.skippedevent whose reason is"pdf-no-extractor"LocalIngestResult.skipped[]with reason"pdfjs-dist not installed"pdfjs-distis added as adevDependencyso CI tests exercise the real extractor (round-trip a minimal in-memory PDF throughextractPdfText).Headline DX
Hosted research tools (Perplexity, OpenAI DR, Gemini DR) cannot do this — your notes don't leave your machine, and the cited answer points back at
file://URLs the user can click open.What's added
src/pdf.ts(~180 lines, no new runtime deps):extractPdfText,isPdfExtractorAvailable,looksLikePdf,joinTextItems,dedupeRunningHeadersFooters,PdfExtractorMissingErrorsrc/local.ts(~140 lines):ingestLocalPaths,expandPaths,stripTagsBrowserSessionPDF short-circuit + newmimeType/bytesfields onFetchedPage--include,--pdf-max-pages(default 50)DEEPDIVE_INCLUDE,DEEPDIVE_PDF_MAX_PAGESinclude.done,fetch.skippedreason"pdf-no-extractor"AgentConfig.include,AgentConfig.pdfMaxPagesfor library consumersdeepdive doctorchecks:pdf.extractor,pricing.tablePRICE_TABLE_VERIFIED_AT,PRICE_TABLE_STALE_AFTER_DAYS, anddaysAgo()exports insrc/pricing.tspackage.jsonbumpedWhat's explicitly out of scope (v1)
--include=$HOMEshouldn't ingest a thousand files; one level deep is defensive)--ocrflagTest plan
test/pdf.test.mjs(pure helpers + an in-memory minimal-PDF round-trip viapdfjs-dist)test/local.test.mjs(stripTags,expandPathsdedup + dir expansion + missing path,ingestLocalPathsMD/TXT/HTML/word-cap/skipped)--includeingestion alongside web sources; PDF byte→synth end-to-end with a hand-rolled minimal PDFdaysAgomath and drift-constant coherencepdf.extractorok,pricing.tableok-when-fresh,pricing.tablewarn-when-stale)npm run typecheckcleannpm run buildcleannpm test— 314/314 passing (up from 275)--helpsmoke test shows--includeand--pdf-max-pagesdeepdive doctor --jsonsmoke test shows the two new categories