AI-first, modular pipeline for turning scanned books into structured JSON with full traceability.
The pipeline follows a 5-stage model:
- Intake → IR (generic): PDF/images → structured elements (Unstructured library provides rich IR with text, types, coordinates, tables)
- Verify IR (generic): QA checks on completeness, page coverage, element quality
- Portionize (domain-specific): Identify logical portions (CYOA sections, genealogy chapters, textbook problems) and reference IR elements
- Augment (domain-specific): Enrich portions with domain data (choices/combat for CYOA, relationships for genealogy)
- Export (format-specific): Output to target format (FF Engine JSON, HTML, Markdown) using IR + augmentations
Steps 1-2 are universal across all document types. Steps 3-4 vary by domain (gamebooks vs genealogies vs textbooks). Step 5 is tied to output requirements (precise layout for PDF, simplified for Markdown).
Reusability goal: Keep upstream intake/OCR modules as generic as possible. Prefer pushing booktype-specific heuristics/normalization (e.g., gamebook navigation phrase canonicalization, FF conventions) downstream into booktype-aware portionize/extract/enrich/export modules or recipe-scoped adapters so the OCR stack can be reused across book types.
The Intermediate Representation (IR) stays unchanged throughout; portionization and augmentation annotate/reference it rather than transforming it.
- Ingest PDF or page images → structured element IR (Unstructured or OCR-based)
- Multimodal LLM cleaning → per-page clean text + confidence
- Sliding-window portionization (LLM, optional priors, multimodal) → portions reference IR elements
- Consensus/dedupe/normalize, resolve overlaps, guarantee coverage
- Assemble per-portion JSON (page spans, source images, raw_text from IR)
- Run outputs stored under
output/runs/<run_id>/with manifests and state
Use this when you want a targeted, auditable pass over extracted gameplay logic without baking book-specific hacks into core modules.
High-level flow:
- Extract
turn_to_linksearly (anchor-derived) during portionization. - Downstream extractors claim links (combat/luck/stat checks/choices) via
turn_to_claims. - Reconcile claimed vs. total links → unclaimed targets are high-confidence edge cases.
- Scan the gamebook for edgecase patterns and emit a structured report.
- AI verify only flagged sections → emit patch JSONL (empty when correct).
- Apply patches deterministically (opt-in via recipe) to produce a patched gamebook.
Recommended run (reuse an existing full run; do not re-run OCR):
python driver.py \
--recipe configs/recipes/recipe-ff-ai-ocr-gpt51-resume-edgecase-scan.yaml \
--run-id edgecase-scan-<run_id> \
--output-dir output/runs/edgecase-scan-<run_id>Artifacts to inspect:
output/runs/<edgecase-run>/04_turn_to_link_reconciler_v1/turn_to_unclaimed.jsonloutput/runs/<edgecase-run>/05_edgecase_scanner_v1/edgecase_scan.jsonloutput/runs/<edgecase-run>/06_edgecase_ai_patch_v1/edgecase_patches.jsonloutput/runs/<edgecase-run>/07_apply_edgecase_patches_v1/gamebook_patched.json
Running the pipeline via CLI flags can be error-prone. Use the simplified workflow with run configuration files.
python tools/run_manager.py create-run my-new-runThis generates output/runs/my-new-run/config.yaml.
Customize output/runs/my-new-run/config.yaml with your recipe, input PDF, and options.
Key Concept: The recipe defines the logic (stages), while this config.yaml defines the context (input PDF, Output Dir, Run ID).
python tools/run_manager.py execute-run my-new-runYou can still pass additional CLI overrides if needed:
python tools/run_manager.py execute-run my-new-run --dry-run- CLI modules/scripts:
pages_dump.py,clean_pages.py,portionize.py,consensus.py,dedupe_portions.py,normalize_portions.py,resolve_overlaps.py,build_portion_text.py, etc. docs/requirements.md: system requirementssnapshot.md: current status and pipeline notesoutput/: git-ignored; run artifacts live atoutput/runs/<run_id>/- Artifact organization: Each module has its own folder
{ordinal:02d}_{module_id}/(e.g.,01_extract_ocr_ensemble_v1/) containing its artifacts - Final outputs:
gamebook.jsonstays in root for easy access - Game-ready package:
output/runs/<run_id>/output/(containsgamebook.json,validator/, and README) - Pipeline metadata:
pipeline_state.json,pipeline_events.jsonl,snapshots/in root
- Artifact organization: Each module has its own folder
settings.example.yaml: sample config- Driver snapshots: each run writes
snapshots/(recipe.yaml, plan.json, registry.json, optional settings/pricing/instrumentation configs) and records paths inoutput/run_manifest.jsonlfor reproducibility. - Shared helpers for module entrypoints live in
modules/common/(utils, OCR helpers).
- Modules live under
modules/<stage>/<module_id>/; recipes live inconfigs/recipes/. - Driver orchestrates stages, stamps artifacts with schema/module/run IDs, and tracks state in
pipeline_state.json. - Swap modules by changing the recipe, e.g. OCR vs text ingest.
Running Headers (Section Ranges):
- Fighting Fantasy gamebooks use running headers in the upper corners of gameplay pages
- Left page (L): Shows section range (e.g., "9-10", "18-21") indicating which sections are on that page
- Right page (R): Shows single section number (e.g., "22") or range indicating sections on that page
- These are NOT page numbers - they indicate which gameplay sections (1-400) appear on the page
- Format: Either ranges like "X-Y" (sections X through Y) or single numbers like "Z" (section Z only)
- Position: Upper outside corners (top-left for left pages, top-right for right pages)
Coordinate System Note:
- OCR engines may use different coordinate systems (standard: y=0=top, inverted: y=0=bottom)
- Running headers at top corners may have high y values (0.9+) if coordinate system is inverted
- Pattern detection must account for this when identifying top vs bottom positions
Current canonical recipe: configs/recipes/recipe-ff-ai-ocr-gpt51.yaml (GPT-5.1 AI-first OCR, HTML-first output).
The legacy OCR-ensemble recipe is archived at configs/recipes/legacy/recipe-ff-canonical.yaml; the module list below is preserved for historical reference.
01. extract_ocr_ensemble_v1 (Code + AI escalation)
- What it does: Runs multiple OCR engines (Tesseract, EasyOCR, Apple Vision, PDF text) in parallel and combines results with voting/consensus
- Why: Different engines excel at different fonts/layouts; ensemble improves accuracy
- Try: Code (multi-engine OCR)
- Validate: Code (disagreement scoring)
- Escalate: AI (GPT-4V vision transcription for high-disagreement pages)
02. easyocr_guard_v1 (Code)
- What it does: Validates that EasyOCR produced text for sufficient pages
- Why: EasyOCR is primary engine; missing output indicates critical failure
- Type: Code-only validation guard
03. pick_best_engine_v1 (Code)
- What it does: Selects the best OCR engine output per page based on quality metrics, preserves standalone numeric headers from all engines
- Why: Reduces noise while preserving critical section headers that might only appear in one engine
- Type: Code-only selection
04. inject_missing_headers_v1 (Code)
- What it does: Scans raw OCR engine outputs for numeric headers (1-400) missing from picked output and injects them
- Why: Critical for 100% section coverage; headers can be lost during engine selection
- Type: Code-only injection
05. ocr_escalate_gpt4v_v1 (AI)
- What it does: Re-transcribes high-disagreement or low-quality pages using GPT-4V vision model
- Why: Vision models can read corrupted/scanned text that OCR engines miss
- Type: AI escalation (targeted, budget-capped)
06. merge_ocr_escalated_v1 (Code)
- What it does: Merges original OCR pages with escalated GPT-4V pages into unified final OCR output
- Why: Creates single authoritative OCR artifact for downstream stages
- Type: Code-only merge
07. reconstruct_text_v1 (Code)
- What it does: Merges fragmented OCR lines into coherent paragraphs while preserving section boundaries
- Why: Cleaner text improves downstream AI accuracy and human readability
- Type: Code-only reconstruction
08. pagelines_to_elements_v1 (Code)
- What it does: Converts pagelines IR (OCR output) into elements_core.jsonl (structured element IR)
- Why: Standardizes format for downstream portionization stages
- Type: Code-only transformation
09. elements_content_type_v1 (Code + optional AI)
- What it does: Classifies elements into DocLayNet types (Section-header, Text, Page-footer, etc.) using text-first heuristics
- Why: Content type tags enable code-first boundary detection (filters for Section-header)
- Try: Code (heuristic classification)
- Escalate: Optional AI (LLM classification for low-confidence items, disabled by default)
10. coarse_segment_v1 (AI)
- What it does: Single LLM call to classify entire book into frontmatter/gameplay/endmatter page ranges
- Why: Establishes macro boundaries before fine-grained section detection
- Type: AI classification (one call for entire book)
11. fine_segment_frontmatter_v1 (AI)
- What it does: Divides frontmatter section into logical portions (title, copyright, TOC, rules, etc.)
- Why: Structures non-gameplay content for completeness
- Type: AI segmentation
12. classify_headers_v1 (AI)
- What it does: Batched AI calls to classify elements as macro headers, game section headers, or neither
- Why: Provides header candidates for global structure analysis
- Type: AI classification (batched, forward/backward redundancy)
13. structure_globally_v1 (AI, currently stubbed)
- What it does: Single AI call to create coherent global document structure from header candidates
- Why: Creates ordered section structure with macro sections and game sections
- Type: AI structuring (currently skipped via stub)
14. detect_boundaries_code_first_v1 (Code + AI escalation)
- What it does: Code-first section boundary detection with targeted AI escalation for missing sections
- Why: Replaces expensive batched AI with free code filter + 0-30 targeted AI calls; achieves 95%+ coverage
- Try: Code (filters elements_core_typed for Section-header with valid numbers, applies multi-stage validation)
- Validate: Code (coverage check vs target)
- Escalate: AI (targeted re-scan of pages with missing sections using GPT-5)
- Type: Code-first with AI escalation
15. portionize_ai_scan_v1 (AI, fallback)
- What it does: Full-document AI scan for section boundaries (fallback if code-first fails)
- Why: Backup method if code-first detection misses too many sections
- Type: AI fallback
16. macro_locate_ff_v1 (AI)
- What it does: Identifies frontmatter/main_content/endmatter pages from minimal OCR text
- Why: Provides macro section hints for structure analysis
- Type: AI location
17. merge_boundaries_pref_v1 (Code)
- What it does: Merges primary boundary set with fallback, preferring primary and filling gaps
- Why: Combines code-first results with AI fallback for maximum coverage
- Type: Code-only merge
18. verify_boundaries_v1 (Code + optional AI)
- What it does: Validates section boundaries with deterministic checks (ordering, duplicates) and optional AI spot-checks
- Why: Catches boundary errors before expensive extraction stage
- Try: Code (deterministic validation)
- Escalate: Optional AI (spot-checks sampled boundaries for mid-sentence starts)
- Type: Code validation with optional AI sampling
19. validate_boundary_coverage_v1 (Code)
- What it does: Ensures boundary set covers expected section IDs and meets minimum count
- Why: Fails fast if coverage is too low
- Type: Code-only validation
20. validate_boundaries_gate_v1 (Code)
- What it does: Final gate check before extraction (count, ordering, gaps)
- Why: Prevents proceeding with invalid boundary set
- Type: Code-only gate
21. portionize_ai_extract_v1 (AI)
- What it does: Extracts section text from elements and parses gameplay data (choices, combat, luck tests, items) using AI
- Why: AI understands context and can extract structured gameplay data from narrative text
- Type: AI extraction (per-section calls)
22. repair_candidates_v1 (Code)
- What it does: Detects sections needing repair (garbled text, low alpha ratio, high digit ratio) using heuristics
- Why: Identifies problematic sections before expensive repair stage
- Type: Code-only detection
23. repair_portions_v1 (AI)
- What it does: Re-reads flagged sections with multimodal LLM (GPT-5) to repair garbled text
- Why: Vision models can transcribe corrupted text that OCR missed
- Type: AI repair (targeted, budget-capped)
24. strip_section_numbers_v1 (Code)
- What it does: Removes section/page number artifacts from section text while preserving paragraph structure
- Why: Clean text for final gamebook output
- Type: Code-only cleaning
25. extract_choices_v1 (Code + optional AI)
- What it does: Extracts choices from section text using deterministic pattern matching ("turn to X", "go to Y")
- Why: Code-first approach is faster, cheaper, and more reliable than pure AI extraction
- Try: Code (pattern matching)
- Escalate: Optional AI (validation for ambiguous cases, disabled by default)
- Type: Code-first with optional AI validation
26. build_ff_engine_v1 (Code)
- What it does: Assembles final gamebook.json from portions with choices, combat, items, etc.
- Why: Creates final output format for game engine consumption
- Type: Code-only assembly
- Output note: Gameplay flow is encoded in ordered
sequenceevents (replaces legacynavigation).
- Combat requires outcomes: every
combatevent must includeoutcomes.win. - Outcome refs:
outcomes.{win,lose,escape}areOutcomeRefobjects with eithertargetSectionorterminal. - Continue in-section: when combat win immediately continues within the same section (e.g., “Test your Luck”), set
outcomes.win = { terminal: { kind: "continue" } }and add aplayer_round_wintrigger to indicate the round count if stated. - Triggers: use
triggersfor mid-combat conditions (e.g.,enemy_attack_strength_total,enemy_round_win,player_round_win). - Split-target fights: multi-part enemies (e.g., pincers/heads) are represented as multiple enemies with
mode: "split-target"and structured rules; avoid single‑enemy split-target output.
Example (basic win/lose):
{
"kind": "combat",
"mode": "single",
"enemies": [{"enemy": "CAVE BEAST", "skill": 7, "stamina": 8}],
"outcomes": {
"win": {"targetSection": "163"},
"lose": {"terminal": {"kind": "death"}}
}
}Example (win continues in-section with Test Your Luck):
{
"kind": "combat",
"mode": "single",
"enemies": [{"enemy": "BLOODBEAST", "skill": 12, "stamina": 10}],
"triggers": [{
"kind": "player_round_win",
"count": 1,
"outcome": {"terminal": {"kind": "continue"}}
}],
"outcomes": {"win": {"terminal": {"kind": "continue"}}}
}27. validate_ff_engine_node_v1 (Node/AJV)
- What it does: Canonical schema validator shared with the game engine (Node + Ajv)
- Why: Ensures pipeline/game engine use identical validation logic
- Type: Node validator (bundled, portable)
- Scope: Generic across Fighting Fantasy books (not tuned to a specific title)
- Ship: Include
modules/validate/validate_ff_engine_node_v1/validatoralongsidegamebook.jsonin the game engine build. - How to ship: Copy
gamebook.json+modules/validate/validate_ff_engine_node_v1/validator/gamebook-validator.bundle.jsinto the game engine bundle, then runnode gamebook-validator.bundle.js gamebook.json --jsonbefore loading. - Validation notes:
- Combat events must include
outcomes.win(required by schema). - Missing-section checks use
metadata.sectionCountwhen present; otherwise fallback toprovenance.expected_range, then default1–400.
- Combat events must include
28. forensics_gamebook / validate_ff_engine_v2 (Code)
- What it does: Forensic validation (missing sections, duplicates, empty sections, structural issues)
- Why: Provides detailed traces for debugging and repair; not the canonical schema validator
- Type: Code-only validation
29. validate_choice_completeness_v1 (Code)
- What it does: Compares "turn to X" references in section text with extracted choices to find missing choices
- Why: Critical for 100% game engine accuracy; missing choices break gameplay
- Type: Code-only validation (pattern matching + comparison)
1. Regular Production Runs (output in output/runs/)
- Purpose: Real pipeline runs that should be preserved and tracked
- Location: Artifacts go to
output/runs/<run_id>/(default or from recipe) - When to use: Actual book processing, production runs, runs you want to keep
- Manifest: Automatically registered in
output/run_manifest.jsonlfor tracking - Example:
# Full canonical FF recipe run (GPT-5.1 OCR; no ARM64/MPS requirement) python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id deathtrap-dungeon-20251225 # With instrumentation python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id deathtrap-dungeon-20251225 --instrument
2. Temporary Test Runs (output in /tmp or /private/tmp)
- Purpose: Quick testing, development, debugging, AI agent experimentation
- Location: Artifacts go to
/tmpor/private/tmp(via--output-diroverride) - When to use:
- Testing new modules or recipe changes
- Debugging pipeline issues
- AI agents doing temporary test runs during development
- Quick smoke tests on subsets
- Not tracked: These runs are NOT registered in
output/run_manifest.jsonl(they're temporary) - Example:
# Temporary test run (AI agents use this for development/testing) python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml \ --run-id cf-ff-ai-ocr-gpt51-test \ --output-dir /private/tmp/cf-ff-ai-ocr-gpt51-test \ --force # Smoke test with subset (GPT-5.1 OCR; no ARM64/MPS requirement) python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml \ --settings configs/settings.ff-ai-ocr-gpt51-smoke-20.yaml \ --run-id ff-ai-ocr-gpt51-smoke-20 \ --output-dir /tmp/cf-ff-ai-ocr-gpt51-smoke-20 \ --force
Key Differences:
- Regular runs: Use default
output/runs/<run_id>/(or recipeoutput_dir), registered in manifest - Temporary runs: Use
--output-dirto override to/tmpor/private/tmp, NOT registered in manifest - AI Agents: Should use temporary runs (
--output-dir /private/tmp/...) for testing/development, and only use regular runs for actual production work
- Canonical smoke (current pipeline):
configs/recipes/recipe-ff-ai-ocr-gpt51.yaml+configs/settings.ff-ai-ocr-gpt51-smoke-20.yaml - Offline fixture smoke (no external calls):
configs/recipes/recipe-ff-smoke.yaml(usestestdata/smoke/ff/) - Legacy/archived smoke:
configs/recipes/legacy/recipe-ocr-coarse-fine-smoke.yamlandconfigs/settings.ff-canonical-smoke*.yaml(legacy OCR pipeline)
# Dry-run legacy OCR recipe (archived)
python driver.py --recipe configs/recipes/legacy/recipe-ocr.yaml --dry-run
# Text ingest with mock LLM stages (for tests without API calls)
python driver.py --recipe configs/recipes/recipe-text.yaml --mock --skip-done
# OCR pages 1–20 real run (auto-generated run_id/output_dir by default)
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --force
# Reuse a specific run_id/output_dir (opt-in)
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id myrun --allow-run-id-reuse
# Resume legacy OCR run from portionize onward (reuses cached clean pages)
python driver.py --recipe configs/recipes/legacy/recipe-ocr.yaml --skip-done --start-from portionize_fine
# Swap modules: edit configs/recipes/*.yaml to choose a different module per stage
# (e.g., set stage: extract -> module: extract_text_v1 instead of extract_ocr_v1)Runtime note: full non-mock OCR on the 113-page sample typically takes ~35–40 minutes for the portionize/LLM window stage (gpt-4.1-mini + boost gpt-5). Use --skip-done with --start-from/--end-at to resume or scope reruns without re-cleaning pages.
Each run emits a lightweight timing_summary.json in the run directory with wall seconds per stage (and pages/min for intake/extract when available).
- Canonical GPT-5.1 OCR runs on any arch; no MPS requirement.
- Prefer the ARM64 Python env on Apple Silicon for legacy Unstructured
hi_resintake:~/miniforge3/envs/codex-arm/bin/python(reportsplatform.machine() == "arm64"). Unstructuredhi_resruns successfully here and yields far better header/section recall. - On x86_64 (Rosetta) the TensorFlow build expects AVX and forces legacy
hi_resto fall back tostrategy: fast, which markedly reduces header detection and downstream section coverage. - Legacy OCR ensemble recipes (archived under
configs/recipes/legacy/) defaulted tostrategy: hi_resand rely on EasyOCR; these notes apply only to legacy recipes. - EasyOCR auto-uses GPU when Metal/MPS is available (Apple Silicon) and falls back to CPU otherwise; no flags needed. Use
--allow-run-id-reuseonly if you explicitly want to reuse an existing run directory; defaults now auto-generate a fresh run_id/output_dir per run. - Metal-friendly env recipe (legacy EasyOCR; pins torch 2.9.1 / torchvision 0.24.1 / Pillow<13):
If
conda create -n codex-arm-mps python=3.11 conda activate codex-arm-mps pip install --no-cache-dir -r requirements-legacy-easyocr.txt -c constraints/metal.txt python - <<'PY' import torch; print(torch.__version__, torch.backends.mps.is_available()) PY
mps.is_available()is false, you are on the wrong arch or missing the Metal wheel. After a GPU smoke run, sanity-check that EasyOCR used MPS:One-shot local smoke + check:python scripts/regression/check_easyocr_gpu.py --debug-file /tmp/cf-easyocr-mps-5/ocr_ensemble/easyocr_debug.jsonl
MPS troubleshooting: ensure./scripts/smoke_easyocr_gpu.sh /tmp/cf-easyocr-mps-5
platform.machine() == "arm64", Xcode CLTs installed, and you’re using the arm64 Python from thecodex-arm-mpsenv. Reinstall with the Metal constraints if torch showsmps.is_available() == False. - Keep the "hi_res first, fast fallback" knob: run ARM hi_res by default, and only flip to
settings.fast-intake.yamlwhen the environment lacks ARM/AVX. Prior runs showed a large coverage drop when forced to fast, so treat fast as a compatibility fallback, not a peer mode. - Recommended full run on ARM:
~/miniforge3/envs/codex-arm/bin/python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id <run> --output-dir <dir> --force - macOS-only Vision OCR: a new module
extract_ocr_apple_v1(and optionalappleengine inextract_ocr_ensemble_v1) usesVNRecognizeTextRequest. It compiles a Swift helper at runtime; only available on macOS with Xcode CLTs installed.- Sandbox caveat: In restricted/sandboxed execution, Apple Vision can fail with errors like
sysctlbyname for kern.hv_vmm_present failed(and emit empty/noappletext). If you hit this, run the OCR stage outside the sandbox / with full host permissions, or disableapplefor that run.
- Sandbox caveat: In restricted/sandboxed execution, Apple Vision can fail with errors like
# Dry-run canonical OCR (GPT-5.1)
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --dry-run
# Text ingest DAG with mock LLM stages (fast, no API calls)
python driver.py --recipe configs/recipes/recipe-text-dag.yaml --mock --skip-done
# Quick smoke: coarse+fine+continuation on first 10 pages (legacy, archived)
python driver.py --recipe configs/recipes/legacy/recipe-ocr-coarse-fine-smoke.yaml --force
# Continuation regression check (after a run)
python scripts/regression/check_continuation_propagation.py \
--hypotheses output/runs/deathtrap-ocr-dag/adapter_out.jsonl \
--locked output/runs/deathtrap-ocr-dag/portions_locked_merged.jsonl \
--resolved output/runs/deathtrap-ocr-dag/portions_resolved.jsonlKey points:
- Stages have ids and
needs; driver topo-sorts and validates schemas. - Override per-stage outputs via either a stage-level
out:key (highest precedence) or the recipe-leveloutputs:map. - Removed (Story 025): image_crop_cv_v1, portionize_page_v1, portionize_numbered_v1, merge_portion_hyp_v1, consensus_spanfill_v1, enrich_struct_v1, build_appdata_v1; demo/alt recipes using them were deleted.
- Each module can declare
param_schema(JSON-Schema-lite) in itsmodule.yamlto type-check params before the run. Supported fields per param:type(string|number|integer|boolean),enum,minimum/maximum,pattern,default; mark required via a top-levelrequiredlist orrequired: trueon the property. - Driver merges
default_params+ recipeparams, applies schema defaults, and fails fast on missing/unknown/invalid params with a message that includes the stage id and module id. - Example:
Param 'min_conf' on stage 'clean_pages' (module clean_llm_v1) expected type number, got str. - Set custom filenames per stage with
out:inside the stage config; this overrides recipeoutputs:and the built-in defaults, and the resolved name is used for resume/skip-done and downstream inputs. - Example snippet with stage-level
out:stages: - id: clean_pages stage: clean module: clean_llm_v1 needs: [extract_text] out: pages_clean_custom.jsonl
Artifacts appear under output/runs/<run_id>/ as listed in the recipe; use --skip-done to resume and --force to rerun stages.
output/runs/<run_id>/contains all artifacts: images/, ocr/, pages_raw/clean, hypotheses, locked/normalized/resolved portions, final JSON,pipeline_state.json.output/run_manifest.jsonllists runs (id, path, date, notes).
- Enable per-stage timing and LLM cost reporting with
--instrument(off by default). - Optional price sheet override via
--price-table configs/pricing.default.yamlor recipeinstrumentation.price_table. - Outputs land beside artifacts:
instrumentation.json(machine-readable),instrumentation.md(summary tables), and rawinstrumentation_calls.jsonlwhen present. Manifest entries link to the reports. - Modules can emit call-level usage via
modules.common.utils.log_llm_usage(...); the driver aggregates tokens/costs per stage and per model.
- Preferred:
scripts/run_driver_monitored.sh(spawns driver, writesdriver.pid, tailspipeline_events.jsonl).- Example:
scripts/run_driver_monitored.sh --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id <run_id> --output-dir output/runs -- --instrument - Important:
run_driver_monitored.shexpects--output-dirto be the parent (e.g.,output/runs) and passes the full run dir todriver.py. Do not pass a run-specific path. - If you pass
--force, the script pre-deletes the run dir, strips--force, and adds--allow-run-id-reuseso the driver accepts the created run dir without wiping the log/pidfile mid-run.
- Example:
- Attach to an existing run:
scripts/monitor_run.sh output/runs/<run_id> output/runs/<run_id>/driver.pid 5 - Foreground one-liner (useful if background terminal support interferes):
while true; do date; tail -n 1 output/runs/<run_id>/pipeline_events.jsonl; sleep 60; done
- Crash visibility: prefer
scripts/run_driver_monitored.shso stderr is captured indriver.log.scripts/monitor_run.shnow tailsdriver.logwhen the PID disappears to surface hard failures (e.g., OpenMP SHM errors). scripts/monitor_run.shalso appends a syntheticrun_monitorfailure event topipeline_events.jsonlwhen the driver PID disappears, so tailing events shows the crash.scripts/run_driver_monitored.shrunsscripts/postmortem_run.shon exit to append arun_postmortemfailure event when the PID is gone.
- Preset settings live in
configs/presets/:speed.text.yaml(text recipe, gpt-4.1-mini, ~8s/page, ~$0.00013/page)cost.ocr.yaml(OCR, gpt-4.1-mini, ~13–18s/page, ~$0.0011/page)balanced.ocr.yaml(OCR, gpt-4.1, ~16–34s/page, ~$0.014–0.026/page)quality.ocr.yaml(OCR, gpt-5, ~70–100s/page, ~$0.015–0.020/page)
- Use with the driver by passing
--settings, e.g.:python driver.py --recipe configs/recipes/recipe-text.yaml --settings configs/presets/speed.text.yaml --instrument python driver.py --recipe configs/recipes/legacy/recipe-ocr.yaml --settings configs/presets/cost.ocr.yaml --instrument
- Bench sessions write metrics to
output/runs/bench-*/bench_metrics.csvandmetadata.json(slices, models, price table, runs). Example sessions:output/runs/bench-cost-perf-ocr-20251124c/bench_metrics.csvoutput/runs/bench-cost-perf-text-20251124e/bench_metrics.csv
- Serve from repo root:
python -m http.server 8000then openhttp://localhost:8000/docs/pipeline-visibility.html. - The page polls
output/run_manifest.jsonlfor run ids, then readsoutput/runs/<run_id>/pipeline_state.jsonandpipeline_events.jsonlfor live progress, artifacts, and confidence stats. - A ready-to-use fixture run lives at
output/runs/dashboard-fixture(listed in the manifest) so you can smoke the dashboard without running the pipeline.
- Enrichment (choices, cross-refs, combat/items/endings)
- Turn-to validator (CYOA), layout-preserving extractor, image cropper/mapper
- Coarse+fine portionizer; continuation merge
- AI planner to pick modules/configs based on user goals
Legacy Unstructured intake only. The canonical GPT-5.1 OCR pipeline does not use hi_res/ocr_only strategies.
Legacy recommendation: hi_res on ARM64, ocr_only on x86_64
python -c "import platform; print(platform.machine())"). On Apple Silicon Macs, verify if ARM64 environment exists even if your current shell is using x86_64.
After comprehensive testing comparing old Tesseract-based OCR with Unstructured strategies (ocr_only vs hi_res):
hi_reson ARM64: ~15% faster (88s/page vs 105s/page), extracts ~35% more granular elements (better layout boundaries), same text quality asocr_only. Use when ARM64 environment is available (Story 033 complete).ocr_only: More compatible (works on x86_64/Rosetta without JAX), similar text quality, fewer elements. Use as fallback or when maximum compatibility is needed.
Note: OCR text quality is source-limited (scanned PDF quality determines accuracy), so strategy choice primarily affects performance and element granularity, not character recognition accuracy.
requirements.txt on any arch and does not require ARM64/MPS or JAX.
Check Your Environment First
Before assuming x86_64/Rosetta, check if you have an ARM64 environment available:
# Check if ARM64 environment exists
ls -la ~/miniforge3/envs/codex-arm/bin/python 2>/dev/null && echo "ARM64 environment available"
# Check current Python architecture
python -c "import platform; print(f'Machine: {platform.machine()}')"
# ARM64 native: "Machine: arm64"
# x86_64/Rosetta: "Machine: x86_64"
# Check ARM64 environment architecture
~/miniforge3/envs/codex-arm/bin/python -c "import platform; print(f'Machine: {platform.machine()}')" 2>/dev/null
# Should show: "Machine: arm64"On Apple Silicon (M-series) Macs: You likely have both environments. Always check for ARM64 first and use it for better performance unless you have a specific reason to use x86_64.
The default setup uses x86_64 Python running under Rosetta 2 on Apple Silicon. This is the most stable and compatible option.
Setup:
- Install Miniconda (x86_64): Download from https://docs.conda.io/en/latest/miniconda.html (choose macOS Intel 64-bit)
- Create environment:
conda create -n codex python=3.11 - Install dependencies:
pip install -r requirements.txt
When to use:
- Quick starts and one-off runs
- When you need maximum compatibility
- When
ocr_onlyOCR strategy is sufficient
OCR Strategy:
- Uses
ocr_only(JAX unavailable under Rosetta, sohi_resnot possible) - Note: If you're on Apple Silicon but using x86_64 Python, check if ARM64 environment exists and use that instead
Limitations:
- Cannot use
hi_resOCR strategy (requires JAX, which has AVX incompatibilities under Rosetta) - Slower performance (~3-5 minutes/page for OCR)
- No GPU acceleration
For repeated processing or when you need hi_res OCR with table structure inference, use native ARM64 with JAX/Metal GPU acceleration.
Setup:
- Install Miniforge (ARM64):
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh bash Miniforge3-MacOSX-arm64.sh -b -p ~/miniforge3 - Create ARM64 environment:
~/miniforge3/bin/conda create -n codex-arm python=3.11 -y ~/miniforge3/envs/codex-arm/bin/pip install -r requirements.txt
- Install JAX with Metal support:
~/miniforge3/envs/codex-arm/bin/pip install jax-metal - Fix pdfminer compatibility (required for unstructured):
~/miniforge3/envs/codex-arm/bin/pip install "pdfminer.six==20240706"
- Verify JAX/Metal:
~/miniforge3/envs/codex-arm/bin/python -c "import jax; print(jax.devices())" # Should show: [METAL(id=0)]
Activation:
source ~/miniforge3/bin/activate
conda activate codex-armWhen to use:
- Processing many PDFs regularly
- Books with complex tables/layouts where
hi_reshelps - When you want GPU acceleration (2-5× faster than x86_64/Rosetta)
- New machine/environment setup from scratch
OCR Strategy:
- Recommended:
hi_res(~15% faster, better element boundaries) - Fallback:
ocr_onlyif needed
Performance:
hi_resOCR: ~88s/page (tested on M4 Pro, pages 16-18)ocr_onlyOCR: ~105s/page (ARM64 native, no JAX)- Expected 2-5× speedup over x86_64/Rosetta for
hi_resworkloads
Known issues:
- numpy version conflict: jax-metal requires numpy>=2.0, but unstructured requires numpy<2 (works despite warning)
- pdfminer.six must be pinned to 20240706 for unstructured 0.16.9 compatibility
Rollback: Simply use your existing x86_64 environment. Miniforge and Miniconda can coexist.
- Requires Tesseract installed/on PATH.
- Models configurable; defaults use
gpt-4.1-miniwith--boost_model gpt-5. - Artifacts are JSON/JSONL; runs are append-only and reproducible via configs.
- Driver unit tests run in CI via
tests.yml. Run locally with:python -m unittest discover -s tests -p "driver_*test.py"