# Datasheet Extractor Walkthrough (Notebook)

This notebook is written as a hands-on tutorial for engineers building **IC device drivers**.

Instead of jumping straight to the final CLI output, we walk the same path your app takes:

1. Start with plain Docling conversion (baseline behavior)
2. Add project-specific processing (chunking, figure artifacts, local LLM triage)
3. Produce organized, stable outputs for downstream driver workflows

By the end, you should clearly see what each layer contributes and why the final pipeline is more useful than raw PDF parsing alone.

In [None]:
from __future__ import annotations

import json
import os
import shutil
from pathlib import Path
from pprint import pprint

# Resolve repo root whether notebook is run from repo root or notebooks/.
cwd = Path.cwd().resolve()
REPO_ROOT = cwd.parent if cwd.name == "notebooks" else cwd

# Optional local-only env overrides (not committed): notebooks/.local_env.json
# Example:
# {
#   "OLLAMA_HOST": "http://172.26.0.1:11434",
#   "ANTHROPIC_BASE_URL": "http://172.26.0.1:11434",
#   "ANTHROPIC_AUTH_TOKEN": "ollama",
#   "ANTHROPIC_API_KEY": ""
# }
local_env_path = REPO_ROOT / "notebooks" / ".local_env.json"
if local_env_path.exists():
    local_env = json.loads(local_env_path.read_text(encoding="utf-8"))
    for key, value in local_env.items():
        os.environ.setdefault(key, str(value))

# Safe defaults for local usage; override via shell env or .local_env.json.
os.environ.setdefault("OLLAMA_HOST", "http://127.0.0.1:11434")
print(f"OLLAMA_HOST: {os.environ.get('OLLAMA_HOST')}")

PDF_PATH = REPO_ROOT / "examples" / "dac7578" / "dac5578.pdf"
WORK_ROOT = REPO_ROOT / "out_notebook"

# Config knobs for runtime: default to one page for fast demos.
PAGES = "1"
MAX_TOKENS = 256

if not PDF_PATH.exists():
    raise FileNotFoundError(f"Missing sample PDF: {PDF_PATH}")

print(f"Repo root: {REPO_ROOT}")
print(f"PDF:       {PDF_PATH}")
print(f"Work dir:  {WORK_ROOT}")
print(f"PAGES:     {PAGES}")

# Local LLM demo controls.
STRICT_LOCAL_LLM = True  # Enforce local-model-only execution for this walkthrough.

from src.local_processor import _detect_ollama_model
LOCAL_VISION_MODEL = _detect_ollama_model()
print(f"Local vision model detected: {LOCAL_VISION_MODEL}")
if not LOCAL_VISION_MODEL:
    raise RuntimeError("No Ollama vision model detected. Start Ollama, run `ollama pull moondream`, and re-run the notebook.")


## Stage 1: Docling Out of the Box

We begin with **default Docling** behavior and no project-specific logic.

### Why this stage matters
- It establishes the baseline quality and structure you get from a strong general-purpose parser.
- It shows the raw signal available before schema normalization and figure triage.

### What to look for
- Total pages, tables, and image-like items (`PictureItem`)
- A markdown preview of raw document export

This tells us what Docling can do alone, and what gaps we still need to fill for driver-oriented extraction.

In [8]:
from docling.document_converter import DocumentConverter
from docling_core.types.doc import PictureItem, TableItem

converter = DocumentConverter()
conv_res = converter.convert(str(PDF_PATH))
doc = conv_res.document

num_pages = getattr(doc, "num_pages", None)
num_pages = num_pages() if callable(num_pages) else num_pages
if num_pages is None:
    num_pages = len(getattr(doc, "pages", {}))

picture_count = 0
table_item_count = 0
for item, _ in doc.iterate_items():
    if isinstance(item, PictureItem):
        picture_count += 1
    elif isinstance(item, TableItem):
        table_item_count += 1

print("Docling baseline summary")
print("- num_pages:", num_pages)
print("- doc.tables:", len(getattr(doc, "tables", [])))
print("- PictureItem count:", picture_count)
print("- TableItem count:", table_item_count)

if hasattr(doc, "export_to_markdown"):
    md = doc.export_to_markdown()
    print("\nDocling markdown preview (first 1000 chars):\n")
    print(md[:1000])


KeyboardInterrupt: 

### Interpreting Stage 1

If Docling finds many `PictureItem`s, that usually means it is extracting every image-like object in the PDF, not just semantic diagrams.

That is useful raw coverage, but it also explains why downstream filtering/organization is needed before these assets are practical for driver-focused use.

## Stage 2: Project Enhancements on Top of Docling

Now we run the same core extraction logic used in this repository.

### Enhancements introduced here
- **Hybrid chunking** with heading context and `enriched_text`
- **Figure export** into stable IDs (`fig_0001`, etc.)
- Optional **local LLM triage** to classify figures and route complex ones

### Why this matters for driver development
These steps make it easier to isolate register-relevant text/tables and identify figures that need higher-quality interpretation before they can be trusted in implementation.

In [None]:
from src.extract_docling import extract_document, to_blocks

stage2_dir = WORK_ROOT / "stage2_docling_plus"
shutil.rmtree(stage2_dir, ignore_errors=True)
stage2_dir.mkdir(parents=True, exist_ok=True)

raw = extract_document(PDF_PATH, out_dir=stage2_dir, max_tokens=MAX_TOKENS)
blocks = to_blocks(raw.get("blocks", []))

if PAGES:
    keep = set()
    for token in PAGES.split(","):
        token = token.strip()
        if "-" in token:
            a, b = token.split("-", 1)
            for page in range(int(a), int(b) + 1):
                keep.add(page)
        elif token:
            keep.add(int(token))
    blocks = [b for b in blocks if b.page in keep]
    raw["tables"] = [t for t in raw.get("tables", []) if int(t.get("page", 1)) in keep]
    raw["figures"] = [f for f in raw.get("figures", []) if int(f.get("page", 1)) in keep]

print("Project extraction summary")
print("- page_count:", raw.get("page_count"))
print("- blocks:", len(blocks))
print("- tables:", len(raw.get("tables", [])))
print("- figures metadata:", len(raw.get("figures", [])))
print("- figure images on disk:", len(list((stage2_dir / "figures").glob("fig_*.png"))))

if blocks:
    sample = blocks[0]
    print("\nSample block:")
    print("- id:", sample.id)
    print("- page:", sample.page)
    print("- headings:", sample.headings)
    print("- text preview:", sample.text[:250])
    print("- enriched preview:", sample.enriched_text[:250])

if raw.get("tables"):
    print("\nSample table (header row):")
    print(raw["tables"][0].get("grid", [[]])[0])

if raw.get("figures"):
    print("\nSample figure metadata:")
    pprint(raw["figures"][0])


### Local LLM Triage (Optional but Useful)

This next cell runs local figure processing to classify and summarize extracted images.

Use it as a **routing signal**, not ground truth: simple artifacts can resolve locally, while complex technical figures should be escalated to a stronger external model or manual review.

In [None]:
from src.local_processor import build_rollup, process_all_figures

figures_dir = stage2_dir / "figures"
processing_dir = stage2_dir / "processing"
processing_dir.mkdir(parents=True, exist_ok=True)

fig_paths = sorted(figures_dir.glob("fig_*.png"))
demo_n = min(12, len(fig_paths))
demo_ids = {p.stem for p in fig_paths[:demo_n]}

print(f"Running local figure processing on {demo_n} figure(s)...")
statuses = process_all_figures(
    figures_dir=figures_dir,
    processing_dir=processing_dir,
    ollama_model=LOCAL_VISION_MODEL,
    force=True,
    figure_ids=demo_ids,
)

rollup = build_rollup(statuses)
print("\nLocal processing rollup (subset):")
pprint(rollup["summary"])

ran_local_llm = sum(1 for s in statuses if s.get("stage") == "local_llm")
print(f"Local LLM-processed figures in subset: {ran_local_llm}/{len(statuses)}")

print("\nFirst 3 per-figure statuses:")
for status in statuses[:3]:
    pprint({
        "figure_id": status["figure_id"],
        "stage": status.get("stage", ""),
        "status": status["status"],
        "classification": status.get("local_llm_classification", ""),
        "needs_external": status["needs_external"],
        "confidence": status["confidence"],
    })


## Stage 3: Full App Pipeline and Organized Outputs

In this stage we run `process_pdf(...)`, the same end-to-end flow used by the CLI.

### What this adds beyond Stage 2
- Canonical `document.json`
- Per-format table exports (`json/csv/md`)
- Per-figure processing status files
- Manual-followup reports and rollups

This is the transition from a technical demo to a reproducible workflow you can hand to teammates or downstream automation.

In [None]:
from src.pipeline import process_pdf

stage3_root = WORK_ROOT / "stage3_app"
shutil.rmtree(stage3_root, ignore_errors=True)
stage3_root.mkdir(parents=True, exist_ok=True)

result = process_pdf(
    pdf_path=PDF_PATH,
    out_root=stage3_root,
    pages=PAGES,
    force=True,
    no_images=False,
    no_tables=False,
    max_figures=None,
    ollama_model=LOCAL_VISION_MODEL,
    max_tokens=MAX_TOKENS,
)

pdf_out = Path(result["out_dir"])
doc_json = pdf_out / "document.json"
index_json = pdf_out / "index.json"
manual_json = pdf_out / "manual_processing_report.json"
rollup_json = pdf_out / "processing_rollup.json"

print("App output directory:", pdf_out)
print("document.json exists:", doc_json.exists())
print("index.json exists:", index_json.exists())
print("manual report exists:", manual_json.exists())
print("processing rollup exists:", rollup_json.exists())

doc = json.loads(doc_json.read_text(encoding="utf-8"))
print("\ndoc_stats:")
pprint(doc["doc_stats"])


In [None]:
print("Key files under out_notebook/stage3_app/<pdf_stem>/")
for rel in [
    "document.json",
    "index.json",
    "manual_processing_report.json",
    "manual_processing_report.md",
    "processing_rollup.json",
    "processing_rollup.md",
    "tables",
    "figures",
    "processing",
    "derived",
]:
    p = pdf_out / rel
    marker = "dir" if p.is_dir() else "file"
    print(f"- [{marker}] {p.relative_to(stage3_root)}")

table_files = sorted((pdf_out / "tables").glob("*"))
figure_files = sorted((pdf_out / "figures").glob("fig_*.png"))
status_files = sorted((pdf_out / "processing").glob("fig_*.json"))

print("\nCounts")
print("- tables artifacts:", len(table_files))
print("- figure images:", len(figure_files))
print("- processing status files:", len(status_files))

if rollup_json.exists():
    rollup = json.loads(rollup_json.read_text(encoding="utf-8"))
    print("\nprocessing_rollup.summary:")
    pprint(rollup["summary"])


### What You Should Have at This Point

You now have a complete per-document package with normalized content and explicit processing state.

This notebook is local-model-only: figure triage runs with your local Ollama vision model (for example `moondream`).
If no local model is available, the notebook fails fast so results are never mixed with fallback behavior.

That package is the bridge from a static datasheet PDF to driver-development tasks like register abstraction, validation checklists, and implementation traceability.