add small models rock blog#36
Conversation
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
ajbozarth
left a comment
There was a problem hiding this comment.
first pass with technical review, will follow with a review of the blog content later
| @@ -0,0 +1,522 @@ | |||
| --- | |||
| title: "Making Small Models Rock with Mellea" | |||
| date: "2026-06-05" | |||
There was a problem hiding this comment.
did you mean
| date: "2026-06-05" | |
| date: "2026-05-06" |
There was a problem hiding this comment.
no, I'm targeting early June for release
There was a problem hiding this comment.
oh, ok, that's a ways out, is there a motivation for holding off publish for over a month given the near readiness of the blog?
| tags: ["mellea", "granite", "rag", "intrinsics", "small-models", "docling"] | ||
| --- | ||
|
|
||
|  |
There was a problem hiding this comment.
this sounds like something that should be fixed at the site level (?)
There was a problem hiding this comment.
I'm not sure what could be fixed, the issue is that the image assumes a white (or light) background in it's transparency layer
There was a problem hiding this comment.
add a --prose-img-bg in src/app/globals.css ?
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
| def _bom_entry_is_well_formed(entry: BOMEntry) -> bool: | ||
| """Quantity is either an integer or the string 'allowance'.""" | ||
| try: | ||
| int(entry.quantity) | ||
| return True | ||
| except ValueError: | ||
| return entry.quantity.lower() == "allowance" |
There was a problem hiding this comment.
Two bugs here. simple_validate passes the model's raw output string to the validator, but _bom_entry_is_well_formed expects a BOMEntry. Also, line 188 calls _bom_entries_are_well_formed (plural) which is undefined. Rewrite the validator to accept and parse the string output:
| def _bom_entry_is_well_formed(entry: BOMEntry) -> bool: | |
| """Quantity is either an integer or the string 'allowance'.""" | |
| try: | |
| int(entry.quantity) | |
| return True | |
| except ValueError: | |
| return entry.quantity.lower() == "allowance" | |
| def _bom_is_valid(output: str) -> bool: | |
| bom = BOM.model_validate_json(output) | |
| return all( | |
| e.quantity.lower() == "allowance" or str(e.quantity).isdigit() | |
| for e in bom.items | |
| ) |
| requirements=[ | ||
| req( | ||
| "Quantity should only contain an integer or Allowance", | ||
| validation_fn=simple_validate(_bom_entries_are_well_formed), |
There was a problem hiding this comment.
Update to match the renamed function above.
| validation_fn=simple_validate(_bom_entries_are_well_formed), | |
| validation_fn=simple_validate(_bom_is_valid), |
|
|
||
| ```python | ||
| m.instruct( | ||
| "Reformat this table to have four columns: item, quantity, type, and notes.", |
There was a problem hiding this comment.
BOMEntry defines this field as category, not type. Mismatched column names cause validation failures.
| "Reformat this table to have four columns: item, quantity, type, and notes.", | |
| "Reformat this table to have four columns: item, quantity, category, and notes.", |
| if is_material_list(m, table_markdown=table.to_markdown()) == "yes": | ||
| bom_routines.append(m.ainstruct(..., format=BOM)) | ||
|
|
||
| bom_thunks: list[ModelOutputThunk] = [await r for r in bom_routines] |
There was a problem hiding this comment.
This awaits each coroutine in series. The "wall-clock scales with the slowest table" claim on line 224 only holds with asyncio.gather.
| bom_thunks: list[ModelOutputThunk] = [await r for r in bom_routines] | |
| bom_thunks: list[ModelOutputThunk] = await asyncio.gather(*bom_routines) |
| question, and context relevance drops to around 0.5 (a pricing document | ||
| about construction, but not the right one) while answerability correctly | ||
| collapses to "unanswerable." Frontier model logits don't give you this. |
There was a problem hiding this comment.
check_context_relevance returns a categorical string ("relevant", "irrelevant", or "partially relevant"), not a float.
| question, and context relevance drops to around 0.5 (a pricing document | |
| about construction, but not the right one) while answerability correctly | |
| collapses to "unanswerable." Frontier model logits don't give you this. | |
| question, and context relevance comes back `"partially relevant"` (a pricing document | |
| about construction, but not the right one) while answerability correctly | |
| collapses to `"unanswerable"`. Frontier model logits don't give you this. |
| citations = find_citations( | ||
| response=price_response.value, | ||
| documents=[doors_doc], | ||
| context=ctx, |
There was a problem hiding this comment.
ctx is undefined in this snippet; every other snippet in this post passes ChatContext() directly.
| context=ctx, | |
| context=ChatContext(), |
| tags: ["granite", "rag", "intrinsics", "small-models", "docling"] | ||
| --- | ||
|
|
||
|  |
There was a problem hiding this comment.
main.png has a transparent background that renders poorly in dark mode. Once #39 is merged, rehype-raw support will be available and this can be fixed inline:
|  | |
| <img src="/images/small-models-rock/main.png" alt="Making Small Models Rock with Mellea" style="background-color: white;" /> |
| Step back from the construction example. What just happened is the general | ||
| shape of the trade. | ||
|
|
||
|  |
There was a problem hiding this comment.
The max-width: 60% rule is currently in globals.css as a filename-specific selector (.prose img[src$="harnessed.png"]) which doesn't belong in a global stylesheet. Once #39 is merged, move it inline:
|  | |
| <img src="/images/small-models-rock/harnessed.png" alt="A small model, harnessed" style="max-width: 60%;" /> |
| and holds those pieces together with ordinary code rather than an | ||
| ever-growing English prompt. | ||
|
|
||
| Three things fall out of that approach. The first is predictable cost: |
There was a problem hiding this comment.
Mechanical enumeration labels ("The first is... The second is... The third is...") read as LLM-generated prose. Consider leading with the claims directly: "Cost is predictable: local inference has a fixed, knowable cost per run... Data stays local: your documents never leave the machine... The inference backend is yours to choose: Mellea talks to Ollama, vLLM..."
| return prices | ||
| ``` | ||
|
|
||
| Two things matter about this loop. First, `verdict == "answerable"` is a |
There was a problem hiding this comment.
"Two things matter about this loop. First, ... Second, ..." — the meta-preamble before the claims is a common LLM tell. Consider opening directly with the first claim: "verdict == \"answerable\" is a gate: items the intrinsic can't confidently answer get total_price=None..."
| from mellea.stdlib.components.intrinsic.rag import find_citations | ||
|
|
||
| citations = find_citations( | ||
| response=price_response.value, |
There was a problem hiding this comment.
BLOCKER — price_response is never defined in this post (also ctx on L411 — already flagged elsewhere). The pricing loop stores results in unit and total, not price_response. Copy-pasting this block verbatim raises NameError: name 'price_response' is not defined (verified by running it).
If the intent was to cite the final price response, wiring to total gives a runnable example:
| response=price_response.value, | |
| citations = find_citations( | |
| response=total.value, | |
| documents=[doors_doc], | |
| context=ChatContext(), | |
| backend=m_hf.backend, | |
| ) |
| requirement with an explicit validator: | ||
|
|
||
| ```python | ||
| m.instruct( |
There was a problem hiding this comment.
WARNING — Steps 1–3 all use m.instruct(...) / m.ainstruct(...), but the first m = mellea.start_session(...) is in Step 4 (L432). A reader copy-pasting in order hits NameError: name 'm' is not defined on this block — confirmed by running it.
Consider adding a setup block just before Step 1 so the tutorial is runnable top-to-bottom:
import mellea
from mellea.backends.model_ids import IBM_GRANITE_4_MICRO_3B
m = mellea.start_session(backend_name="ollama", model_id=IBM_GRANITE_4_MICRO_3B)(Or equivalent — the session variable is load-bearing from here onward.)
| step is to get a clean, typed `BOM` object. | ||
|
|
||
| `RichDocument` wraps [docling](https://github.com/DS4SD/docling) and | ||
| exposes tables as markdown, which small models handle much better than raw |
There was a problem hiding this comment.
WARNING — A fresh pip install mellea raises ImportError: RichDocument requires extra dependencies. Please install them with: pip install "mellea[docling]" on this import. LocalHFBackend in Step 3 similarly needs mellea[hf]. The blog never mentions installing either — it's the first copy-paste failure for a reader.
One install line before Step 1 covers it:
pip install 'mellea[docling,hf]'| from mellea.stdlib.components.docs.richdocument import RichDocument | ||
|
|
||
| construction_plans = RichDocument.from_document_file( | ||
| "construction_docs/construction_plans.pdf" |
There was a problem hiding this comment.
WARNING — The tutorial references construction_docs/construction_plans.pdf (and product_catalogs/*.{pdf,docx,xlsx} in Step 2) but doesn't tell readers where to get them. Running the block raises FileNotFoundError. A pointer to the linked tutorial notebook's asset directory would let readers actually follow along:
Sample input files are in the tutorial repo under
construction_docs/andproduct_catalogs/.
| | Small open-weight (<3B) | Doesn't understand the task. Returns a generic "cost breakdown" with no prices. | | ||
| | Open-weight reasoning (~20B) | Finds categories and subtotals. No pie chart. Numbers often wrong. | | ||
| | Gemini Fast | Mostly reasonable. No chart. Some prices off. | | ||
| | GPT-5.4 Pro, extended thinking | Gets most items right. Cites sources. No chart on first shot. ~$1/run. | |
There was a problem hiding this comment.
WARNING — "GPT-5.4 Pro" appears here and on L15, L93. I can't find this as an actual OpenAI product name. For a cost-vs-capability post anchored on a frontier-model baseline, the specific model matters; an invented name undermines the table. If the intent was "whatever the top-tier frontier reasoning model is at publish time," a generic phrasing avoids ageing:
| | GPT-5.4 Pro, extended thinking | Gets most items right. Cites sources. No chart on first shot. ~$1/run. | | |
| | Frontier reasoning (GPT-5 Pro / o-series / Claude Opus) | Gets most items right. Cites sources. No chart on first shot. ~$1/run. | |
| The construction case isn't a one-off. The same three-pattern approach | ||
| generalizes. On agent benchmarks the Mellea team has run (a DB2 database | ||
| agent and a compliance agent), rewriting large prompt-based systems as | ||
| Mellea programs moves a Llama 70B setup from ~80% task completion to ~90%, |
There was a problem hiding this comment.
WARNING — "Mellea programs moves a Llama 70B setup from ~80% task completion to ~90%, and lets a Granite 8B model match or beat a Llama 70B baseline" — these are strong quantitative claims, and the argument rests on them. A link to the DB2-agent / compliance-agent results (or a benchmark table in the mellea repo) would turn this from marketing into evidence. If those numbers aren't public yet, consider softening to "in internal evaluations" and committing to publish.
| windows_doc = Document(text=rd_windows.to_markdown()) | ||
|
|
||
| rd_lumber = RichDocument.from_document_file("product_catalogs/cone_mountain_lumber_catalog.xlsx") | ||
| lumber_doc = Document(text=rd_lumber.to_markdown()) |
There was a problem hiding this comment.
SUGGESTION — lumber_doc is loaded here but never keyed into the pricing catalog ({"windows": windows_doc, "doors": doors_doc} on L350). The comment at L358 explains lumber is skipped for Colab T4 runtime, but the load is still wasted work and confuses the shape of the pipeline. Either drop the three lumber-loading lines, or include lumber in the dict and let check_answerability return "unanswerable" for items the catalog can't price — both make the "skipped lumber" story explicit in the code rather than buried in a comment.
| "line-item material list with prices. At the top include the /tmp/chart.png image.", | ||
| grounding_context=report_grounding_context, | ||
| ) | ||
| open("/tmp/report.html", "w").write(report.value) |
There was a problem hiding this comment.
NIT — Tutorial code gets copied verbatim; open(...).write(...) teaches a bad habit. A context manager is the one-line fix:
| open("/tmp/report.html", "w").write(report.value) | |
| with open("/tmp/report.html", "w") as f: | |
| f.write(report.value) |
|
A broader editorial observation, on top of the inline notes above. Long prose stretches with no visual breaks. "The Bet", the ALoRA explanation, and the "Why intrinsics are cheap to compose" sections each run 3–5 paragraphs of uninterrupted running text. A reader skimming from the link-sharing site of your choice has nothing to hook onto — no bullet list, no pull-quote, no callout. Consider breaking the longer argumentative sections with either (a) a short bulleted summary of the three differentiators, (b) a margin callout or blockquote for the one-line claim that matters, or (c) an extra H3 that lets the reader resume after an interrupt. The Dijkstra passage is strong enough that it could earn a standalone pull-quote. Long code blocks without internal narration. Quick scan of the 12 fenced Python blocks:
The 57-line pricing loop is the main offender — a reader has to hold the whole thing in their head to reach the "
Inline comments in tutorial code are an anti-pattern in production but are the right call in a blog post — readers copy the block into a notebook and the comments are their only in-line teacher. Neither of these is a blocker, just things that would turn this from "good if you read carefully" into "easy to follow on first scroll." |
|
One more pass, purely editorial — positioning and discoverability asks, not correctness. All are optional polish. No fast hook for skimmers. The post is ~2,800 words / 14-min read, and the first hands-on code appears at L123. A reader linking in from HN or a social share needs a 30-second on-ramp. Consider a short callout between the excerpt and "The Bet" along these lines:
"Harness" is load-bearing but undefined. The word carries most of the argument (L19, L25, L28, L30, L383, L508) but never gets a definition. A reader who hasn't already absorbed Mellea's framing has to infer it. One sentence near first use — e.g., "By 'harness' we mean the software scaffolding around the model call: decomposition, validation, retries, tool dispatch — the part that isn't the forward pass." — makes the rest of the post land harder. Pain points skew finance-y; the dev concerns are missing. The three differentiators (cost, data sovereignty, vendor-agnostic) hit procurement and regulated-industries buyers well. The pains that devs themselves feel are under-represented:
These are one-paragraph each. The "Trade-offs" section at the bottom is a natural home if you don't want to expand the opening argument. Cost comparison is one-sided. The $1/run vs "no per-token billing" framing is accurate but omits the local side of the ledger: GPU/laptop amortisation, electricity, and the engineering time to build the decomposed pipeline. The "Trade-offs" section admits "decomposition takes engineering effort" but doesn't put a number on it. Even a rough "a senior engineer can port a prompt pipeline in a day or two" would neutralise the "you're hiding the real cost" objection that readers will raise in the comments either way. Terminology: "intrinsics" vs "adapters". The post uses both terms interchangeably — ten uses of intrinsic(s) (L64, L246, L255, L259, L301, L312, L329, L393, L406, L521) and six uses of adapter(s) (L255, L316, L319, L322, L396, L398), including the mixed phrasing on L319 "Granite intrinsics ship as ALoRA adapters". My understanding is the Mellea/Granite framing has shifted toward adapters as the external-facing term (with intrinsic still in the module path for now). If that's right, it's worth a sweep to standardise — probably adapters everywhere in prose, with one parenthetical acknowledgement that the Python import path uses tags: ["granite", "rag", "adapters", "small-models", "docling", "local-llm"](I'll defer to you on whether adapters or intrinsics is the preferred term — the ask is consistency, not a specific choice.) None of the above is a blocker — the core argument is strong . |
planetf1
left a comment
There was a problem hiding this comment.
as per comments (need evaluation - but your interpretation about what should be changed is fine)

No description provided.