add small models rock blog by psschwei · Pull Request #36 · generative-computing/mellea-website

psschwei · 2026-05-05T03:00:47Z

No description provided.

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

ajbozarth

first pass with technical review, will follow with a review of the blog content later

ajbozarth · 2026-05-05T14:58:48Z

@@ -0,0 +1,522 @@
+---
+title: "Making Small Models Rock with Mellea"
+date: "2026-06-05"


did you mean

Suggested change

date: "2026-06-05"

date: "2026-05-06"

no, I'm targeting early June for release

oh, ok, that's a ways out, is there a motivation for holding off publish for over a month given the near readiness of the blog?

ajbozarth · 2026-05-05T15:00:16Z

+tags: ["mellea", "granite", "rag", "intrinsics", "small-models", "docling"]
+---
+
+![Making Small Models Rock with Mellea](/images/small-models-rock/main.png)


This image doesn't render well in dark mode:

this sounds like something that should be fixed at the site level (?)

I'm not sure what could be fixed, the issue is that the image assumes a white (or light) background in it's transparency layer

add a --prose-img-bg in src/app/globals.css ?

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

ajbozarth

Here's a handful of items Claude found. I also opened #39 to address the image issues.

ajbozarth · 2026-05-05T21:58:18Z

+def _bom_entry_is_well_formed(entry: BOMEntry) -> bool:
+    """Quantity is either an integer or the string 'allowance'."""
+    try:
+        int(entry.quantity)
+        return True
+    except ValueError:
+        return entry.quantity.lower() == "allowance"


Two bugs here. simple_validate passes the model's raw output string to the validator, but _bom_entry_is_well_formed expects a BOMEntry. Also, line 188 calls _bom_entries_are_well_formed (plural) which is undefined. Rewrite the validator to accept and parse the string output:

Suggested change

def _bom_entry_is_well_formed(entry: BOMEntry) -> bool:

"""Quantity is either an integer or the string 'allowance'."""

try:

int(entry.quantity)

return True

except ValueError:

return entry.quantity.lower() == "allowance"

def _bom_is_valid(output: str) -> bool:

bom = BOM.model_validate_json(output)

return all(

e.quantity.lower() == "allowance" or str(e.quantity).isdigit()

for e in bom.items

)

ajbozarth · 2026-05-05T21:58:18Z

+    requirements=[
+        req(
+            "Quantity should only contain an integer or Allowance",
+            validation_fn=simple_validate(_bom_entries_are_well_formed),


Update to match the renamed function above.

Suggested change

validation_fn=simple_validate(_bom_entries_are_well_formed),

validation_fn=simple_validate(_bom_is_valid),

ajbozarth · 2026-05-05T21:58:18Z

+
+```python
+m.instruct(
+    "Reformat this table to have four columns: item, quantity, type, and notes.",


BOMEntry defines this field as category, not type. Mismatched column names cause validation failures.

Suggested change

"Reformat this table to have four columns: item, quantity, type, and notes.",

"Reformat this table to have four columns: item, quantity, category, and notes.",

ajbozarth · 2026-05-05T21:58:18Z

+        if is_material_list(m, table_markdown=table.to_markdown()) == "yes":
+            bom_routines.append(m.ainstruct(..., format=BOM))
+
+    bom_thunks: list[ModelOutputThunk] = [await r for r in bom_routines]


This awaits each coroutine in series. The "wall-clock scales with the slowest table" claim on line 224 only holds with asyncio.gather.

Suggested change

bom_thunks: list[ModelOutputThunk] = [await r for r in bom_routines]

bom_thunks: list[ModelOutputThunk] = await asyncio.gather(*bom_routines)

ajbozarth · 2026-05-05T21:58:18Z

+question, and context relevance drops to around 0.5 (a pricing document
+about construction, but not the right one) while answerability correctly
+collapses to "unanswerable." Frontier model logits don't give you this.


check_context_relevance returns a categorical string ("relevant", "irrelevant", or "partially relevant"), not a float.

Suggested change

question, and context relevance drops to around 0.5 (a pricing document

about construction, but not the right one) while answerability correctly

collapses to "unanswerable." Frontier model logits don't give you this.

question, and context relevance comes back `"partially relevant"` (a pricing document

about construction, but not the right one) while answerability correctly

collapses to `"unanswerable"`. Frontier model logits don't give you this.

ajbozarth · 2026-05-05T21:58:18Z

+citations = find_citations(
+    response=price_response.value,
+    documents=[doors_doc],
+    context=ctx,


ctx is undefined in this snippet; every other snippet in this post passes ChatContext() directly.

Suggested change

context=ctx,

context=ChatContext(),

ajbozarth · 2026-05-05T21:58:18Z

+tags: ["granite", "rag", "intrinsics", "small-models", "docling"]
+---
+
+![Making Small Models Rock with Mellea](/images/small-models-rock/main.png)


main.png has a transparent background that renders poorly in dark mode. Once #39 is merged, rehype-raw support will be available and this can be fixed inline:

Suggested change

![Making Small Models Rock with Mellea](/images/small-models-rock/main.png)

<img src="/images/small-models-rock/main.png" alt="Making Small Models Rock with Mellea" style="background-color: white;" />

ajbozarth · 2026-05-05T21:58:18Z

+Step back from the construction example. What just happened is the general
+shape of the trade.
+
+![A small model, harnessed](/images/small-models-rock/harnessed.png)


The max-width: 60% rule is currently in globals.css as a filename-specific selector (.prose img[src$="harnessed.png"]) which doesn't belong in a global stylesheet. Once #39 is merged, move it inline:

Suggested change

![A small model, harnessed](/images/small-models-rock/harnessed.png)

<img src="/images/small-models-rock/harnessed.png" alt="A small model, harnessed" style="max-width: 60%;" />

ajbozarth · 2026-05-05T21:58:18Z

+and holds those pieces together with ordinary code rather than an
+ever-growing English prompt.
+
+Three things fall out of that approach. The first is predictable cost:


Mechanical enumeration labels ("The first is... The second is... The third is...") read as LLM-generated prose. Consider leading with the claims directly: "Cost is predictable: local inference has a fixed, knowable cost per run... Data stays local: your documents never leave the machine... The inference backend is yours to choose: Mellea talks to Ollama, vLLM..."

ajbozarth · 2026-05-05T21:58:18Z

+    return prices
+```
+
+Two things matter about this loop. First, `verdict == "answerable"` is a


"Two things matter about this loop. First, ... Second, ..." — the meta-preamble before the claims is a common LLM tell. Consider opening directly with the first claim: "verdict == \"answerable\" is a gate: items the intrinsic can't confidently answer get total_price=None..."

planetf1 · 2026-05-07T08:39:55Z

+from mellea.stdlib.components.intrinsic.rag import find_citations
+
+citations = find_citations(
+    response=price_response.value,


BLOCKER — price_response is never defined in this post (also ctx on L411 — already flagged elsewhere). The pricing loop stores results in unit and total, not price_response. Copy-pasting this block verbatim raises NameError: name 'price_response' is not defined (verified by running it).

If the intent was to cite the final price response, wiring to total gives a runnable example:

Suggested change

response=price_response.value,

citations = find_citations(

response=total.value,

documents=[doors_doc],

context=ChatContext(),

backend=m_hf.backend,

)

planetf1 · 2026-05-07T08:41:03Z

+requirement with an explicit validator:
+
+```python
+m.instruct(


WARNING — Steps 1–3 all use m.instruct(...) / m.ainstruct(...), but the first m = mellea.start_session(...) is in Step 4 (L432). A reader copy-pasting in order hits NameError: name 'm' is not defined on this block — confirmed by running it.

Consider adding a setup block just before Step 1 so the tutorial is runnable top-to-bottom:

import mellea from mellea.backends.model_ids import IBM_GRANITE_4_MICRO_3B m = mellea.start_session(backend_name="ollama", model_id=IBM_GRANITE_4_MICRO_3B)

(Or equivalent — the session variable is load-bearing from here onward.)

planetf1 · 2026-05-07T08:41:04Z

+step is to get a clean, typed `BOM` object.
+
+`RichDocument` wraps [docling](https://github.com/DS4SD/docling) and
+exposes tables as markdown, which small models handle much better than raw


WARNING — A fresh pip install mellea raises ImportError: RichDocument requires extra dependencies. Please install them with: pip install "mellea[docling]" on this import. LocalHFBackend in Step 3 similarly needs mellea[hf]. The blog never mentions installing either — it's the first copy-paste failure for a reader.

One install line before Step 1 covers it:

pip install 'mellea[docling,hf]'

planetf1 · 2026-05-07T08:41:05Z

+from mellea.stdlib.components.docs.richdocument import RichDocument
+
+construction_plans = RichDocument.from_document_file(
+    "construction_docs/construction_plans.pdf"


WARNING — The tutorial references construction_docs/construction_plans.pdf (and product_catalogs/*.{pdf,docx,xlsx} in Step 2) but doesn't tell readers where to get them. Running the block raises FileNotFoundError. A pointer to the linked tutorial notebook's asset directory would let readers actually follow along:

Sample input files are in the tutorial repo under construction_docs/ and product_catalogs/.

planetf1 · 2026-05-07T08:41:07Z

+| Small open-weight (<3B) | Doesn't understand the task. Returns a generic "cost breakdown" with no prices. |
+| Open-weight reasoning (~20B) | Finds categories and subtotals. No pie chart. Numbers often wrong. |
+| Gemini Fast | Mostly reasonable. No chart. Some prices off. |
+| GPT-5.4 Pro, extended thinking | Gets most items right. Cites sources. No chart on first shot. ~$1/run. |


WARNING — "GPT-5.4 Pro" appears here and on L15, L93. I can't find this as an actual OpenAI product name. For a cost-vs-capability post anchored on a frontier-model baseline, the specific model matters; an invented name undermines the table. If the intent was "whatever the top-tier frontier reasoning model is at publish time," a generic phrasing avoids ageing:

Suggested change

| GPT-5.4 Pro, extended thinking | Gets most items right. Cites sources. No chart on first shot. ~$1/run. |

| Frontier reasoning (GPT-5 Pro / o-series / Claude Opus) | Gets most items right. Cites sources. No chart on first shot. ~$1/run. |

planetf1 · 2026-05-07T08:41:08Z

+The construction case isn't a one-off. The same three-pattern approach
+generalizes. On agent benchmarks the Mellea team has run (a DB2 database
+agent and a compliance agent), rewriting large prompt-based systems as
+Mellea programs moves a Llama 70B setup from ~80% task completion to ~90%,


WARNING — "Mellea programs moves a Llama 70B setup from ~80% task completion to ~90%, and lets a Granite 8B model match or beat a Llama 70B baseline" — these are strong quantitative claims, and the argument rests on them. A link to the DB2-agent / compliance-agent results (or a benchmark table in the mellea repo) would turn this from marketing into evidence. If those numbers aren't public yet, consider softening to "in internal evaluations" and committing to publish.

planetf1 · 2026-05-07T08:41:09Z

+windows_doc = Document(text=rd_windows.to_markdown())
+
+rd_lumber = RichDocument.from_document_file("product_catalogs/cone_mountain_lumber_catalog.xlsx")
+lumber_doc = Document(text=rd_lumber.to_markdown())


SUGGESTION — lumber_doc is loaded here but never keyed into the pricing catalog ({"windows": windows_doc, "doors": doors_doc} on L350). The comment at L358 explains lumber is skipped for Colab T4 runtime, but the load is still wasted work and confuses the shape of the pipeline. Either drop the three lumber-loading lines, or include lumber in the dict and let check_answerability return "unanswerable" for items the catalog can't price — both make the "skipped lumber" story explicit in the code rather than buried in a comment.

planetf1 · 2026-05-07T08:41:10Z

+    "line-item material list with prices. At the top include the /tmp/chart.png image.",
+    grounding_context=report_grounding_context,
+)
+open("/tmp/report.html", "w").write(report.value)


NIT — Tutorial code gets copied verbatim; open(...).write(...) teaches a bad habit. A context manager is the one-line fix:

Suggested change

open("/tmp/report.html", "w").write(report.value)

with open("/tmp/report.html", "w") as f:

f.write(report.value)

planetf1 · 2026-05-07T08:50:06Z

A broader editorial observation, on top of the inline notes above.

Long prose stretches with no visual breaks. "The Bet", the ALoRA explanation, and the "Why intrinsics are cheap to compose" sections each run 3–5 paragraphs of uninterrupted running text. A reader skimming from the link-sharing site of your choice has nothing to hook onto — no bullet list, no pull-quote, no callout. Consider breaking the longer argumentative sections with either (a) a short bulleted summary of the three differentiators, (b) a margin callout or blockquote for the one-line claim that matters, or (c) an extra H3 that lets the reader resume after an interrupt. The Dijkstra passage is strong enough that it could earn a standalone pull-quote.

Long code blocks without internal narration. Quick scan of the 12 fenced Python blocks:

Block	Lines	Inline comments
BOM validator (L155)	19	0
Reformat instruct (L181)	11	0
Async `extract_bom` (L206)	15	0
Catalog load (L232)	11	0
Pricing loop (L332)	57	2
`find_citations` (L405)	8	0
Report generation (L425)	31	0

The 57-line pricing loop is the main offender — a reader has to hold the whole thing in their head to reach the "verdict == "answerable" is the gate" claim the surrounding prose is building to. A few options that would help without gutting the post:

Highlight the key line in a preceding sentence, e.g., "The one line that matters is if verdict == "answerable": — everything above it is ceremony to get that gate into place," then show the block.
Split the pricing loop into a short get_catalog_for(entry) helper + the actual priced extraction, so each block is ~15 lines.
Add a handful of inline # comments on the non-obvious lines (the .get(entry.category) returning None for lumber, the continue after append, the if catalog: fallthrough producing unit_price=None).

Inline comments in tutorial code are an anti-pattern in production but are the right call in a blog post — readers copy the block into a notebook and the comments are their only in-line teacher.

Neither of these is a blocker, just things that would turn this from "good if you read carefully" into "easy to follow on first scroll."

planetf1 · 2026-05-07T08:58:38Z

One more pass, purely editorial — positioning and discoverability asks, not correctness. All are optional polish.

No fast hook for skimmers. The post is ~2,800 words / 14-min read, and the first hands-on code appears at L123. A reader linking in from HN or a social share needs a 30-second on-ramp. Consider a short callout between the excerpt and "The Bet" along these lines:

What this post does: walks through a construction-cost-estimation pipeline that one-shot prompting needs GPT-5-tier models for, rebuilt on a 3B Granite model running locally — same accuracy, no API keys, ~$0/run. If you're paying frontier-model prices for structured extraction or matching, the same pattern applies.

"Harness" is load-bearing but undefined. The word carries most of the argument (L19, L25, L28, L30, L383, L508) but never gets a definition. A reader who hasn't already absorbed Mellea's framing has to infer it. One sentence near first use — e.g., "By 'harness' we mean the software scaffolding around the model call: decomposition, validation, retries, tool dispatch — the part that isn't the forward pass." — makes the rest of the post land harder.

Pain points skew finance-y; the dev concerns are missing. The three differentiators (cost, data sovereignty, vendor-agnostic) hit procurement and regulated-industries buyers well. The pains that devs themselves feel are under-represented:

Latency / rate limits — a frontier API can rate-limit you mid-backtest; local inference doesn't
Observability in production — when a prompt-pipeline fails at 3am, debugging is about which step went wrong; Mellea's decomposition surfaces the failure point
Fine-tuning vs. harness trade-off — the obvious alternative to "better harness + small model" is "fine-tune a small model"; why harness first?

These are one-paragraph each. The "Trade-offs" section at the bottom is a natural home if you don't want to expand the opening argument.

Cost comparison is one-sided. The $1/run vs "no per-token billing" framing is accurate but omits the local side of the ledger: GPU/laptop amortisation, electricity, and the engineering time to build the decomposed pipeline. The "Trade-offs" section admits "decomposition takes engineering effort" but doesn't put a number on it. Even a rough "a senior engineer can port a prompt pipeline in a day or two" would neutralise the "you're hiding the real cost" objection that readers will raise in the comments either way.

Terminology: "intrinsics" vs "adapters". The post uses both terms interchangeably — ten uses of intrinsic(s) (L64, L246, L255, L259, L301, L312, L329, L393, L406, L521) and six uses of adapter(s) (L255, L316, L319, L322, L396, L398), including the mixed phrasing on L319 "Granite intrinsics ship as ALoRA adapters". My understanding is the Mellea/Granite framing has shifted toward adapters as the external-facing term (with intrinsic still in the module path for now). If that's right, it's worth a sweep to standardise — probably adapters everywhere in prose, with one parenthetical acknowledgement that the Python import path uses intrinsic. Also drop intrinsics from the tag list; local-llm would be a sensible addition there:

tags: ["granite", "rag", "adapters", "small-models", "docling", "local-llm"]

(I'll defer to you on whether adapters or intrinsics is the preferred term — the ask is consistency, not a specific choice.)

None of the above is a blocker — the core argument is strong .

planetf1

as per comments (need evaluation - but your interpretation about what should be changed is fine)

psschwei added 2 commits May 4, 2026 22:58

add small models rock blog

fe8af92

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

add nathan as coauthor

7e6b0e0

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

psschwei requested review from abrahamdaniels and nrfulton May 5, 2026 03:00

psschwei requested review from a team and ajbozarth as code owners May 5, 2026 03:00

psschwei requested a review from serjikibm May 5, 2026 03:00

lint

b0fd4de

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

ajbozarth requested changes May 5, 2026

View reviewed changes

review comments

57e876f

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

This was referenced May 5, 2026

fix: parse blog dates as local time to avoid UTC offset #38

Merged

feat: add rehype-raw and prose image styles #39

Merged

ajbozarth requested changes May 5, 2026

View reviewed changes

planetf1 reviewed May 7, 2026

View reviewed changes

planetf1 requested changes May 7, 2026

View reviewed changes

	validation_fn=simple_validate(_bom_entries_are_well_formed),
	validation_fn=simple_validate(_bom_is_valid),

	"Reformat this table to have four columns: item, quantity, type, and notes.",
	"Reformat this table to have four columns: item, quantity, category, and notes.",

	bom_thunks: list[ModelOutputThunk] = [await r for r in bom_routines]
	bom_thunks: list[ModelOutputThunk] = await asyncio.gather(*bom_routines)

	![Making Small Models Rock with Mellea](/images/small-models-rock/main.png)
	<img src="/images/small-models-rock/main.png" alt="Making Small Models Rock with Mellea" style="background-color: white;" />

	![A small model, harnessed](/images/small-models-rock/harnessed.png)
	<img src="/images/small-models-rock/harnessed.png" alt="A small model, harnessed" style="max-width: 60%;" />

	\| GPT-5.4 Pro, extended thinking \| Gets most items right. Cites sources. No chart on first shot. ~$1/run. \|
	\| Frontier reasoning (GPT-5 Pro / o-series / Claude Opus) \| Gets most items right. Cites sources. No chart on first shot. ~$1/run. \|

	open("/tmp/report.html", "w").write(report.value)
	with open("/tmp/report.html", "w") as f:
	f.write(report.value)

Conversation

psschwei commented May 5, 2026

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 commented May 7, 2026

Uh oh!

planetf1 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

planetf1 commented May 7, 2026 •

edited

Loading