Hamilton 2.0: Feasibility-Ranked Spec #1645

Dev-iL · 2026-06-20T11:06:12Z

Dev-iL
Jun 20, 2026
Collaborator

Context: https://lists.apache.org/thread/z00h5xqj9tj0n4kzfsgzjklbjq7d9twk

Readers are invited to comment on open questions, suggest ideas, create issues/PRs, etc.

Hamilton 2.0 — Feasibility-Ranked Spec

Restructured from Hamilton 2.0 for roadmap triage. Phase 1 reshaped every actionable idea into its own section; Phase 2 ranked each one against the current codebase. Comparison/notes (the Burr/Hamilton discussion, References) are preserved under Context, not ranked.

How to read the ranks

Scales are consistent across all ideas so the table sorts meaningfully.

Definition — 1 (vague one-liner) … 5 (precise, actionable).
Feasibility — 1 (fundamental blockers in today's architecture) … 5 (clear path within it).
Effort — S / M / L / XL rough implementation cost.
Benefit — 1 (niche) … 5 (broad, high user impact).
BC risk — low / med / high against the doc's hard "vanilla mode is 100% backwards compatible" requirement.

Effort/feasibility figures are grounded in code investigation (see each idea's Codebase touchpoints); benefit is a judgement call and the softest axis.

Guiding constraints (not ranked)

The doc's Requirements (backwards compatibility) section is a set of invariants every idea below must respect, not features to rank:

Vanilla mode is 100% backwards compatible — the function-def API and the Driver API must keep working unchanged. This is the lens for every BC-risk score below.
Auxiliary functions may change — more latitude here.
Hooks/adapters may be deprecated for more natural concepts but should still largely work.
Kill ugly auxiliary functions.

Grounding note: the codebase already runs a deprecation framework (e.g. raw_execute() is deprecated with fail_starting=(2,0,0) in driver.py), the public surface is the Builder/Driver pattern and the "blessed" decorators in function_modifiers/init.py, and FunctionGraph/Node are already marked internal. So "vanilla BC" is well-defined and enforceable, and ideas that stay additive (new builder methods, new adapters, new plugins) are inherently low-risk.

Summary (ordered: strongest near-term candidates first)

#	Idea	Def	Feas	Effort	Benefit	BC risk
TP12	Flexible visualization / rendering	4	5	M	4	low
TP3	Separate graph structure from driver (`with_graph`)	4	5	S–M	4	low
TP10	Rework materializer definition	4	5	S–M	3	low–med
TP6	Cache-native (basics)	3	5	M	4	low
TP2	Alternate execution mode (graceful fallback)	3	4	M	4	low
TP7	Dynamic parallelism-native	3	4	M–L	4	low–med
TP13	Remove module assumptions	3	4	M	3	low
TP5	Tracking / persistence first-class	3	4	L	5	low
TP4	Driver chaining	2	4	M	3	low
W1	Alternative procedural API	2	4	M	3	low
TP1	Async native	3	3	XL	5	high
TP11	More flexible dependencies	2	3	M–L	3	med
TP9	Zero lib dependencies	4	3	L	3	high†
TP8	Fewer, more powerful nodes + metadata	2	3	L	3	med
W4	IDE kernel	1	4	L	3	low
W7	Makefile integration	2	4	M	2	low
W2	Compile-mode (decorators emit code)	2	3	L	2	low
W3	Auto-optimizing parallelism	1	3	L	2	low
W6	Graph compilation / lift-shift	2	2	XL	3	med
W5	Rust integration	1	1	XL	2	high

Bands: TP12, TP3, TP10, TP6 are quick, low-risk wins. TP2, TP7, TP13, TP5 are high-value mid-size bets. TP1 (async) is the highest-impact, highest-cost item and the only top-priority idea carrying high BC risk. W5/W6 are research-grade.

† TP9 carries high end-state BC risk (2.0 removal), but apache/hamilton#918 breaks it into a low-risk, shippable-now 1.9x prep rung + a deletion-only 2.0 rung — see the idea section. Its prep rung belongs with the near-term wins.

Dependencies & work-sharing groups

The ideas are not independent — several share a foundation or unlock each other. Grouping them this way changes sequencing: do the foundation once, then the dependents get cheaper. A → B means B depends on (or is much cheaper after) A; A ⇄ B means they share substantial implementation.

graph LR
  subgraph G1[Graph decoupling]
    TP3[TP3 with_graph] --- TP13[TP13 no module assumptions]
    TP3 --> W1[W1 procedural API]
    TP13 --> W1
  end
  subgraph G2[Execution engine]
    TP7[TP7 dynamic parallelism] --> TP1[TP1 async native]
    TP2[TP2 exec modes] --> TP1
    TP7 --> W3[W3 auto-opt parallelism]
  end
  subgraph G3[State, caching, tracking]
    ES[(execution-state persistence)] --> TP6c[TP6 checkpointing]
    ES --> TP2
    TP5[TP5 tracking] --> TP4[TP4 driver chaining]
  end
  subgraph G4[Failure handling]
    TP2 --- TP11b[TP11 fallback deps]
  end
  subgraph G5[Rendering]
    TP12[TP12 pluggable viz] --> TP6v[TP6 cache-aware viz]
  end
  subgraph G6[Code generation]
    W2[W2 compile-mode] --> W6[W6 lift/shift]
  end
  TP10[TP10 materializer rework] --> TP9[TP9 zero deps]
  TP10 --> W7[W7 makefile/CLI]

G1 — Graph decoupling (foundation). TP3 (graph as a passable object) and TP13 (drop the module assumption) are two halves of one change — both target find_functions/create_function_graph/Driver.__init__ and the node layer is already module-agnostic. Do them together. W1 (procedural API) is a thin imperative front-end that only makes sense once the graph is decoupled. Sequence: TP3+TP13 → W1.

G2 — Execution engine. TP1 (async native) is the deep one: it reworks the lifecycle adapter set and the task executor. TP7 (dynamic-parallelism hardening) and TP2 (execution modes) both touch the same executor/traversal code, and TP1's goal of "async + dynamic parallelism in one model" effectively wants TP7 done first. W3 (auto-optimizing) needs a runtime-metrics layer that doesn't exist yet and rides on the executor abstraction. Sequence: TP7 → TP2 → TP1; W3 last and only if metrics land.

G3 — State, caching, tracking. A single missing primitive — execution-state persistence — underlies both TP6's checkpointing/resume half and TP2's "keep trying". Build it once. Separately, TP5 (tracking) is the substrate for TP4's "auto-tracked as groups / recursive tracking" — TP4's tracking half is largely a TP5 feature. Sequence: TP5 → TP4 (tracking half); execution-state primitive → TP6-checkpointing + TP2.

G4 — Failure handling (shared mechanism). TP2's graceful-fallback mode and TP11's fallback-on-failure dependencies are the same underlying capability (let execution continue past a node failure with a substitute value). Building TP11(b) independently of TP2 would duplicate it — unify them.

G5 — Rendering. TP12 (pluggable renderer over the public HamiltonGraph) is the clean boundary; TP6's "visualize cache state" is then just another thing a renderer annotates. Do TP12 first so cache-aware viz plugs in rather than hard-coding against graphviz.

G6 — Code generation. W2 (decorators emit code) and W6 (lift/shift to other frameworks) share the same hard problem — lowering the decorator graph to target code. W6's per-target compilers could consume W2's per-decorator code-gen. Both are research-grade; if either is pursued, W2 is the more general primitive.

Cross-cutting: materializers. TP10 (materializer rework) cleans the saver/loader base that TP9 (zero-deps, dispatch-based materializers) and W7 (CLI materializer runner) both build on. Doing TP10 first means the other two extend a clean contract instead of the dataclass one.

Cross-cutting: deprecation mechanism (reusable BC infra). The 1.9x prep rung of TP9 (apache/hamilton#918) introduces a Django-style RemovedInHamilton2Warning + version-named warn_deprecated(...) helper + a grep-driven CI "expiry gate" (a test that fails the suite once VERSION >= (2,0) while any flagged usage remains), plus the PEP 562 __getattr__/__dir__ shim pattern for relocating public names without breaking imports. This is roadmap-wide infrastructure, not TP9-specific: every idea here that changes a public shape behind a deprecation — TP1 (async/execute signature), TP6 (caching-on-by-default or DictResult default flip), TP8 (validators-as-hooks replacing @check_output nodes), TP11 (OPTIONAL semantics change), TP12 (graphviz-specific kwargs) — should route its deprecation through this single mechanism rather than hand-rolling warnings.warn. Build it once (it's small, and TP9's prep rung is its first consumer); it converts "high BC risk" items into pre-warned, CI-enforced, grep-removable changes at 2.0.

Standalone (no significant dependencies): TP12 (modulo G5), TP8 (node metadata), W4 (IDE kernel), W5 (Rust). These can be scheduled independently.

A natural first wave that respects these groups: TP10 + TP12 + (TP3+TP13) + the TP9 1.9x prep rung. All are low-risk, and each is either standalone or a foundation others build on — TP9's prep rung in particular ships the reusable deprecation mechanism, has an active work plan (#918), and lands entirely non-breaking on the 1.x line.

Top Priorities

TP1 — Async native

Summary. Make async a first-class execution mode, exposing both async def execute(...) and def execute(...). (Source: Top Priorities §1.)

Mental model. A user writes async def node functions and gets the same Hamilton experience as sync — one Driver, one execution model, the runtime figures out awaiting. Today async feels like a separate product (AsyncDriver); the goal is for "async" to be a property of the run, not a different class.

Assumptions. "Native" means a unified execution layer where sync and async nodes coexist under one driver, not just polishing AsyncDriver. Assuming async should also work with dynamic (task-based) parallel execution, which it currently doesn't.

Codebase touchpoints. async_driver.py — AsyncDriver/AsyncGraphAdapter are a parallel path; AsyncDriver explicitly forbids enable_dynamic_execution(). Sync path is driver.py Driver.execute() → DefaultGraphExecutor/TaskBasedGraphExecutor. Lifecycle hooks split sync/async at registration (LifecycleAdapterSet, call_all_lifecycle_hooks_sync vs _async). Traversal is blocking DFS in execution/graph_functions.py.

Ranks.

Def	Feas	Effort	Benefit	BC risk
3	3	XL	5	high

Feasibility is mid: async works today but as a bolt-on; true unification means reworking LifecycleAdapterSet to auto-detect hook async-ness and making the task executor async-aware — deep changes. BC risk is high only if the unification changes Driver.execute()'s return type; keeping a separate entry point or returning an awaitable that also runs under asyncio.run() is the lever to keep it low.

Open questions. Does driver.execute() stay sync and async live behind a flag/awaitable, or does the signature change? Must async support dynamic parallelism in v1, or is that a later phase? Is mixed sync/async-in-one-graph a requirement?

TP2 — Alternate execution mode

Summary. Optional execution modes for both sync and async: (a) keep-trying / graceful fallback on error (today's graceful behavior), (b) full parallelism with BFS. (Source: Top Priorities §2.)

Mental model. The user picks an execution strategy the way they pick an executor — "run everything you can even if some nodes fail, give me partial results" vs "fail fast." BFS framing: release work in dependency waves rather than greedily.

Assumptions. (a) is the higher-value, better-defined half and is scored as the primary deliverable; (b) "full parallelism with BFS" is speculative (the doc itself appends "(?)") and may not improve on today's greedy-ready scheduling. Treating them as separable.

Codebase touchpoints. GracefulErrorAdapter already exists (lifecycle/default.py, exercised by test_parallel_graceful.py). The blocker for "keep trying" is that errors raise immediately in run_graph_to_completion() (execution/executors.py). Node depth is already computed (get_node_levels() in execution/graph_functions.py) and the task queue is a plain deque (execution/state.py), so a level-based release is not blocked architecturally.

Ranks.

Def	Feas	Effort	Benefit	BC risk
3	4	M	4	low

Both are opt-in flags → low BC risk. Graceful fallback reuses existing adapter infra (the gap is just letting the adapter decide whether to continue before the raise). BFS is the more speculative, lower-confidence half.

Open questions. Is "keep trying" = inject sentinel/None for failed nodes and continue (current GracefulErrorAdapter behavior) or full retry-with-backoff? What concrete win does BFS give over the current greedy-ready scheduler — is (b) worth doing at all?

TP3 — Separate graph structure from driver

Summary. Let users specify the graph as a standalone object and pass it to the driver, instead of modules + config; a with_graph(...) builder method. (Source: Top Priorities §3.)

Mental model. The graph becomes a first-class value you can build, inspect, pass around, and reuse — the Driver becomes an executor of a graph rather than the thing that constructs it from modules.

Assumptions. The standalone object is still built from the same node-resolution pipeline (decorators etc.), just decoupled from Driver.__init__. Not assuming a brand-new graph-authoring DSL.

Codebase touchpoints. FunctionGraph.__init__ already takes a nodes dict, not modules (graph.py); create_function_graph()/from_modules() are the only module-coupled entry. Driver.__init__ builds and stores the graph today (driver.py), and Builder accumulates modules/config. Coupling is shallow — graph_modules is mainly used for serialization and validation hooks.

Ranks.

Def	Feas	Effort	Benefit	BC risk
4	5	S–M	4	low

The graph layer is already module-agnostic internally, so this is mostly surfacing a public GraphBuilder + a Builder.with_graph() path and making graph_modules optional. Additive → low BC risk. Tightly related to TP13.

Open questions. Does serialization need to support module-less graphs from day one (currently relies on module importability)? What's the public type users hold — HamiltonGraph (public, read-oriented) or a new builder output?

TP4 — Driver chaining

Summary. Chain drivers together, possibly via graphs; auto-track chained runs as groups; recursive tracking (like Burr's subdag); materializers that generate multiple nodes (e.g. a dlt pipeline). (Source: Top Priorities §4.)

Mental model. Compose pipelines like functions — one driver's output feeds another, and the tracking UI shows the composition as nested groups rather than disconnected runs.

Assumptions. This is a cluster of loosely-related sub-ideas; the doc hedges ("Maybe with graphs?", "Ties into tracking (?)"). Scoring the core "compose drivers / sub-pipelines as tracked groups" intent. The "materializers generate multiple nodes" sub-point is really a separate capability.

Codebase touchpoints. @subdag/@parameterized_subdag (function_modifiers/recursive.py) already embed one graph into another — this is a form of chaining. Task grouping infra exists (NodeGroupPurpose in execution/grouping.py) but group→logical-subdag tracking metadata is thin. Materializers can already inject nodes (io/materialization.py).

Ranks.

Def	Feas	Effort	Benefit	BC risk
2	4	M	3	low

Feasible because much of the machinery (subdag, grouping, materializer node injection) exists; held back mostly by being under-defined. Depends on TP5 for the "tracked as groups" half to mean anything.

Open questions. What does "chaining" add over @subdag — a runtime API to wire separate Driver instances, or richer tracking of existing composition? Is recursive tracking a TP5 (tracking) feature in disguise?

TP5 — Tracking / persistence first-class

Summary. First-class local tracking: with_tracker(...) writing locally; a lightweight Burr-style UI launched from the filesystem; seamless filesystem → postgres transition; tracking across jobs; tracking decoupled from nodes. (Source: Top Priorities §5.)

Mental model. pip install, add with_tracker(), run hamilton ui, and get a local dashboard of your runs with zero infra — graduating to the full postgres-backed UI only when you outgrow local files. Tracking is a property you switch on, not a server you stand up.

Assumptions. "Lightweight UI like Burr" = a local, single-user, read-from-filesystem server, not the full Docker/Django/Postgres stack. Assuming the existing remote SDK tracker is the reference for the data captured.

Codebase touchpoints. Lifecycle hook system is mature (lifecycle/base.py, lifecycle/api.py) and multiple adapters already coexist (MLflow, OpenLineage, the UI SDK HamiltonTracker). A hamilton ui CLI command exists (launches the full UI via Docker) and a start_mini_mode.sh partial exists under ui/. The caching stores (caching/stores/ — file + sqlite) are a ready template for a local tracking store. No lightweight local filesystem tracker exists today — that's the gap.

Ranks.

Def	Feas	Effort	Benefit	BC risk
3	4	L	5	low

Highest benefit on the board (kills the biggest local-dev friction) and low risk (opt-in adapter + new CLI subcommand). Effort is L because the lightweight read-only UI, not the tracker adapter, is the real work. Cross-job tracking the doc itself flags as later (1.x).

Open questions. Build the mini-UI fresh (Flask/FastAPI over a sqlite/JSONL store) or trim the existing React/Django UI? What's the on-disk schema, and does the fs→postgres migration need to be lossless/automatic for v1?

TP6 — Cache-native

Summary. Caching built into execution: fingerprinting + caching with multiple modes; visualize_execution shows cache state; built-in checkpointing similar to Burr's persister. (Source: Top Priorities §6.)

Mental model. Caching is on by default and visible — you see which nodes were hits/misses in the DAG render, and you can resume a failed run from where it stopped.

Assumptions. Splitting into "basics" (modes + cache-aware visualization), which is near-done, and "checkpointing/resume", which is a distinct, larger capability. The summary table scores basics.

Codebase touchpoints. Caching is mature: HamiltonCacheAdapter (caching/adapter.py), singledispatch fingerprinting on xxh3_128 (caching/fingerprinting.py — note this branch just standardized on xxhash and vectorized DataFrame hashing), behaviors enum (DEFAULT/RECOMPUTE/DISABLE/IGNORE), swappable result/metadata stores (caching/stores/), and Builder.with_cache(). Gaps: visualize_execution() (driver.py) doesn't render cache state, and no resume/checkpoint logic exists — caching is node-output level, not execution-state level.

Ranks.

Def	Feas	Effort	Benefit	BC risk
3	5	M	4	low

Basics are highly feasible because the hard part (fingerprinting + stores) is built and recently improved. Checkpointing/resume is a separate L item with medium BC risk (touches the execution engine, assumes pure functions). Don't conflate the two in planning.

Open questions. Is "cache-native" mostly the visualization + modes polish, or does it require Burr-style resume? Should caching become on-by-default (a behavior change) or stay opt-in?

TP7 — Dynamic parallelism-native

Summary. Make dynamic parallelism (Parallelizable/Collect) robust and well-tested — "no more bugs," clean traversal, "just works" for running a subdag in parallel. (Source: Top Priorities §7.)

Mental model. Users already like this feature; the ask is reliability, not new surface — fan-out/fan-in should be a represented, tested first-class construct rather than something with known edge cases.

Assumptions. This is primarily a hardening/refactor of existing behavior, not a new API. "Run a subdag in parallel" implies possibly nested parallelism, which is currently undefined.

Codebase touchpoints. Parallelizable/Collect are type hints (htypes.py); grouping is GroupByRepeatableBlocks (execution/grouping.py) with explicit known hacks: a TODO for conflicting-group error messages, one-expander-per-collector assumptions, generator→list force-conversion flagged "we will likely remove this" (execution/executors.py), and string-index parameterization with possible collisions (execution/state.py). Tests exist (test_node_grouping.py).

Ranks.

Def	Feas	Effort	Benefit	BC risk
3	4	M–L	4	low–med

Feasible and isolated (the messy code is contained in grouping.py/state.py); high benefit because it removes a known sharp edge users hit. BC risk is low for behavior but medium for tests asserting current exception text. Nested parallelism (recursive task planning) is the part that pushes effort toward L.

Open questions. Is nested parallelism in scope, or just hardening the flat case? Should generators become first-class (streaming) or is force-to-list acceptable? Which specific reported bugs define "done"?

TP8 — Fewer, more powerful nodes with metadata

Summary. Design philosophy shift toward fewer, higher-value nodes: assets→nodes closer to 1:1 with functions; rich metadata attached to nodes (e.g. a data adapter is one tagged node carrying its metadata) accessible from hooks, with one node able to write metadata a later one reads; data validators as post-node hooks that fail and/or write metadata. (Source: Top Priorities §8.)

Mental model. A node is a meaningful asset carrying its own structured metadata, and hooks form a metadata side-channel between nodes — rather than spawning extra helper/validator nodes into the DAG.

Assumptions. This is partly philosophy (node granularity) and partly concrete mechanism (writable, propagating node metadata + validators-as-hooks). Scoring the concrete mechanism, since the philosophy alone isn't implementable.

Codebase touchpoints. Nodes carry a flat dict[str,str] _tags (node.py); @tag/matches_query exist (function_modifiers/metadata.py). @check_output validators are currently separate DAG nodes (function_modifiers/validation.py, data_quality/). Lifecycle post_node_execute hooks see the result but are read-only with no write-back/propagation channel (lifecycle/base.py); nodes are effectively immutable post-construction (copy_with).

Ranks.

Def	Feas	Effort	Benefit	BC risk
2	3	L	3	med

Held back by definition (it's a philosophy with example mechanisms) and by two real architectural frictions: node immutability and read-only hooks. Moving validators from nodes to hook-metadata is breaking for anyone whose code expects validator nodes in the DAG → medium BC risk.

Open questions. Concretely, what is "node metadata accessible/communicable between nodes" — a typed metadata bus, or richer tags? Do validators-as-hooks replace @check_output nodes (breaking) or live alongside? What's the migration path?

TP9 — Zero lib dependencies

Summary. No hard pandas/numpy/etc. dependency; use "databackends + single dispatch" to implement type-specific features (materializers, schema tracking, SDK metadata, hashing, serialization, caching). (Source: Top Priorities §9.)

Mental model. Core Hamilton is a tiny pure-Python orchestration library; pandas/polars/etc. support arrives via optional extras and dispatches on type without the core ever importing them.

Assumptions. "Zero" means pandas/numpy move from hard deps to optional extras, not literally no dependencies. The single-dispatch + abstract-backend approach is taken as the intended mechanism (it already partly exists).

Codebase touchpoints. pandas + numpy are hard deps today (pyproject.toml). Single-dispatch already used in registry.py and fingerprinting; abstract backends that work without importing the libs already exist and are used by caching and schema (experimental/h_databackends.py, plugins/h_schema.py); narwhals integration exists (plugins/h_narwhals.py). The pattern is proven but applied unevenly; many plugins/examples assume pandas present.

Concrete plan exists — apache/hamilton#918. There is an agreed two-rung ladder, which raises this idea's definition from "direction" to "sequenced plan":

1.9x prep rung (non-breaking, shippable now). pandas/numpy stay hard deps; nothing observable breaks. Work: relocate the pandas/numpy-coupled classes to the plugins namespace (PandasDataFrameResult, StrictIndexTypePandasDataFrameResult, SimplePythonDataFrameGraphAdapter → plugins/h_pandas.py; NumpyMatrixResult → new plugins/h_numpy.py) behind PEP 562 __getattr__/__dir__ shims in base.py; re-parent SimplePythonGraphAdapter/DefaultAdapter off the pandas base; split pandas validators out of data_quality/default_validators.py (mirroring the existing pandera conditional-registration pattern); warn on the implicit pandas-DataFrame default in Driver/AsyncDriver; and add a Django-style RemovedInHamilton2Warning + version-gated CI mechanism (see cross-cutting note in Dependencies & groups).
2.0 removal rung (the actual break). Drop pandas/numpy from core deps, delete the shims, flip the result-builder default to DictResult, add a minimal-install CI job. Because the prep rung already moved everything and pre-warned every call site, this becomes a near-pure deletion changeset.

Ranks.

Def	Feas	Effort	Benefit	BC risk
4	3	L	3	high†

The mechanism is validated (abstract backends + dispatch) and now has a concrete migration plan, so definition rises to 4. Effort stays L overall (the 1.9x relocation/shim work is the bulk; 2.0 deletion is small). †BC risk is high only at the 2.0 end-state (users who never migrate imports break, and the pip install apache-hamilton → pandas expectation changes); the #918 ladder makes the 1.9x prep rung low-risk and landable today, which is where the near-term value is.

Open questions. ~~Phased vs hard-cut~~ — resolved by #918 (phased, 1.9x prep → 2.0 removal). Remaining: does pip install apache-hamilton keep a pandas-bearing default extra at 2.0, or go truly minimal? Is the lighter-install benefit worth it for a userbase that mostly uses pandas anyway (benefit may be lower than 3 if so)?

TP10 — Rework materializer definition

Summary. Replace the dataclass-based materializer definition, which forces explicitly stating all arguments, can't support **kwargs, and pollutes the test suite with deprecation warnings. (Source: Top Priorities §10.)

Mental model. Defining a saver/loader should feel like writing a normal class with a flexible constructor, not satisfying a rigid dataclass contract.

Assumptions. Goal is signature-introspection-based argument resolution (supporting defaults and **kwargs) replacing the dataclasses.fields() requirement, while keeping existing dataclass adapters working.

Codebase touchpoints. DataSaver/DataLoader inherit AdapterCommon, whose get_required_arguments()/get_optional_arguments() call dataclasses.fields() and _ensure_dataclass() raises if not a dataclass (io/data_adapters.py). Instantiation is centralized in AdapterFactory.create_loader/create_saver (function_modifiers/adapters.py) — a clean single point to swap the extraction strategy.

Ranks.

Def	Feas	Effort	Benefit	BC risk
4	5	S–M	3	low–med

Well-defined, well-isolated, quick. Benefit is moderate and partly internal (kills the deprecation-warning noise, eases writing adapters) rather than end-user-facing. Keeping dataclass adapters working keeps risk low; only changes if the public DataSaver/DataLoader authoring contract shifts.

Open questions. Is the new contract signature-based introspection of __init__, a Protocol, or explicit get_required/optional_arguments overrides? Confirm which Python version's dataclass behavior triggers the deprecation warnings.

TP11 — More flexible dependencies

Summary. (a) Tag-based grouping — e.g. a dataframe assembled from all columns tagged @final_asset; (b) fallback/optional dependencies — a node optionally tolerating a prior node's failure. (Source: Top Priorities §11.)

Mental model. Dependencies can be declared by property ("everything tagged X") rather than by name, and a node can degrade gracefully when an upstream optionally fails.

Assumptions. Two distinct features bundled. (a) means resolving tag queries to concrete edges at graph-build time. (b) extends OPTIONAL semantics from "skip if absent" to "use a fallback if it fails."

Codebase touchpoints. Dependency types are only REQUIRED/OPTIONAL (node.py); optional currently means "skip the edge if missing" in graph.py. group(...) collection exists (GroupedListDependency/GroupedDictDependency in function_modifiers/dependencies.py) but by explicit source, not by tag. matches_query() tag-matching exists but runs at list/viz time, not during edge resolution. No fallback-on-failure mechanism exists.

Ranks.

Def	Feas	Effort	Benefit	BC risk
2	3	M–L	3	med

Tag-grouping is more feasible (the matcher exists; move resolution to build-time). Fallback-on-failure is the harder, riskier half — it changes OPTIONAL semantics and needs execution-layer failure handling (overlaps TP2's graceful mode), hence medium BC risk.

Open questions. Does tag-based resolution at build time interact badly with @config.when conditional nodes? Is "fallback" a default value, a previous run's value, or just None? Should (b) be unified with TP2 graceful fallback rather than built separately?

TP12 — Flexible visualization / rendering

Summary. Refactor visualization to depend on the public HamiltonGraph/HamiltonNode, and make output pluggable — Mermaid, Reactflow, Graphviz, others, possibly user-contributed. (Source: Top Priorities §12.)

Mental model. Rendering is a pluggable backend over a public graph representation: pick (or contribute) a renderer, get that format out, without touching execution.

Assumptions. Graphviz stays the default; new renderers are additive. The public HamiltonGraph/HamiltonNode API is the intended data source.

Codebase touchpoints. An excellent public API already exists — HamiltonGraph/HamiltonNode dataclasses with .as_dict() and from_graph() (graph_types.py). Current rendering is create_graphviz_graph() (~250 lines tightly coupled to DOT) in graph.py, with the graphviz import already deferred and a custom_style_function hook present. Public entry points display_all_functions()/visualize_execution() (driver.py) take graphviz-specific kwargs.

Ranks.

Def	Feas	Effort	Benefit	BC risk
4	5	M	4	low

The strongest near-term candidate: the clean public API boundary already exists, viz is off the execution path (safe to refactor), and Mermaid/Reactflow are pure data transforms. Default-to-graphviz keeps it low-risk; the only friction is graphviz-specific kwargs in the public signatures, manageable with a deprecation window.

Open questions. Is the renderer interface (HamiltonGraph) -> str|bytes|dict? Which formats ship built-in vs. as a plugin contract? How are today's graphviz-specific kwargs migrated?

TP13 — Remove assumptions on modules

Summary. Stop assuming graphs are built from Python modules; build from functions/nodes directly. (Source: Top Priorities §13.)

Mental model. Modules are one convenient way to supply functions, not a requirement — you can hand Hamilton functions or nodes directly.

Assumptions. Strongly overlaps TP3; this is the "input side" (drop the module requirement) while TP3 is the "output side" (graph as a passable object). Likely planned together.

Codebase touchpoints. Module assumption is concentrated in find_functions() (graph_utils.py) feeding create_function_graph()/from_modules() (graph.py) and Driver.__init__ (driver.py); serialization/validation also iterate graph_modules. But the decorator pipeline (resolve_nodes(fn, config)) and Node itself are already module-agnostic — the assumption is shallow.

Ranks.

Def	Feas	Effort	Benefit	BC risk
3	4	M	3	low

Feasible because the coupling is thin and the node layer is already module-free. Main work is making modules optional through Driver + serialization + validation. Best executed jointly with TP3.

Open questions. What happens to module-based serialization for module-less graphs? Do any public hooks rely on graph_modules being populated?

Wishlist / Wacky ideas

W1 — Alternative procedural API

Summary. A procedural graph-building API as a migration bridge from Kedro/Prefect/etc.; explicitly not the central/intended API. (Source: Wishlist §1.)

Mental model. Imperative gb.add_node(...) construction familiar to migrants, lowering the switching cost from other orchestrators, while idiomatic Hamilton stays declarative.

Assumptions. Builds on the same node objects, just an imperative front door. Depends on TP3/TP13 (graph decoupled from modules) to be clean.

Codebase touchpoints. Decorator pipeline operates on callables, not modules (resolve_nodes in function_modifiers/base.py); FunctionGraph accepts a nodes dict; graph.with_nodes() exists (graph.py). So an imperative builder emitting nodes is well-supported.

Ranks.

Def	Feas	Effort	Benefit	BC risk
2	4	M	3	low

Feasible and additive. The doc itself frames it as non-central, which caps benefit; value is strategic (adoption funnel) rather than for existing users.

Open questions. Is this a supported long-term API or a throwaway migration shim? Which frameworks' mental models must it mirror?

W2 — Compile-mode

Summary. Decorators each describe how to transform themselves into code — e.g. @parameterize generates the parameter expansion, extract_columns generates the unpacking — so a graph can be emitted as code. (Source: Wishlist §2.)

Mental model. The decorator graph is a source you can lower to plain Python (or another target), making Hamilton a compile-time metaprogramming layer rather than only a runtime.

Assumptions. Means adding an optional code-generation path to decorators alongside their node-producing path; not replacing runtime execution.

Codebase touchpoints. Decorators today return Node objects wrapping callables, not code (function_modifiers/base.py, expanders/macros). They're well-separated (NodeCreator/NodeExpander/NodeTransformer), so an additional to_code() branch is structurally possible, but no code-gen exists anywhere today.

Ranks.

Def	Feas	Effort	Benefit	BC risk
2	3	L	2	low

Additive (low risk) but a large, novel surface — every decorator needs a faithful code-emitter, and correctness across the decorator zoo is hard. Benefit is niche/unproven.

Open questions. What's the actual use case — debugging, lift/shift (overlaps W6), performance? Which decorators must support it for it to be useful?

W3 — Auto-optimizing parallelism

Summary. Learn and adjust how to handle parallelism for a given executor based on a heuristic (the doc muses an ML intern could own it). (Source: Wishlist §3.)

Mental model. The runtime observes execution and tunes its own parallelism decisions over time — self-optimizing scheduling.

Assumptions. Requires runtime metrics collection that doesn't exist yet; "ML" is aspirational — a heuristic is the realistic v1.

Codebase touchpoints. Executor abstraction is pluggable (TaskExecutor, ExecutionManager.get_executor_for_task() in execution/executors.py), but max_tasks is fixed at init, grouping is committed before execution, and no runtime metrics are collected.

Ranks.

Def	Feas	Effort	Benefit	BC risk
1	3	L	2	low

Least-defined idea in the doc. The hooks exist to route tasks, but the metrics/learning layer is greenfield and benefit is speculative. A research project, not roadmap-ready.

Open questions. What's the objective (wall-clock? memory? cost?) and what signal drives it? Is a static heuristic enough, making "learning" unnecessary?

W4 — IDE kernel

Summary. A tighter IDE integration ("already part of the way there" via Jupyter + VSCode). (Source: Wishlist §4.)

Mental model. Authoring Hamilton dataflows in an editor with live DAG validation/execution feedback inline, not as a separate run step.

Assumptions. "Kernel" implies tighter Jupyter-kernel/LSP integration with execution feedback, beyond today's loose pieces. Scope is the dominant unknown.

Codebase touchpoints. Jupyter magics exist (plugins/jupyter_magic.py); a real LSP server exists (dev_tools/language_server/) plus a VSCode extension (dev_tools/vscode_extension/). But they're loosely coupled and the LSP is read-only (no execution feedback) — "part of the way there" is accurate.

Ranks.

Def	Feas	Effort	Benefit	BC risk
1	4	L	3	low

Feasible (components exist, all additive/tooling-side) but very under-defined — effort swings wildly with scope. Needs a concrete spec before it's rankable beyond this.

Open questions. What does "kernel" concretely deliver — inline execution results, live validation, profiling? Which surface (Jupyter vs VSCode) leads?

W5 — Rust integration

Summary. Run functions in Rust / a full Rust implementation. (Source: Wishlist §5.)

Mental model. A high-performance native core under the Python API.

Assumptions. Treated as greenfield; "run functions" vs "full impl" are very different scopes, both far from today.

Codebase touchpoints. No Rust/native code anywhere — no .rs, no Cargo.toml, no C extensions. Hamilton's value (dynamic introspection, decorators, dynamic node generation, pickle) is deeply tied to Python's runtime model, which a static Rust core fights against.

Ranks.

Def	Feas	Effort	Benefit	BC risk
1	1	XL	2	high

Lowest feasibility on the board. Pure greenfield, fundamental tension with Hamilton's introspection-driven design, parallel-impl or full-migration burden, and serialization breakage → effectively a 2.0/3.0-scale bet with unclear payoff.

Open questions. What's the actual bottleneck Rust would solve — is Hamilton's overhead even on the critical path for typical workloads? Selective hot-path (PyO3) vs full rewrite?

W6 — Graph compilation / lift-shift

Summary. Compile the graph to alternative frameworks (lift/shift), translate syntax (e.g. integers→vectors), and a fancier with_columns. (Source: Wishlist §6.)

Mental model. Author once in Hamilton, then emit an Airflow DAG / Dask graph / Spark SQL / etc. — Hamilton as the source language, other runtimes as compile targets.

Assumptions. Per-target feasibility varies widely; scoring the general capability. with_columns is the existing precedent for compiling a subdag to a framework's native form.

Codebase touchpoints. with_columns_base (function_modifiers/recursive.py) already compiles a subdag into native Spark/Polars expressions (plugins/h_spark.py, plugins/h_polars_lazyframe.py) — proof the pattern works. FunctionGraph is explicit/queryable. But compiling arbitrary Python nodes needs AST/type inference Hamilton doesn't do; graphs are built at runtime, not compile-time.

Ranks.

Def	Feas	Effort	Benefit	BC risk
2	2	XL	3	med

A bounded target (e.g. Hamilton→Airflow, leveraging the existing Dask executor and with_columns precedent) is medium effort; a general type-aware compiler is XL and feasibility-2 because arbitrary Python can't be compiled reliably ("it just works" breaks at edge cases). Best scoped to one concrete target rather than "compilation" broadly.

Open questions. Which single target is worth a v1 (Airflow? Dask?)? Is "integers→vectors syntax translation" actually wanted, or is it just illustrative?

W7 — Makefile integration

Summary. Snakemake-inspired (plus an R framework) declarative invocation, scoped to materializers only. (Source: Wishlist §7.)

Mental model. hamilton materialize <selector> to run targeted materializers from the CLI, like make targets — declarative build-style invocation over data outputs.

Assumptions. Scoped to materializers (the doc says so); a CLI selector + scheduler over existing materializers, not a general build system.

Codebase touchpoints. Materializer registry + driver.materialize() exist (io/materialization.py, driver.py); CLI uses Typer with visualize/build/inspect commands (cli/main.py) but no materialize command and no selector/scheduler. Dependency-graph infra for ordering already exists.

Ranks.

Def	Feas	Effort	Benefit	BC risk
2	4	M	2	low

Feasible and additive (new CLI subcommand over existing materializers). Benefit is modest and audience-specific (CLI/ops users). Reasonably well-bounded once "materializers only" is taken at face value.

Open questions. What's the selector syntax (glob over materializer names? tags?)? Does it need a dry-run/DAG-of-materializers view, and does it run in-process or shell out?

Context

These sections of the source are notes and references, not actionable proposals, so they're preserved here rather than ranked. They are essential framing for several ideas above (especially TP1 async, TP5 tracking, TP6 caching, TP7 parallelism).

Burr and Hamilton — how they relate

A comparison of the two libraries, informing how Hamilton 2.0 might converge with Burr's model. Faithful summary of the source's ten points:

Graph structure — Hamilton is a DAG (full execution path known up front); Burr allows cycles (richer behaviors like chain-of-thought, but unknown termination).
Defining transitions — Hamilton encodes them in function signatures (parameter names); Burr specifies them via the ApplicationBuilder.
Defining actions — Hamilton uses function modifiers that must be compiled to resolve the graph (@config.when, @pipe_output, @parameterize); Burr resolves the graph explicitly at build time (@action + .bind()).
Conditional transitions — Hamilton has none (so all executed nodes are known beforehand); Burr allows conditional execution (termination point unknown; a different constraint than cycles).
Mental model — Hamilton: dataflow (data = nodes, functions = edges). Burr: state machine (functions = nodes, transitions = edges). Each can be transposed onto the other.
State — Hamilton's state is implicit, {node_name: value} (or {(node_name, task_id): value} for task-based), statically defined and decentralized across nodes. Burr defines state explicitly; fields needn't map to actions; can be centralized for validation.
State persistence — Hamilton uses in-memory dicts; caching added alternative read/write stores (.with_caching(result_store=...)). Burr also in-memory, with .with_persister() logging per action.
Parallelism — Hamilton runs subdag instances in parallel branches but doesn't internally represent them as subdags (source of the edge cases noted in TP7); users like it. Burr represents parallel subdags internally, giving a unified experience across UI/hooks/validation.
Caching — Hamilton caching needs: version input/output data, version transformation code, version DAG deps (implicit via signature), persist results + metadata; those feed checkpointing/caching algorithms. Burr has no caching but has result persistence; it could reuse Hamilton's versioning + algorithms with a Burr-appropriate API.
Async and streaming — open question in the source; Burr is "seemingly more robust" here.

Triage relevance. Point 8 directly motivates TP7 (represent parallel work as real subdags). Point 9 confirms TP6's foundations are reusable and shared with Burr. Points 6–7 inform TP5 (state/tracking decoupled from nodes). Point 10 reinforces TP1's open question about async robustness.

References (from source)

targets — Hamilton-like library in R.
Metaflow resume feature — prior art for TP6 checkpointing/resume.
Metaflow project namespacing — git-branch-like project isolation; relevant to TP5 cross-job tracking.
SQLMesh GitHub CI/CD bot.
SQLMesh Airflow plugin — persists metadata into Airflow backend; relevant to TP5 and W6.
Kedro MLFlow plugin.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hamilton 2.0: Feasibility-Ranked Spec #1645

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Hamilton 2.0: Feasibility-Ranked Spec #1645

Uh oh!

Dev-iL Jun 20, 2026 Collaborator

Hamilton 2.0 — Feasibility-Ranked Spec

How to read the ranks

Guiding constraints (not ranked)

Summary (ordered: strongest near-term candidates first)

Dependencies & work-sharing groups

Top Priorities

TP1 — Async native

TP2 — Alternate execution mode

TP3 — Separate graph structure from driver

TP4 — Driver chaining

TP5 — Tracking / persistence first-class

TP6 — Cache-native

TP7 — Dynamic parallelism-native

TP8 — Fewer, more powerful nodes with metadata

TP9 — Zero lib dependencies

TP10 — Rework materializer definition

TP11 — More flexible dependencies

TP12 — Flexible visualization / rendering

TP13 — Remove assumptions on modules

Wishlist / Wacky ideas

W1 — Alternative procedural API

W2 — Compile-mode

W3 — Auto-optimizing parallelism

W4 — IDE kernel

W5 — Rust integration

W6 — Graph compilation / lift-shift

W7 — Makefile integration

Context

Burr and Hamilton — how they relate

References (from source)

Replies: 0 comments

Dev-iL
Jun 20, 2026
Collaborator