Skip to content

hrw4u: type the AST and add a stage-inspection tool#13180

Open
juanthropic wants to merge 14 commits into
apache:masterfrom
juanthropic:hrw4u-ast-nodes-spans
Open

hrw4u: type the AST and add a stage-inspection tool#13180
juanthropic wants to merge 14 commits into
apache:masterfrom
juanthropic:hrw4u-ast-nodes-spans

Conversation

@juanthropic
Copy link
Copy Markdown
Contributor

Pulls the hrw4u AST out of "untyped strings built by something called a
visitor" into a typed, inspectable artifact, and adds a dev-only CLI to
look at it.

AST contract

  • Rename ASTVisitor to ASTBuilder. The class constructs an AST from
    an ANTLR parse tree; it does not visit one.
  • Replace raw-string fields on Assignment, Comparison, LogicalOp,
    and VarSection with four AST-local enums (AssignOp, CmpOp,
    BoolOp, VarSectionKind) that carry identity only. AST consumers
    pattern-match against semantic names rather than grammar spellings,
    so e.g. changing == to eq in the grammar would only touch the
    builder.
  • Wire the builder to actually emit the new enum values; the dataclass
    annotations had been lying.
  • Rename the shared raw value field to text / name / pattern on
    the three node types where construction strips delimiters or sigils
    (quotes, $, /); keep raw where nothing is stripped, so the
    asymmetry encodes a real distinction.
  • Promote ! to a named BANG token so the negated factor and in
    forms are discriminated via ctx.BANG() instead of fragile textual
    scans of child nodes.
  • Namespace ast_nodes imports in the builder (drop import *); tests
    keep the wildcard since they read more naturally as the module's
    public API consumer.
  • Tidy ast_nodes.py: comment the enum __repr__ alias so it doesn't
    read as dead code, and switch the remaining Union[...] aliases to
    X | Y.
  • Note that Target is a value class, not an AST node, since it's
    destructured from an Assignment's IDENT lhs.

Inspection tool

Adds an hrw4u-ast CLI that emits a chosen IR (cst or ast) from an
HRW4U source file. Stages are registered in a dict so future stages
(ast-resolved, ast-validated, ...) are one-line additions; --stage
defaults to ast since that's the common case.

The tool is dev-only: it's run from the source tree via
uv run scripts/hrw4u-ast, not shipped via pip or PyInstaller (each
--onedir bundle would add ~30–50 MB to release artifacts).

Tests

Backfills tests for builder paths that the existing suite parsed but
never asserted on, so the enum/field renames don't silently regress
qualified function names, !in with an iprange RHS, lowercase false,
empty string assignment, Target.from_dotted with and without a
namespace, and a few other gaps surfaced along the way. Adds a smoke
test for the new CLI surface.

The class constructs an AST from an ANTLR parse tree; it does not
visit one. Rename the module and class to match what they do.
Replace raw-string fields on Assignment, Comparison, LogicalOp, and
VarSection with four AST-local enums: AssignOp, CmpOp, BoolOp, and
VarSectionKind. These carry identity only (no codegen metadata), so
AST consumers pattern-match against semantic names and stay decoupled
from grammar spellings; changing '==' to 'eq' in the grammar would
only touch the builder, not downstream code.

VarSection.scope moves from a raw string ("txn" / "session") to the
thin VarSectionKind, keeping the AST free of semantic/codegen concerns.

The builder and tests still pass raw strings; this commit changes
only the contract. Builder and test updates follow.
Adds a small CLI that takes an HRW4U source file and emits a chosen
intermediate representation: the ANTLR concrete syntax tree (cst) or
the dataclass AST built by ASTBuilder (ast). Stages are registered
in a dict so adding future stages (ast-resolved, ast-validated, ...)
is a one-line change. --stage is required so the tool stays explicit
about which IR is being inspected as the pipeline grows.

Wired into pyproject.toml script-files and the Makefile build target
alongside the other scripts.
The thin AST enums (AssignOp, CmpOp, BoolOp, VarSectionKind) were
defined and referenced in the dataclass annotations but the builder
still emitted raw strings, so the type annotations were lying and the
enums were unreachable. Switch the builder to emit the enum values,
update the existing assertions, and fill in coverage gaps surfaced
along the way: IdentValue assignment RHS, paramRef as a function
argument, multiple use directives, top-level comments, both VARS and
SESSION_VARS in one program, multi-param procedures, and empty
if/elif/else bodies.
Add tests for paths that the existing suite parses but never asserts
on, so future refactors don't silently regress them:

- Qualified function-call names (procedure invocation).
- !in with an iprange RHS (the only IN/NOT_IN x set/iprange pair
  that was untested).
- if false as a top-level BoolLiteral condition.
- false (lowercase) as an assigned value.
- Empty string literal "" as an assigned value.
- Target.from_dotted with and without a namespace, including the
  no-dot branch that the grammar doesn't currently exercise.
The tool's main use is AST inspection; CST (and future resolved /
validated stages) are precursors and refinements around it, so
naming after the AST makes intent clearer than the generic "ir".
Defaulting --stage to ast removes friction from the common case,
and a smoke test covers the CLI surface.
Both ASTBuilder and the existing visitor previously discriminated the
negated forms ('!factor', '!in') by walking children and matching on
text, which is fragile and depends on knowing that no other grammar
production can produce a '!' child. Promoting '!' to a named BANG
token lets the visitors use ctx.BANG() — explicit, accessor-driven,
and impossible to misread.
The repr alias on every operator enum exists so that pprint output
from hrw4u-ast renders as "AssignOp.ASSIGN" rather than the noisy
default "<AssignOp.ASSIGN: 1>"; promote that to a comment so the
next reader doesn't take it for dead code. While here, switch the
remaining Union[...] aliases to the equivalent X | Y syntax to match
the rest of the file (and drop the now-unused typing.Union import).
Drive-by clarification — Target deliberately doesn't inherit from Node
and doesn't carry a line, because the grammar has no Target production:
it's destructured from an Assignment's IDENT lhs, so the source position
already lives on the Assignment.
Replace `from hrw4u.ast_nodes import *` with `from hrw4u import
ast_nodes as nodes`. The builder constructs ~25 distinct node types,
and a `nodes.Foo(...)` prefix at every construction site reinforces
"this is an AST node" without polluting the module namespace through a
wildcard. Tests keep the wildcard since they read more naturally as
the module's public API consumer.
Same change as the builder — replace `from hrw4u.ast_nodes import *`
with `from hrw4u import ast_nodes as nodes`. Keeps the import style
consistent across producer and consumer, and avoids the wildcard
silently shadowing any local name added later in the test file.
The shared `raw` field name was misleading on the three node types
where construction strips delimiters or sigils: LiteralStringValue
drops the surrounding quotes, ParamRef drops the leading '\$', and
RegexValue drops the surrounding '/'. Rename to `text`, `name`, and
`pattern` respectively, and document on each what is and isn't
preserved (escapes are kept as written). IdentValue and IPValue keep
`raw` because nothing is stripped — the field really is the source
text. The asymmetry now encodes a real distinction.
hrw4u-ast is a developer-only inspection tool, not something we ship
to end users. Drop it from setuptools' script-files (no `pip install`
exposure) and from the PyInstaller build target (each --onedir bundle
adds ~30-50 MB to release artifacts). Devs continue to run it from
the source tree via `uv run scripts/hrw4u-ast`.
@juanthropic juanthropic force-pushed the hrw4u-ast-nodes-spans branch from bc903e1 to e32c1ab Compare May 19, 2026 23:18
@juanthropic juanthropic marked this pull request as ready for review May 20, 2026 00:36
@ezelkow1 ezelkow1 requested a review from Copilot May 20, 2026 01:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors hrw4u’s parse-tree-to-AST pipeline to produce a typed, inspectable AST (replacing stringly-typed operator/scope fields with enums), and adds a dev-only hrw4u-ast CLI for emitting either CST or AST for debugging and inspection.

Changes:

  • Introduces typed AST operator/scope enums and updates the AST builder to emit them (plus renamed/clarified node fields like text/name/pattern).
  • Adds scripts/hrw4u-ast for emitting cst or ast stages, with tests covering basic CLI behavior.
  • Updates grammar and visitors to use a dedicated BANG token for ! (instead of relying on child-node text scans), and expands AST builder test coverage.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/hrw4u/tests/test_hrw4u_ast.py Adds smoke tests for the new hrw4u-ast inspection CLI stages and error handling.
tools/hrw4u/tests/test_ast_builder.py Updates tests to use ASTBuilder and the new typed enums/renamed AST node fields; adds coverage for previously unasserted paths.
tools/hrw4u/src/visitor.py Switches negation detection to ctx.BANG() for robustness after grammar tokenization changes.
tools/hrw4u/src/ast_nodes.py Defines AST-local enums and updates node/value dataclasses to use typed operators and clearer field names.
tools/hrw4u/src/ast_builder.py Renames ASTVisitorASTBuilder and wires builder output to the new enums/fields.
tools/hrw4u/scripts/hrw4u-ast Adds a dev-only CLI to emit CST or AST for a given input.
tools/hrw4u/Makefile Updates build file lists to include ast_builder.py and documents the dev-only nature of hrw4u-ast.
tools/hrw4u/grammar/hrw4u.g4 Promotes ! to a BANG token and updates grammar rules to use it.

Comment on lines 88 to 92
OR : '||';
TILDE : '~';
NOT_TILDE : '!~';
BANG : '!';
COLON : ':';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants