hrw4u: type the AST and add a stage-inspection tool#13180
Open
juanthropic wants to merge 14 commits into
Open
Conversation
The class constructs an AST from an ANTLR parse tree; it does not visit one. Rename the module and class to match what they do.
Replace raw-string fields on Assignment, Comparison, LogicalOp, and
VarSection with four AST-local enums: AssignOp, CmpOp, BoolOp, and
VarSectionKind. These carry identity only (no codegen metadata), so
AST consumers pattern-match against semantic names and stay decoupled
from grammar spellings; changing '==' to 'eq' in the grammar would
only touch the builder, not downstream code.
VarSection.scope moves from a raw string ("txn" / "session") to the
thin VarSectionKind, keeping the AST free of semantic/codegen concerns.
The builder and tests still pass raw strings; this commit changes
only the contract. Builder and test updates follow.
Adds a small CLI that takes an HRW4U source file and emits a chosen intermediate representation: the ANTLR concrete syntax tree (cst) or the dataclass AST built by ASTBuilder (ast). Stages are registered in a dict so adding future stages (ast-resolved, ast-validated, ...) is a one-line change. --stage is required so the tool stays explicit about which IR is being inspected as the pipeline grows. Wired into pyproject.toml script-files and the Makefile build target alongside the other scripts.
The thin AST enums (AssignOp, CmpOp, BoolOp, VarSectionKind) were defined and referenced in the dataclass annotations but the builder still emitted raw strings, so the type annotations were lying and the enums were unreachable. Switch the builder to emit the enum values, update the existing assertions, and fill in coverage gaps surfaced along the way: IdentValue assignment RHS, paramRef as a function argument, multiple use directives, top-level comments, both VARS and SESSION_VARS in one program, multi-param procedures, and empty if/elif/else bodies.
Add tests for paths that the existing suite parses but never asserts on, so future refactors don't silently regress them: - Qualified function-call names (procedure invocation). - !in with an iprange RHS (the only IN/NOT_IN x set/iprange pair that was untested). - if false as a top-level BoolLiteral condition. - false (lowercase) as an assigned value. - Empty string literal "" as an assigned value. - Target.from_dotted with and without a namespace, including the no-dot branch that the grammar doesn't currently exercise.
The tool's main use is AST inspection; CST (and future resolved / validated stages) are precursors and refinements around it, so naming after the AST makes intent clearer than the generic "ir". Defaulting --stage to ast removes friction from the common case, and a smoke test covers the CLI surface.
Both ASTBuilder and the existing visitor previously discriminated the
negated forms ('!factor', '!in') by walking children and matching on
text, which is fragile and depends on knowing that no other grammar
production can produce a '!' child. Promoting '!' to a named BANG
token lets the visitors use ctx.BANG() — explicit, accessor-driven,
and impossible to misread.
The repr alias on every operator enum exists so that pprint output from hrw4u-ast renders as "AssignOp.ASSIGN" rather than the noisy default "<AssignOp.ASSIGN: 1>"; promote that to a comment so the next reader doesn't take it for dead code. While here, switch the remaining Union[...] aliases to the equivalent X | Y syntax to match the rest of the file (and drop the now-unused typing.Union import).
Drive-by clarification — Target deliberately doesn't inherit from Node and doesn't carry a line, because the grammar has no Target production: it's destructured from an Assignment's IDENT lhs, so the source position already lives on the Assignment.
Replace `from hrw4u.ast_nodes import *` with `from hrw4u import ast_nodes as nodes`. The builder constructs ~25 distinct node types, and a `nodes.Foo(...)` prefix at every construction site reinforces "this is an AST node" without polluting the module namespace through a wildcard. Tests keep the wildcard since they read more naturally as the module's public API consumer.
Same change as the builder — replace `from hrw4u.ast_nodes import *` with `from hrw4u import ast_nodes as nodes`. Keeps the import style consistent across producer and consumer, and avoids the wildcard silently shadowing any local name added later in the test file.
The shared `raw` field name was misleading on the three node types where construction strips delimiters or sigils: LiteralStringValue drops the surrounding quotes, ParamRef drops the leading '\$', and RegexValue drops the surrounding '/'. Rename to `text`, `name`, and `pattern` respectively, and document on each what is and isn't preserved (escapes are kept as written). IdentValue and IPValue keep `raw` because nothing is stripped — the field really is the source text. The asymmetry now encodes a real distinction.
hrw4u-ast is a developer-only inspection tool, not something we ship to end users. Drop it from setuptools' script-files (no `pip install` exposure) and from the PyInstaller build target (each --onedir bundle adds ~30-50 MB to release artifacts). Devs continue to run it from the source tree via `uv run scripts/hrw4u-ast`.
bc903e1 to
e32c1ab
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors hrw4u’s parse-tree-to-AST pipeline to produce a typed, inspectable AST (replacing stringly-typed operator/scope fields with enums), and adds a dev-only hrw4u-ast CLI for emitting either CST or AST for debugging and inspection.
Changes:
- Introduces typed AST operator/scope enums and updates the AST builder to emit them (plus renamed/clarified node fields like
text/name/pattern). - Adds
scripts/hrw4u-astfor emittingcstoraststages, with tests covering basic CLI behavior. - Updates grammar and visitors to use a dedicated
BANGtoken for!(instead of relying on child-node text scans), and expands AST builder test coverage.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tools/hrw4u/tests/test_hrw4u_ast.py | Adds smoke tests for the new hrw4u-ast inspection CLI stages and error handling. |
| tools/hrw4u/tests/test_ast_builder.py | Updates tests to use ASTBuilder and the new typed enums/renamed AST node fields; adds coverage for previously unasserted paths. |
| tools/hrw4u/src/visitor.py | Switches negation detection to ctx.BANG() for robustness after grammar tokenization changes. |
| tools/hrw4u/src/ast_nodes.py | Defines AST-local enums and updates node/value dataclasses to use typed operators and clearer field names. |
| tools/hrw4u/src/ast_builder.py | Renames ASTVisitor → ASTBuilder and wires builder output to the new enums/fields. |
| tools/hrw4u/scripts/hrw4u-ast | Adds a dev-only CLI to emit CST or AST for a given input. |
| tools/hrw4u/Makefile | Updates build file lists to include ast_builder.py and documents the dev-only nature of hrw4u-ast. |
| tools/hrw4u/grammar/hrw4u.g4 | Promotes ! to a BANG token and updates grammar rules to use it. |
Comment on lines
88
to
92
| OR : '||'; | ||
| TILDE : '~'; | ||
| NOT_TILDE : '!~'; | ||
| BANG : '!'; | ||
| COLON : ':'; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pulls the hrw4u AST out of "untyped strings built by something called a
visitor" into a typed, inspectable artifact, and adds a dev-only CLI to
look at it.
AST contract
ASTVisitortoASTBuilder. The class constructs an AST froman ANTLR parse tree; it does not visit one.
Assignment,Comparison,LogicalOp,and
VarSectionwith four AST-local enums (AssignOp,CmpOp,BoolOp,VarSectionKind) that carry identity only. AST consumerspattern-match against semantic names rather than grammar spellings,
so e.g. changing
==toeqin the grammar would only touch thebuilder.
annotations had been lying.
rawvalue field totext/name/patternonthe three node types where construction strips delimiters or sigils
(quotes,
$,/); keeprawwhere nothing is stripped, so theasymmetry encodes a real distinction.
!to a namedBANGtoken so the negatedfactorandinforms are discriminated via
ctx.BANG()instead of fragile textualscans of child nodes.
ast_nodesimports in the builder (dropimport *); testskeep the wildcard since they read more naturally as the module's
public API consumer.
ast_nodes.py: comment the enum__repr__alias so it doesn'tread as dead code, and switch the remaining
Union[...]aliases toX | Y.Targetis a value class, not an AST node, since it'sdestructured from an
Assignment's IDENT lhs.Inspection tool
Adds an
hrw4u-astCLI that emits a chosen IR (cstorast) from anHRW4U source file. Stages are registered in a dict so future stages
(
ast-resolved,ast-validated, ...) are one-line additions;--stagedefaults to
astsince that's the common case.The tool is dev-only: it's run from the source tree via
uv run scripts/hrw4u-ast, not shipped viapipor PyInstaller (each--onedirbundle would add ~30–50 MB to release artifacts).Tests
Backfills tests for builder paths that the existing suite parsed but
never asserted on, so the enum/field renames don't silently regress
qualified function names,
!inwith an iprange RHS, lowercasefalse,empty string assignment,
Target.from_dottedwith and without anamespace, and a few other gaps surfaced along the way. Adds a smoke
test for the new CLI surface.