Compiler front-end generator. Consumes a grammar specification and emits:
- A canonical JSON bundle describing the lexer DFA, parser tables, AST schema, and hook points.
- Driver skeletons in C, C++, Python, and Lua that load (or embed) those tables and run the front-end.
The JSON bundle is the contract. Backends are independent and replaceable. uplox itself never emits target-language code from grammar source directly — it always goes through the JSON.
v2.0.0 — breaking change to the .uplox grammar source format:
single-quote token literals ('(' not "("), <name> for every
non-terminal reference (LHS and RHS), and a new %keyword_prefix /
%keywords shortcut that collapses KW = "KW" boilerplate into a
whitespace-separated list. v1 grammars do not parse under the new
reader; all bundled examples and the self-host grammar were ported in
this release. See docs/grammar_format.md
for the current syntax and CHANGELOG.md for the full
1.x → 2.0.0 history.
- Replace the hand-written front-ends in the
uc*compiler family (uc80, uc386, uplm80, uada80, ucow, …) with a single generator. - Re-entrant by construction: every backend keeps all parser state in a
uplox_<grammar>_ctxstruct/object and prefixes generated symbols by grammar name. Two front-ends produced by uplox can link into the same binary without symbol collisions. - Pluggable hooks (pre-shift / pre-reduce / post-reduce / on-error) for context-sensitive concerns like name resolution, scoped symbol tables, and error recovery — without leaking those concerns into the grammar.
- Lexer feedback for grammars that need it (the C typedef-name hack and friends) via a runtime token-filter callback paired with the
TypedefTrackerhelper. - Selectable LR construction algorithm:
%define lr.type {canonical-lr | lalr | ielr}. Default canonical LR(1) for clean diagnostics; opt into LALR(1) when build time or bundle size start to hurt; opt into IELR(1) when you want LALR-size tables but the grammar produces a spurious LALR conflict you can't easily restructure away. Per-terminal%shift/%reduceescape hatches resolve ambiguity in any mode. (Concrete data:examples/ada_full.uploxbuilds to 58611 states under canonical LR(1) and 1594 states under LALR(1) — ~37× shrink, both conflict-free; IELR matches LALR on every example grammar in this repo, since none have spurious LALR conflicts.)
src/uplox/
spec/ grammar source reader, IR
lex/ regex → NFA → DFA construction, minimization
parse/ LR(1) item-set construction, table builder
glr/ GLR extension
ast/ node schema, default builder actions
hooks/ pluggable parse-time callbacks (TypedefTracker, ScopedNameTable, …)
tables/ JSON schema, canonical serializer
gen/
c/ C backend (re-entrant, context-struct based)
cpp/ C++ backend (class per grammar, namespaced)
py/ Python backend (embeds bundle, reuses runtime)
lua/ Lua backend (single 5.3+ module per grammar)
cli/ `uplox` command
tests/
examples/ .uplox grammars (calc, plm_subset, plm_full, c_subset, uplox_self, …)
docs/ DSL spec, per-backend notes
All nine phases of the original plan landed in v1.0.0; the v1.1.0 polish round closed the items the v1.0.0 release notes called out as deferred at the cross-language level (token-filter ABI, post-reduce hook, GLR symmetry, self-host); v1.2.0 closed the post-1.0 deferred-grammar list for c_subset and lifted the action-body carve-out from the self-host grammar.
| Phase | Deliverable | Shipped |
|---|---|---|
| 1 | Repo skeleton, pyproject, CI, license, grammar source format frozen | 1.0.0 |
| 2 | Lexer pipeline: regex → NFA → DFA → JSON, Python driver, tests | 1.0.0 |
| 3 | LR(1) parser builder, conflict reporting, JSON, Python driver | 1.0.0 |
| 4 | AST schema + default builder + hooks framework | 1.0.0 |
| 5 | Port the smallest uc* front-end as the first real grammar (plm_subset) |
1.0.0 |
| 6 | GLR extension | 1.0.0 |
| 7 | C backend (re-entrant) — the highest-value target | 1.0.0 |
| 8 | C++ and Lua backends | 1.0.0 |
| 9 | Port remaining uc* front-ends; declare v1 | 1.0.0 |
| — | C / C++ / Lua: token-filter ABI + post-reduce hook | 1.1.0 |
| — | Self-host bootstrap (uplox_self.uplox) |
1.1.0 |
| — | LR(1) build perf (~30% on c_subset) | 1.1.0 |
| — | c_subset: variadic, multi-line #define, bit-fields |
1.2.0 |
| — | c_subset: designated initializers, compound literals | 1.2.0 |
| — | c_subset: full abstract declarators, _Generic, sizeof(type) |
1.2.0 |
| — | Lexer %balanced= for non-regular tokens; action bodies in uplox_self |
1.2.0 |
| — | %balanced= parity in JSON bundle + C / C++ / Lua / emitted-Python |
1.3.0 |
| — | Parse-error diagnostics (expected-token list, EOI rendering) parity | 1.4.0 |
| — | DSL rework: <name> non-terminals, '…' literals, %keywords |
2.0.0 |
pip install uplox # PyPI distribution nameThe PyPI distribution is uplox because the bare uplox name on PyPI
is already taken by an unrelated matplotlib helper; the GitHub repo
(avwohl/uplox) was renamed to
match. The Python module, the CLI binary, and the .uplox grammar
format all keep the original uplox name — pip install uplox
installs a top-level import uplox and a uplox command.
uplox version # uplox 2.0.0 (schema 1)
uplox build examples/calc.uplox -o calc.json # grammar -> bundle
uplox check examples/calc.uplox # build + report conflicts
uplox parse calc.json input.txt # parse a file
uplox emit calc.json --target=c --out=gen # emit C driver
uplox emit calc.json --target=cpp --out=gen # emit C++ driver
uplox emit calc.json --target=lua --out=gen # emit Lua driver
uplox emit calc.json --target=py --out=gen # emit Python driveruplox build accepts --lex-only to omit the parse table when a host
only needs the lexer; uplox parse accepts --glr to use the GLR
runtime when the bundle preserves conflicts.
Some grammars need the parser to influence what the lexer returns —
the canonical example is C's typedef foo; followed by foo x;,
where foo lexes as a different terminal in the second position than
the first. uplox handles this with a token filter callback the host
installs alongside its post-reduce hook:
- The post-reduce hook fires after every reduction; the host updates whatever state it needs (a typedef set, a scope stack, …).
- The token filter receives every freshly fetched lookahead (and re-runs after every reduce) and returns the (possibly rewritten) terminal kind.
Both callbacks are exposed in all four backends. The Python runtime
also ships uplox.hooks.TypedefTracker, a turnkey helper that
implements the full classical hack on top of these primitives.
calc— arithmetic expressions, the smoke test.ambig_expr— classically ambiguous, exercises GLR.scoped— block-scoped IDs, exercises hooks.plm_subset— the first real grammar (Phase 5). Calc-style PL/M with declarations, IF/THEN/ELSE, DO blocks.plm_full— extendsplm_subsetto the constructs Digital Research's BDOS uses (LITERALLY, INITIAL, AT, BASED, iterative DO TO/BY, DO CASE, structures). Parses the first 162 lines of the realbdos.plmsource end-to-end.c_subset— a meaningful subset of C: function definitions, full statement repertoire (if/else/while/for/do-while/switch/break/continue/goto/return), expressions with C precedence, struct/union/enum, casts, the typedef-name lexer hack, and a preprocessor skip. Parses 21 of uc80's example C programs (vendored as fixtures).uplox_self— the .uplox DSL described in itself. Parses every other example grammar in this list, including its own definition — action bodies and all (the lexer's%balanced=extension matches{ ... }runs by counting nested braces).
docs/grammar_format.md— the.uploxDSL spec.docs/c_backend.md— generated C API.docs/cpp_backend.md— generated C++ API.docs/lua_backend.md— generated Lua module.docs/py_backend.md— generated Python module.CHANGELOG.md— versioned release notes.
GPL-3.0-or-later. See LICENSE.