Plan C addendum: Char primitive + codepoint string ops#105
Conversation
Adds TAG_CHAR=0x0C and a complete `runtime/src/char.rs` module covering boxing, equality/ordering, conversion, ASCII classifiers, ASCII case folding, and the load-bearing UTF-8 walkers (`string_chars`, `string_char_at`, `string_from_chars`). `string_chars` and `string_from_chars` accept Cons / Nil header words and discriminants from the codegen call site so the runtime stays free of compile-time `List[Char]` layout knowledge while still emitting well-typed `List[Char]` values. Lossy UTF-8 decode emits U+FFFD per invalid byte and resyncs at the next valid leading byte. Adds `CharAllocCount` / `CharAllocBytes` counters; 32 unit tests cover box/unbox round-trips for ASCII / 2-byte / 3-byte / 4-byte codepoints, every classifier, ASCII case passthrough on non-ASCII, `int_to_char_validate` boundary cases (surrogates, > 0x10FFFF, negative), `string_chars` on valid + invalid UTF-8 input, codepoint- indexed `string_char_at`, and `string_from_chars` round-trip through `string_chars`.
Extends the existing `'x'` Char literal lexer with the full plan-spec
escape and validity surface:
- New escape sequences: `\"`, `\0`, `\u{HEX}` (1–6 hex digits).
- Bare multi-byte codepoints (`'é'`, `'中'`, `'😀'`) decoded via a
new UTF-8 peek/advance helper pair on the byte-based cursor; the
prior `self.src[pos] as char` shortcut would otherwise see N
source bytes and reject as multi-codepoint.
- Compile-time rejection of `\u{...}` values > 0x10FFFF and any
surrogate `0xD800..=0xDFFF` (E0010 with descriptive message).
- Compile-time rejection of empty `\u{}` and >6-hex-digit escapes.
- Multi-codepoint literal bodies (`'ab'`, `'ab\u{41}'`) now produce
the targeted "Char literal must be a single codepoint; got N"
diagnostic rather than the prior "expected closing `'`" surface.
Token type stays `CharLit(char)` — Rust's `char` already enforces
the post-validation invariant (no surrogates, ≤ 0x10FFFF). Existing
parser-side mapping (`Token::CharLit` → `Expr::CharLit`) is unchanged.
15 new lexer tests cover every escape, each Unicode boundary
(0xD7FF / 0xE000 / 0x10FFFF), each rejection case (empty braces,
too many digits, out-of-range, low/high surrogates), and the three
multi-byte bare-codepoint widths.
Registers the user-facing Char primitives via a new `register_builtin_char_schemes(tc)` mirroring `register_builtin_float_schemes`: - 5 equality / ordering: char_eq, char_lt, char_le, char_gt, char_ge — `(Char, Char) -> Bool` - 3 conversion: char_to_int `(Char) -> Int`, int_to_char `(Int) -> Option[Char]`, char_to_string `(Char) -> String` - 5 ASCII classifiers: is_ascii, is_ascii_digit, is_ascii_alpha, is_ascii_alphanumeric, is_ascii_whitespace — `(Char) -> Bool` - 2 ASCII case: to_lower_ascii, to_upper_ascii — `(Char) -> Char` - 4 string codepoint ops: string_chars `(String) -> List[Char]`, string_char_at `(String, Int) -> Option[Char]`, string_from_chars `(List[Char]) -> String` Char itself is already a Ty::Char primitive (pre-existing) and literal type inference (`Expr::CharLit -> Ty::Char`) is unchanged. The validator helpers `int_to_char_validate` / `string_char_at_validate` are codegen-internal — not registered as user-callable schemes; codegen will lower the user-facing `int_to_char` / `string_char_at` to the validate-then-construct pattern in CH4. 13 typecheck unit tests cover each user-facing scheme: round-trip type inference, Option[Char] / List[Char] return types, E0044 on wrong-argument-type calls, E0045 on Char-vs-Int annotation mismatch, and Char literal default inference.
Converts Char from an I32 codepoint immediate to a `TAG_CHAR`-headed
heap pointer (mirroring Float / Int64 / String) and wires the 19
user-facing Char primitives through `lower_call`.
Type-mapping changes:
- `cranelift_ty_for_type_expr("Char")` and `cranelift_ty_of_ty(Ty::Char)`
return `pointer_ty` instead of `types::I32`.
- `Expr::CharLit(c, _)` lowers to `iconst(I64, codepoint) →
sigil_char_box → pointer_ty` (with stackmap placeholder), no longer
bare `iconst(I32, c)`.
- `type_of_expr(Expr::CharLit)` returns `pointer_ty`.
- `Pattern::CharLit(c)` loads the u32 codepoint at offset 8 from the
boxed Char scrutinee, zero-extends to i64, and tests for equality
against the literal codepoint.
Primitive dispatch:
- 17 simple FFI dispatch arms in `lower_call` for char_eq / lt / le /
gt / ge / char_to_int / char_to_string / is_ascii(_digit / _alpha /
_alphanumeric / _whitespace) / to_lower_ascii / to_upper_ascii.
- `int_to_char` and `string_char_at` lower to a validate-then-construct
pattern: validator → `brif(ok==0)` → Some/None branches → merge
block with a `pointer_ty` block-param. Some-branch builds via
`lower_ctor_alloc(Option$$Char, some_idx, [char_ptr])`; None via
`lower_ctor_alloc(..., none_idx, [])`.
- `string_chars` and `string_from_chars` thread the codegen-computed
`List$$Char` Cons / Nil header words and discriminants to the
runtime as i64 immediates; runtime stamps them through `sigil_alloc`
to build well-typed `List[Char]` cells.
`option_layout_for(Ty)` and `list_char_layout_immediates()` are the
two private helpers that resolve the monomorphized layout via
`mangle_type` / `mangle_ctor` and the `ctor_index` / `type_layouts`
side-tables.
`type_of_expr` predictions extended with the 19 Char op return types
(I8 for boolean classifiers / comparators, I64 for `char_to_int`,
`pointer_ty` for the rest). Globals set extended with the 17 user-
callable surface names (the validator helpers
`int_to_char_validate` / `string_char_at_validate` are codegen-
internal only).
Boxed Char (TAG_CHAR) is a heap pointer; the closure-record / sum-ctor / tuple-element slot-widening logic pre-Plan-C-addendum treated `EnvSlotKind::Char` and `Ty::Char` as a sub-word I32 primitive (uextend on store, ireduce(I32) on load). Both directions must drop for boxed Char — the slot already holds a `pointer_ty` value, and narrowing a pointer to I32 corrupts it. Updates: - `EnvSlotKind::is_pointer()` now matches `Char` alongside `String / Closure / User`, so closure-record bitmap bits are set correctly for boxed-Char captures. - Every match site that branched on `EnvSlotKind::Char` (closure store / load, synth-cont capture pack / unpack, post-arm-k capture pack, tail-prefix-let widen, narrow_for_kind helpers, `type_of_expr` for `Expr::ClosureEnvLoad`) routes Char through the pointer-typed branch. - `Ty::Char` in `load_field_value` and the tuple-pattern element loader drops the I32 narrow. - `is_gc_pointer_ty(Ty::Char)` returns true so sum-type field bitmaps mark Char fields for GC tracing (cosmetic on Boehm's conservative scan, load-bearing for any future precise GC). Existing e2e tests that exercised Char-as-I32 semantics: - `perform_side_narrow_to_char_value_checked`: rewrites `c == 'Y'` to `char_eq(c, 'Y')` (pointer equality wouldn't match codepoint equality for boxed Chars). - `cps_abi_captures_bearing_with_char_capture_exercises_widen_- narrow_symmetry`: same `==` → `char_eq` rewrite; the test still pins capture flow through synth-cont closure records, but with no width narrowing. - `arm_reads_char_arg_branches_on_codepoint`: same `==` → `char_eq` rewrite; the I32 → I8 split with the Bool test collapses (both are now pointer-store paths). - `task_78_5_g4_approach6_b_neq_r_pointer_return_arm_through_- char_body`: B != R width discrepancy collapses (B = Char = pointer_ty, R = String = pointer_ty); test docstring updated to reflect post-addendum reality. Test still pins the wrapper composition end-to-end.
- 14 e2e tests cover the full Char surface: char_literal_round_trips_via_to_string, char_codepoint_round_trip, int_to_char_rejects_out_of_range, int_to_char_rejects_surrogate, int_to_char_accepts_valid, is_ascii_classifiers_basic, to_lower_upper_ascii_passthrough, string_chars_ascii, string_chars_multibyte, string_char_at_codepoint_index, string_from_chars_round_trip, char_pattern_match_against_literal, char_eq_distinguishes_different_codepoints, char_doc_only_import. - std/char.sigil ships as a documentation-only file (added to BUILTIN_INJECTED skip-list). Covers literal syntax, the 19 user-facing primitives, the ASCII-only v1 scope + v2 closure path, the byte-vs-codepoint cross-reference to std/string.sigil, and a worked count_digits example. - std/string.sigil doc preamble rewritten — the "Codepoint-aware variants deferred" note becomes a positive byte-vs-codepoint surface description; the new codepoint ops surface (string_chars, string_char_at, string_from_chars) is added to the builtin operations table with a cross-reference to std/char.sigil. - PLAN_C_PROGRESS.md gets a Task CH section documenting the addendum's runtime + compiler + stdlib + test surface and the pre-existing Char-as-I32 → boxed-Char representation switch. - PLAN_C_DEVIATIONS.md Task 68 entry's "deferred class 1" (codepoint-aware ops) and "deferred class 5" (char_to_int / int_to_char) marked CLOSED by this addendum. Class 3 (Float helpers) marked CLOSED by PR #101 (already shipped). Class 2 (codepoint-aware string_split / string_replace) remains deferred — depends on stdlib namespace qualification, not Char.
Updates `spec/language.md` for the Plan C addendum:
- §1 (Lexical structure / literals): expanded `Char` literal entry
covers the boxed representation (TAG_CHAR=0x0C, 16 bytes), the
closed codepoint range (0x000000..=0x10FFFF excluding surrogates),
the full escape table (`\n`, `\t`, `\r`, `\\`, `\'`, `\"`, `\0`,
`\u{HEX}` 1–6 hex digits), bare-codepoint UTF-8 literals, and
parse-time rejection of multi-codepoint / out-of-range / surrogate
inputs. Calls out the deliberate absence of `==` / `<` operator
overloading on Char.
- §3.1 (Built-in types): `Char` row updated to "Boxed Unicode
codepoint (TAG_CHAR=0x0C, 21-bit codepoint payload)" — replaces
the pre-addendum "1-byte codepoint" placeholder.
- §3.1.1 (new subsection): full `Char` reference covering literal
syntax, the 19 user-facing operations grouped by purpose
(equality / ordering, conversion, ASCII classifiers, ASCII case),
the codepoint-aware string operations (`string_chars`,
`string_char_at`, `string_from_chars`), the byte-vs-codepoint
surface coexistence, and a worked `count_digits` example.
- §13 (Stdlib reference): adds a `std.char` row; extends the
`std.string` row with the codepoint-indexed surface.
- §14 (v1 limits): new §14.1 "Deferred to follow-up plans" table
documenting closure paths for codepoint-aware `string_split` /
`string_replace`, Unicode-aware classifiers / case / normalization,
and a hypothetical strict-UTF-8 `string_chars_strict`.
boldfield
left a comment
There was a problem hiding this comment.
Review: Plan C addendum — Char primitive + codepoint string ops
Solid work overall. The I32-immediate → boxed-pointer representation change is handled thoroughly across every code path (closure-record widening, pattern-match lowering, GC bitmap, type_of_expr predictions). Test coverage is good — 32 runtime unit tests, 15 lexer tests, 13 typecheck tests, 14 e2e tests, plus the 4 pre-existing tests updated for char_eq. The UTF-8 decoder in the runtime is correct (overlong rejection, surrogate rejection, lossy resync). The lexer extension handles all the edge cases well.
Three issues, ranked by severity:
1. Bug (performance): string_char_at double-decodes the string — runtime/src/char.rs
Both sigil_string_char_at_validate and sigil_string_char_at independently call decode_codepoints_lossy(slice), which allocates a Vec<u32> and walks the entire UTF-8 string. On the success path, every string_char_at(s, i) call decodes the full string twice — once to check bounds, once to fetch.
This is lifted from the int_to_char validate-then-construct pattern, where the validate call is a cheap integer comparison. Here it's an O(n) allocation + decode pass, and the codegen shape (validate in one branch, fetch in another) makes it impossible to share the result.
Options:
- Combined function: single
sigil_string_char_at_combined(s, idx) -> (*mut u8, i64)returning(char_ptr_or_null, ok_flag)— codegen branches on the flag, uses the pointer on success. One decode pass. - Accept the cost for v1: document in the spec that
string_char_atis O(n) with constant 2 and optimize in v2. The current approach is correct, just slow.
Either way, this should be a conscious decision, not an accident.
2. Minor bug: multi-codepoint char literal error message counts bytes, not codepoints — compiler/src/lexer.rs
The error recovery for multi-codepoint literals ('ab', 'abé') uses self.advance() in a byte-by-byte loop to count "extra" codepoints:
let mut extra = 1;
while !self.at_eof() && self.peek() != '\'' && self.peek() != '\n' {
self.advance();
extra += 1;
}self.advance() moves one byte. For 'abé' (after consuming 'a' as the first codepoint), the loop counts b (1 byte) + é (2 bytes) = 3, producing "got 4" instead of the correct "got 3". For ASCII-only cases this is correct; for non-ASCII multi-codepoint bodies the count is inflated.
Fix: use peek_utf8() to advance by codepoints in the error path, or just say "multiple codepoints" without a count.
3. Style: "Plan C addendum" comment prefix on nearly every changed line
Almost every code change has a // Plan C addendum (Char) — ... comment. These reference a task/plan name that will lose context as the codebase evolves. The commit messages and PR description already provide this provenance. Examples:
// Plan C addendum (Char) — boxed Char (TAG_CHAR) represented as a heap pointer, mirroring Float / Int64 / String.→ the code"Char" => pointer_tyis self-explanatory with the match arm context// Plan C addendum (Char) — boxed Char is pointer-typed.→ appears ~15 times across the slot-widening match arms
I see this is consistent with the existing codebase style (Plan C Task 67, Plan D Task 117), so this may be a deliberate convention. If so, carry on. If not, these could be pruned to just the ones where the why is non-obvious (the slot-widening changes, is_gc_pointer_ty, and advance_utf8 have useful comments; the 15 identical "boxed Char is pointer-typed" annotations don't add information).
Everything else looks correct. The codegen lowering for int_to_char / string_char_at (validate-then-construct via lower_ctor_alloc) follows the established string_to_int pattern cleanly. The List[Char] header-threading approach (passing Cons/Nil headers + discriminants as i64 immediates to the runtime) is clever and keeps the runtime free of monomorphization knowledge. The from_u32_unchecked uses in the runtime are all justified by upstream validation.
CI on PR #105 surfaced two regressions from the boxed-Char representation switch: 1. `is_gc_pointer_ty_matches_expected_types` (layout.rs unit test) pinned `!is_gc_pointer_ty(Ty::Char)` — pre-addendum Char was an I32 immediate, so it correctly wasn't a GC pointer. Updated to assert `is_gc_pointer_ty(Ty::Char)` instead, with a Plan-C- addendum justification comment. 2. Three CH5 e2e tests embedded `\u{HEX}` escapes inside `"..."` string literals. Sigil's *string* lexer accepts only `\n / \t / \r / \\\\ / \\\"`; the `\u{...}` escape lives in *char* literals only. Rewrote the tests to use bare UTF-8 bytes in source (`"héllo"`, `"héllo 😀"`) — the bytes are valid UTF-8 that Rust's source-tree handling preserves into the embedded test string verbatim, and Sigil's runtime treats `String` as a raw UTF-8 byte buffer.
Review — PR #105 (Char primitive + codepoint string ops)Plan compliance is largely tight. UTF-8 decode (with overlong + surrogate rejection), parse-time codepoint validation, GC bitmap + slot widening, stackmap placeholders, and the validate-then-construct lowering for MUST-FIX1. The PR description acknowledges this: "Pre-existing tests using This is actually a class bug, not a Char-specific regression. The same hole exists today for
Recommended fix: tighten the FOLLOW-UP2. 3. 4. Decoder overlong-rejection has no explicit test. 5. Multi-codepoint count error message under-counts on multi-byte input. The "got N" count loop in the lexer's DEFERRED6. 7. WHATWG "maximal subpart" U+FFFD merging. Decoder emits one U+FFFD per invalid byte; WHATWG spec says one per maximal subpart (so an invalid lead + invalid continuation should produce one U+FFFD, not two). Matches What was done well
|
Two more CI failures fixed: 1. Sigil's string lexer pre-Plan-C-addendum read source bytes as Latin-1 chars (`self.src[pos] as char`) and pushed them to a Rust `String` via `String::push(char)`, which UTF-8 re-encodes. Multi-byte source sequences double-encoded — `é` (0xC3 0xA9) → 4 bytes (0xC3 0x82 0xC2 0xA9) inside the resulting StringLit. This regressed nothing pre-addendum because `string_chars` / `string_char_at` didn't exist; the addendum surfaces the bug immediately. `take_string_lit` now uses the `peek_utf8` / `advance_utf8` helpers (added in CH2) to decode multi-byte source bytes as a single codepoint and push that codepoint verbatim. Two new lexer tests pin the round-trip for 2-byte `é` and 4-byte `😀`. 2. `string_from_chars_round_trip` had no explicit `List[Char]` annotation in its source, so monomorphize never saw the type and codegen panicked with "ctor `Cons$$Char` not registered". Added a `let xs: List[Char] = string_chars(s)` binding to trigger mono via the type annotation's Apply node, mirroring the working `string_chars_multibyte` test's shape.
PR #105 review (boldfield, 2026-05-07): one MUST-FIX, four follow- ups, two deferred. All five non-deferred items addressed plus a small note for item 7. **1 (MUST-FIX) — Reject `==` / `!=` on `Char` at typecheck.** Pre- fix, `'a' == 'a'` typechecked and lowered to `icmp Equal l r` on boxed Char pointers — pointer identity, not codepoint equality — silently returning `false` at runtime. New typecheck arm in `check_binop`'s `BinOp::Eq | BinOp::NotEq` rejects `Ty::Char` operands with E0060 pointing at `char_eq(a, b)` (or `!char_eq(a, b)` for `!=`). Two new typecheck unit tests pin the rejection. Float / Int64 (also boxed) inherit the same trap; generalizing the rejection across all heap-boxed primitives is queued as a follow-up since the existing `float_eq` / `int64_eq` discipline currently hides the bug. **2 — Refactor `string_char_at` to early-exit shared helper.** Pre- refactor `sigil_string_char_at_validate` and `sigil_string_char_at` each called `decode_codepoints_lossy(slice)` (full-pass + Vec allocation) — making `for i in 0..n: char_at(s, i)` O(n²) with O(n) transient allocations per call. New shared helper `find_nth_codepoint(bytes, idx) -> Option<u32>` walks bytes front-to-back with early-exit; both entry points use it. Each call is now O(idx + decode_cost) and allocates nothing on the hot path. **3 — `lower_ctor_alloc` comment fix.** Dropped `Char` from the "sub-word primitives (Bool, Byte, Char, Unit)" widen-on-store comment; boxed Char takes the pass-through branch. **4 — Decoder overlong-rejection tests.** Two new runtime tests: `string_chars_overlong_2byte_replaces` (`[0xC0, 0x80]` 2-byte overlong of U+0000) and `string_chars_overlong_3byte_surrogate_- replaces` (`[0xED, 0xA0, 0x80]` 3-byte UTF-8 form of surrogate U+D800). Each pins a distinct decoder branch. **5 — Lexer multi-codepoint count uses codepoint-step.** The "Char literal must be a single codepoint; got N" diagnostic's count loop now uses `peek_utf8` / `advance_utf8`, parity with the literal-body decoder above. New lexer test pins `'aé'` reports "got 2", not "got 3". **7 (DEFERRED note) — U+FFFD merging.** Sigil v1's per-byte replacement (matching `String::from_utf8_lossy`) is now documented in `std/char.sigil` alongside a forward reference to the v2 `string_chars_strict` follow-up. Item 6 (Char in `is_gc_pointer_ty` — note for v2 precise GC) is deferred per reviewer.
|
Thanks for the review — all five non-deferred items addressed in 1 (MUST-FIX) — Char 2 — 3 — 4 — Decoder overlong-rejection tests. Added 5 — Lexer multi-codepoint count loop uses 7 (note) — U+FFFD merging. Documented Sigil v1's per-byte replacement (matching 6 deferred per reviewer (note for any v2 precise-GC work that bitmap representation of types containing Char fields changed). Note: separately from the review, this push also includes earlier fixes for two CI-surfaced regressions from the boxed-Char rep switch: the |
Re-review — PR #105All 7 prior review items resolved correctly. CI green on both hosts. Two new items surfaced (one pre-existing class bug, one e2e coverage gap, both follow-ups), neither blocking. Confirmed resolved1 (MUST-FIX) — 2 — 3 — 4 — Decoder overlong-rejection tests. 5 — Lexer multi-codepoint count. Count loop in 7 (DEFERRED note) — U+FFFD per-byte vs WHATWG maximal-subpart. Documented in 6 (DEFERRED) — New items surfaced during fixes8 (LATENT BUG, FIXED) — String lexer was UTF-8 double-encoding. Pre-PR New FOLLOW-UPs9 (FOLLOW-UP) — Float / Int64 if let Some(a) = lt.as_ref() {
if matches!(a, Ty::Char | Ty::Float | Ty::Int64) {
let suggestion = match a {
Ty::Char => "char_eq",
Ty::Float => "float_eq",
Ty::Int64 => "int64_eq",
_ => unreachable!(),
};
...
}
}Plus one typecheck test per type. ~15 lines total. The discount is real: shipping this now closes the class bug uniformly; deferring leaves two known soundness holes documented in the commit message, which is worse than not knowing. This is a follow-up, not a blocker — Char-specific behavior is correct as shipped. But pin the trivial extension into either this PR or an immediate follow-up before the next agent-driven PR. Don't let it slip into "queued v2." 10 (FOLLOW-UP) — Missing e2e test for lossy UTF-8 decode behavior. Plan §"Test plan" listed Notes (no action)11 (NOTE) — Char vs String literal escape asymmetry. Char literals support 12 (NOTE) — Bottom lineMergeable. Ship Char as the first delivery of the Char + codepoint plan. Open issue / follow-up PR for item 9 (Float/Int64 generalization) before the next agent PR. Item 10 can land as part of either the existing Plan C completion work or a tiny follow-up commit. |
boldfield
left a comment
There was a problem hiding this comment.
Re-review: PR #105 after review-fix commits (d1691af + 2 CI fixes)
The review-fix commit addresses the prior round's findings well: find_nth_codepoint eliminates the Vec allocation, the multi-codepoint error count uses codepoint-stepping, and E0060 rejects ==/!= on Char. Four new issues surfaced in this pass.
1. Bug: E0060 Char rejection only checks the left operand — compiler/src/typecheck.rs
The new E0060 check guards on lt.as_ref() only:
if let Some(a) = lt.as_ref() {
if matches!(a, Ty::Char) { /* emit E0060 */ }
}If the right operand is Char but the left isn't (or is None from a prior error), the Char-specific guidance message doesn't fire. For 42 == 'a', the generic type-mismatch arm catches it (different types), so the user still gets an error — just without the "use char_eq" hint. The only real gap: lt = None (prior type error) + rt = Char → the check is skipped entirely, but since lt = None means there's already a compile error, this isn't a soundness hole.
Low severity. Fix: check both sides.
let is_char = lt.as_ref().map_or(false, |t| matches!(t, Ty::Char))
|| rt.as_ref().map_or(false, |t| matches!(t, Ty::Char));
if is_char { /* emit E0060 */ }2. Design: duplicated UTF-8 decoder — runtime/src/char.rs
decode_codepoints_lossy (~line 296) and find_nth_codepoint (~line 422) are independent hand-rolled UTF-8 decoders with nearly identical logic (leading-byte dispatch, continuation validation, overlong/surrogate rejection, U+FFFD resync). They must agree on codepoint boundaries for string_chars and string_char_at to be consistent on the same input. Today they do agree, but there's no shared code enforcing that invariant.
A decode_next_codepoint(bytes: &[u8], offset: usize) -> (u32, usize) stepping helper used by both would eliminate the duplication. decode_codepoints_lossy calls it in a loop pushing to a Vec; find_nth_codepoint calls it counting to idx with early-exit.
Also: the string_char_at codegen path still calls find_nth_codepoint twice (once in validate, once in fetch). The Vec allocation is gone, which was the big win, but each call still walks from byte 0 to codepoint idx. A combined sigil_string_char_at_try(s, idx) -> *mut u8 | null returning null for out-of-bounds would halve the work. Not blocking — the current shape is correct — but worth noting since it's the same structural issue from last round, just with lower constant factor.
3. Bug: spec cross-reference §3.4 should be §3.1.1 — spec/language.md
In the Char literal description:
use the named functions
char_eq/char_ltetc. (§3.4)
Section 3.4 is "Inference rules (overview)." The correct reference is §3.1.1 ("Char and codepoint string operations"), which is the new subsection added by this same PR.
4. Bug: string_byte_at listed twice — std/string.sigil
The updated preamble reads:
Byte-indexed ops (
string_byte_at,string_substring,string_index_of,string_byte_at) take and return byte offsets
The second string_byte_at should probably be string_length or another op — or just deleted.
Non-blocking observations
Test gap: No test exercises find_nth_codepoint on overlong or surrogate inputs. string_chars_overlong_2byte_replaces tests decode_codepoints_lossy via sigil_string_chars, but the string_char_at path through find_nth_codepoint is only tested on valid multi-byte strings. A test like string_char_at on [0xC0, 0x80, b'a'] would pin that the two decoders agree on the replacement-codepoint count for the same invalid input.
Design note: \0 and \u{HEX} escapes are now supported in char literals but not in string literals. "\u{1F600}" and "\0" produce E0010 from the string lexer. This is likely to surprise users. Not blocking for this PR, but worth tracking as a follow-up.
Codegen layer is clean — all EnvSlotKind::Char sites route through the pointer path, FFI signatures match, stackmap placeholders are on all allocating calls, block sealing is correct. The find_nth_codepoint refactor is a clear improvement over the prior double-Vec-decode. Items 3 and 4 are trivial fixes; items 1 and 2 are low-priority design improvements.
Combined fixes for:
- **Review B item 3** (PR review 4246507154, 17:59) — comment style
- **Review C items 9 + 10** (issue comment 4399992055, 18:30) —
Float/Int64 == extension + e2e lossy-decode test
- **Review D items 1–4** (PR review 4246835141, 18:46) — E0060 both
operands, decoder dedup, spec §3.4 typo, std/string duplicate
### B item 3 — Plan-C-addendum comment spam pruned
Deleted 13 redundant single-line `// Plan C addendum (Char) — boxed
Char is pointer-typed.` comments adjacent to `EnvSlotKind::Char | ...`
match arms across `compiler/src/codegen.rs`. The match-arm context
alone makes the change self-evident; the load-bearing ones (literal
lowering, type-mapping fns, dispatch arms, helpers, FuncIds struct,
runtime counters, ast.rs `is_pointer()`, layout.rs
`is_gc_pointer_ty`) stay.
### C item 9 — Heap-boxed-primitive == rejection extended to Float / Int64
The earlier Char-only E0060 rejection now generalizes to all three
heap-boxed primitives. New `BoxedPrim` enum + `boxed_primitive_-
eq_name` helper drive the typecheck arm; per-type suggestion +
ordering hint string. Float adds the NaN-aware caveat in the error
message. Four new typecheck unit tests pin Float / Int64 `==` and
`!=` rejection.
### C item 10 — e2e test for user-visible lossy UTF-8 decode
`string_chars_invalid_utf8_replaces` constructs a `ByteArray` with
a known-invalid byte (`0xFF`) via `byte_array_alloc` +
`byte_array_concat`, bypasses validation via
`string_from_bytes_alloc` (the post-validation primitive copies
bytes verbatim), and verifies `string_chars` emits U+FFFD (65533)
for the invalid byte. Closes the runtime → user-program coverage
gap.
### D item 1 — E0060 char check now guards both operands
The earlier check only inspected `lt.as_ref()`. Now both `lt` and
`rt` are checked via `lt_boxed.or(rt_boxed)`, so `42 == 'a'` and
similar shapes still fire the named-function suggestion even when
LHS is non-Char or `None`.
### D item 2 — UTF-8 decoder deduplicated
Extracted `decode_next_codepoint(bytes, offset) -> (cp, len)` as
the single source of truth for Sigil's lossy UTF-8 decode.
`decode_codepoints_lossy` (drives `string_chars`) and
`find_nth_codepoint` (drives `string_char_at`) both step through
it, so codepoint-boundary agreement is now identical-by-
construction rather than agree-by-coincidence. New runtime test
`string_char_at_overlong_replaces` exercises `find_nth_codepoint`
on `[0xC0, 0x80, b'a']` (overlong + ASCII) to pin that the two
entry points produce the same codepoint count for invalid input.
### D item 3 — spec §3.4 → §3.1.1
The Char literal entry's "use the named functions (§3.4)" pointer
referenced "Inference rules (overview)"; corrected to §3.1.1
("Char and codepoint string operations") which is the new
subsection.
### D item 4 — `std/string.sigil` duplicate removed
The byte-indexed surface preamble listed `string_byte_at` twice;
replaced the second occurrence with `string_length`.
### Out of scope (acknowledged in PR reply)
- D non-blocking observation #1 (test gap on find_nth_codepoint with
invalid input): closed by `string_char_at_overlong_replaces` above.
- D non-blocking observation #2 (`\\u{HEX}` / `\\0` not supported in
string literals): tracked as a separate follow-up since the
current behavior produces a clear E0010, not a silent miss.
|
You're right — I'd missed two of the four reviews on the first pass. Going back through them all, the breakdown was:
B item 3 — pruned 13 redundant C item 9 — extended E0060 C item 10 — new e2e D item 1 — E0060 char check now guards both operands via D item 2 — extracted D item 3 — spec cross-reference §3.4 → §3.1.1 fixed. D item 4 — duplicate D's second non-blocking observation ( D's note on the |
Summary
Adds a first-class
Chartype (boxed Unicode codepoint) and the codepoint-aware string ops gated behind it. Closes Plan C Task 68 class-1 deferrals (string_chars,string_char_at, ASCII classifier set) and theCharuser-surface widening listed inPLAN_C_DEVIATIONS.md.Implements the design in `docs/plans/2026-05-07-sigil-char-codepoint-design.md` (boldfield/designs).
Tasks
TAG_CHAR=0x0C+runtime/src/char.rs(16 boxing/unboxing/comparison/conversion/classifier/case-folding/UTF-8-walker primitives + 32 unit tests)\\\",\\0,\\u{HEX}escapes; bare multi-byte codepoint decoding; surrogate / out-of-range / multi-codepoint rejection at parse time (15 new lexer tests)register_builtin_char_schemes(tc)registering 19 user-facing primitives (13 new typecheck unit tests)Surface
Architecture notes
Char representation switch. Pre-addendum the codebase already had a partial `Ty::Char` wired through the type system, but with an I32-immediate codepoint representation (lexer / pattern / codegen). The plan moves Char to a boxed `TAG_CHAR` heap pointer, mirroring `Float` / `Int64`. This required updating: closure-record slot widening (sub-word uextend / ireduce no longer apply), match-pattern lowering (load codepoint from offset 8 instead of comparing immediate), `is_gc_pointer_ty`, `EnvSlotKind::is_pointer()`, and four existing e2e tests that compared Chars with `==` (pointer equality wouldn't match codepoint equality; test bodies switched to `char_eq`).
`List[Char]` in the runtime. `sigil_string_chars` / `sigil_string_from_chars` accept Cons / Nil header words and discriminants from the codegen call site as i64 immediates. The runtime uses these to allocate cells via the existing `sigil_alloc` plumbing without needing compile-time `List` layout knowledge — the type-tag is monomorphization-dependent (`List$$Char` vs `List$$Bool` etc.) and varies per program.
Test plan