Skip to content

Plan C addendum: Char primitive + codepoint string ops#105

Merged
boldfield merged 11 commits into
mainfrom
plan-c-char-codepoint
May 7, 2026
Merged

Plan C addendum: Char primitive + codepoint string ops#105
boldfield merged 11 commits into
mainfrom
plan-c-char-codepoint

Conversation

@boldfield
Copy link
Copy Markdown
Owner

Summary

Adds a first-class Char type (boxed Unicode codepoint) and the codepoint-aware string ops gated behind it. Closes Plan C Task 68 class-1 deferrals (string_chars, string_char_at, ASCII classifier set) and the Char user-surface widening listed in PLAN_C_DEVIATIONS.md.

Implements the design in `docs/plans/2026-05-07-sigil-char-codepoint-design.md` (boldfield/designs).

Tasks

  • CH1 — Header constant TAG_CHAR=0x0C + runtime/src/char.rs (16 boxing/unboxing/comparison/conversion/classifier/case-folding/UTF-8-walker primitives + 32 unit tests)
  • CH2 — Lexer + parser: \\\", \\0, \\u{HEX} escapes; bare multi-byte codepoint decoding; surrogate / out-of-range / multi-codepoint rejection at parse time (15 new lexer tests)
  • CH3 — Typecheck: register_builtin_char_schemes(tc) registering 19 user-facing primitives (13 new typecheck unit tests)
  • CH4 — Codegen: `Expr::CharLit` → boxed pointer via `sigil_char_box`; `Pattern::CharLit` loads codepoint at offset 8 + icmp; 17 simple FFI dispatch arms; `int_to_char` + `string_char_at` validate-then-construct lowering with Some/None via `lower_ctor_alloc`; `string_chars` / `string_from_chars` thread `List[Char]` Cons/Nil headers + discriminants; slot widening + GC bitmap updated for pointer-typed Char
  • CH5 — e2e tests + `std/char.sigil` + `std/string.sigil` doc update + `PLAN_C_PROGRESS.md` / `PLAN_C_DEVIATIONS.md`
  • CH6 — `spec/language.md` update

Surface

// Char literals
'a'  '\\\\n'  '\\\\u{1F600}'  '中'  '😀'

// 19 user-facing primitives
char_eq, char_lt, char_le, char_gt, char_ge       // (Char, Char) -> Bool
char_to_int                                        // (Char) -> Int
int_to_char                                        // (Int) -> Option[Char]
char_to_string                                     // (Char) -> String
is_ascii, is_ascii_digit, is_ascii_alpha,
  is_ascii_alphanumeric, is_ascii_whitespace       // (Char) -> Bool
to_lower_ascii, to_upper_ascii                     // (Char) -> Char
string_chars                                       // (String) -> List[Char]
string_char_at                                     // (String, Int) -> Option[Char]
string_from_chars                                  // (List[Char]) -> String

Architecture notes

Char representation switch. Pre-addendum the codebase already had a partial `Ty::Char` wired through the type system, but with an I32-immediate codepoint representation (lexer / pattern / codegen). The plan moves Char to a boxed `TAG_CHAR` heap pointer, mirroring `Float` / `Int64`. This required updating: closure-record slot widening (sub-word uextend / ireduce no longer apply), match-pattern lowering (load codepoint from offset 8 instead of comparing immediate), `is_gc_pointer_ty`, `EnvSlotKind::is_pointer()`, and four existing e2e tests that compared Chars with `==` (pointer equality wouldn't match codepoint equality; test bodies switched to `char_eq`).

`List[Char]` in the runtime. `sigil_string_chars` / `sigil_string_from_chars` accept Cons / Nil header words and discriminants from the codegen call site as i64 immediates. The runtime uses these to allocate cells via the existing `sigil_alloc` plumbing without needing compile-time `List` layout knowledge — the type-tag is monomorphization-dependent (`List$$Char` vs `List$$Bool` etc.) and varies per program.

Test plan

  • `cargo check -p sigil-runtime` and `cargo check -p sigil-compiler` (pod-safe)
  • `cargo test -p sigil-runtime --lib` (212 / 212 incl. 32 new char tests)
  • `cargo clippy -p sigil-runtime --all-targets` and `-p sigil-compiler --all-targets` (no warnings)
  • Lexer + parser + typecheck unit tests pass (15 + 1 + 13 new tests)
  • `./scripts/pod-verify.sh`
  • CI: full e2e + reproducibility + smoke (in progress; pod can't run them)
  • CH5 e2e tests for the Char surface (pending in this PR)

boldfield added 7 commits May 7, 2026 17:05
Adds TAG_CHAR=0x0C and a complete `runtime/src/char.rs` module
covering boxing, equality/ordering, conversion, ASCII classifiers,
ASCII case folding, and the load-bearing UTF-8 walkers
(`string_chars`, `string_char_at`, `string_from_chars`).

`string_chars` and `string_from_chars` accept Cons / Nil header
words and discriminants from the codegen call site so the runtime
stays free of compile-time `List[Char]` layout knowledge while
still emitting well-typed `List[Char]` values. Lossy UTF-8 decode
emits U+FFFD per invalid byte and resyncs at the next valid leading
byte.

Adds `CharAllocCount` / `CharAllocBytes` counters; 32 unit tests
cover box/unbox round-trips for ASCII / 2-byte / 3-byte / 4-byte
codepoints, every classifier, ASCII case passthrough on non-ASCII,
`int_to_char_validate` boundary cases (surrogates, > 0x10FFFF,
negative), `string_chars` on valid + invalid UTF-8 input, codepoint-
indexed `string_char_at`, and `string_from_chars` round-trip through
`string_chars`.
Extends the existing `'x'` Char literal lexer with the full plan-spec
escape and validity surface:

- New escape sequences: `\"`, `\0`, `\u{HEX}` (1–6 hex digits).
- Bare multi-byte codepoints (`'é'`, `'中'`, `'😀'`) decoded via a
  new UTF-8 peek/advance helper pair on the byte-based cursor; the
  prior `self.src[pos] as char` shortcut would otherwise see N
  source bytes and reject as multi-codepoint.
- Compile-time rejection of `\u{...}` values > 0x10FFFF and any
  surrogate `0xD800..=0xDFFF` (E0010 with descriptive message).
- Compile-time rejection of empty `\u{}` and >6-hex-digit escapes.
- Multi-codepoint literal bodies (`'ab'`, `'ab\u{41}'`) now produce
  the targeted "Char literal must be a single codepoint; got N"
  diagnostic rather than the prior "expected closing `'`" surface.

Token type stays `CharLit(char)` — Rust's `char` already enforces
the post-validation invariant (no surrogates, ≤ 0x10FFFF). Existing
parser-side mapping (`Token::CharLit` → `Expr::CharLit`) is unchanged.

15 new lexer tests cover every escape, each Unicode boundary
(0xD7FF / 0xE000 / 0x10FFFF), each rejection case (empty braces,
too many digits, out-of-range, low/high surrogates), and the three
multi-byte bare-codepoint widths.
Registers the user-facing Char primitives via a new
`register_builtin_char_schemes(tc)` mirroring
`register_builtin_float_schemes`:

- 5 equality / ordering: char_eq, char_lt, char_le, char_gt, char_ge
  — `(Char, Char) -> Bool`
- 3 conversion: char_to_int `(Char) -> Int`, int_to_char
  `(Int) -> Option[Char]`, char_to_string `(Char) -> String`
- 5 ASCII classifiers: is_ascii, is_ascii_digit, is_ascii_alpha,
  is_ascii_alphanumeric, is_ascii_whitespace — `(Char) -> Bool`
- 2 ASCII case: to_lower_ascii, to_upper_ascii — `(Char) -> Char`
- 4 string codepoint ops: string_chars `(String) -> List[Char]`,
  string_char_at `(String, Int) -> Option[Char]`, string_from_chars
  `(List[Char]) -> String`

Char itself is already a Ty::Char primitive (pre-existing) and
literal type inference (`Expr::CharLit -> Ty::Char`) is unchanged.
The validator helpers `int_to_char_validate` /
`string_char_at_validate` are codegen-internal — not registered as
user-callable schemes; codegen will lower the user-facing
`int_to_char` / `string_char_at` to the validate-then-construct
pattern in CH4.

13 typecheck unit tests cover each user-facing scheme: round-trip
type inference, Option[Char] / List[Char] return types, E0044 on
wrong-argument-type calls, E0045 on Char-vs-Int annotation
mismatch, and Char literal default inference.
Converts Char from an I32 codepoint immediate to a `TAG_CHAR`-headed
heap pointer (mirroring Float / Int64 / String) and wires the 19
user-facing Char primitives through `lower_call`.

Type-mapping changes:

- `cranelift_ty_for_type_expr("Char")` and `cranelift_ty_of_ty(Ty::Char)`
  return `pointer_ty` instead of `types::I32`.
- `Expr::CharLit(c, _)` lowers to `iconst(I64, codepoint) →
  sigil_char_box → pointer_ty` (with stackmap placeholder), no longer
  bare `iconst(I32, c)`.
- `type_of_expr(Expr::CharLit)` returns `pointer_ty`.
- `Pattern::CharLit(c)` loads the u32 codepoint at offset 8 from the
  boxed Char scrutinee, zero-extends to i64, and tests for equality
  against the literal codepoint.

Primitive dispatch:

- 17 simple FFI dispatch arms in `lower_call` for char_eq / lt / le /
  gt / ge / char_to_int / char_to_string / is_ascii(_digit / _alpha /
  _alphanumeric / _whitespace) / to_lower_ascii / to_upper_ascii.
- `int_to_char` and `string_char_at` lower to a validate-then-construct
  pattern: validator → `brif(ok==0)` → Some/None branches → merge
  block with a `pointer_ty` block-param. Some-branch builds via
  `lower_ctor_alloc(Option$$Char, some_idx, [char_ptr])`; None via
  `lower_ctor_alloc(..., none_idx, [])`.
- `string_chars` and `string_from_chars` thread the codegen-computed
  `List$$Char` Cons / Nil header words and discriminants to the
  runtime as i64 immediates; runtime stamps them through `sigil_alloc`
  to build well-typed `List[Char]` cells.

`option_layout_for(Ty)` and `list_char_layout_immediates()` are the
two private helpers that resolve the monomorphized layout via
`mangle_type` / `mangle_ctor` and the `ctor_index` / `type_layouts`
side-tables.

`type_of_expr` predictions extended with the 19 Char op return types
(I8 for boolean classifiers / comparators, I64 for `char_to_int`,
`pointer_ty` for the rest). Globals set extended with the 17 user-
callable surface names (the validator helpers
`int_to_char_validate` / `string_char_at_validate` are codegen-
internal only).
Boxed Char (TAG_CHAR) is a heap pointer; the closure-record /
sum-ctor / tuple-element slot-widening logic pre-Plan-C-addendum
treated `EnvSlotKind::Char` and `Ty::Char` as a sub-word I32
primitive (uextend on store, ireduce(I32) on load). Both directions
must drop for boxed Char — the slot already holds a `pointer_ty`
value, and narrowing a pointer to I32 corrupts it.

Updates:

- `EnvSlotKind::is_pointer()` now matches `Char` alongside
  `String / Closure / User`, so closure-record bitmap bits are
  set correctly for boxed-Char captures.
- Every match site that branched on `EnvSlotKind::Char` (closure
  store / load, synth-cont capture pack / unpack, post-arm-k
  capture pack, tail-prefix-let widen, narrow_for_kind helpers,
  `type_of_expr` for `Expr::ClosureEnvLoad`) routes Char through
  the pointer-typed branch.
- `Ty::Char` in `load_field_value` and the tuple-pattern element
  loader drops the I32 narrow.
- `is_gc_pointer_ty(Ty::Char)` returns true so sum-type field
  bitmaps mark Char fields for GC tracing (cosmetic on Boehm's
  conservative scan, load-bearing for any future precise GC).

Existing e2e tests that exercised Char-as-I32 semantics:

- `perform_side_narrow_to_char_value_checked`: rewrites
  `c == 'Y'` to `char_eq(c, 'Y')` (pointer equality wouldn't
  match codepoint equality for boxed Chars).
- `cps_abi_captures_bearing_with_char_capture_exercises_widen_-
   narrow_symmetry`: same `==` → `char_eq` rewrite; the test
  still pins capture flow through synth-cont closure records,
  but with no width narrowing.
- `arm_reads_char_arg_branches_on_codepoint`: same `==` →
  `char_eq` rewrite; the I32 → I8 split with the Bool test
  collapses (both are now pointer-store paths).
- `task_78_5_g4_approach6_b_neq_r_pointer_return_arm_through_-
   char_body`: B != R width discrepancy collapses (B = Char =
  pointer_ty, R = String = pointer_ty); test docstring updated
  to reflect post-addendum reality. Test still pins the wrapper
  composition end-to-end.
- 14 e2e tests cover the full Char surface: char_literal_round_trips_via_to_string,
  char_codepoint_round_trip, int_to_char_rejects_out_of_range,
  int_to_char_rejects_surrogate, int_to_char_accepts_valid,
  is_ascii_classifiers_basic, to_lower_upper_ascii_passthrough,
  string_chars_ascii, string_chars_multibyte,
  string_char_at_codepoint_index, string_from_chars_round_trip,
  char_pattern_match_against_literal, char_eq_distinguishes_different_codepoints,
  char_doc_only_import.

- std/char.sigil ships as a documentation-only file (added to
  BUILTIN_INJECTED skip-list). Covers literal syntax, the 19
  user-facing primitives, the ASCII-only v1 scope + v2 closure path,
  the byte-vs-codepoint cross-reference to std/string.sigil, and a
  worked count_digits example.

- std/string.sigil doc preamble rewritten — the "Codepoint-aware
  variants deferred" note becomes a positive byte-vs-codepoint
  surface description; the new codepoint ops surface
  (string_chars, string_char_at, string_from_chars) is added to the
  builtin operations table with a cross-reference to std/char.sigil.

- PLAN_C_PROGRESS.md gets a Task CH section documenting the
  addendum's runtime + compiler + stdlib + test surface and the
  pre-existing Char-as-I32 → boxed-Char representation switch.

- PLAN_C_DEVIATIONS.md Task 68 entry's "deferred class 1"
  (codepoint-aware ops) and "deferred class 5"
  (char_to_int / int_to_char) marked CLOSED by this addendum.
  Class 3 (Float helpers) marked CLOSED by PR #101 (already shipped).
  Class 2 (codepoint-aware string_split / string_replace) remains
  deferred — depends on stdlib namespace qualification, not Char.
Updates `spec/language.md` for the Plan C addendum:

- §1 (Lexical structure / literals): expanded `Char` literal entry
  covers the boxed representation (TAG_CHAR=0x0C, 16 bytes), the
  closed codepoint range (0x000000..=0x10FFFF excluding surrogates),
  the full escape table (`\n`, `\t`, `\r`, `\\`, `\'`, `\"`, `\0`,
  `\u{HEX}` 1–6 hex digits), bare-codepoint UTF-8 literals, and
  parse-time rejection of multi-codepoint / out-of-range / surrogate
  inputs. Calls out the deliberate absence of `==` / `<` operator
  overloading on Char.

- §3.1 (Built-in types): `Char` row updated to "Boxed Unicode
  codepoint (TAG_CHAR=0x0C, 21-bit codepoint payload)" — replaces
  the pre-addendum "1-byte codepoint" placeholder.

- §3.1.1 (new subsection): full `Char` reference covering literal
  syntax, the 19 user-facing operations grouped by purpose
  (equality / ordering, conversion, ASCII classifiers, ASCII case),
  the codepoint-aware string operations (`string_chars`,
  `string_char_at`, `string_from_chars`), the byte-vs-codepoint
  surface coexistence, and a worked `count_digits` example.

- §13 (Stdlib reference): adds a `std.char` row; extends the
  `std.string` row with the codepoint-indexed surface.

- §14 (v1 limits): new §14.1 "Deferred to follow-up plans" table
  documenting closure paths for codepoint-aware `string_split` /
  `string_replace`, Unicode-aware classifiers / case / normalization,
  and a hypothetical strict-UTF-8 `string_chars_strict`.
Copy link
Copy Markdown
Owner Author

@boldfield boldfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Plan C addendum — Char primitive + codepoint string ops

Solid work overall. The I32-immediate → boxed-pointer representation change is handled thoroughly across every code path (closure-record widening, pattern-match lowering, GC bitmap, type_of_expr predictions). Test coverage is good — 32 runtime unit tests, 15 lexer tests, 13 typecheck tests, 14 e2e tests, plus the 4 pre-existing tests updated for char_eq. The UTF-8 decoder in the runtime is correct (overlong rejection, surrogate rejection, lossy resync). The lexer extension handles all the edge cases well.

Three issues, ranked by severity:


1. Bug (performance): string_char_at double-decodes the string — runtime/src/char.rs

Both sigil_string_char_at_validate and sigil_string_char_at independently call decode_codepoints_lossy(slice), which allocates a Vec<u32> and walks the entire UTF-8 string. On the success path, every string_char_at(s, i) call decodes the full string twice — once to check bounds, once to fetch.

This is lifted from the int_to_char validate-then-construct pattern, where the validate call is a cheap integer comparison. Here it's an O(n) allocation + decode pass, and the codegen shape (validate in one branch, fetch in another) makes it impossible to share the result.

Options:

  • Combined function: single sigil_string_char_at_combined(s, idx) -> (*mut u8, i64) returning (char_ptr_or_null, ok_flag) — codegen branches on the flag, uses the pointer on success. One decode pass.
  • Accept the cost for v1: document in the spec that string_char_at is O(n) with constant 2 and optimize in v2. The current approach is correct, just slow.

Either way, this should be a conscious decision, not an accident.


2. Minor bug: multi-codepoint char literal error message counts bytes, not codepoints — compiler/src/lexer.rs

The error recovery for multi-codepoint literals ('ab', 'abé') uses self.advance() in a byte-by-byte loop to count "extra" codepoints:

let mut extra = 1;
while !self.at_eof() && self.peek() != '\'' && self.peek() != '\n' {
    self.advance();
    extra += 1;
}

self.advance() moves one byte. For 'abé' (after consuming 'a' as the first codepoint), the loop counts b (1 byte) + é (2 bytes) = 3, producing "got 4" instead of the correct "got 3". For ASCII-only cases this is correct; for non-ASCII multi-codepoint bodies the count is inflated.

Fix: use peek_utf8() to advance by codepoints in the error path, or just say "multiple codepoints" without a count.


3. Style: "Plan C addendum" comment prefix on nearly every changed line

Almost every code change has a // Plan C addendum (Char) — ... comment. These reference a task/plan name that will lose context as the codebase evolves. The commit messages and PR description already provide this provenance. Examples:

  • // Plan C addendum (Char) — boxed Char (TAG_CHAR) represented as a heap pointer, mirroring Float / Int64 / String. → the code "Char" => pointer_ty is self-explanatory with the match arm context
  • // Plan C addendum (Char) — boxed Char is pointer-typed. → appears ~15 times across the slot-widening match arms

I see this is consistent with the existing codebase style (Plan C Task 67, Plan D Task 117), so this may be a deliberate convention. If so, carry on. If not, these could be pruned to just the ones where the why is non-obvious (the slot-widening changes, is_gc_pointer_ty, and advance_utf8 have useful comments; the 15 identical "boxed Char is pointer-typed" annotations don't add information).


Everything else looks correct. The codegen lowering for int_to_char / string_char_at (validate-then-construct via lower_ctor_alloc) follows the established string_to_int pattern cleanly. The List[Char] header-threading approach (passing Cons/Nil headers + discriminants as i64 immediates to the runtime) is clever and keeps the runtime free of monomorphization knowledge. The from_u32_unchecked uses in the runtime are all justified by upstream validation.

CI on PR #105 surfaced two regressions from the boxed-Char
representation switch:

1. `is_gc_pointer_ty_matches_expected_types` (layout.rs unit test)
   pinned `!is_gc_pointer_ty(Ty::Char)` — pre-addendum Char was an
   I32 immediate, so it correctly wasn't a GC pointer. Updated to
   assert `is_gc_pointer_ty(Ty::Char)` instead, with a Plan-C-
   addendum justification comment.

2. Three CH5 e2e tests embedded `\u{HEX}` escapes inside `"..."`
   string literals. Sigil's *string* lexer accepts only
   `\n / \t / \r / \\\\ / \\\"`; the `\u{...}` escape lives in
   *char* literals only. Rewrote the tests to use bare UTF-8 bytes
   in source (`"héllo"`, `"héllo 😀"`) — the bytes are valid UTF-8
   that Rust's source-tree handling preserves into the embedded
   test string verbatim, and Sigil's runtime treats `String` as a
   raw UTF-8 byte buffer.
@boldfield
Copy link
Copy Markdown
Owner Author

Review — PR #105 (Char primitive + codepoint string ops)

Plan compliance is largely tight. UTF-8 decode (with overlong + surrogate rejection), parse-time codepoint validation, GC bitmap + slot widening, stackmap placeholders, and the validate-then-construct lowering for int_to_char / string_char_at all check out. One MUST-FIX, three FOLLOW-UPs, two notes for DEFERRED.

MUST-FIX

1. == and != on Char silently return wrong answers. compiler/src/typecheck.rs::check_binop's BinOp::Eq | BinOp::NotEq arm only requires "same primitive type" — it does not restrict to Int/Bool. Codegen lowers both to icmp Equal l r, which on a boxed Char is pointer identity, not codepoint equality. So 'a' == 'a' returns false at runtime with no diagnostic.

The PR description acknowledges this: "Pre-existing tests using c == 'Y' (4 sites) updated to char_eq(c, 'Y') since boxed Char == would compare pointer identity, not codepoint." The agent worked around the failing tests but did not close the source: any user-written c == 'A' still typechecks and lowers to broken code. That's worse than not knowing — there is no compile-time signal.

This is actually a class bug, not a Char-specific regression. The same hole exists today for Float and Int64 — both heap-boxed, both subject to identical pointer-compare lowering. The existing stdlib follows the convention float_eq / int64_eq and tests all use the named ops, so it's hidden by discipline. Char inherits the pattern.

<, <=, >, >= on Char are not affected — check_binop's BinOp::Lt | Gt | LtEq | GtEq arm restricts operands to Ty::Int (E0060 fires).

Recommended fix: tighten the BinOp::Eq | BinOp::NotEq arm to reject heap-boxed primitives (Ty::Char, Ty::Float, Ty::Int64, plus String and any other tag-with-payload type) with an E0060 message pointing at the named *_eq function. One typecheck change closes it for all three types simultaneously and prevents the next boxed primitive from inheriting the same trap. Defensible to also block before merge for Char-only and queue the Float/Int64 generalization as a separate fix; either is fine, but the silently-wrong path cannot ship.

FOLLOW-UP

2. string_char_at is O(n²) for any iteration pattern. runtime/src/char.rs::sigil_string_char_at_validate calls decode_codepoints_lossy(slice) and counts; sigil_string_char_at then re-decodes from scratch and indexes. Each random-access call performs two full decode passes and one transient Vec<u32> allocation. Plan says "O(n) decode walk per call" — single calls honor that. But for i in 0..n: string_char_at(s, i) is O(n²) total work plus O(n) transient allocations. Real LLM-tokenizer code will trip into this. Either (a) refactor to a shared find_nth_codepoint(bytes, idx) -> Option<u32> that early-exits, used by both entry points, or (b) document the quadratic-on-loop pattern in std/char.sigil and tell users to materialize via string_chars once. Not blocking — works correctly — but worth fixing before someone writes a parser against it.

3. lower_ctor_alloc comment is stale on Char width. The "sub-word primitives (Bool, Byte, Char, Unit) are zero-extended on store" comment in compiler/src/codegen.rs::lower_ctor_alloc lists Char as sub-word. Char is now pointer-typed. Drop "Char" from the comment to avoid misleading the next reader who reaches for the wrong invariant.

4. Decoder overlong-rejection has no explicit test. decode_codepoints_lossy correctly rejects 2/3/4-byte overlongs (the post-decode overlong match at the bottom of the function), surrogates, and >0x10FFFF. Runtime tests cover "bare 0xFF" and "truncated 0xC3" (generic invalid-byte paths) but not the overlong branches. Two specific cases are highest value: [0xC0, 0x80] (2-byte overlong of U+0000) and [0xED, 0xA0, 0x80] (3-byte UTF-8 form of surrogate U+D800). Each pins a distinct branch that's load-bearing for security-relevant input handling — exactly the kind of corner this language targets.

5. Multi-codepoint count error message under-counts on multi-byte input. The "got N" count loop in the lexer's take_char_lit uses self.advance() (byte-step) rather than advance_utf8. For 'aé', the loop reports "got 3 codepoints" instead of "got 2" because é is 2 UTF-8 bytes. Cosmetic — the error itself fires correctly; only the diagnostic integer is wrong. Use peek_utf8 + advance_utf8 in the count loop for parity with the literal-body decoder above.

DEFERRED

6. Ty::Char now in is_gc_pointer_ty. Cosmetic on Boehm's conservative scan, called out in PR description. Note for any v2 precise-GC work that bitmap representation of types containing Char fields changed.

7. WHATWG "maximal subpart" U+FFFD merging. Decoder emits one U+FFFD per invalid byte; WHATWG spec says one per maximal subpart (so an invalid lead + invalid continuation should produce one U+FFFD, not two). Matches String::from_utf8_lossy's simpler behavior, which is the precedent the design doc cites. Ship as-is; mention in std/char.sigil if user-visible.

What was done well

  • Plan-spec'd Unicode escape \u{HEX} (1–6 hex digits), surrogate rejection, >0x10FFFF rejection — all in the lexer, all at parse time.
  • peek_utf8 / advance_utf8 + from_utf8 validation correctly rejects malformed multi-byte input at parse time.
  • Validate-then-construct lowering for int_to_char and string_char_at mirrors the string_to_int_validate / string_to_int_parse precedent exactly.
  • Stackmap placeholders applied at every allocating Char site.
  • is_gc_pointer_ty(Ty::Char) = true; EnvSlotKind::is_pointer() updated; sub-word slot widening removed at all 9 known sites (closure record store/load, synth-cont capture pack/unpack, post-arm-k, tail-prefix-let, narrow_for_kind, type_of_expr, sum-ctor field load, tuple element load).
  • Pattern::CharLit codegen loads u32 at offset 8 + zero-extend + icmp — correct shape for boxed Char.
  • Decoder correctness on the load-bearing branches (overlong, surrogate, range) is right; only the test coverage gap (item 4) is open.

boldfield added 2 commits May 7, 2026 18:11
Two more CI failures fixed:

1. Sigil's string lexer pre-Plan-C-addendum read source bytes as
   Latin-1 chars (`self.src[pos] as char`) and pushed them to a
   Rust `String` via `String::push(char)`, which UTF-8 re-encodes.
   Multi-byte source sequences double-encoded — `é` (0xC3 0xA9) →
   4 bytes (0xC3 0x82 0xC2 0xA9) inside the resulting StringLit.
   This regressed nothing pre-addendum because `string_chars` /
   `string_char_at` didn't exist; the addendum surfaces the bug
   immediately. `take_string_lit` now uses the `peek_utf8` /
   `advance_utf8` helpers (added in CH2) to decode multi-byte
   source bytes as a single codepoint and push that codepoint
   verbatim. Two new lexer tests pin the round-trip for 2-byte
   `é` and 4-byte `😀`.

2. `string_from_chars_round_trip` had no explicit `List[Char]`
   annotation in its source, so monomorphize never saw the type
   and codegen panicked with "ctor `Cons$$Char` not registered".
   Added a `let xs: List[Char] = string_chars(s)` binding to
   trigger mono via the type annotation's Apply node, mirroring
   the working `string_chars_multibyte` test's shape.
PR #105 review (boldfield, 2026-05-07): one MUST-FIX, four follow-
ups, two deferred. All five non-deferred items addressed plus a
small note for item 7.

**1 (MUST-FIX) — Reject `==` / `!=` on `Char` at typecheck.** Pre-
fix, `'a' == 'a'` typechecked and lowered to `icmp Equal l r` on
boxed Char pointers — pointer identity, not codepoint equality —
silently returning `false` at runtime. New typecheck arm in
`check_binop`'s `BinOp::Eq | BinOp::NotEq` rejects `Ty::Char`
operands with E0060 pointing at `char_eq(a, b)` (or
`!char_eq(a, b)` for `!=`). Two new typecheck unit tests pin the
rejection. Float / Int64 (also boxed) inherit the same trap;
generalizing the rejection across all heap-boxed primitives is
queued as a follow-up since the existing `float_eq` / `int64_eq`
discipline currently hides the bug.

**2 — Refactor `string_char_at` to early-exit shared helper.** Pre-
refactor `sigil_string_char_at_validate` and `sigil_string_char_at`
each called `decode_codepoints_lossy(slice)` (full-pass + Vec
allocation) — making `for i in 0..n: char_at(s, i)` O(n²) with O(n)
transient allocations per call. New shared helper
`find_nth_codepoint(bytes, idx) -> Option<u32>` walks bytes
front-to-back with early-exit; both entry points use it. Each call
is now O(idx + decode_cost) and allocates nothing on the hot path.

**3 — `lower_ctor_alloc` comment fix.** Dropped `Char` from the
"sub-word primitives (Bool, Byte, Char, Unit)" widen-on-store
comment; boxed Char takes the pass-through branch.

**4 — Decoder overlong-rejection tests.** Two new runtime tests:
`string_chars_overlong_2byte_replaces` (`[0xC0, 0x80]` 2-byte
overlong of U+0000) and `string_chars_overlong_3byte_surrogate_-
replaces` (`[0xED, 0xA0, 0x80]` 3-byte UTF-8 form of surrogate
U+D800). Each pins a distinct decoder branch.

**5 — Lexer multi-codepoint count uses codepoint-step.** The
"Char literal must be a single codepoint; got N" diagnostic's
count loop now uses `peek_utf8` / `advance_utf8`, parity with the
literal-body decoder above. New lexer test pins
`'aé'` reports "got 2", not "got 3".

**7 (DEFERRED note) — U+FFFD merging.** Sigil v1's per-byte
replacement (matching `String::from_utf8_lossy`) is now
documented in `std/char.sigil` alongside a forward reference to
the v2 `string_chars_strict` follow-up.

Item 6 (Char in `is_gc_pointer_ty` — note for v2 precise GC) is
deferred per reviewer.
@boldfield
Copy link
Copy Markdown
Owner Author

Thanks for the review — all five non-deferred items addressed in d1691af.

1 (MUST-FIX) — Char == / != rejected at typecheck. New arm in check_binop's BinOp::Eq | NotEq rejects Ty::Char operands with E0060 pointing at char_eq(a, b) (or !char_eq(a, b) for !=). Two new typecheck tests pin the rejection. Generalizing to Float / Int64 / String is queued as a follow-up (PLAN_C_DEVIATIONS.md will pick it up) — the existing float_eq / int64_eq discipline hides the latent bug, so it's not blocking on this PR.

2 — string_char_at O(n²) → O(n + idx) per call. New shared helper find_nth_codepoint(bytes, idx) -> Option<u32> walks bytes front-to-back with early-exit; both sigil_string_char_at_validate and sigil_string_char_at use it. Zero transient allocations on the hot path.

3 — lower_ctor_alloc comment. Dropped Char from the "sub-word primitives (Bool, Byte, Char, Unit)" widen-on-store comment.

4 — Decoder overlong-rejection tests. Added string_chars_overlong_2byte_replaces ([0xC0, 0x80], 2-byte overlong of U+0000) and string_chars_overlong_3byte_surrogate_replaces ([0xED, 0xA0, 0x80], 3-byte UTF-8 form of surrogate U+D800). Each pins a distinct decoder branch.

5 — Lexer multi-codepoint count loop uses advance_utf8. 'aé' now reports "got 2" instead of "got 3"; new lexer test pins this.

7 (note) — U+FFFD merging. Documented Sigil v1's per-byte replacement (matching String::from_utf8_lossy) in std/char.sigil with a forward reference to a future string_chars_strict follow-up that would use the WHATWG maximal-subpart shape.

6 deferred per reviewer (note for any v2 precise-GC work that bitmap representation of types containing Char fields changed).

Note: separately from the review, this push also includes earlier fixes for two CI-surfaced regressions from the boxed-Char rep switch: the is_gc_pointer_ty_matches_expected_types test pin (now asserts is_gc_pointer_ty(Ty::Char) == true), and a string-lexer UTF-8-preservation fix (pre-fix bare multi-byte source bytes double-encoded inside StringLit).

@boldfield boldfield marked this pull request as ready for review May 7, 2026 18:29
@boldfield
Copy link
Copy Markdown
Owner Author

Re-review — PR #105

All 7 prior review items resolved correctly. CI green on both hosts. Two new items surfaced (one pre-existing class bug, one e2e coverage gap, both follow-ups), neither blocking.

Confirmed resolved

1 (MUST-FIX) — == / != on Char. compiler/src/typecheck.rs::check_binop's BinOp::Eq | BinOp::NotEq arm now rejects Ty::Char operands with E0060 pointing at char_eq(a, b) (or !char_eq(a, b) for !=). Diagnostic correctly explains why (boxed → pointer compare → silent wrong answer). Two new typecheck unit tests pin the rejection. Hole closed for Char.

2 — string_char_at quadratic-on-loop. Replaced double decode_codepoints_lossy calls with shared find_nth_codepoint(bytes, idx) -> Option<u32> that walks bytes front-to-back with early-exit. Both entry points consume it; no transient allocations on the hot path. Each call is now O(idx + decode_cost). Loop pattern is now O(n²) over bytes only because of independent re-walks across calls — same as Python's str[i]-in-loop, acceptable. Defensive abort in sigil_string_char_at if validate-first contract violated is a nice safety touch.

3 — lower_ctor_alloc comment. Char removed from the "sub-word primitives" widen-on-store list.

4 — Decoder overlong-rejection tests. string_chars_overlong_2byte_replaces ([0xC0, 0x80]) and string_chars_overlong_3byte_surrogate_replaces ([0xED, 0xA0, 0x80]) shipped. I traced each through the decoder: 2-byte branch matches → decoded 0x00, overlong check rejects → U+FFFD; 3-byte branch matches → decoded 0xD800, surrogate check rejects → U+FFFD. Both correct.

5 — Lexer multi-codepoint count. Count loop in take_char_lit's "got N codepoints" diagnostic now uses peek_utf8 / advance_utf8. Test pins 'aé' reports "got 2".

7 (DEFERRED note) — U+FFFD per-byte vs WHATWG maximal-subpart. Documented in std/char.sigil with forward reference to a v2 strict variant. Reasonable.

6 (DEFERRED) — is_gc_pointer_ty(Ty::Char). Acknowledged. Note for v2 precise GC.

New items surfaced during fixes

8 (LATENT BUG, FIXED) — String lexer was UTF-8 double-encoding. Pre-PR take_string_lit read source bytes as Latin-1 (self.src[pos] as char) and pushed them to a Rust String via push(char), which UTF-8-encodes — so multi-byte source like "é" (0xC3 0xA9 = 2 bytes) stored 4 bytes in the resulting StringLit. Latent because pre-PR no string op decoded UTF-8; the addendum's string_chars immediately surfaces it. Fixed in 11d6f6c to use peek_utf8 / advance_utf8. Two new tests pin the round-trip. Good catch — would've been a footgun for any user writing non-ASCII string literals.

New FOLLOW-UPs

9 (FOLLOW-UP) — Float / Int64 == still silently broken; should ship the trivial extension now, not later. The fix in commit d1691af only rejects Ty::Char; Ty::Float and Ty::Int64 (both heap-boxed, both lower to pointer-identity icmp) inherit the same trap. The agent's commit message acknowledges this and defers on the grounds that "float_eq / int64_eq discipline currently hides the bug." That argument applies to stdlib code which follows the convention. It does not apply to LLM-authored programs against the spec — exactly the use case the language is designed for. An LLM writing if x == 3.14 { ... } from the prompt bank gets a silent wrong answer. The fix is one match-arm change at the same site:

if let Some(a) = lt.as_ref() {
    if matches!(a, Ty::Char | Ty::Float | Ty::Int64) {
        let suggestion = match a {
            Ty::Char => "char_eq",
            Ty::Float => "float_eq",
            Ty::Int64 => "int64_eq",
            _ => unreachable!(),
        };
        ...
    }
}

Plus one typecheck test per type. ~15 lines total. The discount is real: shipping this now closes the class bug uniformly; deferring leaves two known soundness holes documented in the commit message, which is worse than not knowing.

This is a follow-up, not a blocker — Char-specific behavior is correct as shipped. But pin the trivial extension into either this PR or an immediate follow-up before the next agent-driven PR. Don't let it slip into "queued v2."

10 (FOLLOW-UP) — Missing e2e test for lossy UTF-8 decode behavior. Plan §"Test plan" listed string_chars_invalid_replaces — input string with known-invalid bytes produces U+FFFD. Runtime unit tests cover the decoder thoroughly (bare 0xFF, truncated 0xC3, both new overlongs). But there's no end-to-end test that pins the user-visible behavior: a Sigil program reading invalid UTF-8 and getting the replacement codepoint. Minor coverage gap. Add an e2e test that constructs a String with one invalid byte (via byte_array_to_string or similar runtime path), calls string_chars, and verifies the resulting list contains U+FFFD (= int_to_char(65533) then unwrapped). Confirms the runtime → user-program path end-to-end.

Notes (no action)

11 (NOTE) — Char vs String literal escape asymmetry. Char literals support \u{HEX}; string literals only support \n / \t / \r / \\ / \". A user writing "hello \u{1F600}" gets a clear E0010 "unknown string escape \u" — that's the right side of the silent-vs-loud tradeoff. Worth a follow-up plan if the friction surfaces in spec validation. Not in this PR's scope.

12 (NOTE) — string_char_at defensive abort. If sigil_string_char_at is invoked without prior validate, it now eprintln! + process::abort(). Matches the runtime's existing internal-invariant abort pattern. Good.

Bottom line

Mergeable. Ship Char as the first delivery of the Char + codepoint plan. Open issue / follow-up PR for item 9 (Float/Int64 generalization) before the next agent PR. Item 10 can land as part of either the existing Plan C completion work or a tiny follow-up commit.

Copy link
Copy Markdown
Owner Author

@boldfield boldfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: PR #105 after review-fix commits (d1691af + 2 CI fixes)

The review-fix commit addresses the prior round's findings well: find_nth_codepoint eliminates the Vec allocation, the multi-codepoint error count uses codepoint-stepping, and E0060 rejects ==/!= on Char. Four new issues surfaced in this pass.


1. Bug: E0060 Char rejection only checks the left operand — compiler/src/typecheck.rs

The new E0060 check guards on lt.as_ref() only:

if let Some(a) = lt.as_ref() {
    if matches!(a, Ty::Char) { /* emit E0060 */ }
}

If the right operand is Char but the left isn't (or is None from a prior error), the Char-specific guidance message doesn't fire. For 42 == 'a', the generic type-mismatch arm catches it (different types), so the user still gets an error — just without the "use char_eq" hint. The only real gap: lt = None (prior type error) + rt = Char → the check is skipped entirely, but since lt = None means there's already a compile error, this isn't a soundness hole.

Low severity. Fix: check both sides.

let is_char = lt.as_ref().map_or(false, |t| matches!(t, Ty::Char))
           || rt.as_ref().map_or(false, |t| matches!(t, Ty::Char));
if is_char { /* emit E0060 */ }

2. Design: duplicated UTF-8 decoder — runtime/src/char.rs

decode_codepoints_lossy (~line 296) and find_nth_codepoint (~line 422) are independent hand-rolled UTF-8 decoders with nearly identical logic (leading-byte dispatch, continuation validation, overlong/surrogate rejection, U+FFFD resync). They must agree on codepoint boundaries for string_chars and string_char_at to be consistent on the same input. Today they do agree, but there's no shared code enforcing that invariant.

A decode_next_codepoint(bytes: &[u8], offset: usize) -> (u32, usize) stepping helper used by both would eliminate the duplication. decode_codepoints_lossy calls it in a loop pushing to a Vec; find_nth_codepoint calls it counting to idx with early-exit.

Also: the string_char_at codegen path still calls find_nth_codepoint twice (once in validate, once in fetch). The Vec allocation is gone, which was the big win, but each call still walks from byte 0 to codepoint idx. A combined sigil_string_char_at_try(s, idx) -> *mut u8 | null returning null for out-of-bounds would halve the work. Not blocking — the current shape is correct — but worth noting since it's the same structural issue from last round, just with lower constant factor.


3. Bug: spec cross-reference §3.4 should be §3.1.1spec/language.md

In the Char literal description:

use the named functions char_eq / char_lt etc. (§3.4)

Section 3.4 is "Inference rules (overview)." The correct reference is §3.1.1 ("Char and codepoint string operations"), which is the new subsection added by this same PR.


4. Bug: string_byte_at listed twice — std/string.sigil

The updated preamble reads:

Byte-indexed ops (string_byte_at, string_substring, string_index_of, string_byte_at) take and return byte offsets

The second string_byte_at should probably be string_length or another op — or just deleted.


Non-blocking observations

Test gap: No test exercises find_nth_codepoint on overlong or surrogate inputs. string_chars_overlong_2byte_replaces tests decode_codepoints_lossy via sigil_string_chars, but the string_char_at path through find_nth_codepoint is only tested on valid multi-byte strings. A test like string_char_at on [0xC0, 0x80, b'a'] would pin that the two decoders agree on the replacement-codepoint count for the same invalid input.

Design note: \0 and \u{HEX} escapes are now supported in char literals but not in string literals. "\u{1F600}" and "\0" produce E0010 from the string lexer. This is likely to surprise users. Not blocking for this PR, but worth tracking as a follow-up.


Codegen layer is clean — all EnvSlotKind::Char sites route through the pointer path, FFI signatures match, stackmap placeholders are on all allocating calls, block sealing is correct. The find_nth_codepoint refactor is a clear improvement over the prior double-Vec-decode. Items 3 and 4 are trivial fixes; items 1 and 2 are low-priority design improvements.

Combined fixes for:
- **Review B item 3** (PR review 4246507154, 17:59) — comment style
- **Review C items 9 + 10** (issue comment 4399992055, 18:30) —
  Float/Int64 == extension + e2e lossy-decode test
- **Review D items 1–4** (PR review 4246835141, 18:46) — E0060 both
  operands, decoder dedup, spec §3.4 typo, std/string duplicate

### B item 3 — Plan-C-addendum comment spam pruned

Deleted 13 redundant single-line `// Plan C addendum (Char) — boxed
Char is pointer-typed.` comments adjacent to `EnvSlotKind::Char | ...`
match arms across `compiler/src/codegen.rs`. The match-arm context
alone makes the change self-evident; the load-bearing ones (literal
lowering, type-mapping fns, dispatch arms, helpers, FuncIds struct,
runtime counters, ast.rs `is_pointer()`, layout.rs
`is_gc_pointer_ty`) stay.

### C item 9 — Heap-boxed-primitive == rejection extended to Float / Int64

The earlier Char-only E0060 rejection now generalizes to all three
heap-boxed primitives. New `BoxedPrim` enum + `boxed_primitive_-
eq_name` helper drive the typecheck arm; per-type suggestion +
ordering hint string. Float adds the NaN-aware caveat in the error
message. Four new typecheck unit tests pin Float / Int64 `==` and
`!=` rejection.

### C item 10 — e2e test for user-visible lossy UTF-8 decode

`string_chars_invalid_utf8_replaces` constructs a `ByteArray` with
a known-invalid byte (`0xFF`) via `byte_array_alloc` +
`byte_array_concat`, bypasses validation via
`string_from_bytes_alloc` (the post-validation primitive copies
bytes verbatim), and verifies `string_chars` emits U+FFFD (65533)
for the invalid byte. Closes the runtime → user-program coverage
gap.

### D item 1 — E0060 char check now guards both operands

The earlier check only inspected `lt.as_ref()`. Now both `lt` and
`rt` are checked via `lt_boxed.or(rt_boxed)`, so `42 == 'a'` and
similar shapes still fire the named-function suggestion even when
LHS is non-Char or `None`.

### D item 2 — UTF-8 decoder deduplicated

Extracted `decode_next_codepoint(bytes, offset) -> (cp, len)` as
the single source of truth for Sigil's lossy UTF-8 decode.
`decode_codepoints_lossy` (drives `string_chars`) and
`find_nth_codepoint` (drives `string_char_at`) both step through
it, so codepoint-boundary agreement is now identical-by-
construction rather than agree-by-coincidence. New runtime test
`string_char_at_overlong_replaces` exercises `find_nth_codepoint`
on `[0xC0, 0x80, b'a']` (overlong + ASCII) to pin that the two
entry points produce the same codepoint count for invalid input.

### D item 3 — spec §3.4 → §3.1.1

The Char literal entry's "use the named functions (§3.4)" pointer
referenced "Inference rules (overview)"; corrected to §3.1.1
("Char and codepoint string operations") which is the new
subsection.

### D item 4 — `std/string.sigil` duplicate removed

The byte-indexed surface preamble listed `string_byte_at` twice;
replaced the second occurrence with `string_length`.

### Out of scope (acknowledged in PR reply)

- D non-blocking observation #1 (test gap on find_nth_codepoint with
  invalid input): closed by `string_char_at_overlong_replaces` above.
- D non-blocking observation #2 (`\\u{HEX}` / `\\0` not supported in
  string literals): tracked as a separate follow-up since the
  current behavior produces a clear E0010, not a silent miss.
@boldfield
Copy link
Copy Markdown
Owner Author

You're right — I'd missed two of the four reviews on the first pass. Going back through them all, the breakdown was:

  • Review A (issue comment, 18:06) — addressed in d1691af
  • Review B (PR review, 17:59) — items 1+2 addressed in d1691af; item 3 (style) just addressed in fb264b2
  • Review C (issue comment, 18:30) — items 9+10 addressed in fb264b2
  • Review D (PR review, 18:46) — all 4 items + non-blocking test gap addressed in fb264b2

fb264b2 ships:

B item 3 — pruned 13 redundant // Plan C addendum (Char) — boxed Char is pointer-typed. comments adjacent to EnvSlotKind::Char | ... match arms; kept the load-bearing ones.

C item 9 — extended E0060 == / != rejection from Char to all three heap-boxed primitives (Char / Float / Int64). New BoxedPrim enum + boxed_primitive_eq_name helper drive the rejection; per-type suggestion (char_eq / float_eq / int64_eq) + ordering hint in the error message. Float gets a NaN-aware caveat. Four new typecheck unit tests pin Float / Int64 rejection.

C item 10 — new e2e string_chars_invalid_utf8_replaces constructs an invalid-UTF-8 String via byte_array_alloc + byte_array_concat + string_from_bytes_alloc (which copies bytes verbatim per the validate-then-construct contract) and verifies string_chars emits U+FFFD (65533).

D item 1 — E0060 char check now guards both operands via lt_boxed.or(rt_boxed), so 42 == 'a' shapes still surface the named-function hint.

D item 2 — extracted decode_next_codepoint(bytes, offset) -> (cp, len) as the single source of truth; decode_codepoints_lossy and find_nth_codepoint both step through it. Codepoint-boundary agreement is now identical-by-construction. The closely-related test gap from D's non-blocking observations is closed by a new string_char_at_overlong_replaces runtime test that exercises find_nth_codepoint on [0xC0, 0x80, b'a'].

D item 3 — spec cross-reference §3.4 → §3.1.1 fixed.

D item 4 — duplicate string_byte_at in std/string.sigil preamble replaced with string_length.

D's second non-blocking observation (\u{HEX} / \0 escapes not supported in string literals) is tracked as a follow-up; current behavior produces a clear E0010, not a silent miss.

D's note on the string_char_at_try two-walk reduction acknowledged: each call now early-exits at codepoint idx, so the structural cost is O(idx + decode_cost) per call. A combined try returning *mut u8 | null would halve that further; queued as a separate optimization.

@boldfield boldfield merged commit e17c02f into main May 7, 2026
4 checks passed
@boldfield boldfield deleted the plan-c-char-codepoint branch May 7, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant