Skip to content

Boxed Float (IEEE 754 f64) type#101

Merged
boldfield merged 9 commits into
mainfrom
plan-c-float-type
May 6, 2026
Merged

Boxed Float (IEEE 754 f64) type#101
boldfield merged 9 commits into
mainfrom
plan-c-float-type

Conversation

@boldfield
Copy link
Copy Markdown
Owner

Summary

  • Add Float as a boxed IEEE 754 f64 type following the Int64 boxed-scalar pattern
  • Header constant TAG_FLOAT = 0x0B, 21 runtime FFI primitives (arithmetic, comparison, math, conversion)
  • Float literal syntax in lexer/parser (3.14, 1e10, 2.5e-3), type system integration, full codegen wiring
  • float_to_string always includes .0 for whole numbers (e.g., 4.0 not 4); inf/NaN unchanged
  • 10 e2e tests + runtime unit tests covering the full surface
  • spec/language.md updated: Float in literal syntax, type table, expression forms, stdlib reference, runtime model; removed from v1 limits

Key design decisions

  • ABI: sigil_float_box accepts i64 bit pattern (not f64) to stay in Cranelift's integer register class — codegen uses f64::to_bits() as i64iconst(I64) → call
  • Division: IEEE 754 semantics (div-by-zero → ±Inf, not abort)
  • float_to_string formatting: appends .0 to whole numbers so Float values are always visually distinct from Int

Test plan

  • 10 e2e tests: literals, arithmetic, negation, div-by-zero→inf, from_int round-trip, comparisons, floor/ceil, string parse/validate, NaN≠NaN, doc-only import
  • Runtime unit tests: boxing, arithmetic, comparison, math, conversion, to_string formatting (3.14, 4.0, inf, NaN)
  • Full workspace test suite passes (330/330; 3 pre-existing perf-floor flakes excluded)

Implements plan: 2026-05-05-sigil-float.md

🤖 Generated with Claude Code

boldfield and others added 6 commits May 5, 2026 21:38
Add TAG_FLOAT = 0x0B, FloatAllocCount/FloatAllocBytes counters, and
runtime/src/float.rs with 21 FFI primitives: box/unbox, 5 arithmetic,
5 comparison, 4 math, and 5 conversion ops. All follow the Int64
boxed-scalar pattern (atomic alloc, count=1, bitmap=0).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add FloatLit(f64) token and Expr::FloatLit(f64, Span) AST variant.
The lexer recognizes `3.14`, `1e10`, `3.14e-2` as float literals
(requires digits before and after the dot). Parser maps the token
and constant-folds unary negation on float literals. All match arms
across resolve, monomorphize, closure_convert, color, elaborate,
typecheck, and codegen updated for exhaustiveness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Register opaque Float type in builtin_types(). Add
register_builtin_float_schemes() with 16 primitives: 4 arithmetic,
5 comparison, 4 math, 5 conversion/string ops. Float literal
inference (Expr::FloatLit → Ty::User("Float")) already added in
Task F2's exhaustiveness pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 20 FFI declarations for float runtime primitives to
BuiltinFuncIds/BuiltinFuncRefs. Float literals lower via
f64::to_bits() → iconst → sigil_float_box (bit-pattern calling
convention). All 19 callable primitives dispatch through lower_call
with correct stackmap placeholders for allocating ops.

Fix sigil_float_box to accept i64 bit pattern (integer register
class) instead of f64, matching codegen's iconst path.

Add std/float.sigil (doc-only) and imports.rs BUILTIN_INJECTED entry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 e2e tests covering float literals, arithmetic, negation,
div-by-zero→inf, from_int round-trip, comparisons, floor/ceil,
string parse/validate, NaN≠NaN, doc-only import.

float_to_string now appends ".0" to whole-number results so Float
values are always visually distinguishable from Int (inf/NaN
unchanged). Runtime unit tests extended to cover the formatting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Float to: literal syntax (§1), type table (§3), expression
forms (§4.1), stdlib reference (§13), runtime model (§12). Remove
"no Float type" from v1 limits (§14).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@boldfield boldfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Boxed Float Type

Clean implementation that mirrors the established Int64 boxed-scalar pattern. Correctness looks good across all layers. A few issues worth addressing:

Issues

1. Lexer: greedy exponent consumption produces misleading errors

The exponent branch (line ~164 of lexer.rs) unconditionally consumes e/E and an optional sign without verifying that exponent digits follow. Input like 1e or 1e+ gets consumed and then fails with "float literal 1e is out of range" — but it's not out of range, it's malformed.

A simple fix: peek ahead for at least one digit before committing to the exponent parse. Otherwise, leave the e unconsumed and treat the preceding part as an integer literal. This matters because .sigil source with a typo (1e) currently eats characters that belong to the next token and gives a confusing diagnostic.

2. Dead code in codegen: make_float_binop takes unused name parameter

let make_float_binop = |name: &str, sig_holder: &mut Signature| {
    // ...
    let _ = name;
};

The name parameter is immediately discarded. The closure doesn't need it — drop the parameter.

Observations (not blocking)

  • float_to_int clamping is redundant with Rust's saturating as semantics (f64 as i64 already returns 0 for NaN, i64::MAX for overflow, i64::MIN for underflow since Rust 1.45). The explicit checks make the contract more visible to readers though, so not wrong — just worth knowing that removing them wouldn't change behavior.

  • "Plan D" comments appear throughout the codegen additions. The PR summary and branch name suggest this is a Plan C continuation (Float type for the Sigil language). If "Plan D" is intentional naming, ignore this — just flagging potential naming drift.

Verdict

Solid work. The lexer exponent greediness is the only thing I'd want fixed before merge — it'll bite someone debugging a typo in float-heavy code and getting a confusing diagnostic. The dead parameter is a one-line cleanup.

boldfield and others added 3 commits May 5, 2026 22:36
1. Lexer: exponent branch now peeks ahead for at least one digit
   before committing to consume `e`/`E` and optional sign. `1e`
   lexes as IntLit(1) + Ident("e"), not a malformed float.
   Three new lexer unit tests pin the behavior.

2. Codegen: remove unused `name` parameter from `make_float_binop`
   closure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change test values from 3.14/2.718 (approx PI/E) to 3.125/2.75
  to avoid clippy::approx_constant lint
- Add SAFETY comments on .as_ptr() calls for interior-pointer check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@boldfield boldfield merged commit 1fd9e0a into main May 6, 2026
4 checks passed
@boldfield boldfield deleted the plan-c-float-type branch May 6, 2026 05:52
boldfield added a commit that referenced this pull request May 7, 2026
- 14 e2e tests cover the full Char surface: char_literal_round_trips_via_to_string,
  char_codepoint_round_trip, int_to_char_rejects_out_of_range,
  int_to_char_rejects_surrogate, int_to_char_accepts_valid,
  is_ascii_classifiers_basic, to_lower_upper_ascii_passthrough,
  string_chars_ascii, string_chars_multibyte,
  string_char_at_codepoint_index, string_from_chars_round_trip,
  char_pattern_match_against_literal, char_eq_distinguishes_different_codepoints,
  char_doc_only_import.

- std/char.sigil ships as a documentation-only file (added to
  BUILTIN_INJECTED skip-list). Covers literal syntax, the 19
  user-facing primitives, the ASCII-only v1 scope + v2 closure path,
  the byte-vs-codepoint cross-reference to std/string.sigil, and a
  worked count_digits example.

- std/string.sigil doc preamble rewritten — the "Codepoint-aware
  variants deferred" note becomes a positive byte-vs-codepoint
  surface description; the new codepoint ops surface
  (string_chars, string_char_at, string_from_chars) is added to the
  builtin operations table with a cross-reference to std/char.sigil.

- PLAN_C_PROGRESS.md gets a Task CH section documenting the
  addendum's runtime + compiler + stdlib + test surface and the
  pre-existing Char-as-I32 → boxed-Char representation switch.

- PLAN_C_DEVIATIONS.md Task 68 entry's "deferred class 1"
  (codepoint-aware ops) and "deferred class 5"
  (char_to_int / int_to_char) marked CLOSED by this addendum.
  Class 3 (Float helpers) marked CLOSED by PR #101 (already shipped).
  Class 2 (codepoint-aware string_split / string_replace) remains
  deferred — depends on stdlib namespace qualification, not Char.
boldfield added a commit that referenced this pull request May 7, 2026
* [Task CH1] Char header tag + runtime primitives

Adds TAG_CHAR=0x0C and a complete `runtime/src/char.rs` module
covering boxing, equality/ordering, conversion, ASCII classifiers,
ASCII case folding, and the load-bearing UTF-8 walkers
(`string_chars`, `string_char_at`, `string_from_chars`).

`string_chars` and `string_from_chars` accept Cons / Nil header
words and discriminants from the codegen call site so the runtime
stays free of compile-time `List[Char]` layout knowledge while
still emitting well-typed `List[Char]` values. Lossy UTF-8 decode
emits U+FFFD per invalid byte and resyncs at the next valid leading
byte.

Adds `CharAllocCount` / `CharAllocBytes` counters; 32 unit tests
cover box/unbox round-trips for ASCII / 2-byte / 3-byte / 4-byte
codepoints, every classifier, ASCII case passthrough on non-ASCII,
`int_to_char_validate` boundary cases (surrogates, > 0x10FFFF,
negative), `string_chars` on valid + invalid UTF-8 input, codepoint-
indexed `string_char_at`, and `string_from_chars` round-trip through
`string_chars`.

* [Task CH2] Lexer + parser — Char literal extensions

Extends the existing `'x'` Char literal lexer with the full plan-spec
escape and validity surface:

- New escape sequences: `\"`, `\0`, `\u{HEX}` (1–6 hex digits).
- Bare multi-byte codepoints (`'é'`, `'中'`, `'😀'`) decoded via a
  new UTF-8 peek/advance helper pair on the byte-based cursor; the
  prior `self.src[pos] as char` shortcut would otherwise see N
  source bytes and reject as multi-codepoint.
- Compile-time rejection of `\u{...}` values > 0x10FFFF and any
  surrogate `0xD800..=0xDFFF` (E0010 with descriptive message).
- Compile-time rejection of empty `\u{}` and >6-hex-digit escapes.
- Multi-codepoint literal bodies (`'ab'`, `'ab\u{41}'`) now produce
  the targeted "Char literal must be a single codepoint; got N"
  diagnostic rather than the prior "expected closing `'`" surface.

Token type stays `CharLit(char)` — Rust's `char` already enforces
the post-validation invariant (no surrogates, ≤ 0x10FFFF). Existing
parser-side mapping (`Token::CharLit` → `Expr::CharLit`) is unchanged.

15 new lexer tests cover every escape, each Unicode boundary
(0xD7FF / 0xE000 / 0x10FFFF), each rejection case (empty braces,
too many digits, out-of-range, low/high surrogates), and the three
multi-byte bare-codepoint widths.

* [Task CH3] Char builtin schemes — 19 user-facing primitives

Registers the user-facing Char primitives via a new
`register_builtin_char_schemes(tc)` mirroring
`register_builtin_float_schemes`:

- 5 equality / ordering: char_eq, char_lt, char_le, char_gt, char_ge
  — `(Char, Char) -> Bool`
- 3 conversion: char_to_int `(Char) -> Int`, int_to_char
  `(Int) -> Option[Char]`, char_to_string `(Char) -> String`
- 5 ASCII classifiers: is_ascii, is_ascii_digit, is_ascii_alpha,
  is_ascii_alphanumeric, is_ascii_whitespace — `(Char) -> Bool`
- 2 ASCII case: to_lower_ascii, to_upper_ascii — `(Char) -> Char`
- 4 string codepoint ops: string_chars `(String) -> List[Char]`,
  string_char_at `(String, Int) -> Option[Char]`, string_from_chars
  `(List[Char]) -> String`

Char itself is already a Ty::Char primitive (pre-existing) and
literal type inference (`Expr::CharLit -> Ty::Char`) is unchanged.
The validator helpers `int_to_char_validate` /
`string_char_at_validate` are codegen-internal — not registered as
user-callable schemes; codegen will lower the user-facing
`int_to_char` / `string_char_at` to the validate-then-construct
pattern in CH4.

13 typecheck unit tests cover each user-facing scheme: round-trip
type inference, Option[Char] / List[Char] return types, E0044 on
wrong-argument-type calls, E0045 on Char-vs-Int annotation
mismatch, and Char literal default inference.

* [Task CH4] Codegen — boxed Char + 19-primitive dispatch

Converts Char from an I32 codepoint immediate to a `TAG_CHAR`-headed
heap pointer (mirroring Float / Int64 / String) and wires the 19
user-facing Char primitives through `lower_call`.

Type-mapping changes:

- `cranelift_ty_for_type_expr("Char")` and `cranelift_ty_of_ty(Ty::Char)`
  return `pointer_ty` instead of `types::I32`.
- `Expr::CharLit(c, _)` lowers to `iconst(I64, codepoint) →
  sigil_char_box → pointer_ty` (with stackmap placeholder), no longer
  bare `iconst(I32, c)`.
- `type_of_expr(Expr::CharLit)` returns `pointer_ty`.
- `Pattern::CharLit(c)` loads the u32 codepoint at offset 8 from the
  boxed Char scrutinee, zero-extends to i64, and tests for equality
  against the literal codepoint.

Primitive dispatch:

- 17 simple FFI dispatch arms in `lower_call` for char_eq / lt / le /
  gt / ge / char_to_int / char_to_string / is_ascii(_digit / _alpha /
  _alphanumeric / _whitespace) / to_lower_ascii / to_upper_ascii.
- `int_to_char` and `string_char_at` lower to a validate-then-construct
  pattern: validator → `brif(ok==0)` → Some/None branches → merge
  block with a `pointer_ty` block-param. Some-branch builds via
  `lower_ctor_alloc(Option$$Char, some_idx, [char_ptr])`; None via
  `lower_ctor_alloc(..., none_idx, [])`.
- `string_chars` and `string_from_chars` thread the codegen-computed
  `List$$Char` Cons / Nil header words and discriminants to the
  runtime as i64 immediates; runtime stamps them through `sigil_alloc`
  to build well-typed `List[Char]` cells.

`option_layout_for(Ty)` and `list_char_layout_immediates()` are the
two private helpers that resolve the monomorphized layout via
`mangle_type` / `mangle_ctor` and the `ctor_index` / `type_layouts`
side-tables.

`type_of_expr` predictions extended with the 19 Char op return types
(I8 for boolean classifiers / comparators, I64 for `char_to_int`,
`pointer_ty` for the rest). Globals set extended with the 17 user-
callable surface names (the validator helpers
`int_to_char_validate` / `string_char_at_validate` are codegen-
internal only).

* [Task CH4] Slot widening + GC bitmap for boxed Char

Boxed Char (TAG_CHAR) is a heap pointer; the closure-record /
sum-ctor / tuple-element slot-widening logic pre-Plan-C-addendum
treated `EnvSlotKind::Char` and `Ty::Char` as a sub-word I32
primitive (uextend on store, ireduce(I32) on load). Both directions
must drop for boxed Char — the slot already holds a `pointer_ty`
value, and narrowing a pointer to I32 corrupts it.

Updates:

- `EnvSlotKind::is_pointer()` now matches `Char` alongside
  `String / Closure / User`, so closure-record bitmap bits are
  set correctly for boxed-Char captures.
- Every match site that branched on `EnvSlotKind::Char` (closure
  store / load, synth-cont capture pack / unpack, post-arm-k
  capture pack, tail-prefix-let widen, narrow_for_kind helpers,
  `type_of_expr` for `Expr::ClosureEnvLoad`) routes Char through
  the pointer-typed branch.
- `Ty::Char` in `load_field_value` and the tuple-pattern element
  loader drops the I32 narrow.
- `is_gc_pointer_ty(Ty::Char)` returns true so sum-type field
  bitmaps mark Char fields for GC tracing (cosmetic on Boehm's
  conservative scan, load-bearing for any future precise GC).

Existing e2e tests that exercised Char-as-I32 semantics:

- `perform_side_narrow_to_char_value_checked`: rewrites
  `c == 'Y'` to `char_eq(c, 'Y')` (pointer equality wouldn't
  match codepoint equality for boxed Chars).
- `cps_abi_captures_bearing_with_char_capture_exercises_widen_-
   narrow_symmetry`: same `==` → `char_eq` rewrite; the test
  still pins capture flow through synth-cont closure records,
  but with no width narrowing.
- `arm_reads_char_arg_branches_on_codepoint`: same `==` →
  `char_eq` rewrite; the I32 → I8 split with the Bool test
  collapses (both are now pointer-store paths).
- `task_78_5_g4_approach6_b_neq_r_pointer_return_arm_through_-
   char_body`: B != R width discrepancy collapses (B = Char =
  pointer_ty, R = String = pointer_ty); test docstring updated
  to reflect post-addendum reality. Test still pins the wrapper
  composition end-to-end.

* [Task CH5] e2e tests + std/char.sigil + std/string.sigil + PLAN_C docs

- 14 e2e tests cover the full Char surface: char_literal_round_trips_via_to_string,
  char_codepoint_round_trip, int_to_char_rejects_out_of_range,
  int_to_char_rejects_surrogate, int_to_char_accepts_valid,
  is_ascii_classifiers_basic, to_lower_upper_ascii_passthrough,
  string_chars_ascii, string_chars_multibyte,
  string_char_at_codepoint_index, string_from_chars_round_trip,
  char_pattern_match_against_literal, char_eq_distinguishes_different_codepoints,
  char_doc_only_import.

- std/char.sigil ships as a documentation-only file (added to
  BUILTIN_INJECTED skip-list). Covers literal syntax, the 19
  user-facing primitives, the ASCII-only v1 scope + v2 closure path,
  the byte-vs-codepoint cross-reference to std/string.sigil, and a
  worked count_digits example.

- std/string.sigil doc preamble rewritten — the "Codepoint-aware
  variants deferred" note becomes a positive byte-vs-codepoint
  surface description; the new codepoint ops surface
  (string_chars, string_char_at, string_from_chars) is added to the
  builtin operations table with a cross-reference to std/char.sigil.

- PLAN_C_PROGRESS.md gets a Task CH section documenting the
  addendum's runtime + compiler + stdlib + test surface and the
  pre-existing Char-as-I32 → boxed-Char representation switch.

- PLAN_C_DEVIATIONS.md Task 68 entry's "deferred class 1"
  (codepoint-aware ops) and "deferred class 5"
  (char_to_int / int_to_char) marked CLOSED by this addendum.
  Class 3 (Float helpers) marked CLOSED by PR #101 (already shipped).
  Class 2 (codepoint-aware string_split / string_replace) remains
  deferred — depends on stdlib namespace qualification, not Char.

* [Task CH6] Spec update — Char primitive + codepoint string ops

Updates `spec/language.md` for the Plan C addendum:

- §1 (Lexical structure / literals): expanded `Char` literal entry
  covers the boxed representation (TAG_CHAR=0x0C, 16 bytes), the
  closed codepoint range (0x000000..=0x10FFFF excluding surrogates),
  the full escape table (`\n`, `\t`, `\r`, `\\`, `\'`, `\"`, `\0`,
  `\u{HEX}` 1–6 hex digits), bare-codepoint UTF-8 literals, and
  parse-time rejection of multi-codepoint / out-of-range / surrogate
  inputs. Calls out the deliberate absence of `==` / `<` operator
  overloading on Char.

- §3.1 (Built-in types): `Char` row updated to "Boxed Unicode
  codepoint (TAG_CHAR=0x0C, 21-bit codepoint payload)" — replaces
  the pre-addendum "1-byte codepoint" placeholder.

- §3.1.1 (new subsection): full `Char` reference covering literal
  syntax, the 19 user-facing operations grouped by purpose
  (equality / ordering, conversion, ASCII classifiers, ASCII case),
  the codepoint-aware string operations (`string_chars`,
  `string_char_at`, `string_from_chars`), the byte-vs-codepoint
  surface coexistence, and a worked `count_digits` example.

- §13 (Stdlib reference): adds a `std.char` row; extends the
  `std.string` row with the codepoint-indexed surface.

- §14 (v1 limits): new §14.1 "Deferred to follow-up plans" table
  documenting closure paths for codepoint-aware `string_split` /
  `string_replace`, Unicode-aware classifiers / case / normalization,
  and a hypothetical strict-UTF-8 `string_chars_strict`.

* [Task CH4 fix] is_gc_pointer_ty test + e2e UTF-8 source bytes

CI on PR #105 surfaced two regressions from the boxed-Char
representation switch:

1. `is_gc_pointer_ty_matches_expected_types` (layout.rs unit test)
   pinned `!is_gc_pointer_ty(Ty::Char)` — pre-addendum Char was an
   I32 immediate, so it correctly wasn't a GC pointer. Updated to
   assert `is_gc_pointer_ty(Ty::Char)` instead, with a Plan-C-
   addendum justification comment.

2. Three CH5 e2e tests embedded `\u{HEX}` escapes inside `"..."`
   string literals. Sigil's *string* lexer accepts only
   `\n / \t / \r / \\\\ / \\\"`; the `\u{...}` escape lives in
   *char* literals only. Rewrote the tests to use bare UTF-8 bytes
   in source (`"héllo"`, `"héllo 😀"`) — the bytes are valid UTF-8
   that Rust's source-tree handling preserves into the embedded
   test string verbatim, and Sigil's runtime treats `String` as a
   raw UTF-8 byte buffer.

* [Task CH4 fix 2] String lexer UTF-8 preservation + e2e mono trigger

Two more CI failures fixed:

1. Sigil's string lexer pre-Plan-C-addendum read source bytes as
   Latin-1 chars (`self.src[pos] as char`) and pushed them to a
   Rust `String` via `String::push(char)`, which UTF-8 re-encodes.
   Multi-byte source sequences double-encoded — `é` (0xC3 0xA9) →
   4 bytes (0xC3 0x82 0xC2 0xA9) inside the resulting StringLit.
   This regressed nothing pre-addendum because `string_chars` /
   `string_char_at` didn't exist; the addendum surfaces the bug
   immediately. `take_string_lit` now uses the `peek_utf8` /
   `advance_utf8` helpers (added in CH2) to decode multi-byte
   source bytes as a single codepoint and push that codepoint
   verbatim. Two new lexer tests pin the round-trip for 2-byte
   `é` and 4-byte `😀`.

2. `string_from_chars_round_trip` had no explicit `List[Char]`
   annotation in its source, so monomorphize never saw the type
   and codegen panicked with "ctor `Cons$$Char` not registered".
   Added a `let xs: List[Char] = string_chars(s)` binding to
   trigger mono via the type annotation's Apply node, mirroring
   the working `string_chars_multibyte` test's shape.

* [Review] Address PR #105 review items 1–5 + 7

PR #105 review (boldfield, 2026-05-07): one MUST-FIX, four follow-
ups, two deferred. All five non-deferred items addressed plus a
small note for item 7.

**1 (MUST-FIX) — Reject `==` / `!=` on `Char` at typecheck.** Pre-
fix, `'a' == 'a'` typechecked and lowered to `icmp Equal l r` on
boxed Char pointers — pointer identity, not codepoint equality —
silently returning `false` at runtime. New typecheck arm in
`check_binop`'s `BinOp::Eq | BinOp::NotEq` rejects `Ty::Char`
operands with E0060 pointing at `char_eq(a, b)` (or
`!char_eq(a, b)` for `!=`). Two new typecheck unit tests pin the
rejection. Float / Int64 (also boxed) inherit the same trap;
generalizing the rejection across all heap-boxed primitives is
queued as a follow-up since the existing `float_eq` / `int64_eq`
discipline currently hides the bug.

**2 — Refactor `string_char_at` to early-exit shared helper.** Pre-
refactor `sigil_string_char_at_validate` and `sigil_string_char_at`
each called `decode_codepoints_lossy(slice)` (full-pass + Vec
allocation) — making `for i in 0..n: char_at(s, i)` O(n²) with O(n)
transient allocations per call. New shared helper
`find_nth_codepoint(bytes, idx) -> Option<u32>` walks bytes
front-to-back with early-exit; both entry points use it. Each call
is now O(idx + decode_cost) and allocates nothing on the hot path.

**3 — `lower_ctor_alloc` comment fix.** Dropped `Char` from the
"sub-word primitives (Bool, Byte, Char, Unit)" widen-on-store
comment; boxed Char takes the pass-through branch.

**4 — Decoder overlong-rejection tests.** Two new runtime tests:
`string_chars_overlong_2byte_replaces` (`[0xC0, 0x80]` 2-byte
overlong of U+0000) and `string_chars_overlong_3byte_surrogate_-
replaces` (`[0xED, 0xA0, 0x80]` 3-byte UTF-8 form of surrogate
U+D800). Each pins a distinct decoder branch.

**5 — Lexer multi-codepoint count uses codepoint-step.** The
"Char literal must be a single codepoint; got N" diagnostic's
count loop now uses `peek_utf8` / `advance_utf8`, parity with the
literal-body decoder above. New lexer test pins
`'aé'` reports "got 2", not "got 3".

**7 (DEFERRED note) — U+FFFD merging.** Sigil v1's per-byte
replacement (matching `String::from_utf8_lossy`) is now
documented in `std/char.sigil` alongside a forward reference to
the v2 `string_chars_strict` follow-up.

Item 6 (Char in `is_gc_pointer_ty` — note for v2 precise GC) is
deferred per reviewer.

* [Review 2/3/4] Address all 4 outstanding boldfield reviews

Combined fixes for:
- **Review B item 3** (PR review 4246507154, 17:59) — comment style
- **Review C items 9 + 10** (issue comment 4399992055, 18:30) —
  Float/Int64 == extension + e2e lossy-decode test
- **Review D items 1–4** (PR review 4246835141, 18:46) — E0060 both
  operands, decoder dedup, spec §3.4 typo, std/string duplicate

### B item 3 — Plan-C-addendum comment spam pruned

Deleted 13 redundant single-line `// Plan C addendum (Char) — boxed
Char is pointer-typed.` comments adjacent to `EnvSlotKind::Char | ...`
match arms across `compiler/src/codegen.rs`. The match-arm context
alone makes the change self-evident; the load-bearing ones (literal
lowering, type-mapping fns, dispatch arms, helpers, FuncIds struct,
runtime counters, ast.rs `is_pointer()`, layout.rs
`is_gc_pointer_ty`) stay.

### C item 9 — Heap-boxed-primitive == rejection extended to Float / Int64

The earlier Char-only E0060 rejection now generalizes to all three
heap-boxed primitives. New `BoxedPrim` enum + `boxed_primitive_-
eq_name` helper drive the typecheck arm; per-type suggestion +
ordering hint string. Float adds the NaN-aware caveat in the error
message. Four new typecheck unit tests pin Float / Int64 `==` and
`!=` rejection.

### C item 10 — e2e test for user-visible lossy UTF-8 decode

`string_chars_invalid_utf8_replaces` constructs a `ByteArray` with
a known-invalid byte (`0xFF`) via `byte_array_alloc` +
`byte_array_concat`, bypasses validation via
`string_from_bytes_alloc` (the post-validation primitive copies
bytes verbatim), and verifies `string_chars` emits U+FFFD (65533)
for the invalid byte. Closes the runtime → user-program coverage
gap.

### D item 1 — E0060 char check now guards both operands

The earlier check only inspected `lt.as_ref()`. Now both `lt` and
`rt` are checked via `lt_boxed.or(rt_boxed)`, so `42 == 'a'` and
similar shapes still fire the named-function suggestion even when
LHS is non-Char or `None`.

### D item 2 — UTF-8 decoder deduplicated

Extracted `decode_next_codepoint(bytes, offset) -> (cp, len)` as
the single source of truth for Sigil's lossy UTF-8 decode.
`decode_codepoints_lossy` (drives `string_chars`) and
`find_nth_codepoint` (drives `string_char_at`) both step through
it, so codepoint-boundary agreement is now identical-by-
construction rather than agree-by-coincidence. New runtime test
`string_char_at_overlong_replaces` exercises `find_nth_codepoint`
on `[0xC0, 0x80, b'a']` (overlong + ASCII) to pin that the two
entry points produce the same codepoint count for invalid input.

### D item 3 — spec §3.4 → §3.1.1

The Char literal entry's "use the named functions (§3.4)" pointer
referenced "Inference rules (overview)"; corrected to §3.1.1
("Char and codepoint string operations") which is the new
subsection.

### D item 4 — `std/string.sigil` duplicate removed

The byte-indexed surface preamble listed `string_byte_at` twice;
replaced the second occurrence with `string_length`.

### Out of scope (acknowledged in PR reply)

- D non-blocking observation #1 (test gap on find_nth_codepoint with
  invalid input): closed by `string_char_at_overlong_replaces` above.
- D non-blocking observation #2 (`\\u{HEX}` / `\\0` not supported in
  string literals): tracked as a separate follow-up since the
  current behavior produces a clear E0010, not a silent miss.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant