Add TextEncoder and TextDecoder built-ins#272
Conversation
Implements the WHATWG Encoding §8 TextEncoder and TextDecoder APIs.
TextEncoder (always UTF-8):
- encode(string) → Uint8Array
- encodeInto(string, Uint8Array) → {read, written}
- encoding getter → 'utf-8'
TextDecoder (UTF-8 only; RangeError for other labels):
- new TextDecoder([label, {fatal, ignoreBOM}])
- decode([BufferSource]) → string
- encoding, fatal, ignoreBOM getters
- UTF-8 BOM stripping (EF BB BF) unless ignoreBOM is set
- Fatal mode: validates byte sequences and throws TypeError on invalid UTF-8
Both are in DefaultGlobals and registered via the standard
RegisterBuiltinConstructors/RegisterTypeDefinition pipeline.
69 new tests across 9 files covering constructor, encode, encodeInto,
decode, encoding/fatal/ignoreBOM getters.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds WHATWG-style TextEncoder and TextDecoder: constructors, prototypes (encode/encodeInto/decode), accessor properties (encoding, fatal, ignoreBOM), engine/bootstrap registration, class/value implementations, and comprehensive Jest tests for behavior and error cases. Changes
Sequence Diagram(s)sequenceDiagram
participant JS as JS caller
participant Global as GlobalConstructor (TextDecoder)
participant Engine as Engine/BuiltinRegistrar
participant Instance as TextDecoderInstance
participant Validator as UTF8Validator
JS->>Global: new TextDecoder(label?, options?)
Global->>Engine: request create instance
Engine->>Instance: InitializeNativeFromArguments(label, options)
JS->>Instance: instance.decode(buffer)
Instance->>Validator: (if fatal) validate UTF-8 / strip BOM
Validator-->>Instance: validated or throws
Instance-->>JS: return decoded JS string
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
units/Goccia.Builtins.GlobalTextDecoder.pas (1)
35-37: Replace hardcoded constructor name with shared constant.Line 36 hardcodes
'TextDecoder'; useCONSTRUCTOR_TEXT_DECODERfor consistency and drift prevention.♻️ Proposed refactor
uses Goccia.Arguments.Collection, Goccia.Builtins.Base, + Goccia.Constants.ConstructorNames, Goccia.Error.ThrowErrorCallback, Goccia.Scope, Goccia.Values.NativeFunction, Goccia.Values.Primitives, Goccia.Values.TextDecoderValue; @@ FTextDecoderConstructor := TGocciaNativeFunctionValue.Create( - TextDecoderConstructorFn, 'TextDecoder', 0); + TextDecoderConstructorFn, CONSTRUCTOR_TEXT_DECODER, 0);As per coding guidelines, use runtime constant units instead of hardcoded string literals for constructor names (
Goccia.Constants.ConstructorNames).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@units/Goccia.Builtins.GlobalTextDecoder.pas` around lines 35 - 37, Replace the hardcoded constructor name string in the TGocciaNativeFunctionValue.Create call with the shared constant; change the second argument passed in the FTextDecoderConstructor initialization (where TGocciaNativeFunctionValue.Create(TextDecoderConstructorFn, 'TextDecoder', 0) is used) to use CONSTRUCTOR_TEXT_DECODER from Goccia.Constants.ConstructorNames, and keep the subsequent TGocciaTextDecoderValue.ExposePrototype(FTextDecoderConstructor) call unchanged so the prototype exposure still references the same FTextDecoderConstructor.units/Goccia.Values.TextDecoderValue.pas (1)
209-210: Replace hardcoded prototype method name with property-name constantLine 209 uses
'decode'as a literal inAddNamedMethod. Please source this fromGoccia.Constants.PropertyNames(e.g.,PROP_DECODE) to keep runtime names centralized.🔧 Suggested change
- Members.AddNamedMethod('decode', Decode, 1, gmkPrototypeMethod, + Members.AddNamedMethod(PROP_DECODE, Decode, 1, gmkPrototypeMethod, [gmfNoFunctionPrototype]);As per coding guidelines, use runtime constant units instead of hardcoded string literals:
Goccia.Constants.PropertyNamesfor property names.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@units/Goccia.Values.TextDecoderValue.pas` around lines 209 - 210, Replace the hardcoded 'decode' literal passed to Members.AddNamedMethod with the centralized property-name constant: use Goccia.Constants.PropertyNames.PROP_DECODE (or the equivalent exported constant) instead of the string; update the call where Members.AddNamedMethod('decode', Decode, 1, gmkPrototypeMethod, [gmfNoFunctionPrototype]) to reference PROP_DECODE so the runtime name for the Decode method is sourced from the Goccia.Constants.PropertyNames unit and stays consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@units/Goccia.Builtins.GlobalTextEncoder.pas`:
- Around line 35-37: Replace the hardcoded 'TextEncoder' literal passed to
TGocciaNativeFunctionValue.Create with the shared constant
CONSTRUCTOR_TEXT_ENCODER from Goccia.Constants.ConstructorNames: locate where
FTextEncoderConstructor is initialized (the call to
TGocciaNativeFunctionValue.Create with TextEncoderConstructorFn) and change the
second argument to use the constant, then keep the subsequent
TGocciaTextEncoderValue.ExposePrototype(FTextEncoderConstructor) intact so the
prototype registration still uses the same constructor symbol.
In `@units/Goccia.Values.TextDecoderValue.pas`:
- Around line 154-193: Replace the manual argument checks in
TGocciaTextDecoderValue.InitializeNativeFromArguments with
TGocciaArgumentValidator usage: use the validator to read/validate the optional
label as a string and the optional options as an object (or undefined), then if
the validated label is present call NormalizeEncodingLabel and assign FEncoding
(throw on empty), and if the validated options object is present read PROP_FATAL
and PROP_IGNORE_BOM and assign FFatal and FIgnoreBOM accordingly; remove the
explicit instance/type checks and use TGocciaArgumentValidator methods to
centralize validation while keeping PROP_FATAL, PROP_IGNORE_BOM, FEncoding,
FFatal and FIgnoreBOM as the target fields.
- Around line 110-136: The current UTF-8 validation loop only checks
continuation-byte form and thus allows overlong, surrogate and out-of-range
sequences; update the decoder routine that contains variables I, SeqLen, B,
AData, AOffset, ALength and Result to enforce stricter initial-byte dependent
checks: after computing SeqLen and before accepting the sequence, read the
second continuation byte (AData[I+1]) and add these guards—if SeqLen=3 then if
B=$E0 ensure AData[I+1] >= $A0; if B=$ED ensure AData[I+1] <= $9F; if SeqLen=4
then if B=$F0 ensure AData[I+1] >= $90; if B=$F4 ensure AData[I+1] <= $8F; keep
the existing continuation-byte loop and still fail (Result:=False; Exit) when
any of these guards are violated so overlong, surrogate and >U+10FFFF sequences
are rejected.
In `@units/Goccia.Values.TextEncoderValue.pas`:
- Around line 56-65: The StringToUTF8Bytes helper (used by Encode and EncodeInto
when reading ToStringLiteral.Value) does raw byte copying and does not perform
USVString normalization, so lone surrogates are encoded incorrectly; update the
flow to normalize to a well-formed scalar-value string before UTF-8 conversion
either by adding a normalization step to StringToUTF8Bytes (iterating code units
and replacing lone surrogates with U+FFFD, similar to
StringObjectValue.toWellFormed) or by calling that normalization from Encode and
EncodeInto prior to calling StringToUTF8Bytes; ensure you reference the existing
toWellFormed logic in StringObjectValue.pas to implement the surrogate-to-FFFD
replacement so inputs like "\uD800" produce the UTF-8 for U+FFFD.
---
Nitpick comments:
In `@units/Goccia.Builtins.GlobalTextDecoder.pas`:
- Around line 35-37: Replace the hardcoded constructor name string in the
TGocciaNativeFunctionValue.Create call with the shared constant; change the
second argument passed in the FTextDecoderConstructor initialization (where
TGocciaNativeFunctionValue.Create(TextDecoderConstructorFn, 'TextDecoder', 0) is
used) to use CONSTRUCTOR_TEXT_DECODER from Goccia.Constants.ConstructorNames,
and keep the subsequent
TGocciaTextDecoderValue.ExposePrototype(FTextDecoderConstructor) call unchanged
so the prototype exposure still references the same FTextDecoderConstructor.
In `@units/Goccia.Values.TextDecoderValue.pas`:
- Around line 209-210: Replace the hardcoded 'decode' literal passed to
Members.AddNamedMethod with the centralized property-name constant: use
Goccia.Constants.PropertyNames.PROP_DECODE (or the equivalent exported constant)
instead of the string; update the call where Members.AddNamedMethod('decode',
Decode, 1, gmkPrototypeMethod, [gmfNoFunctionPrototype]) to reference
PROP_DECODE so the runtime name for the Decode method is sourced from the
Goccia.Constants.PropertyNames unit and stays consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 83a8c19c-709d-4d2c-b4f3-b0194b3c3545
📒 Files selected for processing (17)
tests/built-ins/TextDecoder/constructor.jstests/built-ins/TextDecoder/prototype/decode.jstests/built-ins/TextDecoder/prototype/encoding.jstests/built-ins/TextDecoder/prototype/fatal.jstests/built-ins/TextDecoder/prototype/ignoreBOM.jstests/built-ins/TextEncoder/constructor.jstests/built-ins/TextEncoder/prototype/encode.jstests/built-ins/TextEncoder/prototype/encodeInto.jstests/built-ins/TextEncoder/prototype/encoding.jsunits/Goccia.Builtins.GlobalTextDecoder.pasunits/Goccia.Builtins.GlobalTextEncoder.pasunits/Goccia.Constants.ConstructorNames.pasunits/Goccia.Constants.PropertyNames.pasunits/Goccia.Engine.pasunits/Goccia.Values.ClassValue.pasunits/Goccia.Values.TextDecoderValue.pasunits/Goccia.Values.TextEncoderValue.pas
Suite Timing
Measured on ubuntu-latest x64. |
Benchmark Results364 benchmarks Interpreted: 🟢 243 improved · 🔴 54 regressed · 67 unchanged · avg +3.9% arraybuffer.js — Interp: 🟢 13, 🔴 1 · avg +11.5% · Bytecode: 🟢 4, 🔴 4, 6 unch. · avg -0.3%
arrays.js — Interp: 🟢 18, 1 unch. · avg +8.4% · Bytecode: 🟢 8, 🔴 4, 7 unch. · avg +0.2%
async-await.js — Interp: 🟢 6 · avg +12.4% · Bytecode: 🔴 1, 5 unch. · avg +0.1%
base64.js — Interp: 🟢 8, 2 unch. · avg +6.7% · Bytecode: 🟢 1, 🔴 9 · avg -3.4%
classes.js — Interp: 🟢 24, 🔴 3, 4 unch. · avg +6.1% · Bytecode: 🟢 3, 🔴 3, 25 unch. · avg -0.1%
closures.js — Interp: 🟢 11 · avg +5.8% · Bytecode: 🟢 2, 9 unch. · avg +0.6%
collections.js — Interp: 🟢 8, 🔴 2, 2 unch. · avg +4.5% · Bytecode: 🟢 3, 🔴 3, 6 unch. · avg -0.3%
destructuring.js — Interp: 🟢 9, 🔴 9, 4 unch. · avg +0.0% · Bytecode: 🟢 4, 🔴 7, 11 unch. · avg -2.1%
fibonacci.js — Interp: 🟢 6, 2 unch. · avg +5.8% · Bytecode: 🟢 2, 🔴 4, 2 unch. · avg -3.6%
float16array.js — Interp: 🟢 10, 🔴 12, 10 unch. · avg -1.7% · Bytecode: 🟢 8, 🔴 12, 12 unch. · avg +0.3%
for-of.js — Interp: 🟢 3, 4 unch. · avg +1.9% · Bytecode: 🔴 3, 4 unch. · avg -1.6%
helpers/bench-module.js — Interp: 0 · Bytecode: 0
iterators.js — Interp: 🟢 17, 🔴 12, 13 unch. · avg +0.8% · Bytecode: 🟢 2, 🔴 22, 18 unch. · avg -2.1%
json.js — Interp: 🟢 18, 2 unch. · avg +5.3% · Bytecode: 🔴 11, 9 unch. · avg -2.0%
jsx.jsx — Interp: 🟢 12, 🔴 1, 8 unch. · avg +2.2% · Bytecode: 🟢 18, 🔴 1, 2 unch. · avg +3.7%
modules.js — Interp: 🟢 9 · avg +7.2% · Bytecode: 🟢 1, 🔴 7, 1 unch. · avg -1.4%
numbers.js — Interp: 🟢 10, 1 unch. · avg +7.5% · Bytecode: 🟢 4, 🔴 5, 2 unch. · avg -2.6%
objects.js — Interp: 🟢 3, 🔴 2, 2 unch. · avg -0.4% · Bytecode: 🟢 4, 🔴 1, 2 unch. · avg +2.0%
promises.js — Interp: 🟢 11, 1 unch. · avg +4.7% · Bytecode: 🟢 7, 5 unch. · avg +2.1%
regexp.js — Interp: 🟢 8, 3 unch. · avg +5.7% · Bytecode: 🔴 5, 6 unch. · avg -1.6%
strings.js — Interp: 🟢 18, 1 unch. · avg +6.9% · Bytecode: 🟢 2, 🔴 1, 16 unch. · avg -0.1%
typed-arrays.js — Interp: 🟢 11, 🔴 6, 5 unch. · avg +2.8% · Bytecode: 🟢 5, 🔴 9, 8 unch. · avg -0.8%
uint8array-encoding.js — Interp: 🟢 10, 🔴 6, 2 unch. · avg +0.6% · Bytecode: 🟢 2, 🔴 11, 5 unch. · avg -2.3%
Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
The WHATWG Encoding built-ins were only registered in TGocciaEngine (interpreter path). The bytecode backend uses TGocciaRuntimeBootstrap as its shared initialization layer, so TextEncoder/TextDecoder were undefined in --mode=bytecode. Mirror all six registration steps into Goccia.Runtime.Bootstrap.pas: interface uses, field declarations, destructor cleanup, RegisterBuiltIns creation, ExposePrototype helpers, and RegisterBuiltinConstructors TypeDef registrations. All 69 TextEncoder/TextDecoder tests now pass in both interpreted and bytecode execution modes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…dUTF8 Four actionable issues from CodeRabbit PR #272: 1. Replace hardcoded 'TextEncoder'/'TextDecoder' string literals in GlobalTextEncoder/GlobalTextDecoder constructors with CONSTRUCTOR_TEXT_ENCODER / CONSTRUCTOR_TEXT_DECODER from Goccia.Constants.ConstructorNames — prevents drift across registration paths. 2. Add PROP_ENCODE, PROP_ENCODE_INTO, PROP_DECODE to Goccia.Constants.PropertyNames and use them in AddNamedMethod calls in TextEncoderValue and TextDecoderValue. 3. Fix IsValidUTF8: add missing second-byte range checks required by RFC 3629 to reject overlong sequences (E0 < A0, F0 < 90), UTF-16 surrogates (ED > 9F), and code points above U+10FFFF (F4 > 8F). Fatal mode now correctly rejects E0 80 80, ED A0 80, F0 80 80 80, F4 90 80 80 and similar sequences. Six new tests verify this. 4. Fix StringToUTF8Bytes: normalise lone surrogates (U+D800-U+DFFF, encoded as ED A0-BF xx in WTF-8/CESU-8) to U+FFFD (EF BF BD) per WHATWG Encoding §8.3.2 USVString requirement. Replacement is in-place since both encodings are exactly 3 bytes. Two new tests verify this. Skipped: TGocciaArgumentValidator refactor — that class only validates argument counts; the manual checks in InitializeNativeFromArguments perform type/value validation on optional arguments where count validation alone would not help. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@units/Goccia.Values.TextDecoderValue.pas`:
- Around line 355-364: The current decode path bypasses validation when
Obj.FFatal is false and casts raw bytes to UTF8String, which does not implement
U+FFFD replacement for malformed sequences; replace that branch with a proper
UTF‑8 decoding routine that produces replacement characters for ill-formed
sequences (e.g., call or implement a helper like DecodeUTF8WithReplacement(Data,
Offset, DataLen) instead of raw string(U8)), keep the fatal branch using
IsValidUTF8/ThrowTypeError, and return the decoded string via
TGocciaStringLiteralValue.Create; also add explicit assertions in
tests/built-ins/TextDecoder/prototype/decode.js to verify replacement behavior
(e.g., new TextDecoder().decode(new Uint8Array([0xFF])) === "\uFFFD").
In `@units/Goccia.Values.TextEncoderValue.pas`:
- Around line 181-276: The prototype methods EncodingGetter, Encode and
EncodeInto must reject non-TextEncoder receivers: at the start of
TGocciaTextEncoderValue.EncodingGetter, TGocciaTextEncoderValue.Encode and
TGocciaTextEncoderValue.EncodeInto check that AThisValue is a
TGocciaTextEncoderValue (or use the same illegal-invocation guard used in
TextDecoder) and call ThrowTypeError with an "Illegal invocation" message when
it is not; add this guard before any use of AThisValue so detached calls like
TextEncoder.prototype.encode.call({}) throw instead of proceeding.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 64b8cab8-7f99-4f48-83ca-dbaca88991bb
📒 Files selected for processing (8)
tests/built-ins/TextDecoder/prototype/decode.jstests/built-ins/TextEncoder/prototype/encode.jsunits/Goccia.Builtins.GlobalTextDecoder.pasunits/Goccia.Builtins.GlobalTextEncoder.pasunits/Goccia.Constants.PropertyNames.pasunits/Goccia.Runtime.Bootstrap.pasunits/Goccia.Values.TextDecoderValue.pasunits/Goccia.Values.TextEncoderValue.pas
✅ Files skipped from review due to trivial changes (3)
- units/Goccia.Builtins.GlobalTextDecoder.pas
- tests/built-ins/TextEncoder/prototype/encode.js
- units/Goccia.Constants.PropertyNames.pas
… receiver guards Two new actionable comments from CodeRabbit on PR #272: 1. Non-fatal decode mode now implements WHATWG U+FFFD replacement semantics (TextDecoderValue.pas line 364). The previous implementation did a raw byte copy in non-fatal mode, which passes invalid bytes through unchanged instead of replacing them with U+FFFD as the WHATWG Encoding §4.1 spec requires. Added DecodeUTF8WithReplacement: a byte-level state machine that mirrors the WHATWG UTF-8 decoder algorithm — each ill-formed byte (invalid leading byte, truncated sequence, overlong, surrogate, or >U+10FFFF) emits U+FFFD and advances the cursor by one byte. The fatal path continues to use the existing IsValidUTF8 + raw copy approach. New tests verify: 0xFF → U+FFFD, truncated 0xC3 → U+FFFD, surrogate ED A0 80 → U+FFFD×3 (per WHATWG spec: the leading byte fails at its second byte, then each orphaned continuation byte also fails individually), and valid bytes preserved between invalid ones. 2. TextEncoder prototype methods now reject non-TextEncoder receivers (TextEncoderValue.pas line 276). EncodingGetter, Encode, and EncodeInto previously ignored AThisValue, so detached calls like TextEncoder.prototype.encode.call({}, "x") would succeed. Added illegal-invocation guards matching the TextDecoder pattern. Tests cover encode and the encoding getter. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Incorporates 5 commits from main: - TextEncoder and TextDecoder built-ins (#272) - Make import.meta tests account for Windows drive letters (#274) - Fixes tagged template object identity (#275) - Add Goccia.build platform metadata (#276) - Add ToObject coercion for primitives across all Object.* static methods (#271) Conflict resolution in 3 files (all "both sides added" — keep both): Goccia.Engine.pas / Goccia.Runtime.Bootstrap.pas: - Added ggTextEncoder and ggTextDecoder to TGocciaGlobalBuiltin enum - Added both to DefaultGlobals alongside ggURL - Added FBuiltinTextEncoder/FBuiltinTextDecoder fields, Free calls, registration blocks, Expose* helpers, and constructor TypeDef blocks alongside the existing URL equivalents Goccia.Values.ClassValue.pas: - Added TGocciaTextEncoderClassValue and TGocciaTextDecoderClassValue declarations, impl uses, and CreateNativeInstance implementations alongside the existing URL equivalents Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Summary
TextEncoderandTextDecoderglobals with native constructors, prototype exposure, andtoStringTagsupport.TextEncoder.prototype.encode,TextEncoder.prototype.encodeInto, andTextDecoder.prototype.decodewith UTF-8 handling and BOM/fatal option support.Testing
tests/built-ins/TextEncoder/**andtests/built-ins/TextDecoder/**.encodeIntoboundary handling, BOM stripping, and fatal-mode errors.