Implement TC39 Template Literal Revision for tagged templates#281
Conversation
Tagged template literals now tolerate malformed escape sequences (\u{,
\xG1, incomplete \u escapes, code points > U+10FFFF) by setting the
cooked segment to undefined while preserving the raw source text.
Untagged template literals continue to throw SyntaxError for the same
invalid escapes, preserving existing behavior.
Closes #269
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughLexer, parser, AST, bytecode, compiler, evaluator, and VM updated to implement ES2018 Template Literal Revision: tagged templates now carry per-segment cooked-valid flags so malformed Changes
Sequence Diagram(s)sequenceDiagram
participant Lexer as Lexer
participant Parser as Parser
participant Compiler as Compiler
participant Bytecode as Bytecode
participant VM as VM
participant Evaluator as Evaluator
participant Tag as TagFunction
Lexer->>Parser: Tokenize template (mark invalid escapes, keep raw)
Parser->>Compiler: Build TGocciaTaggedTemplateExpression (CookedStrings, RawStrings, CookedValid)
Compiler->>Bytecode: AddConstantTemplateObject(..., CookedValid)
Bytecode->>VM: Load template constant (contains CookedValid)
VM->>Evaluator: Build template object (use CookedValid → cooked or undefined)
Evaluator->>Tag: Call tag with template object and interpolations
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
units/Goccia.Bytecode.Binary.pas (1)
164-167: Consider validating CookedValid length invariant to prevent silent stream corruption.The writer serializes
Length(Constant.CookedValid)booleans without writing the count explicitly, while the reader (lines 373-375) readsLength(CookedStrings)booleans. This relies on the invariantLength(CookedValid) = Length(CookedStrings).However,
AddConstantTemplateObjectinGoccia.Bytecode.Chunk.pas(lines 330-332) only validatesLength(ACookedStrings) <> Length(ARawStrings)but does not validateCookedValidlength. If the invariant is ever violated (e.g., parser bug), the.gbcstream will be silently corrupted.Consider adding validation in
AddConstantTemplateObject:if Length(ACookedValid) <> Length(ACookedStrings) then raise Exception.Create( 'Template payload length mismatch: cooked validity vs cooked strings');🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@units/Goccia.Bytecode.Binary.pas` around lines 164 - 167, Add a length-check in AddConstantTemplateObject to ensure the CookedValid array matches the CookedStrings array length and raise an exception if not; specifically verify Length(ACookedValid) = Length(ACookedStrings) (or Length(ACookedValid) = Length(ARawStrings) depending on existing invariants) and raise a descriptive exception like "Template payload length mismatch: cooked validity vs cooked strings" to prevent silent .gbc stream corruption when the writer (which writes Length(Constant.CookedValid) booleans) and reader (which reads Length(CookedStrings) booleans) diverge.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/language-restrictions.md`:
- Line 163: Line 163's inline code uses nested backticks (the tagged template
example `tag`...``) which breaks markdown rendering; update that inline span to
avoid nested backticks by using an alternate delimiter (e.g., wrap the entire
example in double backticks like ``tag`...`` or use an HTML <code> element) so
the tagged template example (`tag`...`) and String.raw remain correctly shown;
change the text containing "tag`...`" and "String.raw" to use the non-nested
inline code formatting.
In `@units/Goccia.Lexer.pas`:
- Around line 435-465: The surrogate-pair probe can advance the lexer even when
the trailing \uXXXX is incomplete or not a valid hex, desynchronizing
FCurrent/FColumn (see CodePoint check, Peek/PeekNext, SavedCurrent/SavedColumn,
Advance, HexStart, HexStr, IsValidHexString, LowSurrogate). Modify the lookahead
so that any path that does not result in a valid low surrogate (IsAtEnd during
the 4-char read, HexStr length ≠ 4, IsValidHexString = false, or low-surrogate
out of $DC00..$DFFF) restores FCurrent and FColumn from SavedCurrent/SavedColumn
before exiting the block; only keep the advanced state when CodePoint is
successfully combined into the paired code point.
---
Nitpick comments:
In `@units/Goccia.Bytecode.Binary.pas`:
- Around line 164-167: Add a length-check in AddConstantTemplateObject to ensure
the CookedValid array matches the CookedStrings array length and raise an
exception if not; specifically verify Length(ACookedValid) =
Length(ACookedStrings) (or Length(ACookedValid) = Length(ARawStrings) depending
on existing invariants) and raise a descriptive exception like "Template payload
length mismatch: cooked validity vs cooked strings" to prevent silent .gbc
stream corruption when the writer (which writes Length(Constant.CookedValid)
booleans) and reader (which reads Length(CookedStrings) booleans) diverge.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9c280fb9-fb66-4a8f-a559-a0654dd7b556
📒 Files selected for processing (11)
docs/language-restrictions.mdtests/language/expressions/string/tagged-templates.jstests/language/expressions/string/template-string.jsunits/Goccia.AST.Expressions.pasunits/Goccia.Bytecode.Binary.pasunits/Goccia.Bytecode.Chunk.pasunits/Goccia.Compiler.Expressions.pasunits/Goccia.Evaluator.pasunits/Goccia.Lexer.pasunits/Goccia.Parser.pasunits/Goccia.VM.pas
Suite Timing
Measured on ubuntu-latest x64. |
Benchmark Results364 benchmarks Interpreted: 🟢 215 improved · 🔴 68 regressed · 81 unchanged · avg +2.5% arraybuffer.js — Interp: 🟢 12, 🔴 2 · avg +6.0% · Bytecode: 🔴 8, 6 unch. · avg -1.7%
arrays.js — Interp: 🟢 17, 2 unch. · avg +6.1% · Bytecode: 🟢 3, 🔴 15, 1 unch. · avg -2.9%
async-await.js — Interp: 🟢 6 · avg +6.4% · Bytecode: 🟢 4, 🔴 1, 1 unch. · avg +1.1%
base64.js — Interp: 🟢 9, 1 unch. · avg +6.1% · Bytecode: 🟢 8, 2 unch. · avg +2.2%
classes.js — Interp: 🟢 22, 🔴 4, 5 unch. · avg +4.5% · Bytecode: 🟢 2, 🔴 8, 21 unch. · avg -0.3%
closures.js — Interp: 🟢 6, 5 unch. · avg +2.3% · Bytecode: 🟢 1, 🔴 4, 6 unch. · avg -2.5%
collections.js — Interp: 🟢 9, 🔴 2, 1 unch. · avg +4.7% · Bytecode: 🟢 3, 🔴 6, 3 unch. · avg -1.8%
destructuring.js — Interp: 🟢 8, 🔴 9, 5 unch. · avg +0.3% · Bytecode: 🟢 1, 🔴 15, 6 unch. · avg -2.3%
fibonacci.js — Interp: 🟢 6, 2 unch. · avg +5.9% · Bytecode: 🔴 6, 2 unch. · avg -2.7%
float16array.js — Interp: 🟢 8, 🔴 12, 12 unch. · avg -2.1% · Bytecode: 🟢 14, 🔴 9, 9 unch. · avg +0.3%
for-of.js — Interp: 🟢 1, 6 unch. · avg -0.1% · Bytecode: 🟢 2, 🔴 2, 3 unch. · avg +0.4%
helpers/bench-module.js — Interp: 0 · Bytecode: 0
iterators.js — Interp: 🟢 17, 🔴 17, 8 unch. · avg -0.7% · Bytecode: 🟢 8, 🔴 25, 9 unch. · avg -1.7%
json.js — Interp: 🟢 18, 2 unch. · avg +6.7% · Bytecode: 🔴 14, 6 unch. · avg -3.3%
jsx.jsx — Interp: 🟢 12, 9 unch. · avg +2.1% · Bytecode: 🔴 17, 4 unch. · avg -2.3%
modules.js — Interp: 🟢 9 · avg +6.9% · Bytecode: 🟢 1, 🔴 7, 1 unch. · avg -1.0%
numbers.js — Interp: 🟢 10, 🔴 1 · avg +5.6% · Bytecode: 🟢 2, 🔴 9 · avg -1.0%
objects.js — Interp: 🟢 2, 🔴 3, 2 unch. · avg -0.5% · Bytecode: 🔴 6, 1 unch. · avg -4.2%
promises.js — Interp: 🟢 6, 🔴 1, 5 unch. · avg +0.9% · Bytecode: 🔴 11, 1 unch. · avg -2.1%
regexp.js — Interp: 🟢 9, 🔴 1, 1 unch. · avg +4.1% · Bytecode: 🔴 11 · avg -4.5%
strings.js — Interp: 🟢 18, 1 unch. · avg +4.6% · Bytecode: 🔴 16, 3 unch. · avg -5.3%
typed-arrays.js — Interp: 🟢 7, 🔴 9, 6 unch. · avg -0.2% · Bytecode: 🟢 2, 🔴 14, 6 unch. · avg -3.9%
uint8array-encoding.js — Interp: 🟢 3, 🔴 7, 8 unch. · avg -0.5% · Bytecode: 🟢 1, 🔴 9, 8 unch. · avg -1.8%
Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
Address CodeRabbit review feedback:
- Lexer: guard surrogate pair lookahead against scanning past the
closing backtick, and restore the cursor on every failure path
(incomplete hex, invalid hex, non-surrogate low value)
- Parser: scope the untagged template SyntaxError to static segments
only — split into segments first via SplitRawAtBoundaries so that
malformed-looking sequences inside ${...} expression bodies do not
cause false-positive rejections
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
units/Goccia.Parser.pas (1)
1230-1250: Prefer the shared code-point→UTF-8 helper here.This hand-rolled UTF-8 encoding path now has to stay bit-for-bit aligned with the lexer and other decoders. Reusing the existing helper would reduce drift in exactly the area this PR is tightening up.
Based on learnings: when decoding Unicode escape sequences and converting them to UTF-8, prefer the existing helper implemented via
CodePointToUTF8that correctly combines UTF-16 surrogate pairs before UTF-8 encoding.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@units/Goccia.Parser.pas` around lines 1230 - 1250, Replace the manual UTF‑8 encoding block that appends bytes via ASB.AppendChar for variable CodePoint with the shared helper CodePointToUTF8 so surrogate pairs are combined and encoding logic stays consistent; locate the block that checks CodePoint ranges and remove it, calling CodePointToUTF8 (the existing helper used by the lexer/decoders) to produce/apply the UTF‑8 bytes (or append its returned string) instead of the hand-rolled byte assembly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@units/Goccia.Parser.pas`:
- Around line 1166-1190: The CookUnicodeEscape function currently calls
StrToInt('$'+HexStr) which can raise on out-of-range hex payloads; replace this
with a safe conversion using TryStrToQWord (e.g., parse into a QWord temp),
return False when TryStrToQWord fails, then validate the numeric value (<=
$10FFFF) before casting into CodePoint (Cardinal) and continuing; update the
logic around the StrToInt call so malformed/overflowing braced escapes cause
CookUnicodeEscape to return False instead of raising (refer to
CookUnicodeEscape, CodePoint and HexStr identifiers).
---
Nitpick comments:
In `@units/Goccia.Parser.pas`:
- Around line 1230-1250: Replace the manual UTF‑8 encoding block that appends
bytes via ASB.AppendChar for variable CodePoint with the shared helper
CodePointToUTF8 so surrogate pairs are combined and encoding logic stays
consistent; locate the block that checks CodePoint ranges and remove it, calling
CodePointToUTF8 (the existing helper used by the lexer/decoders) to
produce/apply the UTF‑8 bytes (or append its returned string) instead of the
hand-rolled byte assembly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0027333f-a2a2-4105-9149-cab5f555964c
📒 Files selected for processing (2)
units/Goccia.Lexer.pasunits/Goccia.Parser.pas
🚧 Files skipped from review as they are similar to previous changes (1)
- units/Goccia.Lexer.pas
Replace StrToInt('$' + HexStr) with TryStrToQWord in both the lexer's
ScanUnicodeEscapeForTemplate and the parser's CookUnicodeEscape for
the \u{...} braced path. Arbitrarily long hex payloads like
\u{FFFFFFFF} now return False instead of raising an unhandled overflow
exception, so tagged templates correctly set cooked=undefined and
untagged templates emit SyntaxError.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the hand-rolled UTF-8 byte assembly in CookUnicodeEscape with a local CodePointToUTF8 helper, keeping the encoding logic in one place and consistent with the same pattern in JSONParser and Goccia.YAML. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
\u{,\xG1, incomplete\uescapes, code points > U+10FFFF) by setting the cooked segment toundefinedwhile preserving the raw source text instrings.raw.SyntaxErrorfor the same invalid escapes, preserving existing behavior.CookedValidflags), evaluator, compiler, bytecode serialization (.gbc), and VM.Testing
tagged-templates.jscovering\xG1,\u{,\u{ZZZZ},\u00,\xQQ, mixed valid/invalid segments, frozen objects, expression evaluation, call-site identity, code point > U+10FFFF, and valid escapes still working.gbcserialization round-trip verified for templates with invalid escapesdocs/language-restrictions.mdupdated to document TC39 Template Literal Revision support🤖 Generated with Claude Code