Recognize LS/PS as line terminators in lexer (ES2026 §12.3)#285
Conversation
The lexer only recognized LF and CR as line terminators, missing the two Unicode line terminators required by ES2026 §12.3. Since FPC operates on AnsiString (UTF-8 bytes), LS/PS arrive as 3-byte sequences (E2 80 A8 / E2 80 A9) that a single-byte CharInSet check cannot detect. Add IsLineTerminator / IsUnicodeLineTerminator helpers and update SkipHashbang (§12.5), SkipComment (§12.4), SkipWhitespace, and SkipBlockComment to handle all four spec-defined line terminators. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds ES2026 Unicode line-terminator support (U+2028/U+2029) to the lexer: recognizes UTF‑8 LS/PS and CR/CRLF as line terminators, updates line/column tracking, adjusts whitespace/hashbang/comment/template/regex scanning, and adds tests for CR, CRLF, and Unicode terminators. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
units/Goccia.Lexer.pas (1)
228-235:⚠️ Potential issue | 🟡 MinorHandle standalone CR as a line terminator in line/column tracking.
Line 228 treats
#13(CR) as plain whitespace and only callsAdvance, unlike#10(LF) which incrementsFLine. This means a source with standalone CR line breaks advances position without updatingFLine, causing incorrect line numbering for subsequent tokens. The same issue occurs inSkipBlockCommentat line 293.The
IsLineTerminatorfunction (line 199) correctly identifies CR as a line terminator, but the whitespace-skipping and comment-handling paths do not treat it as such. Implement CR handling consistently withScanTemplate's pattern (lines 724–733): consume the CR (and optional following LF as a single break), then incrementFLine.Add regression coverage with explicit
#13inputs (both standalone and CRLF sequences) to ensure line tracking is correct.Proposed fix
procedure TGocciaLexer.SkipWhitespace; begin while not IsAtEnd do begin case Peek of - ' ', `#13`, `#9`: + ' ', `#9`: Advance; + `#13`: + begin + Advance; + if Peek = `#10` then + Advance; + Inc(FLine); + FColumn := 0; + end; `#10`: begin Inc(FLine); FColumn := 0; Advance; end;Also in
SkipBlockComment:- else if Peek = `#10` then + else if Peek = `#13` then + begin + Advance; + if Peek = `#10` then + Advance; + Inc(FLine); + FColumn := 0; + end + else if Peek = `#10` then begin Inc(FLine); FColumn := 0; Advance; endAlso applies to: 293–301
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@units/Goccia.Lexer.pas` around lines 228 - 235, Whitespace and comment handling treats CR (`#13`) as plain whitespace and fails to increment FLine, causing incorrect line numbers; update the whitespace-skipping branch (the case handling ' ', `#13`, `#9` and the `#10` block) to treat CR like LF: consume CR, increment FLine, reset FColumn, and if the next char is LF consume it as part of the same break (mirror ScanTemplate logic from ScanTemplate around lines 724–733), and apply the same change inside SkipBlockComment so CR and CRLF both advance the line counter correctly; ensure you use Advance, Inc(FLine) and FColumn := 0 in the updated branches and add regression tests feeding standalone `#13` and CRLF sequences to validate line/column tracking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@units/Goccia.Lexer.pas`:
- Around line 228-235: Whitespace and comment handling treats CR (`#13`) as plain
whitespace and fails to increment FLine, causing incorrect line numbers; update
the whitespace-skipping branch (the case handling ' ', `#13`, `#9` and the `#10`
block) to treat CR like LF: consume CR, increment FLine, reset FColumn, and if
the next char is LF consume it as part of the same break (mirror ScanTemplate
logic from ScanTemplate around lines 724–733), and apply the same change inside
SkipBlockComment so CR and CRLF both advance the line counter correctly; ensure
you use Advance, Inc(FLine) and FColumn := 0 in the updated branches and add
regression tests feeding standalone `#13` and CRLF sequences to validate
line/column tracking.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9efce9d8-01d6-4287-ac3c-fb530592b4e2
📒 Files selected for processing (2)
units/Goccia.Lexer.Test.pasunits/Goccia.Lexer.pas
SkipWhitespace treated CR (#13) as plain whitespace without incrementing FLine, causing incorrect line numbers after standalone CR line endings. SkipBlockComment had the same gap — CR inside a block comment did not advance the line counter. Split CR out of the whitespace branch and mirror the ScanTemplate pattern: consume CR, optionally consume a following LF (so CRLF counts as a single line break), then increment FLine and reset FColumn. Add regression tests feeding standalone CR and CRLF sequences to both whitespace and block-comment contexts and verifying line/column tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Benchmark Results364 benchmarks Interpreted: 🟢 76 improved · 🔴 202 regressed · 86 unchanged · avg -2.1% arraybuffer.js — Interp: 🔴 13, 1 unch. · avg -7.4% · Bytecode: 🟢 1, 🔴 1, 12 unch. · avg -0.0%
arrays.js — Interp: 🔴 18, 1 unch. · avg -5.8% · Bytecode: 🟢 3, 🔴 13, 3 unch. · avg -1.5%
async-await.js — Interp: 🔴 6 · avg -9.1% · Bytecode: 🔴 6 · avg -3.4%
base64.js — Interp: 🟢 1, 🔴 7, 2 unch. · avg -7.0% · Bytecode: 🟢 5, 🔴 2, 3 unch. · avg +4.1%
classes.js — Interp: 🟢 6, 🔴 19, 6 unch. · avg -2.7% · Bytecode: 🔴 12, 19 unch. · avg -3.0%
closures.js — Interp: 🔴 10, 1 unch. · avg -5.0% · Bytecode: 🔴 6, 5 unch. · avg -3.7%
collections.js — Interp: 🟢 2, 🔴 9, 1 unch. · avg -5.3% · Bytecode: 🟢 1, 🔴 9, 2 unch. · avg -1.5%
destructuring.js — Interp: 🟢 9, 🔴 9, 4 unch. · avg +0.2% · Bytecode: 🔴 12, 10 unch. · avg -3.2%
fibonacci.js — Interp: 🟢 2, 🔴 4, 2 unch. · avg -3.7% · Bytecode: 🔴 5, 3 unch. · avg -1.4%
float16array.js — Interp: 🟢 11, 🔴 10, 11 unch. · avg +3.4% · Bytecode: 🟢 11, 🔴 12, 9 unch. · avg +2.0%
for-of.js — Interp: 🟢 2, 🔴 1, 4 unch. · avg +0.7% · Bytecode: 🟢 1, 🔴 6 · avg +0.1%
helpers/bench-module.js — Interp: 0 · Bytecode: 0
iterators.js — Interp: 🟢 13, 🔴 14, 15 unch. · avg +0.2% · Bytecode: 🔴 31, 11 unch. · avg -3.8%
json.js — Interp: 🔴 18, 2 unch. · avg -5.2% · Bytecode: 🟢 1, 🔴 8, 11 unch. · avg -2.1%
jsx.jsx — Interp: 🟢 12, 🔴 3, 6 unch. · avg +1.7% · Bytecode: 🔴 19, 2 unch. · avg -2.5%
modules.js — Interp: 🔴 8, 1 unch. · avg -3.7% · Bytecode: 🔴 2, 7 unch. · avg +0.1%
numbers.js — Interp: 🔴 10, 1 unch. · avg -6.9% · Bytecode: 🟢 4, 🔴 1, 6 unch. · avg +5.5%
objects.js — Interp: 🟢 4, 🔴 1, 2 unch. · avg +1.4% · Bytecode: 🟢 1, 🔴 3, 3 unch. · avg -1.2%
promises.js — Interp: 🔴 6, 6 unch. · avg -2.1% · Bytecode: 🔴 5, 7 unch. · avg -0.8%
regexp.js — Interp: 🔴 5, 6 unch. · avg -4.1% · Bytecode: 🟢 1, 🔴 5, 5 unch. · avg -1.8%
strings.js — Interp: 🟢 2, 🔴 17 · avg -4.5% · Bytecode: 🟢 1, 🔴 4, 14 unch. · avg -1.1%
typed-arrays.js — Interp: 🟢 5, 🔴 8, 9 unch. · avg -0.6% · Bytecode: 🟢 2, 🔴 15, 5 unch. · avg -1.8%
uint8array-encoding.js — Interp: 🟢 7, 🔴 6, 5 unch. · avg +0.7% · Bytecode: 🟢 2, 🔴 6, 10 unch. · avg -1.2%
Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
Suite Timing
Measured on ubuntu-latest x64. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
units/Goccia.Lexer.pas (1)
216-220: Name the UTF-8 sequence length instead of inlining3.
Inc(FCurrent, 3)is part of the LS/PS encoding contract too. AUTF8_LINE_TERMINATOR_BYTE_COUNTconstant would keep this helper consistent with the byte constants above.As per coding guidelines, "Extract bare numeric literals in implementation sections into named constants; only trivial self-explanatory literals (0, 1, -1, '', True, False) may be left as-is."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@units/Goccia.Lexer.pas` around lines 216 - 220, Replace the magic literal in TGocciaLexer.ConsumeUnicodeLineTerminator with a named constant: define UTF8_LINE_TERMINATOR_BYTE_COUNT (value 3) in the implementation section (near the other UTF-8/LS/PS byte constants) and use Inc(FCurrent, UTF8_LINE_TERMINATOR_BYTE_COUNT) so the method (TGocciaLexer.ConsumeUnicodeLineTerminator) and the LS/PS encoding contract consistently reference the same symbolic byte-count constant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@units/Goccia.Lexer.pas`:
- Around line 193-221: The regex and template scanners (ScanRegexLiteral and
ScanTemplate) still only treat `#10/`#13 as line terminators; update them to use
the new helpers so they recognize U+2028/U+2029 too: replace direct checks for
`#10/`#13 with calls to IsLineTerminator, and when a terminator is found,
increment position/line using the existing ConsumeUnicodeLineTerminator for
multi-byte Unicode terminators or the existing single-byte logic for LF/CR
(update FLine and FColumn accordingly). Ensure ScanRegexLiteral raises an error
when IsLineTerminator is true inside the regex body, and ensure ScanTemplate
updates FLine/FColumn via the same branching so raw U+2028/U+2029 are treated
consistently.
- Around line 207-213: Add an explicit short-circuit evaluation directive and
replace the magic UTF‑8 byte-count literal with a named constant: add {$B-} to
the unit header or Goccia.inc to ensure boolean short-circuiting for safety in
TGocciaLexer.IsUnicodeLineTerminator, and declare a constant (e.g.
UTF8_LS_PS_BYTE_ADVANCE = 3) and use that constant wherever Inc(FCurrent, 3) is
used so the UTF‑8 byte-advance semantics are clear and maintainable.
---
Nitpick comments:
In `@units/Goccia.Lexer.pas`:
- Around line 216-220: Replace the magic literal in
TGocciaLexer.ConsumeUnicodeLineTerminator with a named constant: define
UTF8_LINE_TERMINATOR_BYTE_COUNT (value 3) in the implementation section (near
the other UTF-8/LS/PS byte constants) and use Inc(FCurrent,
UTF8_LINE_TERMINATOR_BYTE_COUNT) so the method
(TGocciaLexer.ConsumeUnicodeLineTerminator) and the LS/PS encoding contract
consistently reference the same symbolic byte-count constant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1693dc09-d483-41f2-a152-f0712fd39471
📒 Files selected for processing (2)
units/Goccia.Lexer.Test.pasunits/Goccia.Lexer.pas
✅ Files skipped from review due to trivial changes (1)
- units/Goccia.Lexer.Test.pas
ScanRegexLiteral only rejected LF inside regex bodies; now checks IsLineTerminator before each character so CR, LS, and PS also raise "Unterminated regular expression literal" per ES2026 §12.9.5. ScanTemplate only tracked CR and LF for line counting; add a LS/PS branch that preserves the original code point in both TV and TRV (ES2026 §12.9.6) and increments the line counter. Extract UTF8_LINE_TERMINATOR_BYTE_LENGTH constant for the magic 3 in ConsumeUnicodeLineTerminator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
The lexer only recognized LF (U+000A) and CR (U+000D) as line terminators, missing LINE SEPARATOR (U+2028) and PARAGRAPH SEPARATOR (U+2029) required by ES2026 §12.3. Since FPC operates on
AnsiString(UTF-8 bytes), LS/PS arrive as 3-byte sequences (E2 80 A8/E2 80 A9) that a single-byteCharInSetcheck cannot detect.IsLineTerminator/IsUnicodeLineTerminator/ConsumeUnicodeLineTerminatorhelpers with named UTF-8 byte constantsSkipHashbang(§12.5),SkipComment(§12.4),SkipWhitespace, andSkipBlockCommentto handle all four spec-defined line terminatorsSkipHashbangandSkipCommentTesting
TestCommentTerminatedByUnicodeLineTerminatorstest added🤖 Generated with Claude Code