Extract shared text semantics utilities#439
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a new TextSemantics unit centralizing UTF‑8 byte‑accurate utilities, switches file readers to read raw UTF‑8 bytes then retag/normalize via TextSemantics, updates many parsing, string, iterator, module-resolution, and test paths to use these primitives, and trims Goccia.TextFiles API. Changes
Sequence Diagram(s)sequenceDiagram
participant FS as FileSystem (ReadUTF8FileText)
participant TS as TextSemantics
participant Parser as Parser/Loader
participant Runtime as Runtime/Consumer
FS->>TS: return raw UTF8 bytes (UTF8String)
TS->>TS: RetagUTF8Text / NormalizeUTF8NewlinesToLF / CreateUTF8FileTextLines
TS->>Parser: provide retagged UTF‑8 string or TStringList
Parser->>Runtime: parsed AST / values with preserved UTF‑8 strings
Runtime->>TS: call UnicodeLower/Upper / ExpandReplacementPattern / AdvanceUTF8StringIndex as needed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Benchmark Results386 benchmarks Interpreted: 🟢 46 improved · 🔴 96 regressed · 244 unchanged · avg -1.2% arraybuffer.js — Interp: 14 unch. · avg -0.4% · Bytecode: 🔴 11, 3 unch. · avg -8.1%
arrays.js — Interp: 🔴 2, 17 unch. · avg -1.1% · Bytecode: 🔴 17, 2 unch. · avg -11.6%
async-await.js — Interp: 🔴 1, 5 unch. · avg -1.1% · Bytecode: 🔴 5, 1 unch. · avg -10.1%
base64.js — Interp: 🟢 1, 🔴 5, 4 unch. · avg -3.0% · Bytecode: 🔴 9, 1 unch. · avg -12.9%
classes.js — Interp: 🔴 6, 25 unch. · avg -1.8% · Bytecode: 🔴 22, 9 unch. · avg -6.5%
closures.js — Interp: 🔴 3, 8 unch. · avg -1.2% · Bytecode: 🔴 10, 1 unch. · avg -10.4%
collections.js — Interp: 🔴 1, 11 unch. · avg -0.1% · Bytecode: 🔴 12 · avg -12.0%
csv.js — Interp: 🔴 4, 9 unch. · avg -2.0% · Bytecode: 🔴 13 · avg -11.2%
destructuring.js — Interp: 🔴 2, 20 unch. · avg -2.2% · Bytecode: 🔴 22 · avg -9.3%
fibonacci.js — Interp: 8 unch. · avg -0.0% · Bytecode: 🔴 8 · avg -11.4%
float16array.js — Interp: 🟢 1, 🔴 9, 22 unch. · avg -1.7% · Bytecode: 🟢 4, 🔴 25, 3 unch. · avg -3.6%
for-of.js — Interp: 🟢 5, 2 unch. · avg +4.1% · Bytecode: 🔴 7 · avg -13.0%
helpers/bench-module.js — Interp: 0 · Bytecode: 0
iterators.js — Interp: 🔴 37, 5 unch. · avg -6.1% · Bytecode: 🔴 41, 1 unch. · avg -8.9%
json.js — Interp: 🟢 9, 🔴 1, 10 unch. · avg +2.1% · Bytecode: 🔴 19, 1 unch. · avg -8.0%
jsx.jsx — Interp: 🟢 7, 🔴 2, 12 unch. · avg +1.8% · Bytecode: 🔴 21 · avg -9.9%
modules.js — Interp: 9 unch. · avg -0.4% · Bytecode: 🔴 9 · avg -13.3%
numbers.js — Interp: 🟢 1, 10 unch. · avg +2.0% · Bytecode: 🔴 11 · avg -9.6%
objects.js — Interp: 🟢 3, 4 unch. · avg +2.9% · Bytecode: 🔴 7 · avg -7.4%
promises.js — Interp: 🟢 1, 🔴 1, 10 unch. · avg -0.8% · Bytecode: 🔴 12 · avg -12.7%
regexp.js — Interp: 🟢 5, 6 unch. · avg +4.4% · Bytecode: 🔴 11 · avg -6.3%
strings.js — Interp: 🟢 5, 🔴 2, 12 unch. · avg +0.1% · Bytecode: 🔴 19 · avg -11.3%
tsv.js — Interp: 🔴 1, 8 unch. · avg -1.5% · Bytecode: 🔴 6, 3 unch. · avg -5.3%
typed-arrays.js — Interp: 🟢 1, 🔴 16, 5 unch. · avg -3.1% · Bytecode: 🟢 3, 🔴 15, 4 unch. · avg -3.8%
uint8array-encoding.js — Interp: 🟢 7, 🔴 3, 8 unch. · avg -3.0% · Bytecode: 🟢 7, 🔴 8, 3 unch. · avg -4.9%
Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
Suite Timing
Measured on ubuntu-latest x64. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
source/units/Goccia.Values.StringObjectValue.pas (1)
1843-1864:⚠️ Potential issue | 🟠 Major
codePointAtnow rejects lone-surrogate strings.This path only succeeds when the selected index starts a well-formed UTF-8 sequence. After this PR, strings can still contain lone surrogates/ill-formed UTF-8 bytes (
source/shared/JSONParser.pasnow routes\uXXXXthroughCodePointToUTF8, whilesource/shared/TextSemantics.pasrejects those bytes inTryReadUTF8CodePoint), so'\uD800'.codePointAt(0)will fall into the failure branch and returnundefinedinstead of the surrogate code unit. It also makes any index that lands inside a multibyte character returnundefined, which is inconsistent with the runtime's current byte-indexed string operations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/units/Goccia.Values.StringObjectValue.pas` around lines 1843 - 1864, The current code in codePointAt (using ExtractStringValue, ToIntegerFromArgs and TryReadUTF8CodePoint) returns undefined when TryReadUTF8CodePoint fails (which rejects lone surrogates or when the index falls inside a multibyte sequence); change the failure path to preserve the runtime's previous byte-indexed behavior by falling back to returning the raw byte/code-unit at the requested position instead of undefined: after TryReadUTF8CodePoint fails, read the single byte at StringValue[Index+1] and return that numeric value as the code point (using the same Result/TGocciaUndefinedLiteralValue flow), so lone-surrogate inputs and indices inside multibyte characters yield the original byte/code-unit value.
🧹 Nitpick comments (2)
tests/built-ins/String/prototype/toLocaleLowerCase.js (1)
12-20: Add nullish-receiver error assertions for completeness.This suite should also verify
String.prototype.toLocaleLowerCase.call(null/undefined)throwsTypeErrorto lock inRequireObjectCoerciblebehavior.Suggested additional test
describe("String.prototype.toLocaleLowerCase", () => { @@ test("coerces non-string receivers", () => { expect(String.prototype.toLocaleLowerCase.call(true)).toBe("true"); }); + + test("throws for nullish receivers", () => { + expect(() => String.prototype.toLocaleLowerCase.call(null)).toThrow(TypeError); + expect(() => String.prototype.toLocaleLowerCase.call(undefined)).toThrow(TypeError); + });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/built-ins/String/prototype/toLocaleLowerCase.js` around lines 12 - 20, Add assertions that calling String.prototype.toLocaleLowerCase with nullish receivers throws a TypeError: update the "coerces non-string receivers"/coercion tests in the String.prototype.toLocaleLowerCase suite to include expect(() => String.prototype.toLocaleLowerCase.call(null)).toThrow(TypeError) and expect(() => String.prototype.toLocaleLowerCase.call(undefined)).toThrow(TypeError) so the test locks in RequireObjectCoercible behavior for String.prototype.toLocaleLowerCase.tests/built-ins/String/prototype/trimEnd.js (1)
16-21: Expand this Unicode whitespace vector to the full ECMAScript set.Line 17 currently checks only part of the required set; missing members (like
\u2001-\u200A, tab/CR/LF) could regress without failing this test.Suggested test vector expansion
- const whitespace = "\u00A0\u1680\u2000\u2028\u2029\u202F\u205F\u3000\uFEFF"; + const whitespace = + "\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680" + + "\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A" + + "\u2028\u2029\u202F\u205F\u3000\uFEFF";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/built-ins/String/prototype/trimEnd.js` around lines 16 - 21, The test "String.prototype.trimEnd removes ECMAScript Unicode whitespace" currently uses an incomplete whitespace vector; update the local variable whitespace in that test to include the full ECMAScript whitespace set (e.g. U+0009 TAB, U+000A LF, U+000B VT, U+000C FF, U+000D CR, U+0020 SPACE, U+00A0 NO-BREAK SPACE, U+1680 OGHAM SPACE MARK, U+2000..U+200A EN QUAD..HAIR SPACE, U+2028 LINE SEPARATOR, U+2029 PARAGRAPH SEPARATOR, U+202F NARROW NO-BREAK, U+205F MEDIUM MATHEMATICAL SPACE, U+3000 IDEOGRAPHIC SPACE, U+FEFF ZERO WIDTH NO-BREAK) so the test variable whitespace covers every required codepoint and the assertion using (whitespace + "hello" + whitespace).trimEnd() remains valid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/shared/TextSemantics.pas`:
- Around line 305-317: The loop in ToWellFormedUTF8 calls TryReadUTF8CodePoint
and on failure skips the entire probed ByteLength, which can discard valid
bytes; instead compute the actual invalid subpart length and advance by that. On
failure, set n := 1 and then while (n < ByteLength) and (Index + n <=
Length(AText)) and (AText[Index + n] is a UTF-8 continuation byte, i.e. in
$80..$BF) do Inc(n); ensure n is at least 1 and does not exceed remaining
length, append UTF8_REPLACEMENT_CHARACTER, then Inc(Index, n) (use the same
Buffer.Append/Index/ByteLength symbols already in the diff). This preserves any
subsequent valid start bytes rather than skipping the optimistic probe width.
In `@tests/built-ins/String/prototype/replaceAll.js`:
- Around line 64-75: Update the test "replaceAll calls function replacer for
empty string search values" to assert the exact offsets array contents instead
of only its length: after collecting offsets in the offsets array inside the
replacer callback, replace expect(offsets.length).toBe(3) with an assertion that
offsets equals the sequence [0, 1, 2] (e.g., expect(offsets).toEqual([0, 1,
2])), ensuring the exact values and order produced by 'ab'.replaceAll('', ...)
are validated.
---
Outside diff comments:
In `@source/units/Goccia.Values.StringObjectValue.pas`:
- Around line 1843-1864: The current code in codePointAt (using
ExtractStringValue, ToIntegerFromArgs and TryReadUTF8CodePoint) returns
undefined when TryReadUTF8CodePoint fails (which rejects lone surrogates or when
the index falls inside a multibyte sequence); change the failure path to
preserve the runtime's previous byte-indexed behavior by falling back to
returning the raw byte/code-unit at the requested position instead of undefined:
after TryReadUTF8CodePoint fails, read the single byte at StringValue[Index+1]
and return that numeric value as the code point (using the same
Result/TGocciaUndefinedLiteralValue flow), so lone-surrogate inputs and indices
inside multibyte characters yield the original byte/code-unit value.
---
Nitpick comments:
In `@tests/built-ins/String/prototype/toLocaleLowerCase.js`:
- Around line 12-20: Add assertions that calling
String.prototype.toLocaleLowerCase with nullish receivers throws a TypeError:
update the "coerces non-string receivers"/coercion tests in the
String.prototype.toLocaleLowerCase suite to include expect(() =>
String.prototype.toLocaleLowerCase.call(null)).toThrow(TypeError) and expect(()
=> String.prototype.toLocaleLowerCase.call(undefined)).toThrow(TypeError) so the
test locks in RequireObjectCoercible behavior for
String.prototype.toLocaleLowerCase.
In `@tests/built-ins/String/prototype/trimEnd.js`:
- Around line 16-21: The test "String.prototype.trimEnd removes ECMAScript
Unicode whitespace" currently uses an incomplete whitespace vector; update the
local variable whitespace in that test to include the full ECMAScript whitespace
set (e.g. U+0009 TAB, U+000A LF, U+000B VT, U+000C FF, U+000D CR, U+0020 SPACE,
U+00A0 NO-BREAK SPACE, U+1680 OGHAM SPACE MARK, U+2000..U+200A EN QUAD..HAIR
SPACE, U+2028 LINE SEPARATOR, U+2029 PARAGRAPH SEPARATOR, U+202F NARROW
NO-BREAK, U+205F MEDIUM MATHEMATICAL SPACE, U+3000 IDEOGRAPHIC SPACE, U+FEFF
ZERO WIDTH NO-BREAK) so the test variable whitespace covers every required
codepoint and the assertion using (whitespace + "hello" + whitespace).trimEnd()
remains valid.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f0a2c5c4-87e1-4f4d-b098-069b3c0f2675
⛔ Files ignored due to path filters (2)
tests/language/modules/helpers/unicode-table.csvis excluded by!**/*.csvtests/language/modules/helpers/unicode-table.tsvis excluded by!**/*.tsv
📒 Files selected for processing (54)
docs/built-ins-data-formats.mddocs/built-ins.mddocs/testing.mdsource/app/GocciaBenchmarkRunner.dprsource/app/GocciaBundler.dprsource/app/GocciaREPL.dprsource/app/GocciaScriptLoader.dprsource/app/GocciaTestRunner.dprsource/shared/CLI.ConfigFile.Test.passource/shared/CLI.ConfigFile.passource/shared/JSONParser.passource/shared/TextSemantics.Test.passource/shared/TextSemantics.passource/units/Goccia.Builtins.JSON5.passource/units/Goccia.Bytecode.Binary.passource/units/Goccia.CSV.passource/units/Goccia.Compiler.Test.passource/units/Goccia.Coverage.Report.passource/units/Goccia.Engine.passource/units/Goccia.Error.Messages.passource/units/Goccia.JSON.passource/units/Goccia.JSONL.passource/units/Goccia.JSX.Transformer.passource/units/Goccia.Lexer.passource/units/Goccia.Modules.Configuration.Test.passource/units/Goccia.Modules.ContentProvider.Test.passource/units/Goccia.Modules.ContentProvider.passource/units/Goccia.Modules.Loader.passource/units/Goccia.Modules.Resolver.passource/units/Goccia.Parser.passource/units/Goccia.RegExp.Engine.passource/units/Goccia.SourceMap.Consumer.passource/units/Goccia.SourceMap.Test.passource/units/Goccia.TSV.passource/units/Goccia.TextFiles.Test.passource/units/Goccia.TextFiles.passource/units/Goccia.Utils.passource/units/Goccia.Values.Primitives.passource/units/Goccia.Values.StringObjectValue.passource/units/Goccia.YAML.pastests/built-ins/JSONL/parse.jstests/built-ins/String/prototype/charAt.jstests/built-ins/String/prototype/charCodeAt.jstests/built-ins/String/prototype/replace.jstests/built-ins/String/prototype/replaceAll.jstests/built-ins/String/prototype/substring.jstests/built-ins/String/prototype/toLocaleLowerCase.jstests/built-ins/String/prototype/toLocaleUpperCase.jstests/built-ins/String/prototype/toLowerCase.jstests/built-ins/String/prototype/toUpperCase.jstests/built-ins/String/prototype/trim.jstests/built-ins/String/prototype/trimEnd.jstests/built-ins/String/prototype/trimStart.jstests/language/modules/tabular-import.js
90ffa3a to
6ba93ef
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
tests/built-ins/String/prototype/replaceAll.js (1)
64-75:⚠️ Potential issue | 🟡 MinorAssert exact callback offsets, not just count.
Line 74 only checks the number of calls. That can pass even if offset values/order are wrong. Assert the exact sequence from
'ab'.replaceAll('', ...):[0, 1, 2].Suggested fix
expect(result).toBe('0a1b2'); - expect(offsets.length).toBe(3); + expect(offsets).toEqual([0, 1, 2]);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/built-ins/String/prototype/replaceAll.js` around lines 64 - 75, The test "replaceAll calls function replacer for empty string search values" currently only asserts offsets.length; change it to assert the exact sequence of callback offsets by checking offsets equals [0,1,2] (i.e., after collecting offsets in the offsets array, replace the expect(offsets.length).toBe(3) with an equality assertion that offsets is exactly [0, 1, 2]) so the test verifies both order and values from 'ab'.replaceAll('', ...).source/shared/TextSemantics.pas (1)
305-317:⚠️ Potential issue | 🟠 MajorDon't skip valid bytes after a malformed UTF-8 starter.
When
TryReadUTF8CodePointfails after settingByteLengthto2/3/4, this branch still advances by that full probed width. That drops valid bytes that happen to follow the bad prefix, e.g.#$E2 + '(' + #$A1gets collapsed into a single replacement instead of preserving the(. Advance only across the invalid subpart actually consumed by the malformed sequence, and add a regression for that shape.💡 Suggested fix
function ToWellFormedUTF8(const AText: string): string; var Buffer: TStringBuffer; ByteLength: Integer; CodePoint: Cardinal; + InvalidLength: Integer; Index: Integer; begin Buffer := TStringBuffer.Create(Length(AText)); Index := 1; while Index <= Length(AText) do @@ end else begin Buffer.Append(UTF8_REPLACEMENT_CHARACTER); - if ByteLength < 1 then - ByteLength := 1; - if Index + ByteLength - 1 > Length(AText) then - ByteLength := Length(AText) - Index + 1; - Inc(Index, ByteLength); + if ByteLength < 1 then + ByteLength := 1; + InvalidLength := 1; + while (InvalidLength < ByteLength) and + (Index + InvalidLength <= Length(AText)) and + ((Ord(AText[Index + InvalidLength]) and $C0) = $80) do + Inc(InvalidLength); + Inc(Index, InvalidLength); end; end; Result := Buffer.ToString; end;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/shared/TextSemantics.pas` around lines 305 - 317, When TryReadUTF8CodePoint fails in the decoding loop (the branch that appends UTF8_REPLACEMENT_CHARACTER), do not unconditionally advance Index by the probed ByteLength; instead compute the actual number of bytes consumed by the malformed sequence by scanning from Index+1 up to Index+ByteLength-1 and only including bytes that are valid UTF‑8 continuation bytes (0x80..0xBF), then advance Index by 1 + that count (or at least 1 if none). Update the failing branch around TryReadUTF8CodePoint/ByteLength/Index so Buffer.Append(UTF8_REPLACEMENT_CHARACTER) is followed by this limited-scan increment logic and add a regression test that verifies sequences like #$E2 + '(' + #$A1 preserve '(' (e.g., expecting replacement + '(' + replacement).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@source/shared/TextSemantics.pas`:
- Around line 305-317: When TryReadUTF8CodePoint fails in the decoding loop (the
branch that appends UTF8_REPLACEMENT_CHARACTER), do not unconditionally advance
Index by the probed ByteLength; instead compute the actual number of bytes
consumed by the malformed sequence by scanning from Index+1 up to
Index+ByteLength-1 and only including bytes that are valid UTF‑8 continuation
bytes (0x80..0xBF), then advance Index by 1 + that count (or at least 1 if
none). Update the failing branch around TryReadUTF8CodePoint/ByteLength/Index so
Buffer.Append(UTF8_REPLACEMENT_CHARACTER) is followed by this limited-scan
increment logic and add a regression test that verifies sequences like #$E2 +
'(' + #$A1 preserve '(' (e.g., expecting replacement + '(' + replacement).
In `@tests/built-ins/String/prototype/replaceAll.js`:
- Around line 64-75: The test "replaceAll calls function replacer for empty
string search values" currently only asserts offsets.length; change it to assert
the exact sequence of callback offsets by checking offsets equals [0,1,2] (i.e.,
after collecting offsets in the offsets array, replace the
expect(offsets.length).toBe(3) with an equality assertion that offsets is
exactly [0, 1, 2]) so the test verifies both order and values from
'ab'.replaceAll('', ...).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c73d41bd-1296-4ac0-aca6-4474c7deeecb
⛔ Files ignored due to path filters (2)
tests/language/modules/helpers/unicode-table.csvis excluded by!**/*.csvtests/language/modules/helpers/unicode-table.tsvis excluded by!**/*.tsv
📒 Files selected for processing (55)
docs/built-ins-data-formats.mddocs/built-ins.mddocs/testing.mdsource/app/GocciaBenchmarkRunner.dprsource/app/GocciaBundler.dprsource/app/GocciaREPL.dprsource/app/GocciaScriptLoader.dprsource/app/GocciaTestRunner.dprsource/shared/CLI.ConfigFile.Test.passource/shared/CLI.ConfigFile.passource/shared/JSONParser.passource/shared/TextSemantics.Test.passource/shared/TextSemantics.passource/units/Goccia.Builtins.JSON5.passource/units/Goccia.Bytecode.Binary.passource/units/Goccia.CSV.passource/units/Goccia.Compiler.Test.passource/units/Goccia.Coverage.Report.passource/units/Goccia.Engine.passource/units/Goccia.Error.Messages.passource/units/Goccia.JSON.passource/units/Goccia.JSONL.passource/units/Goccia.JSX.Transformer.passource/units/Goccia.Lexer.passource/units/Goccia.Modules.Configuration.Test.passource/units/Goccia.Modules.ContentProvider.Test.passource/units/Goccia.Modules.ContentProvider.passource/units/Goccia.Modules.Loader.passource/units/Goccia.Modules.Resolver.passource/units/Goccia.Parser.passource/units/Goccia.RegExp.Engine.passource/units/Goccia.SourceMap.Consumer.passource/units/Goccia.SourceMap.Test.passource/units/Goccia.TSV.passource/units/Goccia.TextFiles.Test.passource/units/Goccia.TextFiles.passource/units/Goccia.Utils.passource/units/Goccia.Values.ObjectValue.Test.passource/units/Goccia.Values.Primitives.passource/units/Goccia.Values.StringObjectValue.passource/units/Goccia.YAML.pastests/built-ins/JSONL/parse.jstests/built-ins/String/prototype/charAt.jstests/built-ins/String/prototype/charCodeAt.jstests/built-ins/String/prototype/replace.jstests/built-ins/String/prototype/replaceAll.jstests/built-ins/String/prototype/substring.jstests/built-ins/String/prototype/toLocaleLowerCase.jstests/built-ins/String/prototype/toLocaleUpperCase.jstests/built-ins/String/prototype/toLowerCase.jstests/built-ins/String/prototype/toUpperCase.jstests/built-ins/String/prototype/trim.jstests/built-ins/String/prototype/trimEnd.jstests/built-ins/String/prototype/trimStart.jstests/language/modules/tabular-import.js
✅ Files skipped from review due to trivial changes (15)
- source/app/GocciaREPL.dpr
- source/units/Goccia.Modules.ContentProvider.pas
- source/units/Goccia.Modules.ContentProvider.Test.pas
- tests/built-ins/String/prototype/toLowerCase.js
- source/units/Goccia.Modules.Loader.pas
- tests/built-ins/String/prototype/toLocaleUpperCase.js
- source/units/Goccia.Values.Primitives.pas
- source/units/Goccia.Compiler.Test.pas
- tests/built-ins/String/prototype/toLocaleLowerCase.js
- source/units/Goccia.JSX.Transformer.pas
- source/units/Goccia.Builtins.JSON5.pas
- source/units/Goccia.Lexer.pas
- source/units/Goccia.Engine.pas
- docs/built-ins.md
- source/units/Goccia.Coverage.Report.pas
🚧 Files skipped from review as they are similar to previous changes (24)
- source/units/Goccia.Error.Messages.pas
- tests/built-ins/String/prototype/toUpperCase.js
- tests/built-ins/String/prototype/substring.js
- tests/built-ins/String/prototype/trimStart.js
- tests/built-ins/JSONL/parse.js
- tests/built-ins/String/prototype/trim.js
- tests/built-ins/String/prototype/trimEnd.js
- source/units/Goccia.JSON.pas
- source/app/GocciaBenchmarkRunner.dpr
- source/units/Goccia.Bytecode.Binary.pas
- source/units/Goccia.Modules.Resolver.pas
- source/units/Goccia.JSONL.pas
- source/app/GocciaBundler.dpr
- tests/built-ins/String/prototype/charCodeAt.js
- docs/built-ins-data-formats.md
- source/units/Goccia.RegExp.Engine.pas
- tests/built-ins/String/prototype/replace.js
- tests/language/modules/tabular-import.js
- source/units/Goccia.Utils.pas
- source/units/Goccia.TextFiles.Test.pas
- source/units/Goccia.Modules.Configuration.Test.pas
- source/units/Goccia.SourceMap.Test.pas
- source/units/Goccia.TextFiles.pas
- source/units/Goccia.Values.StringObjectValue.pas
There was a problem hiding this comment.
🧹 Nitpick comments (1)
source/shared/TextSemantics.pas (1)
294-327: TheToWellFormedUTF8fix addresses the prior concern but still has an edge case.The updated logic at lines 314-323 now computes
InvalidLengthby counting continuation bytes, which is an improvement over the prior approach. However, the condition at line 321 (if Index + InvalidLength - 1 > Length(AText)) can never be true after thewhileloop guard at line 318 already ensuresIndex + InvalidLength <= Length(AText). This check is dead code.More importantly, for a lead byte like
$E2(3-byte sequence) followed by a non-continuation byte (e.g.,$E2 + 'A'),ByteLengthis set to 3, but thewhileloop exits immediately whenIsContinuationBytereturns false for'A'. This correctly advances by 1 and preserves'A'. The test at line 76-77 (ToWellFormedUTF8(#$E2 + 'AB')→UTF8_REPLACEMENT_CHARACTER + 'AB') confirms this works.However, consider a truncated sequence at end of string:
#$E2#$80(incomplete 3-byte). The loop setsInvalidLength := 1, then increments to 2 (since#$80is a continuation). It advances by 2, which is correct. The dead code at 321-322 can be safely removed.🧹 Remove dead code
while (InvalidLength < ByteLength) and (Index + InvalidLength <= Length(AText)) and IsContinuationByte(AText, Index + InvalidLength) do Inc(InvalidLength); - if Index + InvalidLength - 1 > Length(AText) then - InvalidLength := Length(AText) - Index + 1; Inc(Index, InvalidLength);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/shared/TextSemantics.pas` around lines 294 - 327, In ToWellFormedUTF8, remove the dead conditional that checks "if Index + InvalidLength - 1 > Length(AText) then InvalidLength := Length(AText) - Index + 1" because the preceding while loop already enforces Index + InvalidLength <= Length(AText); locate the block using the symbols ToWellFormedUTF8, InvalidLength, ByteLength and Index and delete that if/then branch so the function simply advances Index by InvalidLength after counting continuation bytes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@source/shared/TextSemantics.pas`:
- Around line 294-327: In ToWellFormedUTF8, remove the dead conditional that
checks "if Index + InvalidLength - 1 > Length(AText) then InvalidLength :=
Length(AText) - Index + 1" because the preceding while loop already enforces
Index + InvalidLength <= Length(AText); locate the block using the symbols
ToWellFormedUTF8, InvalidLength, ByteLength and Index and delete that if/then
branch so the function simply advances Index by InvalidLength after counting
continuation bytes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2af0d32b-b081-43f8-ac16-e1f15b7de563
📒 Files selected for processing (7)
source/shared/TextSemantics.Test.passource/shared/TextSemantics.passource/units/Goccia.Values.StringObjectValue.pastests/built-ins/String/prototype/codePointAt.jstests/built-ins/String/prototype/replaceAll.jstests/built-ins/String/prototype/toLocaleLowerCase.jstests/built-ins/String/prototype/trimEnd.js
✅ Files skipped from review due to trivial changes (1)
- tests/built-ins/String/prototype/toLocaleLowerCase.js
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/built-ins/String/prototype/trimEnd.js
- tests/built-ins/String/prototype/replaceAll.js
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
source/units/Goccia.Shims.pas (1)
73-74: Inconsistent indentation inside the embedded JavaScript.The
const b0declaration at line 74 has 4-space indentation but should be at 6 spaces since it's inside theelseblock. This doesn't affect runtime behavior but harms readability when debugging the shim source.📝 Suggested indentation fix
- ' } else {'#10 + - ' const b0: number = str.charCodeAt(i);'#10 + + ' } else {'#10 + + ' const b0: number = str.charCodeAt(i);'#10 +🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/units/Goccia.Shims.pas` around lines 73 - 74, Adjust the embedded JavaScript indentation so declarations inside the else block are aligned with the block body: move the "const b0: number = str.charCodeAt(i);" line to use 6 spaces (same indentation level as other lines inside the else) rather than 4 spaces so it visually sits inside the "else" block in the embedded JS string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/units/Goccia.Shims.pas`:
- Around line 68-102: The btoa shim's UTF-8 decoding logic (the block that uses
skip, b0, cp, width and pushes decoded values into codePoints) is incorrect;
remove that whole multi-byte detection section (the code referencing skip, b0,
cp, width and the if/else chains) and replace it with a simple iteration that
collects UTF-16 code units by calling str.charCodeAt(i) for each index and
pushing into codePoints (keep the existing codePoints variable), then ensure
downstream logic that converts codePoints to bytes enforces the WHATWG rule that
each code point must be ≤ 0xFF before base64 encoding.
---
Nitpick comments:
In `@source/units/Goccia.Shims.pas`:
- Around line 73-74: Adjust the embedded JavaScript indentation so declarations
inside the else block are aligned with the block body: move the "const b0:
number = str.charCodeAt(i);" line to use 6 spaces (same indentation level as
other lines inside the else) rather than 4 spaces so it visually sits inside the
"else" block in the embedded JS string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 950aaf83-94ed-42e7-99c3-85edd9af2c4d
📒 Files selected for processing (1)
source/units/Goccia.Shims.pas
8987205 to
17442dd
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
source/units/Goccia.Values.Iterator.Concrete.pas (1)
249-272: Optional: extract shared “read current code point + advance” helper.
AdvanceNextandDirectNextnow duplicate the sameByteLength/Copy/Incblock. A private helper would reduce drift risk in future iterator changes.Also applies to: 274-299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/units/Goccia.Values.Iterator.Concrete.pas` around lines 249 - 272, AdvanceNext and DirectNext duplicate the logic that computes ByteLength, extracts the UTF‑8 code point (Copy) and increments FIndex; create a private helper method (e.g., ReadAndAdvanceCodePoint or GetNextCodePoint) on TGocciaStringIteratorValue that encapsulates: obtain StrVal from FSource, check FIndex vs Length(StrVal), compute TextSemantics.UTF8SequenceLengthAt, return the extracted substring and advance FIndex (and set FDone when at end) so both AdvanceNext and DirectNext call this helper and preserve existing return semantics (CreateIteratorResult with the same values).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@source/units/Goccia.Values.Iterator.Concrete.pas`:
- Around line 249-272: AdvanceNext and DirectNext duplicate the logic that
computes ByteLength, extracts the UTF‑8 code point (Copy) and increments FIndex;
create a private helper method (e.g., ReadAndAdvanceCodePoint or
GetNextCodePoint) on TGocciaStringIteratorValue that encapsulates: obtain StrVal
from FSource, check FIndex vs Length(StrVal), compute
TextSemantics.UTF8SequenceLengthAt, return the extracted substring and advance
FIndex (and set FDone when at end) so both AdvanceNext and DirectNext call this
helper and preserve existing return semantics (CreateIteratorResult with the
same values).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 16519d89-470e-4e81-9ea3-3f4ec96e7fe3
📒 Files selected for processing (3)
source/units/Goccia.Shims.passource/units/Goccia.Values.Iterator.Concrete.pastests/built-ins/String/prototype/symbol-iterator.js
✅ Files skipped from review due to trivial changes (1)
- tests/built-ins/String/prototype/symbol-iterator.js
Summary
TextSemanticsas the shared home for UTF-8 retagging, newline normalization, ECMAScript whitespace, Unicode case mapping, replacement expansion, UTF-8 iteration, and well-formed string handling.Goccia.TextFilesback to file I/O only and remove pass-through wrappers so callers useTextSemanticsdirectly.Testing