Preserve UTF-8 in the TOML compliance harness on Windows#187
Conversation
📝 WalkthroughWalkthroughThe changes fix UTF-8 character handling in TOML parsing by modifying the file loader to explicitly preserve raw UTF-8 bytes through code page assignment, updating the function signature accordingly, adding documentation of Windows-specific behavior, and introducing a test case to verify Unicode key-value preservation. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
units/Goccia.TOML.Test.pas (1)
156-194: Consider adding a paired invalid-UTF-8 rejection regression.This test covers valid UTF-8 preservation well. Given the same incident included an invalid-encoding false-accept, adding one negative case here would make the protection more complete.
➕ Suggested companion test sketch
+procedure TTOMLParserTests.TestParseDocumentRejectsInvalidUTF8Input; +const + UTF8_CODE_PAGE = 65001; +var + Parser: TGocciaTOMLParser; + RawSourceText: RawByteString; + SourceText: string; +begin + // Invalid UTF-8 sequence: continuation byte without a valid leading byte + RawSourceText := 'key = "' + #$80 + '"' + LineEnding; + SetCodePage(RawSourceText, UTF8_CODE_PAGE, False); + SourceText := RawSourceText; + + Parser := TGocciaTOMLParser.Create; + try + ExpectException(procedure begin + Parser.ParseDocument(SourceText); + end); + finally + Parser.Free; + end; +end;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@units/Goccia.TOML.Test.pas` around lines 156 - 194, Add a companion negative regression test to TTOMLParserTests.TestParseDocumentPreservesUTF8BytesForUnicodeKeysAndValues that constructs a RawSourceText containing intentionally invalid UTF-8 byte sequences (e.g. truncated/malformed multi-byte sequences), sets the code page like the positive test, then calls TGocciaTOMLParser.ParseDocument and asserts the parser rejects the input (for example by raising the expected exception or returning a failed parse) instead of accepting a garbled key/value; place the check next to the existing positive test and reference the same helpers (RawSourceText, SetCodePage, ParseDocument, GetChildOrFail) to locate where to add the new negative test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@units/Goccia.TOML.Test.pas`:
- Around line 156-194: Add a companion negative regression test to
TTOMLParserTests.TestParseDocumentPreservesUTF8BytesForUnicodeKeysAndValues that
constructs a RawSourceText containing intentionally invalid UTF-8 byte sequences
(e.g. truncated/malformed multi-byte sequences), sets the code page like the
positive test, then calls TGocciaTOMLParser.ParseDocument and asserts the parser
rejects the input (for example by raising the expected exception or returning a
failed parse) instead of accepting a garbled key/value; place the check next to
the existing positive test and reference the same helpers (RawSourceText,
SetCodePage, ParseDocument, GetChildOrFail) to locate where to add the new
negative test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f80d76e2-8606-44fc-aa2c-24d37a76450d
📒 Files selected for processing (3)
docs/testing.mdscripts/GocciaTOMLCheck.dprunits/Goccia.TOML.Test.pas
Benchmark Results274 benchmarks Interpreted: 🟢 5 improved · 🔴 188 regressed · 81 unchanged · avg -2.3% arraybuffer.js — Interp: 🔴 8, 6 unch. · avg -2.2% · Bytecode: 🟢 4, 🔴 8, 2 unch. · avg -1.1%
arrays.js — Interp: 🔴 14, 5 unch. · avg -1.6% · Bytecode: 🟢 3, 🔴 7, 9 unch. · avg -0.7%
async-await.js — Interp: 🔴 5, 1 unch. · avg -3.6% · Bytecode: 🔴 5, 1 unch. · avg -2.4%
classes.js — Interp: 🟢 1, 🔴 19, 11 unch. · avg -1.4% · Bytecode: 🟢 4, 🔴 1, 26 unch. · avg +0.4%
closures.js — Interp: 🔴 10, 1 unch. · avg -3.7% · Bytecode: 🟢 2, 🔴 1, 8 unch. · avg +0.9%
collections.js — Interp: 🔴 9, 3 unch. · avg -2.2% · Bytecode: 🟢 2, 🔴 3, 7 unch. · avg -0.8%
destructuring.js — Interp: 🔴 16, 6 unch. · avg -2.2% · Bytecode: 🟢 3, 🔴 9, 10 unch. · avg -1.3%
fibonacci.js — Interp: 🔴 6, 2 unch. · avg -3.7% · Bytecode: 🟢 3, 🔴 3, 2 unch. · avg +0.2%
for-of.js — Interp: 🔴 6, 1 unch. · avg -3.5% · Bytecode: 🟢 1, 🔴 4, 2 unch. · avg -1.2%
helpers/bench-module.js — Interp: 0 · Bytecode: 0
iterators.js — Interp: 🟢 1, 🔴 14, 5 unch. · avg -2.5% · Bytecode: 🟢 5, 🔴 4, 11 unch. · avg +0.3%
json.js — Interp: 🟢 1, 🔴 7, 12 unch. · avg -1.5% · Bytecode: 🟢 8, 🔴 1, 11 unch. · avg +1.2%
jsx.jsx — Interp: 🔴 16, 5 unch. · avg -2.1% · Bytecode: 🟢 7, 🔴 4, 10 unch. · avg +0.7%
modules.js — Interp: 🔴 6, 3 unch. · avg -2.5% · Bytecode: 🟢 2, 🔴 5, 2 unch. · avg +0.5%
numbers.js — Interp: 🔴 9, 2 unch. · avg -3.8% · Bytecode: 🔴 10, 1 unch. · avg -7.0%
objects.js — Interp: 🔴 4, 3 unch. · avg -2.6% · Bytecode: 🟢 2, 🔴 1, 4 unch. · avg +0.5%
promises.js — Interp: 🔴 12 · avg -3.7% · Bytecode: 🔴 9, 3 unch. · avg -2.2%
regexp.js — Interp: 🟢 1, 🔴 5, 5 unch. · avg -1.6% · Bytecode: 🔴 4, 7 unch. · avg -0.8%
strings.js — Interp: 🟢 1, 🔴 5, 5 unch. · avg -1.0% · Bytecode: 🔴 11 · avg -5.8%
typed-arrays.js — Interp: 🔴 17, 5 unch. · avg -2.4% · Bytecode: 🟢 3, 🔴 8, 11 unch. · avg +1.1%
Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
Suite Timing
Measured on ubuntu-latest x64. |
|
Superseded by #188. Closing the earlier Windows TOML follow-up so there is a single active fix PR. |
Summary
scripts/GocciaTOMLCheck.dprinstead of letting Windows convert them through the local ANSI code pageWhy
The post-merge CI run for
mainstill failed intoml-complianceonx86_64-win64andi386-win32after the multiline-LF fix. The remaining 16 failures were all Unicode-related suite cases (quoted-unicode,unicode-escape,multibyte, etc.), plus one invalid encoding case that became a false accept after the bytes were mangled.The root cause was the standalone compliance harness reading TOML files as UTF-8 bytes and then converting them into a plain
stringin a way that lost the UTF-8 code page on Windows. That produced mojibake likeJoséand key mismatches in the compliance JSON.Verification
/tmp/goccia-toml-test-bin/Goccia.TOML.Testpython3 scripts/run_toml_test_suite.py --harness=/tmp/goccia-toml-harness-bin/GocciaTOMLCheck./build/TestRunner tests/built-ins/TOML/parse.js./format.pas --checkNotes
A full local
./build.pas testscurrently hits an unrelated linker failure inGoccia.Builtins.TestAssertions.Test.pason this machine, so verification here was kept to the directly affected TOML targets.Summary by CodeRabbit
Bug Fixes
Documentation
Tests