Update CaseInsensitive table to Unicode 9.0 #2367

dilijev · 2017-01-12T07:30:51Z

ECMA 262 specifies implementations should use the latest Unicode standard.
As of time of writing, that is Unicode 9.0.

@tcare

…nicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Merge pull request #2356 from dilijev:unicase Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item. The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table. ## CaseInsensitive mapping generator tool PR: dilijev#3 Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR. # Overview of Changes I have staged the changes to hopefully make this easier to review. Here's an overview. NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable. First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order. 3d0f37f Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly. abb5d91 4894d24 25049de Added new codepoints: f197902 - GREEK LETTER YOT af2d083 - Cyrillic cba5439 - Cherokee 6c25a51 - Latin extensions Other tests and fixes: cb736ab - Add test cases from #517 to ensure those issues are fixed. fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag. dc3e750 - Case-insensitive matching for Cherokee only with /u. [1] d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests. [1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55). # Test Coverage * **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures` * Internal and slow tests pass. * Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass. # Reviewers @tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.

@tcare

… hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Merge pull request #2356 from dilijev:unicase Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item. The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table. ## CaseInsensitive mapping generator tool PR: dilijev#3 Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR. # Overview of Changes I have staged the changes to hopefully make this easier to review. Here's an overview. NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable. First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order. 3d0f37f Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly. abb5d91 4894d24 25049de Added new codepoints: f197902 - GREEK LETTER YOT af2d083 - Cyrillic cba5439 - Cherokee 6c25a51 - Latin extensions Other tests and fixes: cb736ab - Add test cases from #517 to ensure those issues are fixed. fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag. dc3e750 - Case-insensitive matching for Cherokee only with /u. [1] d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests. [1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55). # Test Coverage * **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures` * Internal and slow tests pass. * Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass. # Reviewers @tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.

dilijev · 2017-09-06T01:20:19Z

Removing "Bug" as this is not a regression, but is still a work item to follow up.

dilijev added the Task label Jan 12, 2017

dilijev added this to the Backlog milestone Jan 12, 2017

dilijev self-assigned this Jan 12, 2017

dilijev mentioned this issue Jan 12, 2017

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 #2356

Merged

dilijev modified the milestones: 1.6, Backlog May 17, 2017

dilijev added the Bug label May 17, 2017

dilijev mentioned this issue May 17, 2017

Update CaseInsensitive table to Unicode 10.0 #2984

Open

dilijev modified the milestones: 1.6, 1.6 candidates May 17, 2017

curtisman modified the milestones: vNext, 1.6 candidates May 18, 2017

dilijev removed the Bug label Sep 6, 2017

MikeHolman unassigned dilijev Jun 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CaseInsensitive table to Unicode 9.0 #2367

Update CaseInsensitive table to Unicode 9.0 #2367

dilijev commented Jan 12, 2017 •

edited

Loading

dilijev commented Sep 6, 2017

Update CaseInsensitive table to Unicode 9.0 #2367

Update CaseInsensitive table to Unicode 9.0 #2367

Comments

dilijev commented Jan 12, 2017 • edited Loading

dilijev commented Sep 6, 2017

dilijev commented Jan 12, 2017 •

edited

Loading