Updating Unicode files to use 14.0.0#66362
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-globalization Issue Detailsfixes #44423 cc: @GrabYourPitchforks @tarekgh @stephentoub As part #61048 I needed to generate a casing table using the latest Unicode data (14.0) which wasn't yet being consumed by dotnet/runtime. This change will make the required changes to update to 14.0 so that in a follow-up I can add the required changes that will generate the casing table for Regex that I require. I basically followed the steps outlined both in the GenUnicodeProp readme as well as the STEW tools project to make these changes and then tested locally that both the Globalization and the STEW tests passed after the changes. Please let me know if I'm missing anything else as part of this change.
|
stephentoub
left a comment
There was a problem hiding this comment.
LGTM, but others should review as well. I'm intrigued that there are no tests being edited; does that mean we don't have any that depend on the data that's changing?
|
I'm not super familiar with the tests so can't tell for sure, but looks like System.Text.Encodings.Web.Tests project imports the modified source file It also seems like we are embedding the source UnicodeData.txt file into that test assembly, and it is loading it in a specific test and doing some validation: |
|
We also seem to be doing something similar in the Globalization tests: |
why including these tables? I think you need to remove Refers to: src/libraries/System.Private.CoreLib/src/System/Globalization/CharUnicodeInfoData.cs:1416 in 2db5f07. [](commit_id = 2db5f07, deletion_comment = False) |
|
@joperezr the test projects include the Unicode text file from the package path . Ensure this path is pointing at the correct version. |
I was only following the instructions over at https://github.com/dotnet/runtime/blob/2db5f079aa06a7bf453c1b8898dcb134214f9449/src%2Fcoreclr%2FSystem.Private.CoreLib%2FTools%2FGenUnicodeProp%2FReadme.md It isn't clear in the document weather or not CasingData should be includded by default in CharUnicodeInfoData.cs so I assumed it was. If it is not supposed to be, I can run it again without that parameter. |
Yes, after making my changes I built the relevant test projects and ensured that the resources were still embedded (meaning that the packagepath had not changed) |
|
If there's any changes to upper/lower casing, we should also update src\coreclr\pal\src\locale\unicodedata.cpp. IIRC, this file is generated by running src\coreclr\pal\src\locale\unicodedata.cs, giving it a path to UnicodeData.txt. unicodedata.cpp is used by e.g. the reflection stack to implement BindingFlags.IgnoreCase. |
I hope one day we consolidate this table with the table we generate inside corelib. The one in corelib is more optimized in size and access code speed too. |
Until they're consolidated, what prevents us from using the same approach on the managed side? |
nothing. It needs someone to spend some time doing it. I recall I had some offline discussion with @jkotas before when I was optimizing the ordinal operations, but it was not a priority to do anything in the runtime at that time. In addition, the current perf for this runtime table is acceptable as no one complained about it. |
The north start is to move all case-insensitive comparisons to C# and delete the casing table in the unmanaged runtime. |
|
@GrabYourPitchforks @tarekgh @jkotas @MichalStrehovsky I added as part of the PR in the last commit one markdown file that is making note of all of the instructions that I followed when updating to the 14.0 version, so that there is a simple guide to follow for next update. Could you PTAL to that comment to make sure I didn't miss anything that needs to be checked/updated and that I covered all steps correctly? |
src/libraries/System.Private.CoreLib/src/System/Globalization/Updating-Unicode-Versions.md
Show resolved
Hide resolved
tarekgh
left a comment
There was a problem hiding this comment.
I passed offline comment to Jose. Otherwise, LGTM.
|
Could you please also move the tool under libraries while you are on it? The tool is not coreclr specific, and so it should not live under coreclr. The tool is updating the tables in the shared CoreLib part under libraries and so it should live there too. |
fixes #44423
cc: @GrabYourPitchforks @tarekgh @stephentoub
As part #61048 I needed to generate a casing table using the latest Unicode data (14.0) which wasn't yet being consumed by dotnet/runtime. This change will make the required changes to update to 14.0 so that in a follow-up I can add the required changes that will generate the casing table for Regex that I require. I basically followed the steps outlined both in the GenUnicodeProp readme as well as the STEW tools project to make these changes and then tested locally that both the Globalization and the STEW tests passed after the changes. Please let me know if I'm missing anything else as part of this change.