Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codepoint sets in unicode.py must be updated #19

Closed
byllyfish opened this issue Jan 11, 2022 · 3 comments · Fixed by #20
Closed

Codepoint sets in unicode.py must be updated #19

byllyfish opened this issue Jan 11, 2022 · 3 comments · Fixed by #20

Comments

@byllyfish
Copy link
Owner

These are hard-wired and must be cross-checked against the current unicode data files in https://www.unicode.org/Public/UNIDATA/extracted/DerivedJoiningType.txt among others.

byllyfish added a commit that referenced this issue Jan 13, 2022
Update builtin CodepointSet tables based on Unicode 14.0 (#19)
@byllyfish
Copy link
Owner Author

Here are the differences in the built-in tables from Unicode version to version (generated by a modified version of the check_codepoints.py program.)

For example, the line 11.0.0 Join_Type=L {(68864, 68864)} <<< set() indicates that Unicode 11.0 added code point 68864 to have Join_Type=L. All code points are in decimal.

If there are code points to the right of the <<<, these were the previous code point ranges that were extended/overlapped.

14.0.0 Default_Ignorable_Code_Point {(6155, 6159)} <<< {(6155, 6158)}
11.0.0 Hebrew {(1519, 1524)} <<< {(1520, 1524)}
11.0.0 Join_Type=D {(68865, 68897), (69457, 69459), (69424, 69426), (68899, 68899), (69428, 69444), (6176, 6264)} <<< {(6176, 6263)}
11.0.0 Join_Type=R {(69460, 69460), (68898, 68898), (69427, 69427)} <<< set()
11.0.0 Join_Type=L {(68864, 68864)} <<< set()
11.0.0 Join_Type=T {(70750, 70750), (70459, 70460), (68900, 68903), (70089, 70092), (2558, 2558), (69446, 69456), (73104, 73105), (3076, 3076), (73111, 73111), (72193, 72202), (73459, 73460), (2259, 2273), (71727, 71735), (71737, 71738), (73109, 73109), (2045, 2045), (43263, 43263)} <<< {(70090, 70092), (2260, 2273), (69821, 69821), (72201, 72202), (70460, 70460), (72193, 72198)}
12.0.0 Join_Type=T {(125252, 125259), (123184, 123190), (72148, 72151), (78896, 78904), (3764, 3772), (72160, 72160), (72154, 72155), (43452, 43453), (94031, 94031), (123628, 123631)} <<< {(125252, 125258), (3764, 3769), (3771, 3772), (43452, 43452)}
13.0.0 Join_Type=D {(2234, 2247), (69578, 69578), (69554, 69555), (69563, 69564), (69569, 69569), (69566, 69567), (69572, 69572), (69560, 69560), (69552, 69552)} <<< {(2234, 2237)}
13.0.0 Join_Type=R {(69556, 69558), (69577, 69577), (69565, 69565), (69570, 69571), (2134, 2136), (69561, 69562)} <<< set()
13.0.0 Join_Type=L {(69579, 69579)} <<< set()
13.0.0 Join_Type=T {(2901, 2902), (69291, 69292), (72003, 72003), (70095, 70095), (71995, 71996), (3457, 3457), (94180, 94180), (43052, 43052), (71998, 71998), (6832, 6848)} <<< {(6832, 6846), (2902, 2902)}
14.0.0 Join_Type=D {(69488, 69491), (69494, 69505), (2182, 2182), (2185, 2189), (2234, 2248), (2227, 2232)} <<< {(2227, 2228), (2234, 2247), (2230, 2232)}
14.0.0 Join_Type=R {(69492, 69493), (2160, 2178), (2190, 2190)} <<< set()
14.0.0 Join_Type=T {(69744, 69744), (7616, 7679), (6159, 6159), (2200, 2207), (69747, 69748), (118528, 118573), (69506, 69509), (123566, 123566), (3132, 3132), (6832, 6862), (5938, 5939), (69826, 69826), (2250, 2273), (118576, 118598)} <<< {(7675, 7679), (7616, 7673), (5938, 5940), (2259, 2273), (6832, 6848)}
11.0.0 Han {(19968, 40943)} <<< {(19968, 40938)}
12.0.0 Hiragana {(110928, 110930)} <<< set()
12.0.0 Katakana {(110948, 110951)} <<< set()
13.0.0 Han {(196608, 201546), (19968, 40956), (131072, 173789), (13312, 19903), (94192, 94193)} <<< {(131072, 173782), (19968, 40943), (13312, 19893)}
14.0.0 Hiragana {(110593, 110879)} <<< {(110593, 110878)}
14.0.0 Katakana {(110576, 110579), (110589, 110590), (110880, 110882), (110581, 110587)} <<< set()
14.0.0 Han {(173824, 177976), (19968, 40959), (131072, 173791), (94178, 94179)} <<< {(19968, 40956), (131072, 173789), (173824, 177972)}

@byllyfish byllyfish reopened this Jan 15, 2022
@byllyfish
Copy link
Owner Author

Still left to do:

  1. Describe the future procedure for updating and checking the Unicode tables in the README.

@byllyfish
Copy link
Owner Author

Fixed in 1.0.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant