Update unicode data to unicode 10.0.0 #13

neil-lindquist · 2017-06-14T02:03:54Z

The current unicode data is from 6.2.0, so this would add the new characters from the past 5 years, including over 350 new emoji.

neil-lindquist · 2018-01-01T22:00:37Z

This originally was updating the data to Unicode 9, but since Unicode 10 is available, I updated the pull request to use those files instead

stassats · 2018-01-02T00:17:23Z

A lot of tests are now failing.

neil-lindquist · 2018-01-02T01:16:19Z

I don't think updating the Unicode data is causing the failures. The failing tests also fail when I try to run the tests from the current master branch (commit 45d3ff1). I tried on sbcl 1.4.2 and clisp 2.48, both with Windows 10.

I ran the tests using sbcl --noprint --eval "(ql:quickload :cl-unicode/test)" --eval "(asdf:operate 'asdf:test-op :cl-unicode)" --eval "(quit)" and clisp -x "(ql:quickload :cl-unicode/test) (asdf:operate 'asdf:test-op :cl-unicode) (quit)" with my repository of cl-unicode in the local-projects directory of quicklisp. The sbcl outputs had no difference in outputs between the commits. The clisp outputs only differed in the memory address in the printing of the return value of asdf:operate (i.e. #<ASDF/PLAN:SEQUENTIAL-PLAN #x1C6B93B1>)
sbcl_unicode5.txt
sbcl_unicode10.txt
clisp_unicode5.txt
clisp_unicode10.txt

neil-lindquist · 2018-01-02T01:46:12Z

I realized that I didn't run clean.cmd between running the tests (and thus the derived properties tests weren't refreshed). After running tests again with running clean.cmd between each run, there where differences between the runs. However, for the most part, they are just changes in the numberings and the addition of more, passing, tests (which makes sense, given characters with derived properties were added). However, there is one new failure, (HAS-BINARY-PROPERTY (CHARACTER-NAMED "CHAM PUNCTUATION DOUBLE DANDA" :WANT-CODE-POINT-P T) "STerm") returned NIL
I'll start looking into this failure.

neil-lindquist · 2018-01-02T02:24:46Z

In Unicode 10, the long name alias of STerm was renamed to Sentence_Terminal (see http://unicode.org/reports/tr44/ under PropertyAliases.txt). The short name remained STerm, so it was the same as adding a new alias for the property. However, cl-unicode doesn't load alias's from PropertyAliases.txt, so only the long name is used. I suspect adding support for PropertyAliases.txt would be preferred to breaking backwards compatability.

I think this will entail adding another lookup table build from PropertyAliases.txt and running property names through that before looking them up in the current tables. Should that be a part of this pull request or a separate one?

neil-lindquist · 2018-01-03T04:58:46Z

I've added property aliasing, which fixed the test failure caused by the renamed property.
I ended up fixing a few more failing tests (like (STRING= "Basic Latin" (CODE-BLOCK 1)) returned NIL in the simple tests) because I starting thinking that they were also new. That was a simple regex tweak in the split lines when reading data.

The long name alias (effectively the main name) changed for STerm so Property Aliasing is needed to seamlessly support the change.

stassats · 2018-01-06T18:57:19Z

I'm still getting
got an unexpected error: There is no property called "Changes_When_Casemapped".

neil-lindquist · 2018-01-06T19:06:43Z

I get that failure when running the current master branch. It's caused by the fields starting on line 5183 in DerivedCoreProperties.txt, but the derived property Changes_When_Casemapped isn't defined (and there are similar properties for the similar failures).

I've fixed the failures for Cased and Cased_Insensitive in a branch built off this one (https://github.com/neil-lindquist/cl-unicode/tree/fix-derived-tests), but the others require NFD normalization to be implemented, which CL-Unicode currently doesn't do (https://github.com/Ferada/cl-unicode/tree/decomposition-mapping does start implementing normalization).

Update unicode data to unicode 10.0.0

2ec989c

neil-lindquist changed the title ~~Update unicode data to unicode 9.0.0~~ Update unicode data to unicode 10.0.0 Jan 1, 2018

Support Property Aliasing

3aa8a32

The long name alias (effectively the main name) changed for STerm so Property Aliasing is needed to seamlessly support the change.

neil-lindquist mentioned this pull request Jan 5, 2018

Update docs #14

Open

stassats merged commit 85ee54a into edicl:master Jan 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update unicode data to unicode 10.0.0 #13

Update unicode data to unicode 10.0.0 #13

neil-lindquist commented Jun 14, 2017 •

edited

Loading

neil-lindquist commented Jan 1, 2018

stassats commented Jan 2, 2018

neil-lindquist commented Jan 2, 2018 •

edited

Loading

neil-lindquist commented Jan 2, 2018

neil-lindquist commented Jan 2, 2018

neil-lindquist commented Jan 3, 2018

stassats commented Jan 6, 2018

neil-lindquist commented Jan 6, 2018

Update unicode data to unicode 10.0.0 #13

Update unicode data to unicode 10.0.0 #13

Conversation

neil-lindquist commented Jun 14, 2017 • edited Loading

neil-lindquist commented Jan 1, 2018

stassats commented Jan 2, 2018

neil-lindquist commented Jan 2, 2018 • edited Loading

neil-lindquist commented Jan 2, 2018

neil-lindquist commented Jan 2, 2018

neil-lindquist commented Jan 3, 2018

stassats commented Jan 6, 2018

neil-lindquist commented Jan 6, 2018

neil-lindquist commented Jun 14, 2017 •

edited

Loading

neil-lindquist commented Jan 2, 2018 •

edited

Loading