Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update unicode data to unicode 10.0.0 #13

Merged
merged 2 commits into from
Jan 6, 2018
Merged

Update unicode data to unicode 10.0.0 #13

merged 2 commits into from
Jan 6, 2018

Conversation

neil-lindquist
Copy link
Contributor

@neil-lindquist neil-lindquist commented Jun 14, 2017

The current unicode data is from 6.2.0, so this would add the new characters from the past 5 years, including over 350 new emoji.

@neil-lindquist neil-lindquist changed the title Update unicode data to unicode 9.0.0 Update unicode data to unicode 10.0.0 Jan 1, 2018
@neil-lindquist
Copy link
Contributor Author

This originally was updating the data to Unicode 9, but since Unicode 10 is available, I updated the pull request to use those files instead

@stassats
Copy link
Member

stassats commented Jan 2, 2018

A lot of tests are now failing.

@neil-lindquist
Copy link
Contributor Author

neil-lindquist commented Jan 2, 2018

I don't think updating the Unicode data is causing the failures. The failing tests also fail when I try to run the tests from the current master branch (commit 45d3ff1). I tried on sbcl 1.4.2 and clisp 2.48, both with Windows 10.

I ran the tests using sbcl --noprint --eval "(ql:quickload :cl-unicode/test)" --eval "(asdf:operate 'asdf:test-op :cl-unicode)" --eval "(quit)" and clisp -x "(ql:quickload :cl-unicode/test) (asdf:operate 'asdf:test-op :cl-unicode) (quit)" with my repository of cl-unicode in the local-projects directory of quicklisp. The sbcl outputs had no difference in outputs between the commits. The clisp outputs only differed in the memory address in the printing of the return value of asdf:operate (i.e. #<ASDF/PLAN:SEQUENTIAL-PLAN #x1C6B93B1>)
sbcl_unicode5.txt
sbcl_unicode10.txt
clisp_unicode5.txt
clisp_unicode10.txt

@neil-lindquist
Copy link
Contributor Author

I realized that I didn't run clean.cmd between running the tests (and thus the derived properties tests weren't refreshed). After running tests again with running clean.cmd between each run, there where differences between the runs. However, for the most part, they are just changes in the numberings and the addition of more, passing, tests (which makes sense, given characters with derived properties were added). However, there is one new failure, (HAS-BINARY-PROPERTY (CHARACTER-NAMED "CHAM PUNCTUATION DOUBLE DANDA" :WANT-CODE-POINT-P T) "STerm") returned NIL
I'll start looking into this failure.

@neil-lindquist
Copy link
Contributor Author

In Unicode 10, the long name alias of STerm was renamed to Sentence_Terminal (see http://unicode.org/reports/tr44/ under PropertyAliases.txt). The short name remained STerm, so it was the same as adding a new alias for the property. However, cl-unicode doesn't load alias's from PropertyAliases.txt, so only the long name is used. I suspect adding support for PropertyAliases.txt would be preferred to breaking backwards compatability.

I think this will entail adding another lookup table build from PropertyAliases.txt and running property names through that before looking them up in the current tables. Should that be a part of this pull request or a separate one?

@neil-lindquist
Copy link
Contributor Author

I've added property aliasing, which fixed the test failure caused by the renamed property.
I ended up fixing a few more failing tests (like (STRING= "Basic Latin" (CODE-BLOCK 1)) returned NIL in the simple tests) because I starting thinking that they were also new. That was a simple regex tweak in the split lines when reading data.

The long name alias (effectively the main name) changed for STerm
so Property Aliasing is needed to seamlessly support the change.
@neil-lindquist neil-lindquist mentioned this pull request Jan 5, 2018
@stassats
Copy link
Member

stassats commented Jan 6, 2018

I'm still getting
got an unexpected error: There is no property called "Changes_When_Casemapped".

@neil-lindquist
Copy link
Contributor Author

I get that failure when running the current master branch. It's caused by the fields starting on line 5183 in DerivedCoreProperties.txt, but the derived property Changes_When_Casemapped isn't defined (and there are similar properties for the similar failures).

I've fixed the failures for Cased and Cased_Insensitive in a branch built off this one (https://github.com/neil-lindquist/cl-unicode/tree/fix-derived-tests), but the others require NFD normalization to be implemented, which CL-Unicode currently doesn't do (https://github.com/Ferada/cl-unicode/tree/decomposition-mapping does start implementing normalization).

@stassats stassats merged commit 85ee54a into edicl:master Jan 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants