Add support for Name and NameAlias #70

wismill · 2022-05-13T16:14:12Z

Create generators to parse DerivedName.txt and NameAliases.txt.
Create a new package unicode-data-names.

Fixes #67

wismill · 2022-05-13T16:14:21Z

Note: I could not get GHC compile names with reasonable resources, either using a naive case statement, or a shorter one for ranges (names containing *) or with Map.fromList. So I used Data.Binary.

unicode-data/unicode-data.cabal

unicode-data/exe/Parser/Text.hs

unicode-data-names/unicode-data-names.cabal

unicode-data-names/lib/Unicode/Char/General/Names.hs

unicode-data/exe/Parser/Text.hs

harendra-kumar

Can we use Map.fromList to avoid a dependency on binary?

I would prefer to not depend on text. The goal of unicode-data is to be sufficiently low level so that other packages can build upon it including text and streamly. For example, streamly does not use text at all, it uses arrays, so it should be able to use unicode-data without having to depend on text.

A stream may be a good representation, but in the absence of a standard stream representation we could either use an Addr# pointing to utf8 encoded static strings or a utf8 encoded bytearray. I think GHC stores string literals utf8 encoded so I guess we do not need to do much, we can just define unboxed string literals (e.g. "hello world!"#) and hand out pointers to those. High level packages can create any other types from this e.g. streamly has this:

{-# INLINE fromCString# #-}
fromCString# :: Addr# -> Array Word8
fromCString# addr# = do
    let cstr = Ptr addr#
        len = unsafeInlineIO $ c_strlen cstr
    fromPtr (fromIntegral len) (castPtr cstr)

I wonder if we can break the pattern matching cases into different segments residing in different files to avoid GHC compilation issue, we did that for decomposition cases. I have not checked how many cases we have here, and if it would be feasible to do something like that.

We can add a simple benchmark to gauge/compare performance while we try different representations. That would be handy in getting an idea what works better.

wismill · 2022-05-16T11:30:03Z

Can we use Map.fromList to avoid a dependency on binary?

Map.fromList is not an option. As I wrote before:

Note: I could not get GHC compile names with reasonable resources, either using a naive case statement, or a shorter one for ranges (names containing *) or with Map.fromList. So I used Data.Binary.

I would prefer to not depend on text.

No problem.

Switched to ShortByteString for the internal representation and String for the public API.
Switched from Map to IntMap for names.
Added benchmark
Added Python test

harendra-kumar · 2022-05-17T13:04:03Z

I will have to look at it carefully, give me some time.

wismill · 2022-05-28T05:41:03Z

@harendra-kumar kind reminder for a review

harendra-kumar · 2022-05-28T13:46:04Z

@wismill I suggest the following.

Assuming names do not consist of the NUL character, they can be represented as a NUL terminated string e.g. "letter A\0". A sequence of names can be concatenated in a single string e.g. "letter A\0letter B\0". Let's write all the names as a single string literal:

names = Ptr "letter A\0letter B\0"#

Along with this create a Map storing the mapping from the character to the offset of the name in the above string. e.g.:

offsets :: IntMap Int
offsets = IntMap.fromList [(0,0),(1,9)]

If GHC cannot handle fromList then we can use a Addr# literal storing the (char, offset) pairs as 32-bit ints in little endian format and then walk through this offsets "array" to create the offset Map.

Then we can return a CString from the names array e.g.:

getName n = names `plusPtr` (fromJust (IntMap.lookup n offsets)) :: CString

> peekCString (getName 1)
"letter B"

If we want to allow the NUL character as part of string we can store the length along with the offset and use CStringLen instead of CString.

Our low level interface would return a CString and the high level libraries can convert this to text, bytestring, stream or whatever they like. The advantage of this is that we depend only on base and containers. We can even implement a simple binary search into the static offsets array to avoid depending on containers as well. Another advantage of using just the static array would be that we do not have to do any dynamic allocations, therefore, there won't be a GC (esp. copying the Map across generations) overhead.

Let me know if you want me to create a prototype for this.

wismill · 2022-05-30T17:19:52Z

Let me know if you want me to create a prototype for this.

@harendra-kumar I gave it a try: using two Addr#; one for the names and one for the offsets. I implemented a binary search to avoid using IntMap and I switch ShortByteString to CString. I should probably use Data.List.lookup for the aliases, in order to remove the dependence on containers.

I go unfortunately 30% slower for Unicode.Internal.Char.UnicodeData.DerivedName.name.

harendra-kumar · 2022-05-30T18:10:10Z

I go unfortunately 30% slower for Unicode.Internal.Char.UnicodeData.DerivedName.name.

Even though performance probably does not matter much when looking up char names, I think we are much better off with the new code for two reasons:

It seems, we are measuring the worst case of binary search, 30% slower seems pretty good. Average should be much better.

    benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
    benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)

We are not considering the cost of GC in this benchmark, which could be pretty bad in real world because we need to carry around all the allocations for the Map, copy them over generations of GC over the lifetime of the program. In the binary search case we have no dynamic allocations, just a single statically allocated memory chunk in read only memory of the process. This is the most significant difference between two approaches.

harendra-kumar · 2022-05-30T18:34:43Z

I should probably use Data.List.lookup for the aliases, in order to remove the dependence on containers.

This should be simple. Just use a separate lookup table function for each alias type. Use a case statement for the alias type and pattern match to switch to the right function.

wismill · 2022-05-31T08:28:54Z

It seems, we are measuring the worst case of binary search, 30% slower seems pretty good. Average should be much better.
    benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
    benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)

@harendra-kumar Don’t we get the mean over the range of characters when compared to the reference?

This should be simple. Just use a separate lookup table function for each alias type. Use a case statement for the alias type and pattern match to switch to the right function.

I went for a simplier approach. What do you think?

README.md

unicode-data-names/lib/Unicode/Internal/Char/UnicodeData/NameAliases.hs

unicode-data-names/lib/Unicode/Char/General/Names.hs

unicode-data/unicode-data.cabal

unicode-data-names/lib/Unicode/Internal/Char/UnicodeData/NameAliases.hs

harendra-kumar · 2022-06-13T14:09:14Z

Mostly looks good. I have a few minor comments/suggestions above.

harendra-kumar · 2022-06-13T14:42:27Z

@harendra-kumar Don’t we get the mean over the range of characters when compared to the reference?

I don't think it works like that, the code is:

    benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
    benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)

    fold_ :: forall a. (NFData a) => (Char -> a) -> (Char, Char) -> ()
    fold_ f = foldr (deepseq . f) () . range

Let's fold a tuple using foldr as we are doing in the code above:

Prelude> foldr (:) [] (1,10)
[10]

We are essentially doing all our tests only on a single character i.e. maxBound.

Did you mean to use [minBound..maxBound] instead? In that case it would give you the aggregate timing for all the characters, not the average. But that should be fine for comparison. But I think the tests will take too long in that case, because there will be too many characters to test.

wismill · 2022-06-13T18:57:23Z

@harendra-kumar Don’t we get the mean over the range of characters when compared to the reference?

I don't think it works like that, the code is:
    benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
    benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)

    fold_ :: forall a. (NFData a) => (Char -> a) -> (Char, Char) -> ()
    fold_ f = foldr (deepseq . f) () . range
Let's fold a tuple using foldr as we are doing in the code above:
Prelude> foldr (:) [] (1,10)
[10]
We are essentially doing all our tests only on a single character i.e. maxBound.

Did you mean to use [minBound..maxBound] instead? In that case it would give you the aggregate timing for all the characters, not the average. But that should be fine for comparison. But I think the tests will take too long in that case, because there will be too many characters to test.

I think you missed range. We fold over range (minBound, maxBound) i.e. all the chars.

wismill · 2022-06-13T18:58:22Z

Mostly looks good. I have a few minor comments/suggestions above.

I will check this carefully tomorrow.

harendra-kumar · 2022-06-13T20:22:45Z

I think you missed range. We fold over range (minBound, maxBound) i.e. all the chars.

Yes, I missed that. It should work as intended then.

harendra-kumar · 2022-06-14T06:52:14Z

The range of characters is 1114112 long i.e. more than a million whereas the valid characters are less than 150,000. Our benchmarks are overwhelmed by non-existing characters which skew the results. I am seeing a lot of benchmarks giving the same timing of around 417 us which is suspicious, need to look into it.

I propose that we parse the unicode blocks file (https://www.unicode.org/Public/14.0.0/ucd/Blocks.txt) and have APIs to:

return a list of unicode blocks (range of chars in each block)
given a char return the unicode block it belongs to.

We need only the first one for our purpose, that should be enough for now. Using the first API we can run benchmarks only in valid ranges. It may also be possible to run benchmarks for different blocks to see how it performs block wise but that is not so important.

wismill · 2022-06-14T07:58:43Z

The range of characters is 1114112 long i.e. more than a million whereas the valid characters are less than 150,000. Our benchmarks are overwhelmed by non-existing characters which skew the results. I am seeing a lot of benchmarks giving the same timing of around 417 us which is suspicious, need to look into it.

I agree we should probably only test defined characters. Are there relevant use cases for PUA?

I propose that we parse the unicode blocks file (https://www.unicode.org/Public/14.0.0/ucd/Blocks.txt) and have APIs to:
1. return a list of unicode blocks (range of chars in each block)

2. given a char return the unicode block it belongs to.

I thought about this for a personal project. I will have a look.

We need only the first one for our purpose, that should be enough for now. Using the first API we can run benchmarks only in valid ranges. It may also be possible to run benchmarks for different blocks to see how it performs block wise but that is not so important.

I think we should also parse the scripts. It will be more relevant for realistic inputs.

wismill · 2022-06-14T08:04:15Z

Regarding blocks & scripts: I would prefer for this MR to be merged to avoid rebasing.

harendra-kumar · 2022-06-14T09:17:02Z

Regarding blocks & scripts: I would prefer for this MR to be merged to avoid rebasing.

Of course. I did not mean doing that in this change.

unicode-data-names/lib/Unicode/Char/General/Names.hs

wismill · 2022-06-15T11:08:39Z

@harendra-kumar I would say that if the tests are OK, this MR is complete. I will publish the new package when merged.

harendra-kumar · 2022-06-15T13:02:42Z

Looks good to me. Due to lack of time, I only reviewed important aspects like exposed API signatures/naming, dependencies etc. Other things are easier to correct later even if we have find some issues.

We can wait a few days before merging in case @Bodigrim and @adithyaov have anything to say.

wismill · 2022-06-15T19:19:31Z

While working on #75, I noted there was a lack of documentation for name that could make the review difficult. So I improved it and took the opportunity to add tests and a new function, correctedName. I will now wait for the last round of review.

harendra-kumar · 2022-06-19T06:13:42Z

Looks good. You may want to squash/cleanup some commits before merge.

wismill requested review from adithyaov and harendra-kumar May 13, 2022 16:14

wismill force-pushed the wip/names branch from d847447 to a9d66e6 Compare May 13, 2022 16:23

harendra-kumar reviewed May 15, 2022

View reviewed changes

unicode-data/unicode-data.cabal Show resolved Hide resolved

harendra-kumar reviewed May 15, 2022

View reviewed changes

unicode-data/exe/Parser/Text.hs Outdated Show resolved Hide resolved

harendra-kumar reviewed May 15, 2022

View reviewed changes

unicode-data-names/unicode-data-names.cabal Outdated Show resolved Hide resolved

unicode-data-names/lib/Unicode/Char/General/Names.hs Outdated Show resolved Hide resolved

harendra-kumar reviewed May 15, 2022

View reviewed changes

unicode-data/exe/Parser/Text.hs Show resolved Hide resolved

harendra-kumar reviewed May 16, 2022

View reviewed changes

wismill force-pushed the wip/names branch 2 times, most recently from 7de579e to 86c2f33 Compare May 16, 2022 11:53

wismill force-pushed the wip/names branch from 94732d3 to 9b6f03f Compare May 30, 2022 17:04

harendra-kumar reviewed Jun 13, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

harendra-kumar reviewed Jun 13, 2022

View reviewed changes

unicode-data-names/lib/Unicode/Internal/Char/UnicodeData/NameAliases.hs Outdated Show resolved Hide resolved

harendra-kumar reviewed Jun 13, 2022

View reviewed changes

harendra-kumar reviewed Jun 14, 2022

View reviewed changes

wismill force-pushed the wip/names branch from 8d16da7 to 9d4a431 Compare June 15, 2022 10:57

wismill mentioned this pull request Jun 15, 2022

Move ucd2haskell to its own internal package #74

Closed

wismill changed the title ~~WIP: Add support for Name and NameAlias~~ Add support for Name and NameAlias Jun 15, 2022

wismill mentioned this pull request Jun 15, 2022

Add support for block and scripts #75

Merged

harendra-kumar approved these changes Jun 15, 2022

View reviewed changes

wismill added 2 commits June 23, 2022 07:40

Move unicode-data package to subdir

d6ebf7b

Create new package unicode-data-names

155e1e1

wismill force-pushed the wip/names branch from 76e07ae to 155e1e1 Compare June 23, 2022 05:44

wismill merged commit 9bf74ed into master Jun 23, 2022

wismill deleted the wip/names branch September 26, 2022 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Name and NameAlias #70

Add support for Name and NameAlias #70

wismill commented May 13, 2022

wismill commented May 13, 2022 •

edited

harendra-kumar left a comment

wismill commented May 16, 2022

harendra-kumar commented May 17, 2022

wismill commented May 28, 2022

harendra-kumar commented May 28, 2022

wismill commented May 30, 2022 •

edited

harendra-kumar commented May 30, 2022

harendra-kumar commented May 30, 2022

wismill commented May 31, 2022 •

edited

harendra-kumar commented Jun 13, 2022

harendra-kumar commented Jun 13, 2022 •

edited

wismill commented Jun 13, 2022

wismill commented Jun 13, 2022

harendra-kumar commented Jun 13, 2022

harendra-kumar commented Jun 14, 2022

wismill commented Jun 14, 2022

wismill commented Jun 14, 2022

harendra-kumar commented Jun 14, 2022

wismill commented Jun 15, 2022 •

edited

harendra-kumar commented Jun 15, 2022

wismill commented Jun 15, 2022

harendra-kumar commented Jun 19, 2022

Add support for Name and NameAlias #70

Add support for Name and NameAlias #70

Conversation

wismill commented May 13, 2022

wismill commented May 13, 2022 • edited

harendra-kumar left a comment

Choose a reason for hiding this comment

wismill commented May 16, 2022

harendra-kumar commented May 17, 2022

wismill commented May 28, 2022

harendra-kumar commented May 28, 2022

wismill commented May 30, 2022 • edited

harendra-kumar commented May 30, 2022

harendra-kumar commented May 30, 2022

wismill commented May 31, 2022 • edited

harendra-kumar commented Jun 13, 2022

harendra-kumar commented Jun 13, 2022 • edited

wismill commented Jun 13, 2022

wismill commented Jun 13, 2022

harendra-kumar commented Jun 13, 2022

harendra-kumar commented Jun 14, 2022

wismill commented Jun 14, 2022

wismill commented Jun 14, 2022

harendra-kumar commented Jun 14, 2022

wismill commented Jun 15, 2022 • edited

harendra-kumar commented Jun 15, 2022

wismill commented Jun 15, 2022

harendra-kumar commented Jun 19, 2022

wismill commented May 13, 2022 •

edited

wismill commented May 30, 2022 •

edited

wismill commented May 31, 2022 •

edited

harendra-kumar commented Jun 13, 2022 •

edited

wismill commented Jun 15, 2022 •

edited