Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Name and NameAlias #70

Merged
merged 2 commits into from
Jun 23, 2022
Merged

Add support for Name and NameAlias #70

merged 2 commits into from
Jun 23, 2022

Conversation

wismill
Copy link
Collaborator

@wismill wismill commented May 13, 2022

Fixes #67

@wismill
Copy link
Collaborator Author

wismill commented May 13, 2022

Note: I could not get GHC compile names with reasonable resources, either using a naive case statement, or a shorter one for ranges (names containing *) or with Map.fromList. So I used Data.Binary.

Copy link
Member

@harendra-kumar harendra-kumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Map.fromList to avoid a dependency on binary?

I would prefer to not depend on text. The goal of unicode-data is to be sufficiently low level so that other packages can build upon it including text and streamly. For example, streamly does not use text at all, it uses arrays, so it should be able to use unicode-data without having to depend on text.

A stream may be a good representation, but in the absence of a standard stream representation we could either use an Addr# pointing to utf8 encoded static strings or a utf8 encoded bytearray. I think GHC stores string literals utf8 encoded so I guess we do not need to do much, we can just define unboxed string literals (e.g. "hello world!"#) and hand out pointers to those. High level packages can create any other types from this e.g. streamly has this:

{-# INLINE fromCString# #-}
fromCString# :: Addr# -> Array Word8
fromCString# addr# = do
    let cstr = Ptr addr#
        len = unsafeInlineIO $ c_strlen cstr
    fromPtr (fromIntegral len) (castPtr cstr)

I wonder if we can break the pattern matching cases into different segments residing in different files to avoid GHC compilation issue, we did that for decomposition cases. I have not checked how many cases we have here, and if it would be feasible to do something like that.

We can add a simple benchmark to gauge/compare performance while we try different representations. That would be handy in getting an idea what works better.

@wismill
Copy link
Collaborator Author

wismill commented May 16, 2022

Can we use Map.fromList to avoid a dependency on binary?

Map.fromList is not an option. As I wrote before:

Note: I could not get GHC compile names with reasonable resources, either using a naive case statement, or a shorter one for ranges (names containing *) or with Map.fromList. So I used Data.Binary.


I would prefer to not depend on text.

No problem.

  • Switched to ShortByteString for the internal representation and String for the public API.
  • Switched from Map to IntMap for names.
  • Added benchmark
  • Added Python test

@wismill wismill force-pushed the wip/names branch 2 times, most recently from 7de579e to 86c2f33 Compare May 16, 2022 11:53
@harendra-kumar
Copy link
Member

I will have to look at it carefully, give me some time.

@wismill
Copy link
Collaborator Author

wismill commented May 28, 2022

@harendra-kumar kind reminder for a review

@harendra-kumar
Copy link
Member

@wismill I suggest the following.

Assuming names do not consist of the NUL character, they can be represented as a NUL terminated string e.g. "letter A\0". A sequence of names can be concatenated in a single string e.g. "letter A\0letter B\0". Let's write all the names as a single string literal:

names = Ptr "letter A\0letter B\0"#

Along with this create a Map storing the mapping from the character to the offset of the name in the above string. e.g.:

offsets :: IntMap Int
offsets = IntMap.fromList [(0,0),(1,9)]

If GHC cannot handle fromList then we can use a Addr# literal storing the (char, offset) pairs as 32-bit ints in little endian format and then walk through this offsets "array" to create the offset Map.

Then we can return a CString from the names array e.g.:

getName n = names `plusPtr` (fromJust (IntMap.lookup n offsets)) :: CString
> peekCString (getName 1)
"letter B"

If we want to allow the NUL character as part of string we can store the length along with the offset and use CStringLen instead of CString.

Our low level interface would return a CString and the high level libraries can convert this to text, bytestring, stream or whatever they like. The advantage of this is that we depend only on base and containers. We can even implement a simple binary search into the static offsets array to avoid depending on containers as well. Another advantage of using just the static array would be that we do not have to do any dynamic allocations, therefore, there won't be a GC (esp. copying the Map across generations) overhead.

Let me know if you want me to create a prototype for this.

@wismill
Copy link
Collaborator Author

wismill commented May 30, 2022

Let me know if you want me to create a prototype for this.

@harendra-kumar I gave it a try: using two Addr#; one for the names and one for the offsets. I implemented a binary search to avoid using IntMap and I switch ShortByteString to CString. I should probably use Data.List.lookup for the aliases, in order to remove the dependence on containers.

I go unfortunately 30% slower for Unicode.Internal.Char.UnicodeData.DerivedName.name.

@harendra-kumar
Copy link
Member

I go unfortunately 30% slower for Unicode.Internal.Char.UnicodeData.DerivedName.name.

Even though performance probably does not matter much when looking up char names, I think we are much better off with the new code for two reasons:

  1. It seems, we are measuring the worst case of binary search, 30% slower seems pretty good. Average should be much better.
    benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
    benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)
  1. We are not considering the cost of GC in this benchmark, which could be pretty bad in real world because we need to carry around all the allocations for the Map, copy them over generations of GC over the lifetime of the program. In the binary search case we have no dynamic allocations, just a single statically allocated memory chunk in read only memory of the process. This is the most significant difference between two approaches.

@harendra-kumar
Copy link
Member

I should probably use Data.List.lookup for the aliases, in order to remove the dependence on containers.

This should be simple. Just use a separate lookup table function for each alias type. Use a case statement for the alias type and pattern match to switch to the right function.

@wismill
Copy link
Collaborator Author

wismill commented May 31, 2022

  1. It seems, we are measuring the worst case of binary search, 30% slower seems pretty good. Average should be much better.
    benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
    benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)

@harendra-kumar Don’t we get the mean over the range of characters when compared to the reference?

This should be simple. Just use a separate lookup table function for each alias type. Use a case statement for the alias type and pattern match to switch to the right function.

I went for a simplier approach. What do you think?

README.md Outdated Show resolved Hide resolved
@harendra-kumar
Copy link
Member

Mostly looks good. I have a few minor comments/suggestions above.

@harendra-kumar
Copy link
Member

harendra-kumar commented Jun 13, 2022

@harendra-kumar Don’t we get the mean over the range of characters when compared to the reference?

I don't think it works like that, the code is:

    benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
    benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)

    fold_ :: forall a. (NFData a) => (Char -> a) -> (Char, Char) -> ()
    fold_ f = foldr (deepseq . f) () . range

Let's fold a tuple using foldr as we are doing in the code above:

Prelude> foldr (:) [] (1,10)
[10]

We are essentially doing all our tests only on a single character i.e. maxBound.

Did you mean to use [minBound..maxBound] instead? In that case it would give you the aggregate timing for all the characters, not the average. But that should be fine for comparison. But I think the tests will take too long in that case, because there will be too many characters to test.

@wismill
Copy link
Collaborator Author

wismill commented Jun 13, 2022

@harendra-kumar Don’t we get the mean over the range of characters when compared to the reference?

I don't think it works like that, the code is:

    benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
    benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)

    fold_ :: forall a. (NFData a) => (Char -> a) -> (Char, Char) -> ()
    fold_ f = foldr (deepseq . f) () . range

Let's fold a tuple using foldr as we are doing in the code above:

Prelude> foldr (:) [] (1,10)
[10]

We are essentially doing all our tests only on a single character i.e. maxBound.

Did you mean to use [minBound..maxBound] instead? In that case it would give you the aggregate timing for all the characters, not the average. But that should be fine for comparison. But I think the tests will take too long in that case, because there will be too many characters to test.

I think you missed range. We fold over range (minBound, maxBound) i.e. all the chars.

@wismill
Copy link
Collaborator Author

wismill commented Jun 13, 2022

Mostly looks good. I have a few minor comments/suggestions above.

I will check this carefully tomorrow.

@harendra-kumar
Copy link
Member

I think you missed range. We fold over range (minBound, maxBound) i.e. all the chars.

Yes, I missed that. It should work as intended then.

@harendra-kumar
Copy link
Member

The range of characters is 1114112 long i.e. more than a million whereas the valid characters are less than 150,000. Our benchmarks are overwhelmed by non-existing characters which skew the results. I am seeing a lot of benchmarks giving the same timing of around 417 us which is suspicious, need to look into it.

I propose that we parse the unicode blocks file (https://www.unicode.org/Public/14.0.0/ucd/Blocks.txt) and have APIs to:

  1. return a list of unicode blocks (range of chars in each block)
  2. given a char return the unicode block it belongs to.

We need only the first one for our purpose, that should be enough for now. Using the first API we can run benchmarks only in valid ranges. It may also be possible to run benchmarks for different blocks to see how it performs block wise but that is not so important.

@wismill
Copy link
Collaborator Author

wismill commented Jun 14, 2022

The range of characters is 1114112 long i.e. more than a million whereas the valid characters are less than 150,000. Our benchmarks are overwhelmed by non-existing characters which skew the results. I am seeing a lot of benchmarks giving the same timing of around 417 us which is suspicious, need to look into it.

I agree we should probably only test defined characters. Are there relevant use cases for PUA?

I propose that we parse the unicode blocks file (https://www.unicode.org/Public/14.0.0/ucd/Blocks.txt) and have APIs to:

1. return a list of unicode blocks (range of chars in each block)

2. given a char return the unicode block it belongs to.

I thought about this for a personal project. I will have a look.

We need only the first one for our purpose, that should be enough for now. Using the first API we can run benchmarks only in valid ranges. It may also be possible to run benchmarks for different blocks to see how it performs block wise but that is not so important.

I think we should also parse the scripts. It will be more relevant for realistic inputs.

@wismill
Copy link
Collaborator Author

wismill commented Jun 14, 2022

Regarding blocks & scripts: I would prefer for this MR to be merged to avoid rebasing.

@harendra-kumar
Copy link
Member

Regarding blocks & scripts: I would prefer for this MR to be merged to avoid rebasing.

Of course. I did not mean doing that in this change.

@wismill
Copy link
Collaborator Author

wismill commented Jun 15, 2022

@harendra-kumar I would say that if the tests are OK, this MR is complete. I will publish the new package when merged.

@wismill wismill changed the title WIP: Add support for Name and NameAlias Add support for Name and NameAlias Jun 15, 2022
@harendra-kumar
Copy link
Member

Looks good to me. Due to lack of time, I only reviewed important aspects like exposed API signatures/naming, dependencies etc. Other things are easier to correct later even if we have find some issues.

We can wait a few days before merging in case @Bodigrim and @adithyaov have anything to say.

@wismill
Copy link
Collaborator Author

wismill commented Jun 15, 2022

While working on #75, I noted there was a lack of documentation for name that could make the review difficult. So I improved it and took the opportunity to add tests and a new function, correctedName. I will now wait for the last round of review.

@harendra-kumar
Copy link
Member

Looks good. You may want to squash/cleanup some commits before merge.

@wismill wismill merged commit 9bf74ed into master Jun 23, 2022
@wismill wismill deleted the wip/names branch September 26, 2022 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Name and NameAlias properties
2 participants