-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Name and NameAlias #70
Conversation
Note: I could not get GHC compile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use Map.fromList
to avoid a dependency on binary
?
I would prefer to not depend on text
. The goal of unicode-data is to be sufficiently low level so that other packages can build upon it including text
and streamly
. For example, streamly
does not use text
at all, it uses arrays, so it should be able to use unicode-data without having to depend on text
.
A stream may be a good representation, but in the absence of a standard stream representation we could either use an Addr#
pointing to utf8 encoded static strings or a utf8 encoded bytearray. I think GHC stores string literals utf8 encoded so I guess we do not need to do much, we can just define unboxed string literals (e.g. "hello world!"#
) and hand out pointers to those. High level packages can create any other types from this e.g. streamly has this:
{-# INLINE fromCString# #-}
fromCString# :: Addr# -> Array Word8
fromCString# addr# = do
let cstr = Ptr addr#
len = unsafeInlineIO $ c_strlen cstr
fromPtr (fromIntegral len) (castPtr cstr)
I wonder if we can break the pattern matching cases into different segments residing in different files to avoid GHC compilation issue, we did that for decomposition cases. I have not checked how many cases we have here, and if it would be feasible to do something like that.
We can add a simple benchmark to gauge/compare performance while we try different representations. That would be handy in getting an idea what works better.
No problem.
|
7de579e
to
86c2f33
Compare
I will have to look at it carefully, give me some time. |
@harendra-kumar kind reminder for a review |
@wismill I suggest the following. Assuming names do not consist of the NUL character, they can be represented as a NUL terminated string e.g.
Along with this create a Map storing the mapping from the character to the offset of the name in the above string. e.g.:
If GHC cannot handle Then we can return a CString from the names array e.g.:
If we want to allow the NUL character as part of string we can store the length along with the offset and use CStringLen instead of CString. Our low level interface would return a Let me know if you want me to create a prototype for this. |
@harendra-kumar I gave it a try: using two I go unfortunately 30% slower for |
Even though performance probably does not matter much when looking up char names, I think we are much better off with the new code for two reasons:
benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)
|
This should be simple. Just use a separate lookup table function for each alias type. Use a case statement for the alias type and pattern match to switch to the right function. |
@harendra-kumar Don’t we get the mean over the range of characters when compared to the reference?
I went for a simplier approach. What do you think? |
unicode-data-names/lib/Unicode/Internal/Char/UnicodeData/NameAliases.hs
Outdated
Show resolved
Hide resolved
unicode-data-names/lib/Unicode/Internal/Char/UnicodeData/NameAliases.hs
Outdated
Show resolved
Hide resolved
Mostly looks good. I have a few minor comments/suggestions above. |
I don't think it works like that, the code is:
Let's fold a tuple using foldr as we are doing in the code above:
We are essentially doing all our tests only on a single character i.e. Did you mean to use |
I think you missed |
I will check this carefully tomorrow. |
Yes, I missed that. It should work as intended then. |
The range of characters is 1114112 long i.e. more than a million whereas the valid characters are less than 150,000. Our benchmarks are overwhelmed by non-existing characters which skew the results. I am seeing a lot of benchmarks giving the same timing of around 417 us which is suspicious, need to look into it. I propose that we parse the unicode blocks file (https://www.unicode.org/Public/14.0.0/ucd/Blocks.txt) and have APIs to:
We need only the first one for our purpose, that should be enough for now. Using the first API we can run benchmarks only in valid ranges. It may also be possible to run benchmarks for different blocks to see how it performs block wise but that is not so important. |
I agree we should probably only test defined characters. Are there relevant use cases for PUA?
I thought about this for a personal project. I will have a look.
I think we should also parse the scripts. It will be more relevant for realistic inputs. |
Regarding blocks & scripts: I would prefer for this MR to be merged to avoid rebasing. |
Of course. I did not mean doing that in this change. |
@harendra-kumar I would say that if the tests are OK, this MR is complete. I will publish the new package when merged. |
Looks good to me. Due to lack of time, I only reviewed important aspects like exposed API signatures/naming, dependencies etc. Other things are easier to correct later even if we have find some issues. We can wait a few days before merging in case @Bodigrim and @adithyaov have anything to say. |
While working on #75, I noted there was a lack of documentation for |
Looks good. You may want to squash/cleanup some commits before merge. |
DerivedName.txt
andNameAliases.txt
.unicode-data-names
.Fixes #67