Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper Unicode ToLower/ToUpper #1

Open
dhasenan opened this issue Apr 13, 2017 · 4 comments
Open

Proper Unicode ToLower/ToUpper #1

dhasenan opened this issue Apr 13, 2017 · 4 comments

Comments

@dhasenan
Copy link
Owner

char.ToLower and char.ToUpper do not handle all characters. Find the raw unicode tables and turn them into proper implementations.

@dhasenan
Copy link
Owner Author

So...

I've come up with a program that builds my own CharInfo stuff. (System.CharInfo is inaccessible due to protection level. Faugh.)

Two problems:

  1. It brings my IDE to its knees.
  2. It crashes as soon as it tries to load the static data.

It also crashes on startup if I try to run something with a switch statement containing 35,000 cases. I wonder why... It actually crashes so bad that I have to kill the terminal window.

I can resort to native code if I must. Explicitly control struct layout and enum member values, then compile the unicode data text file into binary. Load in native code, etc.

I might be able to do something effectively identical without native code. The thing I really want, though, is O(1) lookups. How do I make that happen fast? The answer is to waste space and to shift names around.

Two embedded resources. One just contains names. The other contains CharInfo serialized structs. I calculate how long the CharInfo's members need to be (specifically the decomposition) and pad as necessary. Two of the members are the start and length offsets into the names resource to find the name.

Now I have a fixed size for the struct. I can find where in the resource the struct for a given codepoint is located in O(1) time (assuming no gaps, and I can ensure that when I compile).

A little awkward, but it should be manageable.

Now, if I just want the upper/lower mapping, that's easier. I can binary search the relevant entries -- it's about 1300 each. I can compare that to having the full array.

I can also hard-code the common Latin subset.

@dhasenan
Copy link
Owner Author

Hrm, System.CharInfo is specific to Mono and doesn't do anything like what I want. Looking at other options in the CLR.

@dhasenan
Copy link
Owner Author

dhasenan commented Apr 14, 2017

Importing Unicode data as serialized resource files seems to work.

dhasenan added a commit that referenced this issue Apr 14, 2017
@dhasenan
Copy link
Owner Author

So, we've got the serialized unicode data. (ASCII -> binary provided approximately zero compression.) I can look up unicode codepoint info. This takes a binary search. If I added a distance heuristic, that would probably make it faster... Also, there are probably a lot of contiguous ranges, so I could binary/linear search through a couple dozen ranges and then go with fixed offsets.

This isn't just an upper/lower mapping. I can make that alone; it'll be 1300-ish entries, mostly binary search.

dhasenan added a commit that referenced this issue Apr 15, 2017
Updates to #1

Need testing before I call it done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant