Proper Unicode ToLower/ToUpper #1

dhasenan · 2017-04-13T02:32:54Z

char.ToLower and char.ToUpper do not handle all characters. Find the raw unicode tables and turn them into proper implementations.

dhasenan · 2017-04-13T05:19:54Z

So...

I've come up with a program that builds my own CharInfo stuff. (System.CharInfo is inaccessible due to protection level. Faugh.)

Two problems:

It brings my IDE to its knees.
It crashes as soon as it tries to load the static data.

It also crashes on startup if I try to run something with a switch statement containing 35,000 cases. I wonder why... It actually crashes so bad that I have to kill the terminal window.

I can resort to native code if I must. Explicitly control struct layout and enum member values, then compile the unicode data text file into binary. Load in native code, etc.

I might be able to do something effectively identical without native code. The thing I really want, though, is O(1) lookups. How do I make that happen fast? The answer is to waste space and to shift names around.

Two embedded resources. One just contains names. The other contains CharInfo serialized structs. I calculate how long the CharInfo's members need to be (specifically the decomposition) and pad as necessary. Two of the members are the start and length offsets into the names resource to find the name.

Now I have a fixed size for the struct. I can find where in the resource the struct for a given codepoint is located in O(1) time (assuming no gaps, and I can ensure that when I compile).

A little awkward, but it should be manageable.

Now, if I just want the upper/lower mapping, that's easier. I can binary search the relevant entries -- it's about 1300 each. I can compare that to having the full array.

I can also hard-code the common Latin subset.

dhasenan · 2017-04-13T05:22:52Z

Hrm, System.CharInfo is specific to Mono and doesn't do anything like what I want. Looking at other options in the CLR.

dhasenan · 2017-04-14T15:29:24Z

Importing Unicode data as serialized resource files seems to work.

Helps with #1.

dhasenan · 2017-04-14T15:59:29Z

So, we've got the serialized unicode data. (ASCII -> binary provided approximately zero compression.) I can look up unicode codepoint info. This takes a binary search. If I added a distance heuristic, that would probably make it faster... Also, there are probably a lot of contiguous ranges, so I could binary/linear search through a couple dozen ranges and then go with fixed offsets.

This isn't just an upper/lower mapping. I can make that alone; it'll be 1300-ish entries, mostly binary search.

Updates to #1 Need testing before I call it done

dhasenan added a commit that referenced this issue Apr 14, 2017

Use resources to get unicode data.

795e20f

Helps with #1.

dhasenan added a commit that referenced this issue Apr 15, 2017

My own ToUpper implementation!

ecff7e6

Updates to #1 Need testing before I call it done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper Unicode ToLower/ToUpper #1

Proper Unicode ToLower/ToUpper #1

dhasenan commented Apr 13, 2017

dhasenan commented Apr 13, 2017

dhasenan commented Apr 13, 2017

dhasenan commented Apr 14, 2017 •

edited

Loading

dhasenan commented Apr 14, 2017

Proper Unicode ToLower/ToUpper #1

Proper Unicode ToLower/ToUpper #1

Comments

dhasenan commented Apr 13, 2017

dhasenan commented Apr 13, 2017

dhasenan commented Apr 13, 2017

dhasenan commented Apr 14, 2017 • edited Loading

dhasenan commented Apr 14, 2017

dhasenan commented Apr 14, 2017 •

edited

Loading