Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect implementation of unicode case folding #168

Closed
rlidwka opened this issue Jul 9, 2019 · 3 comments
Closed

Incorrect implementation of unicode case folding #168

rlidwka opened this issue Jul 9, 2019 · 3 comments

Comments

@rlidwka
Copy link
Contributor

rlidwka commented Jul 9, 2019

Abstract: this code isn't up to unicode spec, partially incorrect (and can probably be replaced by a simple combination of lowercase/uppercase in modern js engines).


So I was looking at this test the other day: commonmark/commonmark-spec#582

And I was trying to figure out what's the difference between "naive" lowercasing and actual Unicode case folding as implemented in commonmark.

So I checked it with the spec published here:
http://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

And surely enough, I found a difference:

AB73; C; 13A3; # CHEROKEE SMALL LETTER O

They are corresponding lower and upper case letter, even javascript knows that:

$ node -e 'console.log("\uab73".toUpperCase().codePointAt(0).toString(16));'
13a3
$ node -e 'console.log("\u13a3".toLowerCase().codePointAt(0).toString(16));'
ab73

But commonmark.js doesn't treat them as the same character:

$ echo -e '[\u13a3]\n\n[\uab73]: /test' | ./bin/commonmark
<p>[Ꭳ]</p>

Funnily enough, original implementation (https://github.com/mathiasbynens/swapcase) supports this:

$ node -e "console.log(require('swapcase')('\uab73').codePointAt(0).toString(16))"
13a3

So it looks like what we're having here is not simply outdated, but was never correct to begin with.

It is my estimation that around 330 other characters in current implementation are processed incorrectly, as opposed to 170 in original swapcase, which are most likely due to changes in unicode between now and 5 years ago when it was published.


I have a better solution. Use simple and naive str.toLowerCase(). It only gets 125 codepoints wrong, which is better.

Or use str.toUpperCase(). It is actually almost perfect, as it only gets 6 characters incorrectly (list of exceptions: İ, ϴ, ẞ, Ω, K, Å - those are already uppercased, but have differently uppercased versions).

If we combine those approaches, str.toLowerCase().toUpperCase() is observed to normalize ALL characters in the same way as case fold spec (as tested on node 12, earlier v8 versions have had some issues).

So... why not use that, and save on 30kb of javascript code and its future maintenance time?

@jgm
Copy link
Member

jgm commented Jul 11, 2019

The proposal sounds great to me. Can you measure any signficant performance difference?

@rlidwka
Copy link
Contributor Author

rlidwka commented Jul 16, 2019

Can you measure any signficant performance difference?

Yes. It is either 30 times faster or 3 times slower depending on how you measure.

Here's ASCII range:

function test(fn, name) {
  let str = ''

  for (let i=0; i<=0x7F; i++) {
    str += String.fromCodePoint(i)
  }

  let time = Date.now()
  for (let i=0; i<200000; i++) {
    fn(str)
  }
  console.log(name, Date.now() - time)
}

test(require('fold-case'), 'fold-case')
test(require('swapcase'), 'swapcase')
test(s => s.toLowerCase().toUpperCase(), 'built-in')

// Result:
// fold-case 715
// swapcase 1340
// built-in 45

Here's full unicode range:

function test(fn, name) {
  let str = ''

  for (let i=0; i<0x10FFFF; i++) {
    str += String.fromCodePoint(i)
  }

  let time = Date.now()
  for (let i=0; i<20; i++) {
    fn(str)
  }
  console.log(name, Date.now() - time)
}

test(require('fold-case'), 'fold-case')
test(require('swapcase'), 'swapcase')
test(s => s.toLowerCase().toUpperCase(), 'built-in')

// Result:
// fold-case 503
// swapcase 490
// built-in 1539

I assume most identifiers will be ascii, so first test applies.

@jgm
Copy link
Member

jgm commented Jul 16, 2019

Looks good. Do you want to do a PR for this change?

@jgm jgm closed this as completed in b8fc0c7 Jan 10, 2020
taku0 added a commit to taku0/cmark-el that referenced this issue Jan 11, 2020
We now use the built in str.toLowerCase().toUpperCase(), which
@rlidwka has shown does an accurate unicode case fold.
This allows us to remove a huge lookup table and should
both decrease the size of the library and speed things up.

Closes commonmark/commonmark.js#168.

commonmark/commonmark.js@b8fc0c7
Author: John MacFarlane <jgm@berkeley.edu>
Date:   Fri Jan 10 09:19:35 2020 -0800
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants