Incorrect implementation of unicode case folding #168

rlidwka · 2019-07-09T11:58:17Z

Abstract: this code isn't up to unicode spec, partially incorrect (and can probably be replaced by a simple combination of lowercase/uppercase in modern js engines).

So I was looking at this test the other day: commonmark/commonmark-spec#582

And I was trying to figure out what's the difference between "naive" lowercasing and actual Unicode case folding as implemented in commonmark.

So I checked it with the spec published here:
http://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

And surely enough, I found a difference:

AB73; C; 13A3; # CHEROKEE SMALL LETTER O

They are corresponding lower and upper case letter, even javascript knows that:

$ node -e 'console.log("\uab73".toUpperCase().codePointAt(0).toString(16));'
13a3
$ node -e 'console.log("\u13a3".toLowerCase().codePointAt(0).toString(16));'
ab73

But commonmark.js doesn't treat them as the same character:

$ echo -e '[\u13a3]\n\n[\uab73]: /test' | ./bin/commonmark
<p>[Ꭳ]</p>

Funnily enough, original implementation (https://github.com/mathiasbynens/swapcase) supports this:

$ node -e "console.log(require('swapcase')('\uab73').codePointAt(0).toString(16))"
13a3

So it looks like what we're having here is not simply outdated, but was never correct to begin with.

It is my estimation that around 330 other characters in current implementation are processed incorrectly, as opposed to 170 in original swapcase, which are most likely due to changes in unicode between now and 5 years ago when it was published.

I have a better solution. Use simple and naive str.toLowerCase(). It only gets 125 codepoints wrong, which is better.

Or use str.toUpperCase(). It is actually almost perfect, as it only gets 6 characters incorrectly (list of exceptions: İ, ϴ, ẞ, Ω, K, Å - those are already uppercased, but have differently uppercased versions).

If we combine those approaches, str.toLowerCase().toUpperCase() is observed to normalize ALL characters in the same way as case fold spec (as tested on node 12, earlier v8 versions have had some issues).

So... why not use that, and save on 30kb of javascript code and its future maintenance time?

The text was updated successfully, but these errors were encountered:

jgm · 2019-07-11T04:26:56Z

The proposal sounds great to me. Can you measure any signficant performance difference?

rlidwka · 2019-07-16T13:02:22Z

Can you measure any signficant performance difference?

Yes. It is either 30 times faster or 3 times slower depending on how you measure.

Here's ASCII range:

function test(fn, name) {
  let str = ''

  for (let i=0; i<=0x7F; i++) {
    str += String.fromCodePoint(i)
  }

  let time = Date.now()
  for (let i=0; i<200000; i++) {
    fn(str)
  }
  console.log(name, Date.now() - time)
}

test(require('fold-case'), 'fold-case')
test(require('swapcase'), 'swapcase')
test(s => s.toLowerCase().toUpperCase(), 'built-in')

// Result:
// fold-case 715
// swapcase 1340
// built-in 45

Here's full unicode range:

function test(fn, name) {
  let str = ''

  for (let i=0; i<0x10FFFF; i++) {
    str += String.fromCodePoint(i)
  }

  let time = Date.now()
  for (let i=0; i<20; i++) {
    fn(str)
  }
  console.log(name, Date.now() - time)
}

test(require('fold-case'), 'fold-case')
test(require('swapcase'), 'swapcase')
test(s => s.toLowerCase().toUpperCase(), 'built-in')

// Result:
// fold-case 503
// swapcase 490
// built-in 1539

I assume most identifiers will be ascii, so first test applies.

jgm · 2019-07-16T14:58:22Z

Looks good. Do you want to do a PR for this change?

@rlidwka

We now use the built in str.toLowerCase().toUpperCase(), which @rlidwka has shown does an accurate unicode case fold. This allows us to remove a huge lookup table and should both decrease the size of the library and speed things up. Closes commonmark/commonmark.js#168. commonmark/commonmark.js@b8fc0c7 Author: John MacFarlane <jgm@berkeley.edu> Date: Fri Jan 10 09:19:35 2020 -0800

jgm closed this as completed in b8fc0c7 Jan 10, 2020

dongwoo-kim mentioned this issue Mar 26, 2020

fix: apply bug fixes from commonmark.js since 0.29.0 nhn/tui.editor#855

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect implementation of unicode case folding #168

Incorrect implementation of unicode case folding #168

rlidwka commented Jul 9, 2019 •

edited

jgm commented Jul 11, 2019 •

edited

rlidwka commented Jul 16, 2019

jgm commented Jul 16, 2019

Incorrect implementation of unicode case folding #168

Incorrect implementation of unicode case folding #168

Comments

rlidwka commented Jul 9, 2019 • edited

jgm commented Jul 11, 2019 • edited

rlidwka commented Jul 16, 2019

jgm commented Jul 16, 2019

rlidwka commented Jul 9, 2019 •

edited

jgm commented Jul 11, 2019 •

edited