Fix UTF-8 prober fullLen calculation, ignores basic ASCII characters #59

lingsamuel · 2020-06-29T06:14:11Z

This PR fixes a wrong fullLen calculation:

this.feed = function(aBuf) {
        this._mFullLen = aBuf.length; // FullLen should be +=

And the confidence function now ignores all basic ASCII character (code <= 127), because many encoding methods encodes them in the same way but extended ASCII code may have different behavior (such as Ā in UTF-8 and Windows-1252).

Encoding reference: https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec

Ideally, this functionality should ensure:

Won't increase the false positive rate
Increase positive rate

Ignoring basic ASCII characters MAY increase confidence for a multi-byte character document. But the problem now is low UTF-8 positive rate rather than high UTF-8 false positive rate. I write some short tests locally and found text encoding by other methods like Windows-1252 will be detected correctly, the UTF-8 prober never triggered, and the tests also pass as is. So I think this trade is worth it.

Fix UTF-8 prober full len calculation, ignores basic ASCII characters

f619181

lingsamuel mentioned this pull request Jun 29, 2020

Multi-byte character ratio detection in UTF-8 prober confidence function #57

Closed

aadsm merged commit f88ae28 into aadsm:master Jun 30, 2020

This was referenced Jun 30, 2020

Wrong guess encoding as Windows 1252 microsoft/vscode#33720

Closed

Wrong guess encoding as Windows 1252 #48

Open

UTF-8 encoding of Degree Symbol #50

Open

rstm-sf mentioned this pull request Apr 9, 2021

Port multi-byte character ratio detection in UTF-8 prober confidence function from jschardet CharsetDetector/UTF-unknown#117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF-8 prober fullLen calculation, ignores basic ASCII characters #59

Fix UTF-8 prober fullLen calculation, ignores basic ASCII characters #59

lingsamuel commented Jun 29, 2020 •

edited

Loading

Fix UTF-8 prober fullLen calculation, ignores basic ASCII characters #59

Fix UTF-8 prober fullLen calculation, ignores basic ASCII characters #59

Conversation

lingsamuel commented Jun 29, 2020 • edited Loading

lingsamuel commented Jun 29, 2020 •

edited

Loading