Fix UTF-8 prober fullLen calculation, ignores basic ASCII characters #59
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a wrong fullLen calculation:
And the confidence function now ignores all basic ASCII character (code <= 127), because many encoding methods encodes them in the same way but extended ASCII code may have different behavior (such as
Ā
in UTF-8 and Windows-1252).Encoding reference: https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec
Ideally, this functionality should ensure:
Ignoring basic ASCII characters MAY increase confidence for a multi-byte character document. But the problem now is low UTF-8 positive rate rather than high UTF-8 false positive rate. I write some short tests locally and found text encoding by other methods like Windows-1252 will be detected correctly, the UTF-8 prober never triggered, and the tests also pass as is. So I think this trade is worth it.