Windows-1252 encoding for HTML numeric entities #191

ifly6 · 2020-12-13T20:13:04Z

Currently, if given text like so:

"arma virumque cano&#133;"
"&#147;bread and circuses&#148;"

StringEscapeUtils currently returns the corresponding Unicode characters for points 128, 133, 147, and 148, which are bunch of obscure basically-never-used control characters that display as spaces. Those code points are, however, used more often in Windows-1252 encoding, corresponding to characters like € and ™.

I've changed NumericEntityUnescaper to treat HTML numeric entities corresponding to valid CP-1252 code points between 128 and 159 (inclusive) as CP-1252 characters and decode them to the corresponding punctuation marks etc instead of the obscure Unicode control characters.

for that range, translate to Windows 1252 encoding; re-throw illegal argument exception with input if restrictions are violated add test which tests numeric entities that are improperly encoded in cp-1252 and including code points before and after range [128, 159]. algorithm for numeric entity applies in very restrictive conditions: it must be in the range where ISO 8859-1 and Windows-1252 decohere, it must be a non-hex numeric entity (this is to avoid tripping one of StringEscapeUtilsTests), it must also not be an invalid Windows-1252 point in that range.

invalid points shouldn't be modifiable

stopped creating a new decoder every time it is invoked; use static decoder instead. renamed to cp-1252 to save characters; fixed documentation in the constructor. also reformatted it to match what i think was intended by original author

garydgregory · 2020-12-13T20:18:56Z

There is a test failing... :-(

src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java

capitalise the unicode escapes in the test because that seems to be the prevailing code style

coveralls · 2020-12-13T20:42:02Z

Coverage decreased (-0.03%) to 98.654% when pulling f5c12c9 on ifly6:cp1252 into fa366c8 on apache:master.

elharo · 2023-08-20T17:07:24Z

What does the HTML spec say? That's what should be followed.

ifly6 added 5 commits December 13, 2020 14:31

clarify the test added

7d222b4

Update NumericEntityUnescaper.java

eaad9b1

invalid points shouldn't be modifiable

Update NumericEntityUnescaper.java

f473b6f

stopped creating a new decoder every time it is invoked; use static decoder instead. renamed to cp-1252 to save characters; fixed documentation in the constructor. also reformatted it to match what i think was intended by original author

fix javadoc

fc1b4f2

ifly6 added 3 commits December 13, 2020 15:20

fix javadoc again

b5244d0

fix check style violations

ab6a790

fix check style violation

c66e327

kinow reviewed Dec 13, 2020

View reviewed changes

src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java Outdated Show resolved Hide resolved

code style change

033cff7

capitalise the unicode escapes in the test because that seems to be the prevailing code style

remove accidentally left serr call

f5c12c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows-1252 encoding for HTML numeric entities #191

Windows-1252 encoding for HTML numeric entities #191

ifly6 commented Dec 13, 2020 •

edited

garydgregory commented Dec 13, 2020

coveralls commented Dec 13, 2020 •

edited

elharo commented Aug 20, 2023

Windows-1252 encoding for HTML numeric entities #191

Are you sure you want to change the base?

Windows-1252 encoding for HTML numeric entities #191

Conversation

ifly6 commented Dec 13, 2020 • edited

garydgregory commented Dec 13, 2020

coveralls commented Dec 13, 2020 • edited

elharo commented Aug 20, 2023

ifly6 commented Dec 13, 2020 •

edited

coveralls commented Dec 13, 2020 •

edited