Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows-1252 encoding for HTML numeric entities #191

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

ifly6
Copy link

@ifly6 ifly6 commented Dec 13, 2020

Currently, if given text like so:

"arma virumque cano…"
"“bread and circuses”"

StringEscapeUtils currently returns the corresponding Unicode characters for points 128, 133, 147, and 148, which are bunch of obscure basically-never-used control characters that display as spaces. Those code points are, however, used more often in Windows-1252 encoding, corresponding to characters like € and ™.

I've changed NumericEntityUnescaper to treat HTML numeric entities corresponding to valid CP-1252 code points between 128 and 159 (inclusive) as CP-1252 characters and decode them to the corresponding punctuation marks etc instead of the obscure Unicode control characters.

for that range, translate to Windows 1252 encoding; re-throw illegal argument exception with input if restrictions are violated

add test which tests numeric entities that are improperly encoded in cp-1252 and including code points before and after range [128, 159].

algorithm for numeric entity applies in very restrictive conditions: it must be in the range where ISO 8859-1 and Windows-1252 decohere, it must be a non-hex numeric entity (this is to avoid tripping one of StringEscapeUtilsTests), it must also not be an invalid Windows-1252 point in that range.
invalid points shouldn't be modifiable
stopped creating a new decoder every time it is invoked; use static decoder instead.

renamed to cp-1252 to save characters; fixed documentation in the constructor. also reformatted it to match what i think was intended by original author
@garydgregory
Copy link
Member

There is a test failing... :-(

capitalise the unicode escapes in the test because that seems to be the prevailing code style
@coveralls
Copy link

coveralls commented Dec 13, 2020

Coverage Status

Coverage decreased (-0.03%) to 98.654% when pulling f5c12c9 on ifly6:cp1252 into fa366c8 on apache:master.

@elharo
Copy link

elharo commented Aug 20, 2023

What does the HTML spec say? That's what should be followed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants