Skip to content

Comments

added cp1252 escapes#190

Closed
ifly6 wants to merge 6 commits intoapache:masterfrom
ifly6:master
Closed

added cp1252 escapes#190
ifly6 wants to merge 6 commits intoapache:masterfrom
ifly6:master

Conversation

@ifly6
Copy link

@ifly6 ifly6 commented Dec 12, 2020

In EntityArrays, added escape map for CP-1252 encoding and unescape mapping. Added to StringEscapeUtils methods to escape them, preserving existing ESCAPE_HTML4 functionality.

Functionality added in response to bug report https://issues.apache.org/jira/browse/TEXT-192.

In EntityArrays, added escape map for CP-1252 encoding and unescape mapping. Added to StringEscapeUtils methods to employ them, preserving existing ESCAPE_HTML4 functionality.
@ifly6 ifly6 closed this Dec 12, 2020
missed a semicolon
@ifly6 ifly6 reopened this Dec 12, 2020
@ifly6
Copy link
Author

ifly6 commented Dec 12, 2020

I've never worked on this project before, could someone tell me what is going on?

@garydgregory
Copy link
Member

This PR is missing unit tests.

@ifly6
Copy link
Author

ifly6 commented Dec 12, 2020

This PR is missing unit tests.

I don't know anything about how to build the project locally or how to fix the unit tests. Looking at the logs for the build, it seems that there's some counter which tallies new HashMap<>() calls and initialMap.put calls. I don't know why the test is set up this way, but I think it implies that the way the CP1252 map variable is set up, which uses new HashMap<>(ISO8859_1_ESCAPE) will always raise an error.

added smart quotes test
separation needed to avoid entitiyarrays test error
missed a comma ugh
seems the source is in iso 8859-1
@coveralls
Copy link

Coverage Status

Coverage increased (+0.007%) to 98.694% when pulling 5fb8fc7 on ifly6:master into 193c6c4 on apache:master.

{"null", null, null},
{"ampersand", "bread &amp; butter", "bread & butter"},
{"quotes", "&quot;bread&quot; &amp; butter", "\"bread\" & butter"},
{"smart quotes", "&ldquo;bread and circuses&rdquo;", "\u201Cbread and circuses\u201d"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a new map with dozens of new entries but you are only testing a single value? Am I reading this right or are all the other map entries somehow also tested?

Copy link
Author

@ifly6 ifly6 Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested a single value because the tests seem to test only a few values; adding an exhaustive number of tests seems mostly just to reproduce the existing map.

That said, thinking about how the library handles numeric escapes, I don't think it solves the initial problem that led me towards creating this pull request – translating something like &#147; to or &#137; to – so I've closed it.

@ifly6 ifly6 closed this Dec 12, 2020
@kinow
Copy link
Member

kinow commented Dec 13, 2020

Hi @ifly6 only had time to look at your PR now, sorry. Feel free to open a new one if you have another solution for the Windows 1252 charset issue 👍 Thanks!

@ifly6
Copy link
Author

ifly6 commented Dec 13, 2020

Hi @ifly6 only had time to look at your PR now, sorry. Feel free to open a new one if you have another solution for the Windows 1252 charset issue 👍 Thanks!

Thanks for looking at it; I think the actual way forward would be something to do with changing the numeric unescaper in Commons text to decode certain ranges as Windows-1252 instead treating them as Unicode. I'm not sure right now how to implement it in a manner consistent with the existing code base.

@ifly6
Copy link
Author

ifly6 commented Dec 13, 2020

Hi @ifly6 only had time to look at your PR now, sorry. Feel free to open a new one if you have another solution for the Windows 1252 charset issue 👍 Thanks!

I think I found a working way to get the NumericUnescaper to work with the CP-1252 section; should I create a new pull request or wait for someone to get back to me?

@kinow
Copy link
Member

kinow commented Dec 13, 2020

Feel free to create a new PR, or re-open thos one if you will build up on the work you've already done hhhere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants