Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAT: unescape XML/HTML character entities #14

Closed
sebastian-nagel opened this issue Jun 13, 2019 · 1 comment

Comments

Projects
None yet
1 participant
@sebastian-nagel
Copy link

commented Jun 13, 2019

The Common Crawl WAT files contain lot of XML/HTML entities which should be unescaped. For links/URLs the amount of values exceeds 10%. Examples (HTML snippet + WAT extract):

<img src="https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&amp;videoId=5808028819001" alt="Míchel Salgado, exjugador del Real Madrid: &quot;Cristiano Ronaldo es insustituible&quot;">

{
  "path": "IMG@/src",
  "alt": "Míchel Salgado, exjugador del Real Madrid: &quot;Cristiano Ronaldo es insustituible&quot;",
  "url": "https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&amp;videoId=5808028819001"
},
  • note that the problem applies to all kind of XML/HTML character entities:
<a href="https://secure.customersvc.com/wes/servlet/Show?WESPAGE&#x3D;iam/pages/home.jsp&amp;MSRSMAG&#x3D;FI">
  EU Customer Service
</a>

{
  "path": "A@/href",
  "text": "EU Customer Service",
  "url": "https://secure.customersvc.com/wes/servlet/Show?WESPAGE&#x3D;iam/pages/home.jsp&amp;MSRSMAG&#x3D;FI"
},
  • in text
<a class="pdb-meta-link" href="http://www.madsack.de/"
   target="_blank" rel="nofollow"
   >© Verlagsgesellschaft Madsack GmbH &amp; Co. KG</a>

{
  "path": "A@/href",
  "rel": "nofollow",
  "text": "© Verlagsgesellschaft Madsack GmbH &amp; Co. KG",
  "url": "http://www.madsack.de/",
  "target": "_blank"
},
  • and attribute values
<meta property="og:description" content="As Goal revealed on Tuesday, the Reds are in talks with Roma over signing the Brazil international, who would transform Jurgen Klopp&amp;#39;s defence" >

{
  "property": "og:description",
  "content": "As Goal revealed on Tuesday, the Reds are in talks with Roma over signing the Brazil international, who would transform Jurgen Klopp&amp;#39;s defence"
},

The WAT extractor should replace the character entities with the corresponding character values to leverage the processing of the WAT files.

sebastian-nagel added a commit that referenced this issue Jun 19, 2019

WAT: unescape XML/HTML character entities (#14)
- call org.htmlparser.util.Translate.decode(String) on attribute
  values and text content of elements (<a> anchor text, <title>)
  to decode character entities
- add unit test
@sebastian-nagel

This comment has been minimized.

Copy link
Author

commented Jul 2, 2019

The changes in e0d23b8 have been used for the June 2019 crawl (CC-MAIN-2019-26). A comparison with two randomly selected WAT files from May and June, shows that the number of entities in JSON string values has dropped by a factor of 100:

  • from 1,019,102 in CC-MAIN-20190526105248-20190526131248-00063.warc.wat.gz
  • to 8,791 in CC-MAIN-20190619204313-20190619230313-00114.warc.wat.gz

The counts are based on a simple regex pattern which should give an acceptable approximation:

% zgrep '^{' CC-MAIN-20190526*.wat.gz | jq . | grep -E '&.{2,8};' | wc -l
1019102

A quick check of the remaining 9,000 entities showed the following reasons why there are still unescaped entities:

  • double escaped entities (&amp;amp;) - probably errors on web pages in most cases. But since one might want to write "In HTML a literal ampersand must be written as &amp;", recursively decoding entities isn't the best practice. It doesn't conform to the standard in any case.
  • entities not supported by htmlparser.org:

sebastian-nagel added a commit that referenced this issue Jul 8, 2019

WAT: unescape XML/HTML character entities (#14)
- call org.htmlparser.util.Translate.decode(String) on attribute
  values and text content of elements (<a> anchor text, <title>)
  to decode character entities
- add unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.