Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
WAT: unescape XML/HTML character entities #14
The Common Crawl WAT files contain lot of XML/HTML entities which should be unescaped. For links/URLs the amount of values exceeds 10%. Examples (HTML snippet + WAT extract):
The WAT extractor should replace the character entities with the corresponding character values to leverage the processing of the WAT files.
The changes in e0d23b8 have been used for the June 2019 crawl (CC-MAIN-2019-26). A comparison with two randomly selected WAT files from May and June, shows that the number of entities in JSON string values has dropped by a factor of 100:
The counts are based on a simple regex pattern which should give an acceptable approximation:
A quick check of the remaining 9,000 entities showed the following reasons why there are still unescaped entities: