Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
WAT: only unescape complete XML/HTML character entities #19
Note that only malformed URLs are mangled, the ampersand should be escaped also in HTML element attributes. Of course, this is a frequent error in HTML and invalid URLs/links might be cumbersome in case they're used to feed a crawler, construct a webgraph, etc.
When extracting text a lazy replacement (without a closing