Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[WET] Missing spaces in parsed content #13
For example, in CC-MAIN-20170629154125-20170629174125-00719.warc.wet, the parsed text for the URL http://awaywithwords.co/category/general/ contains the line:
The problem for me is when multiple words become one, e.g. "Heath9", but not having a newline before "One thing" is also strange.
Original Google Groups discussion: https://groups.google.com/forum/#!topic/common-crawl/heyZMsBT4YY