Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WET] Missing spaces in parsed content #13

pipldev opened this issue Aug 9, 2017 · 1 comment


Copy link

commented Aug 9, 2017

For example, in CC-MAIN-20170629154125-20170629174125-00719.warc.wet, the parsed text for the URL contains the line:
February 25, 2017by Catherine Heath9 min readAdd Comment One thing I’m surprised by in my career (in less than a year at professional blogging) is the haters.
It is parsed from the attached HTML fragment (could not find a good way to embed the HTML here).

The problem for me is when multiple words become one, e.g. "Heath9", but not having a newline before "One thing" is also strange.

Original Google Groups discussion:!topic/common-crawl/heyZMsBT4YY

@sebastian-nagel sebastian-nagel changed the title Missing spaces in parsed content [WET] Missing spaces in parsed content Aug 23, 2017

sebastian-nagel added a commit that referenced this issue Aug 23, 2017
WET extractor: improve spacing of textual content, fixes #13
- rely on HTML elements (block vs. inline) for spacing and line breaks

This comment has been minimized.

Copy link

commented Aug 24, 2017

Hi @pipldev, fixed for August crawl (CC-MAIN-2017-34). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
2 participants
You can’t perform that action at this time.