New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WET] Missing spaces in parsed content #13

Closed
pipldev opened this Issue Aug 9, 2017 · 1 comment

Comments

Projects
None yet
2 participants
@pipldev

pipldev commented Aug 9, 2017

For example, in CC-MAIN-20170629154125-20170629174125-00719.warc.wet, the parsed text for the URL http://awaywithwords.co/category/general/ contains the line:
February 25, 2017by Catherine Heath9 min readAdd Comment One thing I’m surprised by in my career (in less than a year at professional blogging) is the haters.
It is parsed from the attached HTML fragment (could not find a good way to embed the HTML here).
fragment.html.txt

The problem for me is when multiple words become one, e.g. "Heath9", but not having a newline before "One thing" is also strange.

Original Google Groups discussion: https://groups.google.com/forum/#!topic/common-crawl/heyZMsBT4YY

@sebastian-nagel sebastian-nagel changed the title from Missing spaces in parsed content to [WET] Missing spaces in parsed content Aug 23, 2017

sebastian-nagel added a commit that referenced this issue Aug 23, 2017

WET extractor: improve spacing of textual content, fixes #13
- rely on HTML elements (block vs. inline) for spacing and line breaks
@sebastian-nagel

This comment has been minimized.

Show comment
Hide comment
@sebastian-nagel

sebastian-nagel Aug 24, 2017

Hi @pipldev, fixed for August crawl (CC-MAIN-2017-34). Thanks!

sebastian-nagel commented Aug 24, 2017

Hi @pipldev, fixed for August crawl (CC-MAIN-2017-34). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment