Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boilerpipe crash #29

Closed
GoogleCodeExporter opened this issue May 6, 2015 · 1 comment
Closed

boilerpipe crash #29

GoogleCodeExporter opened this issue May 6, 2015 · 1 comment

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?
1. Try to extract that url:
http://sourceforge.net/projects/xampp/files/XAMPP%20Windows/1.7.4/xampp-win32-1.
7.4-VC6-installer.exe/download
I have used ArticleExtractor.
It throws few times:
Warning: SAX input contains nested A elements -- You have probably hit a bug in 
your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML 
externally and feed it to boilerpipe again. Trying to recover somehow...
and then crashes with OutOfMemoryException

I'm using version 1.2.0. I have tested on Windows and on Ubuntu as well.

Original issue reported on code.google.com by fzr...@gmail.com on 29 Jul 2011 at 1:27

@GoogleCodeExporter
Copy link
Author

The input was no HTML (application/x-msdos-program instead), boilerpipe 
nevertheless accepted it and NekoHTML choked on it.

In the meantime, in boilerpipe trunk checks were added to only fetch text/html 
content, and throw an exception otherwise. boilerpipe-web has additional checks 
(e.g., content length).

In both cases, the NekoHTML bug exception will not appear anymore.

Original comment by ckkohl79 on 22 Jan 2012 at 11:03

  • Changed state: Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant