New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements of HTML parser. #58

merged 1 commit into from Oct 3, 2014


None yet
3 participants

eiotec commented Aug 26, 2014

I've done some improvements to the HTML parser - it's now able to convert tables into ASCII equivalents and produces nicer output (it can distinguish inline elements and uses newlines more carefully).

Piotr Sliwa

This comment has been minimized.


deanmalmgren commented Aug 26, 2014

Thanks for putting this together. The whitespace of the current html parser admittedly makes the textract output rather...uhhhh...gross.

If I had to be completely honest with myself, I'm a little torn on whether to merge this in though. The beauty of the current (and extremely simple) solution is that, for most (all?) downstream natural language processing tasks, the current solution and its gross whitespace is completely fine because whitespace is usually ignored for text analysis tasks provided the words are in the correct order. Can you help me understand how this improves the quality of downstream natural language processing?

From all the great work you put into this, I gather that it is really valuable. I just want to make sure the benefits outweigh the costs of adding significantly more complex code that we'll need to support going forward.

Assuming we can get some clarity on the question above, we'll also need to address a few other things to get this merged in:

  • As of v1.0.0, I ported all of the tests to using the python unittest framework (and specifically nosetests to run them)— definitely got caught in the switch! Can you rebase your contribution off of the current master and update the tests to reflect your changes? You'll probably need to overwrite the tests/html/raw_text.txt file with the new output from tests/html/raw_text.html
  • It would also be great to add your tables.html test to the testing framework to make sure we don't unintentionally muck it up in the future.

This comment has been minimized.


arvindch commented Aug 26, 2014

This could actually be a new feature.
Maybe have a cmdline switch (-p, --prettify), which prints nicer output (if supported), else silently pass, and print normal output.

What do you think?


This comment has been minimized.

eiotec commented Aug 28, 2014

I'm currently working on a project of web search that has to display contents of parsed web documents in pure-text format and besides the text processing it's quite important for me to make those contents more user friendly, that's why I needed this improvement.
I think that's a good idea to make it a -p/--prettify feature.
Of course, if it's ok, I will corrent those problems to make it pass the Travis build.


This comment has been minimized.


deanmalmgren commented Aug 28, 2014

That's an interesting use case that I hadn't envisioned when I started this project, but I can see how that could be useful in some contexts. I think we should make it clear that this is a strong secondary use case in the documentation (note to self) so that others see the value there too.

Let's try and get this merged in with all the tests passing, etc. I think we can do it without the -p/--prettify flag to keep things as simple as possible.

Thanks again and let me know if you need any help.

@deanmalmgren deanmalmgren merged commit 1e07b34 into deanmalmgren:master Oct 3, 2014

1 check failed

continuous-integration/travis-ci The Travis CI build failed

This comment has been minimized.


deanmalmgren commented Oct 3, 2014

The more I thought about this the more I appreciated the importance of having human readable output. @eiotec's use case is certainly one example. But even from a development perspective it is nice to be able to read the output from textract to have confidence that it is working rather than just seeing a stream of whitespace punctuated by text.

Thanks for bearing with me while we got this merged in, @eiotec. This past month was rather busy personally and professionally and I finally had some time to sit down and merge it in this morning. Thanks again for the contribution!


This comment has been minimized.


deanmalmgren commented Oct 3, 2014

Hit comment too quickly. I meant to add that I fixed the tests to use the new output and added another one with the tables.html that @eiotec added. There are still some failing tests on travis-ci but that is related to another known issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment