Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Extract table contents from docx #92
Many Microsoft Word documents utilize tables for page organization and formatting. In the previous implementation of docx_parser.py, such tables were ignored even though python-docx module provides useful interface to these containers. Documents using many of the templates provided by Microsoft Word contain almost all of their content within a hidden table. This would result in many seemingly full length documents to be parsed as empty strings.
The proposed revision recursively explores the table structure of the document and extracts all paragraphs contained therein, resulting in a far more thorough and flexible parser.
Thanks for putting this together @jsmith-mploir! This looks fantastic. So just to make sure I understand, the way that this currently works is that it will first extract all of the paragraphs and then extract all of the tables. Is that correct?
Ideally, we could preserve the order of the text in tables and paragraphs so that, if there are tables interspersed within the body of a document, the tables will appear in the correct order. This can be important for a lot of natural language processing applications. Is it possible to preserve the order of paragraphs and tables with python-docx?
Don't worry about the failing tests. That doesn't have anything to do with your addition. It looks like BeautifulSoup got an upgrade which is wreaking havoc on the html parsers and we're still having some f^#(ing problems with the audio parser testing, but I'll take care of that nonsense.
I can take care of this, but I'd give you a solid
def test_tables(self): """make sure table output is correct""" d = self.get_extension_directory() self.compare_cli_output(os.path.join(d, "paragraphs_and_tables.docx"))
Hi Dean, this issue discusses how this has been handled in python-docx: python-openxml/python-docx#40. This is one of the pain points that arose from modeling the MS API for Word. They actually don't have an object that contains all the so-called "block-level" items (para + table) in document order.
There were some folks who were working on it I think, but I never got a complete pull request on it as I recall. The workaround uses internals, which of course are subject to change, so needed a couple fix-ups as the package evolved. But you should be able to piece a working version together from the thread. If you do you can post it for others. The question comes up from time to time.
If you have more specific questions after looking at that code let me know :)
Thanks for the helpful link, @scanny! We'll be sure to keep you in the loop if/when we incorporate a solution on our end. I could imagine sending a PR to your repository, perhaps in the form of python-openxml/python-docx#72. From poking around on python-docx's issues, it looks like this comes up pretty frequently.
Greetings @deanmalmgren and @scanny. I just want to say textract is an amazing tool. I use it as a swiss army knife of document parsing for a lot of data driven projects. When I came across this issue, it looked like a good opportunity to give something back. I've never contributed to an open source project before and am also not super familiar with Github. The continuous integration stuff is also new to me but I'd love to learn. If you still need me to add the test case, I'm willing to take a shot at it, but I may need some guidance.
Oct 9, 2015
1 check failed
added a commit
this pull request
Oct 9, 2015
Thanks for the feedback @scanny and the contribution @jsmith-mploir. I added a few tests and cleaned this up a bit so it could be included. I didn't get around to dealing with the tricky issue of properly ordering the tables and paragraphs, unfortunately. For the purposes of this package, however, I think its more important to get all of the text out of a document, secondarily worry about making sure all of the contents are in the correct order, and as a tertiary concern worry about making the text output is human readable.
Thanks again everybody. I'm going to make another release for textract very soon. I'm just trying to button up a few lingering issues that are causing the tests to fail.