New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract table contents from docx #92

Merged
merged 6 commits into from Oct 9, 2015

Conversation

Projects
None yet
3 participants
@jsmith-mploir
Contributor

jsmith-mploir commented Jul 23, 2015

Many Microsoft Word documents utilize tables for page organization and formatting. In the previous implementation of docx_parser.py, such tables were ignored even though python-docx module provides useful interface to these containers. Documents using many of the templates provided by Microsoft Word contain almost all of their content within a hidden table. This would result in many seemingly full length documents to be parsed as empty strings.

The proposed revision recursively explores the table structure of the document and extracts all paragraphs contained therein, resulting in a far more thorough and flexible parser.

jsmith-mploir added some commits Jul 23, 2015

jsmith-mploir
Extract table contents from docx
Many Microsoft Word documents utilize tables for page organization and formatting.  In the previous implementation of docx_parser.py, such tables were ignored even though python-docx module provides useful interface to these containers.  Documents using many of the templates provided by Microsoft Word contain almost all of their content within a hidden table.  This would result in many seemingly full length documents to be parsed as empty strings.

The proposed revision recursively explores the table structure of the document and extracts all paragraphs contained therein, resulting in a far more thorough and flexible parser.
jsmith-mploir
jsmith-mploir
jsmith-mploir
jsmith-mploir
jsmith-mploir
@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 29, 2015

Thanks for putting this together @jsmith-mploir! This looks fantastic. So just to make sure I understand, the way that this currently works is that it will first extract all of the paragraphs and then extract all of the tables. Is that correct?

Ideally, we could preserve the order of the text in tables and paragraphs so that, if there are tables interspersed within the body of a document, the tables will appear in the correct order. This can be important for a lot of natural language processing applications. Is it possible to preserve the order of paragraphs and tables with python-docx?

Don't worry about the failing tests. That doesn't have anything to do with your addition. It looks like BeautifulSoup got an upgrade which is wreaking havoc on the html parsers and we're still having some f^#(ing problems with the audio parser testing, but I'll take care of that nonsense.

I can take care of this, but I'd give you a solid 👍 👍 if you could also add a test case for this new functionality. To do this, you can add a word document to the tests/docx/ directory and then use the textract command line tool to extract the text with textract tests/docx/paragraphs_and_tables.docx > tests/docx/paragraphs_and_text.txt and then adding something like this to the tests/test_docx.py:

def test_tables(self):
    """make sure table output is correct"""
    d = self.get_extension_directory()
    self.compare_cli_output(os.path.join(d, "paragraphs_and_tables.docx"))
@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 31, 2015

@scanny At the risk of looping you into this conversation, do you have any suggestions for how to extract all paragraphs and tables from a docx file in the order in which they appear?

Feel free to ignore this; I figured I might as well ask though :)

@scanny

This comment has been minimized.

scanny commented Aug 31, 2015

Hi Dean, this issue discusses how this has been handled in python-docx: python-openxml/python-docx#40. This is one of the pain points that arose from modeling the MS API for Word. They actually don't have an object that contains all the so-called "block-level" items (para + table) in document order.

There were some folks who were working on it I think, but I never got a complete pull request on it as I recall. The workaround uses internals, which of course are subject to change, so needed a couple fix-ups as the package evolved. But you should be able to piece a working version together from the thread. If you do you can post it for others. The question comes up from time to time.

If you have more specific questions after looking at that code let me know :)

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 31, 2015

Thanks for the helpful link, @scanny! We'll be sure to keep you in the loop if/when we incorporate a solution on our end. I could imagine sending a PR to your repository, perhaps in the form of python-openxml/python-docx#72. From poking around on python-docx's issues, it looks like this comes up pretty frequently.

@jsmith-mploir

This comment has been minimized.

Contributor

jsmith-mploir commented Sep 11, 2015

Greetings @deanmalmgren and @scanny. I just want to say textract is an amazing tool. I use it as a swiss army knife of document parsing for a lot of data driven projects. When I came across this issue, it looked like a good opportunity to give something back. I've never contributed to an open source project before and am also not super familiar with Github. The continuous integration stuff is also new to me but I'd love to learn. If you still need me to add the test case, I'm willing to take a shot at it, but I may need some guidance.

@deanmalmgren deanmalmgren merged commit a2a2aa4 into deanmalmgren:master Oct 9, 2015

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details

deanmalmgren added a commit that referenced this pull request Oct 9, 2015

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Oct 9, 2015

Thanks for the feedback @scanny and the contribution @jsmith-mploir. I added a few tests and cleaned this up a bit so it could be included. I didn't get around to dealing with the tricky issue of properly ordering the tables and paragraphs, unfortunately. For the purposes of this package, however, I think its more important to get all of the text out of a document, secondarily worry about making sure all of the contents are in the correct order, and as a tertiary concern worry about making the text output is human readable.

Thanks again everybody. I'm going to make another release for textract very soon. I'm just trying to button up a few lingering issues that are causing the tests to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment