New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch from python-docx to python-docx2txt #100

Merged
merged 1 commit into from Nov 15, 2016

Conversation

Projects
None yet
3 participants
@ankushshah89
Contributor

ankushshah89 commented Nov 9, 2015

switching to another module for extracting text from docx file to solve the issues discussed here

@RogerTavares1

This comment has been minimized.

RogerTavares1 commented Feb 22, 2016

Hey guys been working on a project where I have to extract all the text as well a as images from a .docx file however python-docx only seems to extract the text. I managed to achieve the desired effect by combining python-docx with docx2txt. I essentially create a temp folder extract the images with docxtotxt process them using textract via python-docx and then deleting the temp folder. My code is posted underneath should anyone require this functionality:

def parse_docx_file(file):
    text = textract.process(file)
    if not os.path.exists(temp_images):
        os.mkdir(temp_images)
    docx2txt.process(file, temp_images)
    if os.listdir(temp_images):
        files = glob.glob(temp_images + "/*")
        for file in files:
            text += textract.process(file)
    shutil.rmtree(temp_images)
    print(text.decode())

Final point the code is written in Python 3.4

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Feb 22, 2016

Thanks for sharing, @RogerTavares1. That's pretty interesting. With #104 it looks like we'll have python 3 support very soon. I'm still trying to work out the best way to handle this issue and #95 from @ankushshah89 in the best way.

Its interesting to me that you want to process the text out of the images as well. If this is a common use case, then we should probably consider parsing text from images in other rich-media formats, like images and movies in web pages or powerpoint presentations. I have a rather big deadline next week but I hope to get to this soon. If you have any thoughts or comments in the meantime, please don't hesitate to weigh in.

@deanmalmgren deanmalmgren merged commit 6ecc86d into deanmalmgren:master Nov 15, 2016

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details

deanmalmgren added a commit that referenced this pull request Nov 15, 2016

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Nov 15, 2016

Finally got around to merging this in @ankushshah89. Thanks again for the PR!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment