Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
switch from python-docx to python-docx2txt #100
referenced this pull request
Nov 9, 2015
Hey guys been working on a project where I have to extract all the text as well a as images from a .docx file however python-docx only seems to extract the text. I managed to achieve the desired effect by combining python-docx with docx2txt. I essentially create a temp folder extract the images with docxtotxt process them using textract via python-docx and then deleting the temp folder. My code is posted underneath should anyone require this functionality:
Final point the code is written in Python 3.4
Thanks for sharing, @RogerTavares1. That's pretty interesting. With #104 it looks like we'll have python 3 support very soon. I'm still trying to work out the best way to handle this issue and #95 from @ankushshah89 in the best way.
Its interesting to me that you want to process the text out of the images as well. If this is a common use case, then we should probably consider parsing text from images in other rich-media formats, like images and movies in web pages or powerpoint presentations. I have a rather big deadline next week but I hope to get to this soon. If you have any thoughts or comments in the meantime, please don't hesitate to weigh in.