This Notebook will go into detail on extracting information from MSWord Documents locally. Since many companies and roles are inseparable from the Microsoft Office Suite, this is a useful reference for anyone faced with data transferred through .doc or .docx formats.

This is a sister-blog to my entry about Thomas Edison State University's (TESU) open source materials accessibility initiative. <a href = "https://medium.com/@NatalieOlivo/preserving-web-content-of-links-provided-in-a-word-doc-using-aws-services-ec2-and-s3-2c4f0cee0a26">Medium post</a> / <a href = "https://github.com/nmolivo/tesu_scraper">Github repository</a><br><br>
The initial blog I composed for this project was specific and borders on the line of unwieldy, so this blog is the first in a series I will be writing to take deep dives into separate aspects of the TESU project and make the material more accessible.

In [None]:
#specific to extracting information from word documents
import os
import zipfile

#useful tool for extracting information from XML
import re

#to pretty print our xml:
import xml.dom.minidom

A sample word document can be found in this github repository. Let's find it using the packages we've imported above.

In [None]:
#to check files in the current directory, use a single period
os.listdir('.')

In [None]:
#to check files in the directory above the current directory, use double periods
os.listdir('..')

In [None]:
#the sample word document is in the folder entitled "docs"
#os.listdir('../docs')

Great! we can find where the word document is!<Br><Br>
We will now use the <a href = "https://docs.python.org/3/library/zipfile.html">zipfile</a> library to help us read our document. The defaults are listed below, and they're all good for our purposes of reading the word document.<br><br>
```python
class zipfile.ZipFile(file, mode='r', compression=ZIP_STORED, allowZip64=True, compresslevel=None)```

In [None]:
document = zipfile.ZipFile('sow_example.docx')

In [None]:
document

Ok, now to turn this into xml using 
```python 
ZipFile.read(name, pwd=None)```

As you can see, we need a name. This is different than the filename. This is how to direct ZipFile to read your file in the correct format. For our purposes, we'll want to see the xml that makes up the file.

In [None]:
document.namelist()

Just to see what kind of information is stored in our document, let's test a few.
I'm going to pretty print it, so it's a little more legible. This work is courtesy of answer from user <a href = "https://stackoverflow.com/users/47775/nick-bolton">Nick Bolton</a> on StackOverflow Question: <a href = "https://stackoverflow.com/questions/749796/pretty-printing-xml-in-python">Pretty Printing XML in Python</a>

In [None]:
#name = 'word/people.xml'
#we can see who the document author is: Stephen C. Phillips
uglyXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print(prettyXml)

In [None]:
#name = 'word/fontTable.xml'
#This looks like the fonts used in the document style
uglyXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print(prettyXml)

In [None]:
#ok cool let's get the xml that has the text contained in the document
#name = 'word/document.xml'
uglyXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print(prettyXml)

Per the scope of this contract, we'll need to find all hyperlinks stored in this document and add them to a list. Let's take a look at what hyperlinks look like in the xml:
```
      <w:hyperlink r:id="rId27">
        <w:r w:rsidRPr="0005387D">
          <w:rPr>
            <w:rFonts w:ascii="Arial" w:cs="Arial" w:eastAsia="Arial" w:hAnsi="Arial"/>
            <w:color w:val="0000FF"/>
            <w:sz w:val="20"/>
            <w:szCs w:val="20"/>
            <w:u w:val="single"/>
          </w:rPr>
          <w:t>https://cyber.harvard.edu/getinvolved/jobs/communicationsmanager</w:t>
        </w:r>
      </w:hyperlink>
```

A simple pattern I notice is that they all start with characters `>http` and end with characters `</`<br> Now we can convert our xml to a string and use regex to collect all text between those characters.

To help with the regex I'll need to accomplish our goal of collecting all text between the aforementioned characters, I used the following StackOverflow question, which contains what I am looking for in the initial ask: <a href = "https://stackoverflow.com/questions/1454913/regular-expression-to-find-a-string-included-between-two-characters-while-exclud">Regular Expression to find a string included between two characters while EXCLUDING the delimiters.</a> 

While I do want to keep the `http`, I do not want to keep the `<` or `>`. I will make these modifications to my list items using list comprehension.

In [None]:
#first to turn the xml content into a string:
xml_content = document.read('word/document.xml')
document.close()
xml_str = str(xml_content)

In [None]:
# link_list = re.findall('http.*?\<',xml_str)[1:]
# link_list = [x[:-1] for x in link_list]
shall_list = re.findall('shall', xml_str)

In [None]:
shall_list

In [None]:
len(shall_list)

Great! We've now collected all the URLS from this word document! Check out <a href= "https://github.com/nmolivo/tesu_scraper/blob/master/01_scraper.ipynb">the Notebook</a> updloaded to <a href = "https://github.com/nmolivo/tesu_scraper">the TESU Scraper Repo</a>, where we use the technique covered in this repository to collect a link list for each word document contained in a folder stored on an AWS S3 bucket. 

Created by Natalie Olivo
<a href = "https://www.linkedin.com/in/natalie-olivo-82548951/">LinkedIn</a> | <a href = "https://github.com/nmolivo">GitHub</a> | <a href = "https://medium.com/@NatalieOlivo">Blog</a>

In [43]:
import sys
!mamba install --yes --prefix {sys.prefix} -c conda-forge doc2txt


Looking for: ['doc2txt']


Pinned packages:
  - python 3.9.*


Encountered problems while solving:
  - nothing provides requested doc2txt



In [42]:
import docx2txt
test_doc = docx2txt.process(xml_content)
docu_Regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = docu_Regex.findall(test_doc)
print(mo)

ModuleNotFoundError: No module named 'docx2txt'