This Notebook will go into detail on extracting information from MSWord Documents locally. Since many companies and roles are inseparable from the Microsoft Office Suite, this is a useful reference for anyone faced with data transferred through .doc or .docx formats.

This is a sister-blog to my entry about Thomas Edison State University's (TESU) open source materials accessibility initiative. <a href = "https://medium.com/@NatalieOlivo/preserving-web-content-of-links-provided-in-a-word-doc-using-aws-services-ec2-and-s3-2c4f0cee0a26">Medium post</a> / <a href = "https://github.com/nmolivo/tesu_scraper">Github repository</a><br><br>
The initial blog I composed for this project was specific and borders on the line of unwieldy, so this blog is the first in a series I will be writing to take deep dives into separate aspects of the TESU project and make the material more accessible.

In [1]:
#specific to extracting information from word documents
import os
import zipfile

#useful tool for extracting information from XML
import re

#to pretty print our xml:
import xml.dom.minidom

A sample word document can be found in this github repository. Let's find it using the packages we've imported above.

In [2]:
#to check files in the current directory, use a single period
os.listdir('.')

['[Content_Types].xml',
 'docProps',
 '01_extract_from_MSWord.ipynb',
 'word',
 'my_word_file.docx',
 '_rels',
 '.ipynb_checkpoints']

In [3]:
#to check files in the directory above the current directory, use double periods
os.listdir('..')

['OPM_descarga FB con formato 20 oct (2)_02.rtf',
 'OPM_descarga FB con formato 20 oct (2)_01.txt',
 'OPM_descarga FB con formato 20 oct (2).txt',
 'OPM_descarga FB con formato 20 oct (2).docx',
 'word',
 'out',
 'my_word_file-out.docx',
 'face.py',
 'read.py',
 'xmlre.py',
 'my_word_file.docx',
 'docxfrompy.py',
 'ext.py',
 'xml',
 'TABLA.docx',
 'OPM_descarga redes_oct 2020.xlsx',
 'OPM_descarga FB con formato 20 oct (2)_03.txt']

In [7]:
#the sample word document is in the folder entitled "docs"
os.listdir('../')

['OPM_descarga FB con formato 20 oct (2)_02.rtf',
 'OPM_descarga FB con formato 20 oct (2)_01.txt',
 'OPM_descarga FB con formato 20 oct (2).txt',
 'OPM_descarga FB con formato 20 oct (2).docx',
 'word',
 'out',
 'my_word_file-out.docx',
 'face.py',
 'read.py',
 'xmlre.py',
 'my_word_file.docx',
 'docxfrompy.py',
 'ext.py',
 'xml',
 'TABLA.docx',
 'OPM_descarga redes_oct 2020.xlsx',
 'OPM_descarga FB con formato 20 oct (2)_03.txt']

Great! we can find where the word document is!<Br><Br>
We will now use the <a href = "https://docs.python.org/3/library/zipfile.html">zipfile</a> library to help us read our document. The defaults are listed below, and they're all good for our purposes of reading the word document.<br><br>
```python
class zipfile.ZipFile(file, mode='r', compression=ZIP_STORED, allowZip64=True, compresslevel=None)```

In [8]:
document = zipfile.ZipFile('../my_word_file.docx')

In [9]:
document

<zipfile.ZipFile filename='../my_word_file.docx' mode='r'>

Ok, now to turn this into xml using 
```python 
ZipFile.read(name, pwd=None)```

As you can see, we need a name. This is different than the filename. This is how to direct ZipFile to read your file in the correct format. For our purposes, we'll want to see the xml that makes up the file.

In [10]:
document.namelist()

['[Content_Types].xml',
 '_rels/.rels',
 'word/_rels/document.xml.rels',
 'word/document.xml',
 'word/media/image174.jpeg',
 'word/media/image117.jpeg',
 'word/media/image116.jpeg',
 'word/media/image115.jpeg',
 'word/media/image114.jpeg',
 'word/media/image113.jpeg',
 'word/media/image112.jpeg',
 'word/media/image111.jpeg',
 'word/media/image110.jpeg',
 'word/media/image118.jpeg',
 'word/media/image119.jpeg',
 'word/media/image120.jpeg',
 'word/media/image128.jpeg',
 'word/media/image127.jpeg',
 'word/media/image126.jpeg',
 'word/media/image125.jpeg',
 'word/media/image124.jpeg',
 'word/media/image123.jpeg',
 'word/media/image122.jpeg',
 'word/media/image121.jpeg',
 'word/media/image109.jpeg',
 'word/media/image108.jpeg',
 'word/media/image107.jpeg',
 'word/media/image95.jpeg',
 'word/media/image94.jpeg',
 'word/media/image93.jpeg',
 'word/media/image92.jpeg',
 'word/media/image91.jpeg',
 'word/media/image90.jpeg',
 'word/media/image89.jpeg',
 'word/media/image88.jpeg',
 'word/media/i

Just to see what kind of information is stored in our document, let's test a few.
I'm going to pretty print it, so it's a little more legible. This work is courtesy of answer from user <a href = "https://stackoverflow.com/users/47775/nick-bolton">Nick Bolton</a> on StackOverflow Question: <a href = "https://stackoverflow.com/questions/749796/pretty-printing-xml-in-python">Pretty Printing XML in Python</a>

In [13]:
#name = 'word/people.xml'
#we can see who the document author is: Stephen C. Phillips
uglyXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print(prettyXml)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [14]:
#name = 'word/fontTable.xml'
#This looks like the fonts used in the document style
uglyXml = xml.dom.minidom.parseString(document.read('word/fontTable.xml')).toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print(prettyXml)

<?xml version="1.0" ?>
<w:fonts xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" mc:Ignorable="w14">
  <w:font w:name="Symbol">
    <w:panose1 w:val="05050102010706020507"/>
    <w:charset w:val="02"/>
    <w:family w:val="roman"/>
    <w:pitch w:val="variable"/>
    <w:sig w:usb0="00000000" w:usb1="10000000" w:usb2="00000000" w:usb3="00000000" w:csb0="80000000" w:csb1="00000000"/>
  </w:font>
  <w:font w:name="Times New Roman">
    <w:panose1 w:val="02020603050405020304"/>
    <w:charset w:val="00"/>
    <w:family w:val="roman"/>
    <w:pitch w:val="variable"/>
    <w:sig w:usb0="E0002AFF" w:usb1="C0007841" w:usb2="00000009" w:usb3="00000000" w:csb0="000001FF" w:csb1="00000000"/>
  </w:font>
  <w:font w:name="Courier New">
    <w:panose1 w:val="02070309

In [15]:
#ok cool let's get the xml that has the text contained in the document
#name = 'word/document.xml'
uglyXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print(prettyXml)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Per the scope of this contract, we'll need to find all hyperlinks stored in this document and add them to a list. Let's take a look at what hyperlinks look like in the xml:
```
      <w:hyperlink r:id="rId27">
        <w:r w:rsidRPr="0005387D">
          <w:rPr>
            <w:rFonts w:ascii="Arial" w:cs="Arial" w:eastAsia="Arial" w:hAnsi="Arial"/>
            <w:color w:val="0000FF"/>
            <w:sz w:val="20"/>
            <w:szCs w:val="20"/>
            <w:u w:val="single"/>
          </w:rPr>
          <w:t>https://cyber.harvard.edu/getinvolved/jobs/communicationsmanager</w:t>
        </w:r>
      </w:hyperlink>
```

A simple pattern I notice is that they all start with characters `>http` and end with characters `</`<br> Now we can convert our xml to a string and use regex to collect all text between those characters.

To help with the regex I'll need to accomplish our goal of collecting all text between the aforementioned characters, I used the following StackOverflow question, which contains what I am looking for in the initial ask: <a href = "https://stackoverflow.com/questions/1454913/regular-expression-to-find-a-string-included-between-two-characters-while-exclud">Regular Expression to find a string included between two characters while EXCLUDING the delimiters.</a> 

While I do want to keep the `http`, I do not want to keep the `<` or `>`. I will make these modifications to my list items using list comprehension.

In [45]:
#first to turn the xml content into a string:
xml_content = document.read('word/document.xml')
document.close()
xml_str = str(xml_content)
print(xml_str)

ValueError: Attempt to use ZIP archive that was already closed

In [53]:
link_list = re.findall('(?<=\<w:hyperlink)(.*?)(?=\w:hyperlink>)',xml_str)[1:]
link_list = [x[:-1] for x in link_list]

In [54]:
link_list

[' r:id="rId7" w:history="1"><w:r w:rsidR="006A38C6" w:rsidRPr="006A38C6"><w:rPr><w:rFonts w:ascii="inherit" w:eastAsia="Times New Roman" w:hAnsi="inherit" w:cs="Times New Roman"/><w:color w:val="365899"/><w:sz w:val="21"/><w:szCs w:val="21"/><w:lang w:eastAsia="es-ES"/></w:rPr><w:t>#</w:t></w:r><w:r w:rsidR="006A38C6" w:rsidRPr="006A38C6"><w:rPr><w:rFonts w:ascii="inherit" w:eastAsia="Times New Roman" w:hAnsi="inherit" w:cs="Times New Roman"/><w:color w:val="385898"/><w:sz w:val="21"/><w:szCs w:val="21"/><w:lang w:eastAsia="es-ES"/></w:rPr><w:t>multipliquemosesperanza</w:t></w:r><',
 ' r:id="rId8" w:history="1"><w:r w:rsidR="006A38C6" w:rsidRPr="006A38C6"><w:rPr><w:rFonts w:ascii="inherit" w:eastAsia="Times New Roman" w:hAnsi="inherit" w:cs="Times New Roman"/><w:color w:val="365899"/><w:sz w:val="21"/><w:szCs w:val="21"/><w:lang w:eastAsia="es-ES"/></w:rPr><w:t>#</w:t></w:r><w:r w:rsidR="006A38C6" w:rsidRPr="006A38C6"><w:rPr><w:rFonts w:ascii="inherit" w:eastAsia="Times New Roman" w:h

In [44]:
len(link_list)

0

Great! We've now collected all the URLS from this word document! Check out <a href= "https://github.com/nmolivo/tesu_scraper/blob/master/01_scraper.ipynb">the Notebook</a> updloaded to <a href = "https://github.com/nmolivo/tesu_scraper">the TESU Scraper Repo</a>, where we use the technique covered in this repository to collect a link list for each word document contained in a folder stored on an AWS S3 bucket. 

Created by Natalie Olivo
<a href = "https://www.linkedin.com/in/natalie-olivo-82548951/">LinkedIn</a> | <a href = "https://github.com/nmolivo">GitHub</a> | <a href = "https://medium.com/@NatalieOlivo">Blog</a>