# Parsing Microsoft Word Documents with Python
Here's an example of using Python to parse out text from Microsoft Word documents.

In [2]:
import zipfile
import xml.etree.ElementTree as ET

Grab the text of the Word document (in XML format) and pass it to an ElementTree object:

In [3]:
# nice reference: https://towardsdatascience.com/how-to-extract-data-from-ms-word-documents-using-python-ed3fbb48c122

doc = zipfile.ZipFile('./data/test.docx').read('word/document.xml')
root = ET.fromstring(doc)

You can explore that XML like so:

In [16]:
ET.tostring(root)

b'<ns0:document xmlns:ns0="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:ns1="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:ns2="http://schemas.microsoft.com/office/word/2010/wordml" ns1:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh wp14"><ns0:body><ns0:sdt><ns0:sdtPr><ns0:id ns0:val="658278343" /><ns0:docPartObj><ns0:docPartGallery ns0:val="Table of Contents" /><ns0:docPartUnique /></ns0:docPartObj></ns0:sdtPr><ns0:sdtEndPr><ns0:rPr><ns0:rFonts ns0:asciiTheme="minorHAnsi" ns0:cstheme="minorBidi" ns0:eastAsiaTheme="minorHAnsi" ns0:hAnsiTheme="minorHAnsi" /><ns0:b /><ns0:bCs /><ns0:noProof /><ns0:color ns0:val="auto" /><ns0:sz ns0:val="22" /><ns0:szCs ns0:val="22" /></ns0:rPr></ns0:sdtEndPr><ns0:sdtContent><ns0:p ns2:paraId="45D59C77" ns2:textId="79FA5DD6" ns0:rsidR="00B608E9" ns0:rsidRDefault="00B608E9"><ns0:pPr><ns0:pStyle ns0:val="TOCHeading" /></ns0:pPr><ns0:r><ns0:t>Contents</ns0:t></ns0:r></ns0:p><ns0:p ns2:paraId="21A637BF" ns2:tex

In [18]:
# Microsoft's XML makes heavy use of XML namespaces; thus, we'll need to reference that in our code
ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
body = root.find('w:body', ns)  # find the XML "body" tag
p_sections = body.findall('w:p', ns)  # under the body tag, find all the paragraph sections

If we loop through all the paragraph sections and pull out just the text from each section, we'll see this:

In [24]:
for p in p_sections:
    text_elems = p.findall('.//w:t', ns)
    print(''.join([t.text for t in text_elems]))
    print()



Overview

Some sort of document overview.  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer id tellus facilisis, bibendum tellus quis, luctus elit. Phasellus erat quam, tincidunt sed sem at, fermentum egestas est.





Main Section

Sub-Section 1

Some text in Sub-Section 1.  Per my previous email staff engagement, so reach out, yet low hanging fruit. Tread it daily first-order optimal strategies translating our vision of having a market leading platfrom nor action item. Reach out this proposal is a win-win situation which will cause a stellar paradigm shift, and produce a multi-fold increase in deliverables nor poop, and personal development.



Sub-Section 2

Some text in Sub-Section 2.  Bacon ipsum dolor amet landjaeger capicola boudin frankfurter meatloaf short loin jowl, turducken rump shankle hamburger salami drumstick shoulder pastrami. Pig tri-tip shoulder, t-bone corned beef ham hock picanha burgdoggen tail flank meatball. Biltong ground round rump salami sho

But suppose we only want the text from the three Sub-Sections.  In the Word document, we know that the headers of these sections are styled as "Heading 2".  Can we then search for just "Heading 2" sections and get the associated text?

In [25]:
def is_heading2_section(p):
    """Returns True if the given paragraph section has been styled as a Heading2"""
    return_val = False
    heading_style_elem = p.find(".//w:pStyle[@w:val='Heading2']", ns)
    if heading_style_elem is not None:
        return_val = True
    return return_val


def get_section_text(p):
    """Returns the joined text of the text elements under the given paragraph tag"""
    return_val = ''
    text_elems = p.findall('.//w:t', ns)
    if text_elems is not None:
        return_val = ''.join([t.text for t in text_elems])
    return return_val


section_labels = [get_section_text(s) if is_heading2_section(s) else '' for s in p_sections]
section_labels

['',
 '',
 '',
 '',
 '',
 '',
 'Sub-Section 1',
 '',
 '',
 'Sub-Section 2',
 '',
 '',
 'Sub-Section 3',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

So, now that we know which paragraphs are the Heading 2 titles, we can grab the text from the associated sections:

In [29]:
section_text = [{'title': t, 'text': get_section_text(p_sections[i+1])} for i, t in enumerate(section_labels) if len(t) > 0]
section_text

[{'title': 'Sub-Section 1',
  'text': 'Some text in Sub-Section 1.  Per my previous email staff engagement, so reach out, yet low hanging fruit. Tread it daily first-order optimal strategies translating our vision of having a market leading platfrom nor action item. Reach out this proposal is a win-win situation which will cause a stellar paradigm shift, and produce a multi-fold increase in deliverables nor poop, and personal development.'},
 {'title': 'Sub-Section 2',
  'text': 'Some text in Sub-Section 2.  Bacon ipsum dolor amet landjaeger capicola boudin frankfurter meatloaf short loin jowl, turducken rump shankle hamburger salami drumstick shoulder pastrami. Pig tri-tip shoulder, t-bone corned beef ham hock picanha burgdoggen tail flank meatball. Biltong ground round rump salami short ribs bacon pastrami brisket chicken chuck.'},
 {'title': 'Sub-Section 3',
  'text': 'Some text in Sub-Section 3.  Leverage agile frameworks to provide a robust synopsis for high level overviews. Itera