# Pull parser with string output

## Examine the input XML

In [1]:
with open('flattened.xml') as input:
    print(input.read())

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse">
    <p th:sID="d1e3"/>This is a <word th:sID="d1e5"/>paragraph<word th:eID="d1e5"/> that contains
    some <nonTrojan type="test"/> stuff.<p th:eID="d1e3"/>
    <p th:sID="d1e9"/>This is <emphasis role="bold">another</emphasis> paragraph <phrase
        th:sID="d1e11"/><word th:sID="d1e12"/>that<word th:eID="d1e12"/>
    <word th:sID="d1e15"/>contains<word th:eID="d1e15"/>
    <word th:sID="d1e18"/>more<word th:eID="d1e18"/><phrase th:eID="d1e11"/> stuff.<p th:eID="d1e9"
    />
</root>



## Transform it

In [18]:
from xml.dom.pulldom import CHARACTERS, START_ELEMENT, parseString, END_ELEMENT
from xml.dom.minidom import Document


class Stack(list):
    def push(self, item):
        self.append(item)

    def peek(self):
        return self[-1]


open_elements = Stack()
d = Document()
open_elements.push(d)

with open('flattened.xml') as input:
    for event, node in parseString(input.read()):
        if event == START_ELEMENT and not node.hasAttribute('th:eID'): # process pseudo-end-tags on END_ELEMENT event
            # Can’t remove attributes from the original, so work with a clone
            clone = node.cloneNode(deep=False)
            if clone.hasAttribute('th:sID'):
                clone.removeAttribute('th:sID')
            if clone.hasAttribute('xmlns:th'):
                clone.removeAttribute('xmlns:th')
            open_elements.peek().appendChild(clone)
            open_elements.push(clone)
        elif event == END_ELEMENT and not node.hasAttribute('th:sID'): # process pseudo-start-tags on START_ELEMENT event
            open_elements.pop()
        elif event == CHARACTERS:
            t = d.createTextNode(node.data)
            open_elements.peek().appendChild(t)
        else:
            continue

result = open_elements[0].toxml()
print(result)

<?xml version="1.0" ?><root>
    <p>This is a <word>paragraph</word> that contains
    some <nonTrojan type="test"/> stuff.</p>
    <p>This is <emphasis role="bold">another</emphasis> paragraph <phrase><word>that</word>
    <word>contains</word>
    <word>more</word></phrase> stuff.</p>
</root>


## What happens if the document cannot be raised entirely?

Here’s an alternative flattened document, one that is well formed in it’s flattened versions, but it cannot be raised without creating overlap:

In [3]:
with open('overlap.xml') as input:
    print(input.read())

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse">
    <page th:sID="page1"/>
    <para th:sID="para1"/>Content on page 1 in paragraph 1 
    <page th:eID="page1"/>
    <page th:sID="page2"/>Content on page 2 in para 1 
    <para th:eID="para1"/>
    <para th:sID="para2"/>Content on page 2 in para 2
    <para th:eID="para2"/>
    <page th:eID="page2"/>
</root>



Let’s try raising it. There are three possible results:

1. Errors out.
1. Writes ill-formed output. This shouldn’t be possible, since the output construction is XML-aware.
1. Raises only as much as it can, and leaves the rest flattened. This is the desired output, and it’s what happens with inside-out recursion.
1. Well-formed output that raises all elements. This cannot represent the actual flattened structure because the actual flattened structure cannot be represented with well-formed XML that uses only container elements, and no Trojan milestones.

In [24]:
from xml.dom.pulldom import CHARACTERS, START_ELEMENT, parseString, END_ELEMENT
from xml.dom.minidom import Document


class Stack(list):
    def push(self, item):
        self.append(item)

    def peek(self):
        return self[-1]


open_elements = Stack()
d = Document()
open_elements.push(d)

with open('overlap.xml') as input:
    for event, node in parseString(input.read()):
        if event == START_ELEMENT and not node.hasAttribute('th:eID'): # process pseudo-end-tags on END_ELEMENT event
            # Can’t remove attributes from the original, so work with a clone
            clone = node.cloneNode(deep=False)
            if clone.hasAttribute('th:sID'):
                clone.removeAttribute('th:sID')
            if clone.hasAttribute('xmlns:th'):
                clone.removeAttribute('xmlns:th')
            open_elements.peek().appendChild(clone)
            open_elements.push(clone)
        elif event == END_ELEMENT and not node.hasAttribute('th:sID'): # process pseudo-start-tags on START_ELEMENT event
            open_elements.pop()
        elif event == CHARACTERS:
            t = d.createTextNode(node.data)
            open_elements.peek().appendChild(t)
        else:
            continue

result = open_elements[0].toxml()
print(result)

<?xml version="1.0" ?><root>
    <page>
    <para>Content on page 1 in paragraph 1 
    </para>
    <page>Content on page 2 in para 1 
    </page>
    <para>Content on page 2 in para 2
    </para>
    </page>
</root>


We get result #4: well-formed XML that includes all elements from the original, but imposes a (possibly counterintuitive or undesired) hierarchy.