# ongoing implementation of ExportXML importer (discoursegraphs branch: exportxml)

* initially, I was working on Tueba/D-Z 5.0, but now I have version 8.0 available
* the whole corpus is available as a single XML file, which would result in a graph  
  that is way to large for networkx (ca. 1.7 million edges)
  
## Tueba-D/Z 5.0

* I assumed that the corpus could not be split into documents and therefore wrote  
  a parser using igraph instead of networkx
* it turns out that each sentence has a ``origin`` attribute, e.g. ``T990507.2``,  
  which translates into (collection ID: T990507, document id: 2)
* all documents within a collection (NB: I will use those terms, Tueba doesn't)  
  have consequtively numbered token node IDs, i.e. if document 1 contains sentences  
  1 to 12, document 2 might contain sentences 13 to 43
  
## Tueba-D/Z 8.0

* ``tuebadz-8.0-mit-NE+Anaphern+Diskurs.exml.xml`` contains bad XML

```python
XMLSyntaxError: ID text_145 already defined, line 4663422, column 20
```

* two text IDs occur twice: ``text_3160`` and ``text_145``

In [2]:
import os
from lxml import etree, html
import igraph as ig

import discoursegraphs as dg

TUEBADZ5_FILE = os.path.expanduser(
    '~/corpora/tueba/tuebadz-5.0/data/XML/tuebadz-5.0.anaphora.export.xml')

TUEBADZ8_FILE = os.path.expanduser(
    '~/corpora/tueba/TuebaDZ8.0/tuebadz-8.0-mit-NE+Anaphern+Diskurs.exml.xml')

HTML_PARSER = html.HTMLParser()

Couldn't import dot_parser, loading of dot files will not be possible.


# ExportXML format

In [3]:
from collections import Counter
texts = Counter()

# recover: try hard to parse through broken input
context = etree.iterparse(TUEBADZ8_FILE, events=('end',), recover=True)

In [4]:
def add_ns(key, ns='http://www.w3.org/XML/1998/namespace'):
    """
    adds a namespace prefix to a string, e.g. turns 'foo' into
    '{http://www.w3.org/XML/1998/namespace}foo'
    """
    return '{{{namespace}}}{key}'.format(namespace=ns, key=key)

In [26]:
def fast_iter(context, func):
    """
    memory-saving iterator for lxml's iterparse.
    cf. http://www.ibm.com/developerworks/library/x-hiperfparse/
    
    Parameters
    ----------
    context : lxml.etree.iterparse
        an iterparse iterator over an XML file
    func : function
        a function that will be called for each element of the context
    """
    for event, elem in context:
        func(elem)
        # removes element (and references to it) from memory after processing it
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def conditional_fast_iter(context, condition, func):
    """
    memory-saving iterator for lxml's iterparse, which runs
    a function on every element satisfying the given condition.
    
    Parameters
    ----------
    context : lxml.etree.iterparse
        an iterparse iterator over an XML file
    condition : str
        a string that represents a condition and which must
        return True or False when evaluated with eval(condition)
    func : function
        a function that will be called for each element of the context
    """
    for event, elem in context:
        if eval(condition) == True:
            yield func(elem)
        # removes element (and references to it) from memory after processing it
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

In [215]:
def get_text(text_element):
    text_str = 'ID: {}\nOrigin: {}\n\n'.format(text_element.attrib[add_ns('id')], text_element.attrib['origin'])
    for sentence in text_element.xpath('//sentence'):
        sent_str = u' '.join(word.attrib['form'] for word in sentence.xpath('.//word'))
        text_str += u'{}\n'.format(sent_str)
    return text_str

context = etree.iterparse(TUEBADZ8_FILE, events=('end',), recover=True)
texts = conditional_fast_iter(context, 'elem.tag == "text"', get_text)
# texts = conditional_fast_iter(context, 'elem.tag == "text"', lambda x: x)

In [216]:
print texts.next()

ID: text_0
Origin: T990507.2


SPD / CDU / AfB für Daewoo-Millionen
Aber Bremerhavens AfB fordert jetzt Untersuchungsausschuß
Die Wirtschaftsförderausschüsse haben gestern ein Finanzierungsvolumen von insgesamt 86 Millionen Mark beschlossen , mit denen das Vulkan-Gelände für die Ansiedlung des Autoimporteurs Egerland frei gemacht werden kann .
Egerland soll die 10.000 bis 20.000 koreanischen Daewoo-Pkw , die derzeit von der Firma Mosolf in



In [194]:
tree = etree.parse('exportxml/text0_exportxml.xml')

In [195]:
len(tree.xpath('//sentence'))

35

In [196]:
len(tree.xpath('//node'))

837

In [197]:
len(tree.xpath('//word'))

675

In [198]:
for sentence in tree.xpath('//sentence'):
    print len(sentence.xpath('.//word')),

5 17 6 26 15 39 40 12 26 24 10 8 22 18 26 21 19 25 19 16 32 14 28 20 19 17 21 16 22 19 13 14 30 14 2


In [183]:
tree.iter?

In [200]:
for text in tree.iter('text'):
    print 'ID: {}\nOrigin: {}\n'.format(text.attrib[add_ns('id')], text.attrib['origin'])
    for sent in text.iter('sentence'):
        print sent.attrib[add_ns('id')], len(list(sent.xpath('.//word')))
        print ' '.join(word.attrib['form'] for word in sent.xpath('.//word'))

ID: text_0
Origin: T990507.2

s1 5
Veruntreute die AWO Spendengeld ?
s2 17
Staatsanwaltschaft muß AWO-Konten prüfen / Flossen 165.000 Mark Sammelgelder für Flutopfer in ein Altenheim in Danzig ?
s3 6
Landesvorsitzende Ute Wedemeier : Ein Buchungsfehler
s4 26
Im Januar hat die Arbeiterwohlfahrt Bremen ihren langjährigen Geschäftsführer Hans Taake fristlos entlassen , nun wird auch der Vorstand der Wohlfahrtsorganisation in den Fall hineingezogen .
s5 15
In einer anonymen Anzeige werden der Bremer Staatsanwaltschaft Details über dubiose finanzielle Transaktionen mitgeteilt .
s6 39
Verantwortlich , so das Schreiben einer Mitarbeiterin der AWO , sei die Landesvorsitzende Uter Wedemeier , die sich jetzt als " Sauberfrau " gebe , " wo doch alle wissen , wie eng sie mit Taake zusammenhing " .
s7 40
Vorwurf Nummer 1 : 165.000 Mark aus der bundesweiten Geldsammlung für die Flutopfer in Südpolen seien über das Konto des Bremer Landesverbandes der AWO an die Caritas in Danzig geflossen , " damit 

## example sentences

In [2]:
from collections import Counter

tree = etree.parse(TUEBADZ8_FILE)
sentence_origins = Counter()

sentences_iter = tree.iterfind('sentence')

XMLSyntaxError: ID text_145 already defined, line 4663422, column 20

In [None]:
s0 = sentences_iter.next()

In [7]:
print etree.tostring(s0)

<sentence origin="T990507.2" comment="%% HEADLINE" date="2008020714:44:17" editor="hschulz">
    <node cat="SIMPX" parent="0" comment="" func="--" id="s_1_n_506">
      <node cat="LK" comment="" func="-" id="s_1_n_503">
        <node cat="VXFIN" comment="" func="HD" id="s_1_n_500">
          <word comment="" form="Veruntreute" func="HD" pos="VVFIN" morph="3sit" id="s_1_n_0"/>
        </node>
      </node>
      <node cat="MF" comment="" func="-" id="s_1_n_505">
        <node cat="NX" comment="" func="ON" id="s_1_n_504">
          <word comment="" form="die" func="-" pos="ART" morph="nsf" id="s_1_n_1"/>
          <node cat="EN-ADD" comment="" func="HD" id="s_1_n_501">
            <word comment="" form="AWO" func="-" pos="NN" morph="nsf" id="s_1_n_2"/>
          </node>
        </node>
        <node cat="NX" comment="" func="OA" id="s_1_n_502">
          <word comment="" form="Spendengeld" func="HD" pos="NN" morph="asn" id="s_1_n_3"/>
        </node>
      </node>
    </node>
    <word p

## ExportXML ``<element>`` counts

In [8]:
from collections import Counter

element_count = Counter()

for element in tree.iter():
    element_count[element.tag] += 1

element_count.most_common()

[('node', 947810),
 ('word', 794079),
 ('relation', 56273),
 ('anaphora', 56273),
 ('sentence', 45200),
 ('secedge', 4647),
 ('originDef', 2213),
 ('morphDef', 437),
 ('posDef', 56),
 ('edgeDef', 50),
 ('nodeDef', 29),
 ('editorDef', 26),
 ('secedgeDef', 7),
 ('comment', 6),
 ('posList', 1),
 ('secedgeList', 1),
 ('editorList', 1),
 ('edgeList', 1),
 ('originList', 1),
 ('format', 1),
 ('morphList', 1),
 ('nodeList', 1),
 ('export', 1)]

### What about `<relation>` and `<anaphora>`?

In [9]:
from collections import Counter, defaultdict

def count_element_positions(tree, element):
    element_positions = defaultdict(Counter)

    for elem_instance in tree.iter(element):
        element_positions['parent'][elem_instance.getparent().tag] += 1
        for child in elem_instance.getchildren():
            element_positions['children'][child.tag] += 1
    return element_positions

In [10]:
count_element_positions(tree, 'relation')

defaultdict(<class 'collections.Counter'>, {'parent': Counter({'anaphora': 56273})})

In [11]:
count_element_positions(tree, 'anaphora')

defaultdict(<class 'collections.Counter'>, {'children': Counter({'relation': 56273}), 'parent': Counter({'node': 49713, 'word': 6560})})

In [12]:
anaphora_iter = tree.iter('anaphora')

for i in range(3):
    print etree.tostring(anaphora_iter.next())

<anaphora>
              <relation type="coreferential" antecedent="s_1_n_504"/>
            </anaphora>
            
<anaphora>
                  <relation type="anaphoric" antecedent="s_4_n_527"/>
                </anaphora>
              
<anaphora>
                <relation type="coreferential" antecedent="s_4_n_527"/>
              </anaphora>
              


In [13]:
anaphora_iter = tree.iter('anaphora')
word_anaphora = [a for a in anaphora_iter if a.getparent().tag == 'word']
for word_ana in word_anaphora[:3]:
    print etree.tostring(word_ana)
        

<anaphora>
                  <relation type="anaphoric" antecedent="s_4_n_527"/>
                </anaphora>
              
<anaphora>
                    <relation type="anaphoric" antecedent="s_19_n_518"/>
                  </anaphora>
                
<anaphora>
                  <relation type="anaphoric" antecedent="s_25_n_505"/>
                </anaphora>
              


## unusual examples

### expletive

```
~/corpora/tueba/tuebadz-5.0/data/XML $ ack-grep -A 5 s_11429_n_506 tuebadz-5.0.anaphora.export.xml
        <node cat="NX" comment="" func="ON" id="s_11429_n_506">
          <anaphora>
            <relation type="expletive" antecedent=""/>
          </anaphora>
          <word comment="" form="Es" func="HD" pos="PPER" morph="nsn3" id="s_11429_n_10"/>
        </node>
```

### split antecedent

```
27434-          <node cat="MF" comment="" func="-" id="s_382_n_511">
27435-            <node cat="NX" comment="" func="ON" id="s_382_n_501">
27436-              <anaphora>
27437:                <relation type="split_antecedent" antecedent="s_381_n_9,s_378_n_510"/>
27438-              </anaphora>
27439-              <word comment="" form="die" func="-" pos="ART" morph="npm" id="s_382_n_1"/>
27440-              <word comment="" form="Partner" func="HD" pos="NN" morph="npm" id="s_382_n_2"/>
27441-            </node>
27442-            <node cat="ADVX" comment="" func="OADVP" id="s_382_n_502">
27443-              <word comment="" form="miteinander" func="HD" pos="ADV" morph="--" id="s_382_n_3"/>
27444-            </node>
27445-          </node>
```

In [14]:
anaphora_iter = tree.iter('anaphora')
for anaphora in anaphora_iter:
    # there's only one <relation> child element
    antecedent = anaphora.getchildren()[0].attrib['antecedent']

## ExportXML <element> meanings

* `<node>`: a node in a syntax tree
* `<word>`: a token in a sentence / syntax tree
* `<relation>`: a child of an `<anaphora>` element; it's always a leaf node  
   it has a `type` attrib (relation type) and an `antecedent` attrib (antecedent's node ID)  
   NB: if an anaphora has no antecedent, (e.g. if it's an `expletive` relation) the `antecedent` attrib  
   is an empty string!
* `<anaphora>`: a child of a `<node>` or `<word>` element; always has one `<relation>` child;  
  the element itself contains no information
* `<sentence>`: a sentence / syntax tree

```
 ('secedge', 4647),
 ('originDef', 2213),
 ('morphDef', 437),
 ('posDef', 56),
 ('edgeDef', 50),
 ('nodeDef', 29),
 ('editorDef', 26),
 ('secedgeDef', 7),
 ('comment', 6),


 ('posList', 1),
 ('secedgeList', 1),
 ('editorList', 1),
 ('edgeList', 1),
 ('originList', 1),
 ('format', 1),
 ('morphList', 1),
 ('nodeList', 1),
 ('export', 1)]
```

In [15]:
def parse_anaphora(anaphora, source_id):
    """
    Parameters
    ----------
    anaphora : etree.Element
        an <anaphora> element
    source_id : str
        the node ID of the anaphora (points either to a <node> or a <word>)
    
    Returns
    -------
    antecedent : str
        node ID of the antecedent, e.g. ``s_4_n_527``
    relation_type : str
        anaphoric relation type, e.g. ``anaphoric`` or ``coreferential``
    """
    # there's only one <relation> child element
    relation = anaphora.getchildren()[0]
    return relation.attrib['antecedent'], relation.attrib['type']
    

def exportxml2igraph(exportxml_file):
    """
    TODO: add <node> and <word> attributes
    """
    # in igraph, adding a single edge is prohibitively slow,
    # as the whole index of the graph has to be rebuild!
    # http://stackoverflow.com/questions/13974279/igraph-why-is-add-edge-function-so-slow-ompared-to-add-edges
    # to speed this up, store the edges in a list & call add_edges() once!
    edges = []
    relations = {}
    idocgraph = ig.Graph(directed=True)
    
    treeiter = etree.iterparse(TUEBADZ_FILE, tag='sentence')
    for _action, sentence in treeiter:
        sent_root_id = sentence.attrib['origin']
        idocgraph.add_vertex(sent_root_id, label=sent_root_id)
        
        for element in sentence.iter('node', 'word', 'anaphora'):
            parent_element = element.getparent()
            # some <anaphora> are children of <word> elements
            if parent_element.tag in ('node', 'word'):
                parent_id = parent_element.attrib['id']
            elif parent_element.tag == 'sentence':
                parent_id = parent_element.attrib['origin']
            else:
                sys.stderr.write("Unexpected parent '{}' of element '{}'\n".format(parent_element, element))
            element_id = element.attrib.get('id') # <anaphora> doesn't have an ID

            if element.tag == 'node':
                idocgraph.add_vertex(element_id, label=element.attrib['cat'])
                edges.append((parent_id, element_id))
            elif element.tag == 'word':
                idocgraph.add_vertex(element_id, label=element.attrib['form'])
                edges.append((parent_id, element_id))

            else: # element.tag == 'anaphora'
                # <anaphora> doesn't have an ID, but it's tied to its parent element
                antecedent_str, relation_type = parse_anaphora(element, parent_id)
                if antecedent_str:
                    # there might be more than one antecedent
                    for antecedent_id in antecedent_str.split(','):
                        edge = (parent_id, antecedent_id)
                        edges.append(edge)
                        relations[edge] = relation_type
                else:
                    # there's no antecedent in case of an expletive anaphoric relation
                    relations[(parent_id, None)] = relation_type
      
    idocgraph.add_edges(edges)

    # igraph doesn't store nodes/edge names in a dict, so a lookup would be O(n)
    node_name2id = {node['name']: node.index for node in idocgraph.vs}
    edge_endpoints2id = {(edge.source, edge.target): edge.index
                         for edge in idocgraph.es}

    for (source, target) in relations:
        relation_type = relations[(source, target)]
        if target:
            edge_endpoints = (node_name2id[source], node_name2id[target])
            idocgraph.es[edge_endpoints2id[edge_endpoints]]['exportxml:relation_type'] = relation_type
#             idocgraph.es[idocgraph.get_eid(source, target)]['exportxml:relation_type'] = relation_type

        else:
            # there's no antecedent in case of an expletive anaphoric relation
            
            idocgraph.vs[node_name2id[source]]['exportxml:anaphora_type'] = relation_type
#             idocgraph.vs.select(name=source)['exportxml:anaphora_type'] = relation_type
    return idocgraph

            

In [16]:
%%time
tuebagraph = exportxml2igraph(TUEBADZ_FILE)

CPU times: user 17 s, sys: 856 ms, total: 17.9 s
Wall time: 18 s


In [17]:
tuebagraph.ecount()

1794461

In [18]:
tuebagraph.vcount()

1787089

In [19]:
import igraph as ig

foo = ig.Graph(directed=True)
foo.add_vertices(['1','2','3','4'])
foo.add_edges([('1', '1'), ('1', '2'), ('3', '4')])

In [20]:
# for e in foo.es:
#     print e, e.index, e.source, e.target
foo.es[0]

igraph.Edge(<igraph.Graph object at 0x5975af8>, 0, {})

In [21]:
for v in foo.vs:
    print v, v.index, v['name']

igraph.Vertex(<igraph.Graph object at 0x5975af8>,0,{'name': '1'}) 0 1
igraph.Vertex(<igraph.Graph object at 0x5975af8>,1,{'name': '2'}) 1 2
igraph.Vertex(<igraph.Graph object at 0x5975af8>,2,{'name': '3'}) 2 3
igraph.Vertex(<igraph.Graph object at 0x5975af8>,3,{'name': '4'}) 3 4


In [22]:
foo.vs.select(name='2')

<igraph.VertexSeq at 0xab9a83c0>

In [23]:
foo.vs[0]

igraph.Vertex(<igraph.Graph object at 0x5975af8>,0,{'name': '1'})

In [24]:
list(foo.es.select(_source='3', _target='4'))

[]

In [25]:
from collections import defaultdict

def exportxml2dict(exportxml_file):
    nodes = []
    edges = []
    
    itree = etree.iterparse(TUEBADZ_FILE, tag='sentence')
    for _action, sentence in itree:
        sent_root_id = sentence.attrib['origin']
        nodes.append( (sent_root_id, sent_root_id) )
        
        for element in sentence.iter('node', 'word'):
            parent_element = element.getparent()
            if parent_element.tag == 'node':
                parent_id = parent_element.attrib['id']
            elif parent_element.tag == 'sentence':
                parent_id = parent_element.attrib['origin']
            else:
                sys.stderr.write("Unexpected parent '{}' of element '{}'\n".format(parent_element, element))
            element_id = element.attrib.get('id') # <anaphora> doesn't have an ID

            if element.tag == 'node':
                element_label = element.attrib['cat']
            elif element.tag == 'word':
                element_label = element.attrib['form']
            else:
                continue # for now, ignore other elements (e.g. <anaphora>)

            nodes.append( (element_id, element_label) )
            edges.append( (parent_id, element_id) )
    return nodes, edges

# Speed comparison

In [26]:
%%time
nodes, edges = exportxml2dict(TUEBADZ_FILE) # 6.63s

CPU times: user 5.99 s, sys: 620 ms, total: 6.61 s
Wall time: 6.63 s


## igraph: preprocessed node/edge lists, edges batch insert (total: 12.81 s)

- create node/edge lists
- add nodes one by one
- add edges in one go

```
%%time

idocgraph = ig.Graph()
for node_id, node_label in nodes:
    idocgraph.add_vertex(node_id, label=node_label)
idocgraph.add_edges(edges)
```

CPU times: user 6.06 s, sys: 112 ms, total: 6.18 s
Wall time: 6.18 s


## igraph: preprocessed node/edge lists, nodes & edges batch insert (total: 7.85 s)

- create node/edge lists
- add nodes in one go (without labels)
- add edges in one go

```python
%%time

idocgraph = ig.Graph()
node_ids = (node_id for (node_id, node_label) in nodes)
idocgraph.add_vertices(node_ids)
idocgraph.add_edges(edges)
```

CPU times: user 1.16 s, sys: 60 ms, total: 1.22 s
Wall time: 1.22 s

## igraph: preprocessed node/edge lists, nodes/edges inserted iteratively (intractable)

- added each node and edge one by one
- took too long (killed after 10mins)

```python
%%time

idocgraph = ig.Graph()
for node_id, node_label in nodes:
    idocgraph.add_vertex(node_id, label=node_label)
for source, target in edges:
    # for each added edge, the whole index of the graph has to be rebuild!
    # never do this for large graphs!
    # http://stackoverflow.com/questions/13974279/igraph-why-is-add-edge-function-so-slow-ompared-to-add-edges
    idocgraph.add_edge(source, target)
```

## igraph: combined preprocessing/inserting of nodes, edge caching/batch insert (total: 20.7s)

```python
%%time
tueba_igraph = exportxml2igraph(TUEBADZ_FILE)
```

CPU times: user 20.7 s, sys: 156 ms, total: 20.9 s
Wall time: 20.9 s

In [27]:
def exportxml2docgraph(exportxml_file):
    edges = []
    docgraph = dg.DiscourseDocumentGraph()
    edge_attribs = {'layers': {docgraph.ns}} # default edge attributes
    
    treeiter = etree.iterparse(TUEBADZ_FILE, tag='sentence')
    for _action, sentence in treeiter:
        sent_root_id = sentence.attrib['origin']
        docgraph.add_node(sent_root_id, label=sent_root_id)
        
        for element in sentence.iter('node', 'word'):
            parent_element = element.getparent()
            if parent_element.tag == 'node':
                parent_id = parent_element.attrib['id']
            elif parent_element.tag == 'sentence':
                parent_id = parent_element.attrib['origin']
            else:
                sys.stderr.write("Unexpected parent '{}' of element '{}'\n".format(parent_element, element))
            element_id = element.attrib.get('id') # <anaphora> doesn't have an ID

            if element.tag == 'node':
                element_label = element.attrib['cat']
            elif element.tag == 'word':
                element_label = element.attrib['form']
            else:
                continue # for now, ignore other elements (e.g. <anaphora>)

            docgraph.add_node(element_id, label=element_label)
            edges.append((parent_id, element_id, edge_attribs))
    docgraph.add_edges_from(edges)
    return docgraph

## networkx: combined preprocessing/inserting of nodes, edge caching/batch insert (intractable)

- killed process after 12 hours

# use multiple cores by splitting exportxml file into 'documents'

In [28]:
from lxml import etree

treeiter = etree.iterparse(TUEBADZ_FILE, tag='sentence')

In [29]:
from collections import defaultdict

documents = defaultdict(list)

In [30]:
%%time

for _action, sentence in treeiter:
    documents[sentence.attrib['origin']].append(sentence)

CPU times: user 6.23 s, sys: 188 ms, total: 6.42 s
Wall time: 6.44 s


In [31]:
len(documents.keys())

2213