In [1]:
# stuff that are needed to get the output pretty
# but not to be included in the slideshow
%doctest_mode

Exception reporting mode: Plain
Doctest mode is: ON


# Python and XML

# What is XML?

- XML (eXtensible Markup Language) is a markup language for documents containing structured information
- A recommendation by the World Wide Web Consortium W3C (1998)
- XML is just plain text: Software that can handle plain text can also handle XML
- XML does not DO anything: it was created to structure, store and send information
- XML is not a programming language

```xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<job id="evaluation A1" progress="500/500" status="FINISHED">
  <unit id="1" status="FINISHED" type="pe">
    <S producer="FILES1and2">What is the most essential information in Business Analyst CV?</S>
    <MT producer="SMT">Was ist die wichtigsten Informationen in Business Analyst Lebenslauf?</MT>
    <annotations revisions="1">
      <annotation r="1">
        <PE producer="Translator_1.SMT">Was sind die wichtigsten Informationen im Business Analyst Lebenslauf?</PE>
        <assessment id="fluency">
          <score>3. Near native </score>
        </assessment>
        ...
      </annotation>
    </annotations>
  </unit>
</job>
```

- this XML document is just information wrapped in XML tags. Someone needs to write a piece of software to send, receive or display it.

```xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<job id="evaluation A1" progress="500/500" status="FINISHED">
  <unit id="1" status="FINISHED" type="pe">
      <S producer="FILES1and2">What is the most essential information in Business Analyst CV?</S>
  ...
```

- ``<?xml version="1.0" encoding="UTF-8" standalone="no"?>`` is XML declaration. Depending on the parser it may not be neccessary
- ``<job id="evaluation A1" progress="500/500" status="FINISHED">`` is the root node of the document. Every XML document needs to have exactly one root node
   - **job** is called the tag name
   - **id="evaluation A1"** indicates that the attribute **id** has value **evaluation A1**
- ``<unit id="1" status="FINISHED" type="pe">`` is a child element of tag **job**
- **What is the most essential information in Business Analyst CV?** is a text node and the child node of tag **S**

- XML documents have a tree structure
- we use terms:
    - root node
    - parent, child, sibling to describe the relationship between elements

# XML Elements

- is a container for its content

```xml
<assessment id="fluency">
  <score>3. Near native </score>
</assessment>
```

- there are two XML elements here: **assessment** and **score**
- XML elements can have text content or another XML elements
- An XML element should be well-formed. The example below it is not, fact also highlighted in this case by the colours used for rendering

```xml
<assessment id="fluency">
  <score>3. Near native</assessment>
  </score>
```

# XML elements

- the name of an element
    - is case sensitive
    - must start with a letter or _
    - can contain alphanumeric characters, _ or ~
- elements can have attributes (name-value pairs) `` <assessment id="fluency">``
- attribute values must be surrounded by single or double quotes

# XML comments

- Start with ``<!--`` and end with ``-->``
- comments cannot be nested (comments inside comments)
- can be accessed by the XML parser

# Entity references

- some characters have special meanings in XML (e.g. <, ")
- these characters can be represented using entities
- Entities are also used to refer to often repeated or varying text and to include the content of external files
- standard XML entities: 
    - ``&lt;`` for ``<`` 
    - ``&gt;`` for ``>``
    - ``&amp;`` for ``&``
    - ``&apos;`` for ``'``
    - ``&quot;`` for ``"``

# Document Type Definition (DTD)

- it is not compulsory
- define rules for elements content, attribute of an element, values, etc.

- Well-formed vs. valid XML document:
    - well-formed – matches all the character encoding and syntax rules defined in the XML 1.0 recommendations
    - valid – matches the definitions in the DTD and is well-formed

# XML and regular expressions

```xml
<job id="evaluation A1" progress="500/500" status="FINISHED">
  <unit id="1" status="FINISHED" type="pe">
    <S producer="FILES1and2">What is the most essential information in Business Analyst CV?</S>
  </unit>
</job>
```

Exercise: write regular expressions to process the following XML document. 

# Python and XML

- there are two basic ways of working with Python:
    - SAX (Simple API for XML): reads small bits of documents at a time
    - DOM (Document Object Model): reads the whole document into memory and creates a representation which can be manipulated

# Parsing files with DOM

- we are going to use minidom, a lightweight DOM implementation
- it has limited functionality, but very good to start with
- we need to import the relevant module:
```python
import xml.dom.minidom
```
or 
```python
from xml.dom import minidom
```

- open and parse the file. The object returned is the root of the XML document
```python
xmldoc = xml.dom.minidom.parse(<file name>)
```
or
```python
xmldoc = minidom.parse(<file name>)
```



In [2]:
import xml.dom.minidom
xmldoc = xml.dom.minidom.parse('data/evaluationDE_A1.per')

# Accessing information in XML documents

- ``toxml()`` can be used to print a node (either the whole document or a specific node in the XML tree). No formatting is applied to the output. 

In [3]:
print(xmldoc.toxml()[:400])

<?xml version="1.0" ?><job id="evaluation A1" progress="500/500" status="FINISHED"><unit id="1" status="FINISHED" type="pe"><S producer="FILES1and2">What is the most essential information in Business Analyst CV?</S><MT producer="SMT">Was ist die wichtigsten Informationen in Business Analyst Lebenslauf?</MT>
    <annotations revisions="1">
      <annotation r="1">
        <PE producer="Translator_1


**Note**: the indentations and new lines were present in the file. For the first few lines they were removed manually to prove the difference between ``toxml()`` and ``toprettyxml``

- ``toprettyxml(indent="", newl="", encoding="")`` prints the output nicely formatted using indent for indenting child nodes and newl to separate tags

In [4]:
print(xmldoc.toprettyxml()[:400])

<?xml version="1.0" ?>
<job id="evaluation A1" progress="500/500" status="FINISHED">
	<unit id="1" status="FINISHED" type="pe">
		<S producer="FILES1and2">What is the most essential information in Business Analyst CV?</S>
		<MT producer="SMT">Was ist die wichtigsten Informationen in Business Analyst Lebenslauf?</MT>
		
    
		<annotations revisions="1">
			
      
			<annotation r="1">
				
      


- the output seems a bit messy because there are already newlines in the document

# Accessing the child nodes

- ``xmldoc.childNodes`` returns a list of child nodes of the root element 
- more generally ``node.childNodes`` returns the children of node
- the elements have the type Node

In [5]:
PE = xmldoc.childNodes[0].childNodes[0].childNodes[0]
print(PE.toprettyxml())

<S producer="FILES1and2">What is the most essential information in Business Analyst CV?</S>



In [6]:
PEtext = PE.childNodes[0]
print(PEtext.toprettyxml())

What is the most essential information in Business Analyst CV?



In [7]:
child_PEtext = PEtext.childNodes
print(child_PEtext)

()


# Nodes

- nodes can have several types: 
   - elements
   - text
   - comments, entities, …
- it is possible to determine the type using ``node.nodeType`` which can be ``node.ELEMENT_NODE``, ``node.TEXT_NODE``, ``node.COMMENT_NODE`` 
- it is possible to access the name of the node ``node.nodeName`` and value ``node.nodeValue`` (but they have different values depending on the type of the node)

In [8]:
PE

<DOM Element: S at 0x42e4800>

In [9]:
PE.nodeType == xml.dom.Node.ELEMENT_NODE

True

In [10]:
PE.nodeName

'S'

In [11]:
print(PE.nodeValue)

None


In [12]:
PEtext

<DOM Text node "'What is th'...">

In [13]:
PEtext.nodeType == xml.dom.Node.TEXT_NODE

True

In [14]:
PEtext.nodeName

'#text'

In [15]:
PEtext.nodeValue

'What is the most essential information in Business Analyst CV?'

# Node

- provide a variety of **variables** to get 
  - the parent (``node.parentNode``), 
  - next sibling (``node.nextSibling``), 
  - previous sibling (``previousSibling``), 
  - children nodes (``node.childNodes``), 
  - first child (``node.firstChild``)
  
See more at <a href="https://docs.python.org/3.6/library/xml.dom.html#dom-node-objects" target="_blank">https://docs.python.org/3.6/library/xml.dom.html#dom-node-objects</a>

# Searching for nodes in the document

- ``doc.getElementsByTagName(<tag> )`` returns a list of elements <tag> 
- ``getElementsByTagName`` can be applied to any Node in the XML document not only to the root

In [16]:
units = xmldoc.getElementsByTagName("unit")
print("There are", len(units), "units")

There are 500 units


In [17]:
# let's get now the number of annotations
for unit in units:
    annotations = unit.getElementsByTagName("annotation")
    if len(annotations) != 1: 
        print("I found a unit with several annotations")

I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations
I found a unit with several annotations


# Accessing attributes of elements

- ``attrs = node.attributes`` returns a structure similar to a dictionary which contains the attributes of a node
- it is possible to do ``attrs.keys()`` and ``attrs.values()`` 
- but ``for attr in attrs`` doesn't work
- an individual attribute can be accessed using indexing ``attribute = attrs["<name>"]``
- an attribute has a name (``attribute.name``) and a value (``attribute.value``)

In [18]:
print(unit.toxml()[:100])

<unit id="500" status="FINISHED" type="pe">
    <S producer="FILES1and2">But this results of this ex


In [19]:
attrs = unit.attributes
for key in attrs.keys():
    print("attribute", key, "has value", attrs[key].value)

attribute id has value 500
attribute status has value FINISHED
attribute type has value pe


In [20]:
# note that attrs[key] does not return the value directly
print(attrs["id"])

<xml.dom.minidom.Attr object at 0x10CA3F70>


# Attributes

- It is also possible to:
   - check whether an element has an attribute ``element.hasAttribute(<attr>)``
   - retrieve the value of an attribute ``element.getAttribute(<attr>)``, but it returns empty string if the attribute does not exist

In [21]:
# let's get now the number of annotations
for unit in units:
    annotations = unit.getElementsByTagName("annotation")
    if len(annotations) != 1: 
        print("Unit", unit.getAttribute("id"), "has", len(annotations), "annotations")

Unit 59 has 2 annotations
Unit 227 has 2 annotations
Unit 230 has 3 annotations
Unit 247 has 2 annotations
Unit 248 has 2 annotations
Unit 261 has 2 annotations
Unit 283 has 2 annotations
Unit 299 has 2 annotations
Unit 308 has 2 annotations
Unit 309 has 2 annotations
Unit 366 has 2 annotations
Unit 423 has 2 annotations
Unit 497 has 2 annotations


# Further reading

- Minidom: <a href="https://docs.python.org/3.6/library/xml.dom.minidom.html" target="_blank">https://docs.python.org/3.6/library/xml.dom.minidom.html</a>
- More technical details about DOM: <a href="https://docs.python.org/3.6/library/xml.dom.html" target="_blank">https://docs.python.org/3.6/library/xml.dom.html</a>