# Chapter 16 - XML
Until now, we have already seen quite some data formats (csv, tsv, json, etc). In this week, we will learn how to work with one of the most popular structured data format: [XML](http://www.w3schools.com/xml/). XML is used a lot in NLP and therefore it is important that you know how to work with it. 

### Preparation 
Please run the following cells to check whether the relevant files are in the relevant places on your computer.

In [None]:
import os

In [None]:
github_link = 'https://github.com/cltl/python-for-text-analysis#planning'
path_course_xml = '../Data/xml_data/course.xml'
path_naf_xml = '../Data/xml_data/naf.xml'

for path in [path_course_xml, path_naf_xml]:
    assert os.path.exists(path), f'{path} does not exist. Please download it from {github_link}'

### At the end of this chapter, you will be able to
* read an XML file using **<span style="background-color:yellow">etree.parse</span>**
* read XML from string using **<span style="background-color:yellow">etree.fromstring</span>**
* convert an XML element to a string using **<span style="background-color:yellow">etree.tostring</span>**
* use the following methods and attributes of an XML element (of type **lxml.etree._Element**):
    * **to access elements**: **<span style="background-color:yellow">methods 'find', 'findall', and 'getchildren'</span>**
    * **to access attributes**: **<span style="background-color:yellow">method 'get'</span>**
    * **to access element information**: **<span style="background-color:yellow">attributes 'tag' and 'text'</span>**

* [not needed for assignment] create your own XML and write it to a file

### If you want to learn more about this chapter, you might find the following links useful:
* [XML](http://www.w3schools.com/xml/)
* [detailled XML introduction](http://www.dfki.de/~uschaefer/esslli09/xmlquerylang.pdf)
* [NAF XML](http://www.newsreader-project.eu/files/2013/01/techreport.pdf)
* [Xpath](http://www.w3schools.com/xml/xpath_syntax.asp)
* Other structured data formats: [JSON-LD](http://json-ld.org/), [MicroData](https://www.w3.org/TR/microdata/), [RDF](https://www.w3.org/RDF/)

### 1. Introduction
NLP is all about data. More specifically, we usually want to annotate (manually or automatically) textual data with information about:
* [part of speech](https://en.wikipedia.org/wiki/Part_of_speech)
* [word senses](https://en.wikipedia.org/wiki/Word_sense)
* [entities](https://en.wikipedia.org/wiki/Entity_linking)
* [semantic role labelling](https://en.wikipedia.org/wiki/Semantic_role_labeling)
* many many many more.....

What would data look like that contains all this information? Let's look at a simple example:

In [None]:
import nltk

In [None]:
text = nltk.word_tokenize("Tom Cruise is an actor.")
print(nltk.pos_tag(text))

In this example, we see that the format is a list of [tuples](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences).  The first element of each tuple is the word and the second element is the part of speech tag. Great, so far this works.  However, we also want to indicate that **Tom Cruise** is an entity. Now, we start to run into trouble, because some annotations are for single words and some are for combinations of words. In addition, sometimes we have more than one annotation per token. Data structures such as csv and tsv are not great at **representing** linguistic information. So is there a format that is better at it? The answer is yes and the format is XML. 

## 2. Terminology
Let's look at an example (the line numbers are there for explanation purposes). On purpose, we start with a non-linguistic, hopefully intuitive example.

```xml
1.  <Course>
2.      <person role="coordinator">Van der Vliet</person>
3.      <person role="instructor">Van Miltenburg</person>
4.      <person role="instructor">Van Son</person>
5.      <person role="instructor">Postma</person>
6.      <person role="student">Baloche</person>
7.      <person role="student">De Boer</person>
8.      <animal role="student">Rubber duck</animal>
9.      <person role="student">Van Doorn</person>
10.     <person role="student">De Jager</person>
11.     <person role="student">King</person>
12.     <person role="student">Kingham</person>
13.     <person role="student">Mózes</person>
14.     <person role="student">Rübsaam</person>
15.     <person role="student">Torsi</person>
16.     <person role="student">Witteman</person>
17.     <person role="student">Wouterse</person>
18.     <person/>
19. </Course>
```

### 2.1 Elements
Line 1 to 19 all show examples of [XML elements](http://www.w3schools.com/xml/xml_elements.asp). Each XML element contains a **starting tag** (e.g. ```<person>```) and an **end tag** (e.g. ```</person>```). An element can contain:
* **text** *Van der Vliet* on line 2
* **attributes**: *role* attribute in lines 2 to 17
* **elements**: elements can contain other elements, e.g. *person* elements inside the *Course* element. The terminology to talk about this is as follows. In this example, we call `person` the `child` of `Course` and `Course` the `parent` of `person`.

Please note that on line 18 the **starting tag** and **end tag** are combined. This happens when an element has no children and/or no text. The syntax for an element is then **``` <START_TAG/>```**.

### 2.2 Root element
A special element is the **root** element. In our example, **Course** is our [root element](https://en.wikipedia.org/wiki/Root_element). The element starts at line 1 (```<Course>```) and ends at line 19 (```</Course>```). Notice the difference between the begin tag (no '/') and the end tag (with '/'). A root element is special in that it is the only element, which is the sole parent element to all the other elements.

### 2.3 Attributes
Elements can contain [attributes](http://www.w3schools.com/xml/xml_attributes.asp), which contain information about the element. In this case, this information is the role a person has in the course. All attributes are located in the start tag of an xml element.

### 3. Working with XML in Python
Now that we know the basics of XML, we want to be able to access it in Python. In order to work with XML, we will use the [**lxml**](http://lxml.de/) library.

In [None]:
from lxml import etree

As a next step, we load an XML file from our computer. Please first manually inspect the file **course.xml** in the folder **data** using a text editor, e.g. [Atom](https://atom.io/) to get an idea of what to expect. The **etree.parse** method is used to load XML files.

In [None]:
tree = etree.parse('../Data/xml_data/course.xml')
print(type(tree))

### 3.1 Accessing root element
In order to access the root element of an XML file, we can use the **getroot** method. Note that this does not show the XML element itself, but only a reference. In order to show the element itself, we can use the **etree.dump** method.

In [None]:
root = tree.getroot()
print('root', root)
print()
print('etree.dump example')
etree.dump(root, pretty_print=True)

As with any python object, we can use the built-in function **dir** to list all methods of an element (which has the type **lxml.etree._Element**) 

In [None]:
print(type(root))
dir(root)

We will focus on the following methods/attributes:
* **to access elements**: **<span style="background-color:yellow">methods 'find', 'findall', and 'getchildren'</span>**
* **to access attributes**: **<span style="background-color:yellow">method 'get'</span>**
* **to access element information**: **<span style="background-color:yellow">attributes 'tag' and 'text'</span>**

### 3.2 Accessing elements
There are several ways of accessing XML elements. The **find** method returns the **first** matching child.

In [None]:
first_person_el = root.find('person')
etree.dump(first_person_el, pretty_print=True)

In order to get a list of all person children, we can use the **findall** method.
Notice that this does not return the **animal** since we are looking for **person** elements.

In [None]:
all_person_els = root.findall('person')
all_person_els

Sometimes, we simple want all the children, while ignoring the start tags. This can be achieved using the **getchildren** method. This will simply return all children.
Now we do get the **animal** element again.

In [None]:
all_child_els = root.getchildren()
all_child_els

### 3.3 Accessing element information
We will now show how to access the attributes, text, and tag of an element.

The **get** method is used to access the attribute of an element.
If an attribute does not exists, it will return None, hence no error.

In [None]:
first_person_el = root.find('person')
role_first_person_el = first_person_el.get('role')
attribute_not_found = first_person_el.get('blabla')
print('role first person element:', role_first_person_el)
print('value if not found:', attribute_not_found)

The **text** of an element is found in the attribute **text**

In [None]:
print(first_person_el.text)

The **tag** of an element is found in the attribute **tag**

In [None]:
print(first_person_el.tag)

### Exercises
* Can you print the names of all students?
* Can you print the names of all instructors whose name starts with 'Van'?

## 4 How to deal with more than one layer
In the example from Section 3 (previous), we had an example with only one nested layer (**person**). However, XML can deal with many more. Let's look at such an example and think about how you would access the first **target** element, i.e. 
```xml

<target id="t1" />
```

```xml

<NAF xml:lang="en" version="v3">
    <terms>
        <term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP">
        <term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP">
        <term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ">
        <term id="t4" type="open" lemma="an" pos="R" morphofeat="DT">
        <term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN">
    </terms>
    <entities>
        <entity id="e3" type="PERSON">
              <references>
                  <span>
                      <target id="t1" />
                      <target id="t2" />
                  </span>
              </references>
        </entity>
    </entities>
</NAF>
```

Just like we can load an XML file from disk using **etree.parse**, we can use **etree.fromstring** to load XML from a string.

In [None]:
naf_string = """
<NAF xml:lang="en" version="v3">
    <text>
        <wf id="w1" offset="0" length="3" sent="1" para="1">tom</wf>
        <wf id="w2" offset="4" length="6" sent="1" para="1">cruise</wf>
        <wf id="w3" offset="11" length="2" sent="1" para="1">is</wf>
        <wf id="w4" offset="14" length="2" sent="1" para="1">an</wf>
        <wf id="w5" offset="17" length="5" sent="1" para="1">actor</wf>
    </text>
    <terms>
        <term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP"/>
        <term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP"/>
        <term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ"/>
        <term id="t4" type="open" lemma="an" pos="R" morphofeat="DT"/>
        <term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN"/>
    </terms> 
    <entities>
        <entity id="e3" type="PERSON">
              <references>
                  <span>
                      <target id="t1" />
                      <target id="t2" />
                  </span>
              </references>
        </entity>
    </entities>
</NAF>
"""

naf = etree.fromstring(naf_string)
print(type(naf))
etree.dump(naf, pretty_print=True)

Please note that the structure is as follows:
* the **NAF** element is the parent of the elements **text**, **terms**, and **entities**
* the **wf** elements are children of the **text** element, which provides us information about the position of words in the text, e.g. that *tom* is the first word in the text **id=w1** and in the first sentence **sent="1"**
* the **term** elements are children of the **term** elements, which provide us information about lemmatization and part of speech
* the **entity** element is a child of the **entities** element. We learn from the **entity** element that the terms **t1** and **t2** (e.g. Tom Cruise) form an entity of type **person**.

One way of accessing the first **target** element is by going one level at a time:

In [None]:
entities_el = naf.find('entities')
entity_el = entities_el.find('entity')
references_el = entity_el.find('references')
span_el = references_el.find('span')
target_el = span_el.find('target')
etree.dump(target_el, pretty_print=True)

Is there a better way? The answer is yes! The following way is an easier way to find our **target** element

In [None]:
target_el = naf.find('entities/entity/references/span/target')
etree.dump(target_el, pretty_print=True)

You can also use **findall** to find all **target** elements

In [None]:
for target_el in naf.findall('entities/entity/references/span/target'):
    etree.dump(target_el, pretty_print=True)

### 5. Creating your own XML

### Please note that this section is optional, meaning that you don't need to understand this section in order to complete the assignment.
I will now show you how to create your own XML. There are three main steps:
* Step a: create an xml object
* Step b: Creating elements and adding them
* Step c: writing to a file

#### Step a: create an xml object
You create a new XML object by:
* creating the **root** element -> using **etree.Element** (every element is created using  **etree.Element**, not only the root)
* creating the main XML object -> using **etree.ElementTree**

You do not have to fully understand how this works. Please make sure you can reuse this code snippet when you create your own XML.

In [None]:
our_root = etree.Element('Course')
our_tree = etree.ElementTree(our_root)
our_root = our_tree.getroot()

We can inspect what we have created by using the **etree.dump** method. As you can see, we only have the root node **Course** currently in our document.

In [None]:
etree.dump(our_root, pretty_print=True)

As you see, we created an XML object, containing only the root element **Course**.

#### Step b: Creating elements and adding them
We can also create our own XML elements by using the **etree.Element** method:

In [None]:
tag = 'person' # what the start and end tag will be 
attributes = {'role': 'student'} # dictionary of attributes, can be more than one
name_student = 'Lee' # the text of the elements

new_person_element = etree.Element(tag, attrib=attributes)
new_person_element.text = name_student

etree.dump(new_person_element, pretty_print=True)

In the cell above, I showed an example of how we can create an XML element. Following common practice, it is good to check the **type** of the XML element we created:

In [None]:
print(type(new_person_element))

We learn that we created an instance of the class **lxml.etree.\_Element**. This is not different from creating an instance of a **string** or a **list**. We just instantiated an instance of a class.

We can add children to an XML element using **append**

In [None]:
tag = 'pet'
attributes = {'role': 'joy'}
name_pet = 'Romeo'

# please note that 'tag' is a positional argument
# please note that 'attrib' is a keyword argument
new_pet_element = etree.Element(tag, attrib=attributes) 
new_pet_element.text = name_pet

print()
print('our new pet element')
etree.dump(new_pet_element, pretty_print=True)

# now we will make this element the child of the new_person_element elements
new_person_element.append(new_pet_element)

print()
print('person element with pet element as child')
etree.dump(new_person_element, pretty_print=True) # please note the pet element as a child of the person element 

In [None]:
pet_child = new_person_element.find('pet')
etree.dump(pet_child, pretty_print=True)

Finally, we add our **new_person_element** to our **root**

In [None]:
our_root.append(new_person_element)

#### Exercise
Please add three new elements to the **root** element.

#### Step c: writing to a file
This is how we can write our selfmade XML to a file. Please inspect **data/selfmade_xml.xml** using a text editor to check if it worked.

In [None]:
with open('../Data/xml_data/selfmade_xml.xml', 'wb') as outfile:
    our_tree.write(outfile,
                   pretty_print=True,
                   xml_declaration=True,
                   encoding='utf-8')