# Topic-5: XML
Until now, we have already seen quite some data formats (csv, tsv, json, etc). In this week, we will learn how to work with one of the most popular structured data format: [XML](http://www.w3schools.com/xml/). XML is used a lot in NLP and therefore it is important that you know how to work with it. 

### At the end of this topic, you will be able to
* read an XML file/string and extract information from it
* create your own XML and write it to a file

### This requires that you already have (some) knowledge about:
* dictionaries
* strings
* lists
* functions
* if elif else statements

### If you want to learn more about these topics, you might find the following links useful:
* [XML](http://www.w3schools.com/xml/)
* [detailled XML introduction](http://www.dfki.de/~uschaefer/esslli09/xmlquerylang.pdf)
* [NAF XML](http://www.newsreader-project.eu/files/2013/01/techreport.pdf)
* [Xpath](http://www.w3schools.com/xml/xpath_syntax.asp)
* Other structured data formats: [JSON-LD](http://json-ld.org/), [MicroData](https://www.w3.org/TR/microdata/), [RDF](https://www.w3.org/RDF/)

## Subtopic: XML
NLP is all about data. More specifically, we usually want to annotate (manually or automatically) textual data with information about:
* [part of speech](https://en.wikipedia.org/wiki/Part_of_speech)
* [word senses](https://en.wikipedia.org/wiki/Word_sense)
* [entities](https://en.wikipedia.org/wiki/Entity_linking)
* [semantic role labelling](https://en.wikipedia.org/wiki/Semantic_role_labeling)
* many many many more.....

What would data look like that contains all this information? Let's look at a simple example:

In [None]:
import nltk

In [None]:
text = nltk.word_tokenize("Tom Cruise is an actor.")
print(nltk.pos_tag(text))

In this example, we see that the format is a list of [tuples](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences).  The first element of each tuple is the word and the second element is the part of speech tag. Great, so far this works.  However, we also want to indicate that **Tom Cruise** is an entity. Now, we start to run into trouble, because some annotations are for single words and some are for combinations of words. In addition, sometimes we have more than one annotation per token. Data structures such as csv and tsv are not great at **representing** linguistic information. So is there a format that is better at it? The answer is yes and the format is XML. Let's look at an example (the line numbers are there for explanation purpuses). On purpose, we start with a non-linguistic, hopefully intuitive example, and we will move to a linguistic example later on.

```xml
1.  <Course>
2.      <person role="coordinator">Van der Vliet</person>
3.      <person role="instructor">Van Miltenburg</person>
4.      <person role="instructor">Van Son</person>
5.      <person role="instructor">Marten Postma</person>
6.      <person role="student">Baloche</person>
7.      <person role="student">De Boer</person>
8.      <animal role="student">Rubber duck</animal>
9.      <person role="student">Van Doorn</person>
10.     <person role="student">De Jager</person>
11.     <person role="student">King</person>
12.     <person role="student">Kingham</person>
13.     <person role="student">Mózes</person>
14.     <person role="student">Rübsaam</person>
15.     <person role="student">Torsi</person>
16.     <person role="student">Witteman</person>
17.     <person role="student">Wouterse</person>
18.     <person role="student">Zuijderduin</person>
19. </Course>
```

#### Elements
Line 1 to 19 all show examples of [XML elements](http://www.w3schools.com/xml/xml_elements.asp). Each XML element contains a **starting tag** (e.g. ```<person>```) and an **end tag** (e.g. ```</person>```). An element can contain:
* **text** *Van der Vliet* on line 2
* **attributes**: *role* attribute in lines 2 to 18
* **elements**: elements can contain other elements, e.g. *person* elements inside the *Course* element. The terminology to talk about this is as follows. In this example, we call `person` the `child` of `Course` and `Course` the `parent` of `person`.

#### Root element
A special element is the **root** element. In our example, **Course** is our [root element](https://en.wikipedia.org/wiki/Root_element). The element starts at line 1 (```<Course>```) and ends at line 19 (```</Course>```). Notice the difference between the begin tag (no '/') and the end tag (with '/'). A root element is special in that it is the only element, which is the sole parent element to all the other elements.

#### Attributes
Elements can contain [attributes](http://www.w3schools.com/xml/xml_attributes.asp), which contain information about the element. In this case, this information is the role a person has in the course. All attributes are located in the start tag of an xml element.

#### Working with XML in Python
Now that we know the basics of XML, we want to be able to access it in Python. In order to work with XML, we will use the [**lxml**](http://lxml.de/) library.

In [None]:
from lxml import etree

As a next step, we load an XML file from our computer. Please first inspect the file **course.xml** in the folder **data** using a text editor, e.g. [Atom](https://atom.io/) manually to get an idea of what to expect. The **etree.parse** method is used to load XML files.

In [None]:
tree = etree.parse('data/course.xml')
print(type(tree))

#### Accessing root element
In order to access the root element of an XML file, we can use the **getroot** method. Note that this does not show the XML element itself, but only a reference. In order to show the element itself, we can use the **etree.dump** method.

In [None]:
root = tree.getroot()
print('root', root)
print()
print('etree.dump example')
etree.dump(root)

#### Accessing elements
There are several ways of accessing XML elements. The **find** method returns the **first** matching child.

In [None]:
first_person_el = root.find('person')
etree.dump(first_person_el)

In order to get a list of all person children, we can use the **findall** method.
Notice that this does not return the **animal** since we are looking for **person** elements.

In [None]:
all_person_els = root.findall('person')
all_person_els

Sometimes, we simple want all the children, while ignoring the start tags. This can be achieved using the **getchildren** method. This will simply return all children.
Now we do get the **animal** element again.

In [None]:
all_child_els = root.getchildren()
all_child_els

#### Accessing element information
We will now show how to access the attributes, text, and tag of an element.

The **get** method is used to access the attribute of an element.
If an attribute does not exists, it will return None, hence no error.

In [None]:
first_person_el = root.find('person')
role_first_person_el = first_person_el.get('role')
attribute_not_found = first_person_el.get('blabla')
print('role first person element:', role_first_person_el)
print('value if not found:', attribute_not_found)

The **text** of an element is found in the attribute **text**

In [None]:
print(first_person_el.text)

The **tag** of an element is found in the attribute **tag**

In [None]:
print(first_person_el.tag)

### Exercise
Can you print me the names of all students?

### Creating your own XML
I will now show you how to create your own XML.

#### Step a: create an xml object
You create a new XML object by:
* creating the **root** element -> using **etree.Element** 
* creating the main XML object -> using **etree.ElementTree**

You do not have to fully understand how this works. Please make sure you can reuse this code snippet when you create your own XML.

In [None]:
our_root = etree.Element('Course')
our_tree = etree.ElementTree(our_root)
our_root = our_tree.getroot()

We can inspect what we have created by using the **etree.dump** method. As you can see, we only have the root node **Course** currently in our document.

In [None]:
etree.dump(our_root)

As you see, we created an XML object, containing only the root element **Course**.

#### Step b: Creating elements and adding them
We can also create our own XML elements by using the **etree.Element** method:

In [None]:
tag = 'person' # what the start and end tag will be 
attributes = {'role': 'student'} # dictionary of attributes, can be more than one
name_student = 'Lee' # the text of the elements

new_person_element = etree.Element(tag, attrib=attributes)
new_person_element.text = name_student

etree.dump(new_person_element)

In the cell above, I showed an example of how we can create an XML element. Following common practice, it is good to check the **type** of the XML element we created:

In [None]:
print(type(new_person_element))

We learn that we created an instance of the class **lxml.etree.\_Element**. This is not different from creating an instance of a **string** or a **list**. We just instantiated an instance of a class.

We can add children to an XML element using **append**

In [None]:
tag = 'pet'
attributes = {'role': 'joy'}
name_pet = 'Romeo'

new_pet_element = etree.Element(tag, attrib=attributes)
new_pet_element.text = name_pet

print()
print('our new pet element')
etree.dump(new_pet_element)

# now we will make this element the child of the new_person_element elements
new_person_element.append(new_pet_element)

print()
print('person element with pet element as child')
etree.dump(new_person_element) # please note the pet element as a child of the person element 

In [None]:
pet_child = new_person_element.find('pet')
etree.dump(pet_child)

Finally, we add our **new_person_element** to our **root**

In [None]:
our_root.append(new_person_element)

#### Exercise
Please add three new elements to the **root** element.

#### Step 3c: writing to a file
This is how we can write our selfmade XML to a file. Please inspect **data/selfmade_xml.xml** using a text editor to check if it worked.

In [None]:
with open('data/selfmade_xml.xml', 'wb') as outfile:
    our_tree.write(outfile,
                   pretty_print=True,
                   xml_declaration=True,
                   encoding='utf-8')

## Subtopic: Extracting linguistic information from XML
### Note: this example is somewhat advanced. Please read through it and try to understand it. It is not necessary that you fully understand every step.
Last year, we organized the [CLIN26 shared task]("http://wordpress.let.vupr.nl/clin26/shared-task/") as part of 
[The 26th Meeting of Computational Linguistics in the Netherlands (CLIN26)](http://wordpress.let.vupr.nl/clin26/).
Aside from organizing it, we also wanted to know how the Entity Linker inside the [NewsReader]("http://www.newsreader-project.eu/") Dutch NLP pipeline would perform on the shared task.

In order to compete, the team needed to extract the [Entity Linking](https://en.wikipedia.org/wiki/Entity_linking) output from the pipeline. In this subtopic, I will show you how this can be done.

#### Example XML output Entity Linking
Please observe the following element. Try to understand which elements are children/parents of which elements.
```xml
<entity id="e4" type="ORG">
    <references>
        <span>
            <!--General Motors-->
            <target id="t_19"/>
            <target id="t_20"/>
        </span>
    </references>
    <externalReferences>
        <externalRef confidence="1.0" reference="http://nl.dbpedia.org/resource/General_Motors" resource="spotlight_v1"/>
        <externalRef confidence="6.3984197E-25" reference="http://nl.dbpedia.org/resource/General_Motors_Belgium" resource="spotlight_v1"/>
    </externalReferences>
</entity>
```

Above, you see part of an NAF XML file, which contains output from the Newsreader pipeline.
* the **entity** element is the main element.
* the **entity** element has two attributes:  **id** (entity identifier), and  **entity type**.
* the first child of the **entity** element is the **references** element. This element provides us the information that the entity is **General Motors** and that the term *General* is the 19th term in the document and *Motors* the 20th.
* the second child of the entity element is the *externalReferences* element. This shows the output from the system *spotlight_v1*, which tries to link the entity 'General Motors' to [Dbpedia](http://wiki.dbpedia.org/) (structured Wikipedia). The system has a confidence of 1.0 (the highest possible value) that the entity refers to http://nl.dbpedia.org/resource/General_Motors and a confidence of 6.3984197E-25 that it refers to http://nl.dbpedia.org/resource/General_Motors_Belgium.
    
In this example, we are interested in extracting the following information:

| What        | Output | Where |
|-------------|--------|-------|
| entity type | `ORG`  | we want to know that **General Motors** is an organisation (ORG). The attribute **type** of the **entity** element provides us this information. |
| DBpedia link with highest confidence | `http://nl.dbpedia.org/resource/General_Motors` | the value of the **reference** attribute with the highest **confidence** value from the **externalRef** elements. |

In order to created the wanted output, we will first load the XML element. In this example, I use the **etree.XML** to show you how you can load an XML element from a string in Python. If you read from a file, always use **etree.parse**.

In [None]:
from lxml import etree

In [None]:
#load the element as XML element.
entity = '''
<entity id="e4" type="ORG">
      <references>
        <span>
          <!--General Motors-->
          <target id="t_19"/>
          <target id="t_20"/>
        </span>
      </references>
      <externalReferences>
        <externalRef confidence="1.0" reference="http://nl.dbpedia.org/resource/General_Motors" reftype="nl" resource="spotlight_v1"/>
        <externalRef confidence="6.3984197E-25" reference="http://nl.dbpedia.org/resource/General_Motors_Belgium" reftype="nl" resource="spotlight_v1"/>
      </externalReferences>
</entity>'''

entity_el = etree.XML(entity)
print(entity_el)

We write a function to obtain the entity type:

In [None]:
def type_of_entity(entity_el):
    '''
    given an entity element, this function returns the entity type
    (access the value of the attribute 'type')
    If the value is an empty string, or the attribute does not exist, returns '_'
    '''
    entity_type = entity_el.get('type')
    if entity_type is None:
        entity_type = '_'
    
    return entity_type

entity_type = type_of_entity(entity_el)
print(entity_type)

We write a function to get the dbpedia link with the highest confidence

In [None]:
def dbpedia_link_with_highest_confidence(entity_el):
    '''
    returns the DBpedia link with the highest confidence. 
    returns '_' if there are no dbpedia links
    in the externalReferences element.
    
    create a list of tuples with dbpedia links with their corresponding confidences
    [(1.0,'http://nl.dbpedia.org/resource/General_Motors'),
     (6.3984197e-25,'http://nl.dbpedia.org/resource/General_Motors_Belgium')] 
    '''
    entities = []
    max_confidence = 0.0
    max_entity = '_'
    
    # find externalReferences element
    ext_refs_el = entity_el.find('externalReferences')
    
    # loop through children of externalReferences element
    for ext_ref_el in ext_refs_el.findall('externalRef'):
        
        #get confidence attribute as a float
        confidence = ext_ref_el.get('confidence')
        float_confidence = float(confidence)
        
        # check if confidence is higher than current maximum
        if float_confidence > max_confidence:
            entity = ext_ref_el.get('reference')
            max_entity = entity
            
            # set max_confidence to the confidence of found entity
            max_confidence = float_confidence
        
    return max_entity
    
highest_link = dbpedia_link_with_highest_confidence(entity_el)
print(highest_link)

## Subtopic: Xpath
### This is an even more advanced topic. You can safely ignore it. For those interested, xpath allows you to search much faster through XML
Is what I have just shown the nicest way to work with XML? Probably not.
In this subtopic, I'll show some examples that make it easier to work with XML. The query language [XPath](https://nl.wikipedia.org/wiki/XPath) is a big part of this.

### Search in deeper levels

#### BEFORE

In [None]:
ext_refs_el = entity_el.find('externalReferences')
if ext_refs_el is not None:
    for ext_ref_el in ext_refs_el.findall('externalRef'):
        print(ext_ref_el)

#### AFTER

Select all externalRef elements that are children of externalReferences

In [None]:
for ext_ref_el in entity_el.xpath('externalReferences/externalRef'):
    print(ext_ref_el)

You can even also use '//' which selects nodes in the document from the current node that match the selection no matter where they are.

In [None]:
for ext_ref_el in entity_el.xpath('//externalRef'):
    print(ext_ref_el)

### Search on attribute values
Let's say we only want the output from the **spotlight_v1** system.

#### BEFORE

In [None]:
ext_refs_el = entity_el.find('externalReferences')
if ext_refs_el is not None:
    for ext_ref_el in ext_refs_el.findall('externalRef'):
        system = ext_ref_el.get('resource')
        if system == 'spotlight_v1':
            etree.dump(ext_ref_el)

#### AFTER

In [None]:
xpath_query = 'externalReferences/externalRef[@resource="spotlight_v1"]'
for ext_ref_el in entity_el.xpath(xpath_query):
    etree.dump(ext_ref_el)

Is this everything? No, XPath can do so much more. Please take a look at [tutorial](http://www.w3schools.com/xml/xpath_syntax.asp) to learn more.