[< __INTRO MODULE 5__](./0.Introduction.ipynb)

---

# Index:
- [Introduction](#introduction)
    - [The most relevant modules of the xml library](#the-most-relevant-modules-of-the-xml-library)
    - [Segmenting an XML document](#segmenting-an-xml-document)
- [Parsing an XML file](#parsing-an-xml-file)
    - [Converting XML file to usable data in Python](#converting-xml-file-to-usable-data-in-python)
    - [Navigating the entire tree structure](#navigating-the-entire-tree-structure)
    - [Accessing elements at a lower level within XML](#accessing-elements-at-a-lower-level-within-xml)
- [Getting the first element that meets the requirements](#getting-the-first-element-that-meets-the-requirements)
- [Working with a parsed XML file](#working-with-a-parsed-xml-file)
    - [Adding elements](#adding-elements)
    - [Editing elements](#editing-elements)
    - [Deleting elements](#deleting-elements)
    - [Saving changes done](#saving-changes-done)
- [Creating an XML from scratch](#creating-an-xml-from-scratch)
    - [Creating the root element](#creating-the-root-element)
    - [Adding SubElements](#adding-subelements)



---

## Introduction

As we have seen in module 4 ([RESTful-APIs](../4.RESTful-APIs/Introduction.ipynb)), `XML` is a file format for storing and managing data. In Python, the standard library for managing data that is organised in `XML` format is called `xml`.

---

### The most relevant modules of the xml library

- `xml.etree.ElementTree`:
    - Has a very simple API for analyzing and creating XML data.
    - It's an excellent choice for people who have never worked with the Document Object Model (DOM) before.
- `xml.dom.minidom`:
    - Is the minimum implementation of the Document Object Model (DOM).
    - Using the DOM, the approach to an XML document is slightly different, because it's parsed into a tree structure in which each node is an object.
- `xml.sax`:
    - SAX is an acronym for “Simple API for XML”.
    - SAX is an interface in Python for event-driven XML document analysis.
    - Unlike the above modules, analyzing a simple XML document using this module requires more work.

---

### Segmenting an XML document

XML is used by SOAP to transmit messages between systems. It is standardised by the W3C organisation.

Before segmenting it, an example of XML is shown:
```xml
    <?xml version="1.0" encoding="ISO-8859-2"?>
    <data>
        <book title="The Little Prince">
            <author>Antoine de Saint-Exupéry</author>
            <year>1943</year>
        </book>
        <book title="Hamlet">
            <author>William Shakespeare</author>
            <year>1603</year>
        </book>
    </data>
```

The elements that make up XML are the following:
- __prolog__:
    - The first (optional) line of the document. In the prolog, you can specify character encoding, e.g., `<?xml version="1.0" encoding="ISO-8859-2"?>` changes the default character encoding (UTF-8) to ISO-8859-2.
- __root element__:
    - The XML document must have one root element that contains all other elements. In the example below, the main element is the `data` tag.
- __elements__: 
    - These consist of opening and closing tags. The elements include text, attributes, and other child elements. In the example below, we can find the book element with the `title` __attribute__ and two __child elements__ (`author` and `year`).
- __attributes__:
    - These are placed in the opening tags. They consist of key-value pairs, e.g., `title = "The Little Prince"`.

---

## Parsing an XML file

The following is an example of how to parse an XML document, so that we can work with the XML data in a more Python-like way.

---

### Converting XML file to usable data in Python

The `ElementTree` module provides methods with which to transform the XML file.
This transformation can be done through two methods (depending on how we have obtained the data):
- `parse(path)`: Receives as argument the path to the XML file.
    - Once the instance has been obtained, we must execute the `parse()` method on it to obtain the variable with which to work with the data.
- `fromstring(str_xml)`: Receives as argument the XML file in STR format.

After performing the transformation we will obtain an instance called `Element`. This instance is the one that allows us to browse through the whole file. It is important to note that the instance starts from the [root_element](#segmenting-an-xml-document), i.e. from the base of the XML tree.

---

### Navigating the entire tree structure

Through the `Element` instance we can access the data of each record. This allows us to access each `tag` and each `attribute`.

Below is an example of working with the XML seen [above](#segmenting-an-xml-document):

In [1]:
import xml.etree.ElementTree as ET

STRING_XML = """<?xml version="1.0" encoding="ISO-8859-2"?>
<data>
    <book title="The Little Prince">
        <author>Antoine de Saint-Exupéry</author>
        <year>1943</year>
    </book>
    <book title="Hamlet">
        <author>William Shakespeare</author>
        <year>1603</year>
    </book>
</data>
"""

root = ET.fromstring(STRING_XML)
print('The root tag is:', root.tag)
print('The root has the following children:')
for child in root:
    print(child.tag, child.attrib)

The root tag is: data
The root has the following children:
book {'title': 'The Little Prince'}
book {'title': 'Hamlet'}



---

### Accessing elements at a lower level within XML

Child elements can be accessed through the following systems:
- `indexes`: By iterating them, you can access the attributes with indexes. As if they were an array.
- `iter()`: Method of `Element` with which you can get all attributes called in a specific way.
- `findall()`: Method similar to `iter()`, with the difference that only the values in the first level are returned.
    - __NOTE__: The findall method also accepts an XPath expression. We encourage you to deepen your knowledge of XPath expressions and apply it to the example shown.
- `attrib()`: Returns a dictionary with all the values of the child.

#### Example with indexes:

In [6]:
print("My books:\n")
for book in root:
    print('Title: ', book.attrib['title'])
    print('Author:', book[0].text)
    print('Year: ', book[1].text, '\n')


My books:

Title:  The Little Prince
Author: Antoine de Saint-Exupéry
Year:  1943 

Title:  Hamlet
Author: William Shakespeare
Year:  1603 




#### Example with `iter()`:


In [7]:
for author in root.iter('author'):
    print(author.text)


Antoine de Saint-Exupéry
William Shakespeare


#### Example with `findall()`:

In [13]:
for book in root:
    for author in book.findall('author'):
        print(author.text)

Antoine de Saint-Exupéry
William Shakespeare


#### Example with `attrib`:

In [21]:
for book in root:
    for key, val in book.attrib.items():
        print(key, ": ", val)    

title :  The Little Prince
title :  Hamlet



---

### Getting the first element that meets the requirements

Another interesting feature of `Element` is a `find` call. This functionality returns the first element that matches the requested name.

An example is shown below:

In [2]:
print(root.find('book').get('title'))

The Little Prince


Note, however, that `find` will not perform a drill down, i.e. if the `tag` searched for is not part of the item itself, but of a child `find` will not return anything.

The following is an example of using find to search for the author of the book starting from the root:

In [16]:
# As we can see, find cannot find the author when starting from the root of XML
print(root.find('book').get("author"))

# However, if the search is carried out on the specific element where the author is located...
print(root[0].find('author').text)

None
Antoine de Saint-Exupéry



---

## Working with a parsed XML file

So far we have covered how to read an XML document parsed through `xml.etree.ElementTree`. Now what remains to be understood is how we can edit this same document...

The following points are related to editing XML documents.

---

### Adding elements

Starting with the creation of new elements within the document. This action is achieved through the `set` method provided by the `Element` class.


As an example, we will include in our XML library a field to indicate the genre of the book:

In [12]:
from xml.etree.cElementTree import Element

def display_book_characteristics(root: Element):
    for number, book in enumerate(root):
        print(f"{book.tag.capitalize()} number {number}".center(50, "*"))
        for key, val in book.attrib.items():
            print(key, ": ", val)
        for characteristics in book:
            print(characteristics.tag, ":", characteristics.text)
        

# Resets the XML worked
root = ET.fromstring(STRING_XML)
display_book_characteristics(root)

# For-loop for editing data in the XML file
for book in root:
    book.set("genre", "who knows")
    

display_book_characteristics(root)

******************Book number 0*******************
title :  The Little Prince
author : Antoine de Saint-Exupéry
year : 1943
******************Book number 1*******************
title :  Hamlet
author : William Shakespeare
year : 1603
******************Book number 0*******************
title :  The Little Prince
genre :  who knows
author : Antoine de Saint-Exupéry
year : 1943
******************Book number 1*******************
title :  Hamlet
genre :  who knows
author : William Shakespeare
year : 1603


__NOTE__: The data has not been saved in the XML document.

---

### Editing elements

The values that we iterate through an XML file are editable, that is to say, everything that we modify in them will be saved in the association variable with the XML.

As an example, and going back to the library file, let's say we want to make the following modifications:
- The `.tag` _book_ will become _movie_.
- The `.text` _year_ of all movies will become 2000.

To achieve this we would need to do the following:

In [15]:
# Resets the XML worked
root = ET.fromstring(STRING_XML)
display_book_characteristics(root)

for book in root:
    book.tag = 'movie'
    book.find('year').text = 2000

display_book_characteristics(root)

******************Book number 0*******************
title :  The Little Prince
author : Antoine de Saint-Exupéry
year : 1943
******************Book number 1*******************
title :  Hamlet
author : William Shakespeare
year : 1603
******************Movie number 0******************
title :  The Little Prince
author : Antoine de Saint-Exupéry
year : 2000
******************Movie number 1******************
title :  Hamlet
author : William Shakespeare
year : 2000



---

### Deleting elements

In order to remove entries from an XML file we must make use of the `.remove()` method located in the `Element` class. This method accepts child elements as arguments.

> NOTE: An easy way to pass these arguments is through the `.find()` method, since it will return the first element that makes reference to the `tag` we want to remove (which will be the same as the one we want to send as parameter).

We will generate an example case through our well-known XML library, in this case what we will do is to remove the year among its values.

This can be achieved through the following snipped of code:

In [16]:
# Resets the XML worked
root = ET.fromstring(STRING_XML)
display_book_characteristics(root)

for book in root:
    book.remove(book.find('year'))

display_book_characteristics(root)

******************Book number 0*******************
title :  The Little Prince
author : Antoine de Saint-Exupéry
year : 1943
******************Book number 1*******************
title :  Hamlet
author : William Shakespeare
year : 1603
******************Book number 0*******************
title :  The Little Prince
author : Antoine de Saint-Exupéry
******************Book number 1*******************
title :  Hamlet
author : William Shakespeare



---

### Saving changes done

One thing that should not be forgotten is that changes made to our parsed variable will not be reflected in the final file. It is like making changes in a Word document and not saving them (they will be lost once the document is closed).

For this case we must use the `write` method which owns `ElementTree` (the class we use to parse the file).

Next we will show an example where we get our beloved library (in file format), apply changes and then save the results.

The code to make the following changes is as follows:

In [50]:
import os
import xml.etree.ElementTree as ET

# Getting the file that will be edited
library = ET.parse('./persistance/book.xml')
root = library.getroot()
ET.dump(root)

# Changing one value
for book in root:
    book.tag = 'movie'

# Saving the results into a file
library.write("movies.xml")

# Parsing, showing and deleting the file created
print("Showing changes done".center(50, "*"))
ET.dump(ET.parse("movies.xml"))
os.remove("movies.xml")


<data>
    <book title="The Little Prince">
        <author>Antoine de Saint-Exupery</author>
        <year>1943</year>
    </book>
    <book title="Hamlet">
        <author>William Shakespeare</author>
        <year>1603</year>
    </book>
</data>
***************Showing changes done***************
<data>
    <movie title="The Little Prince">
        <author>Antoine de Saint-Exupery</author>
        <year>1943</year>
    </movie>
    <movie title="Hamlet">
        <author>William Shakespeare</author>
        <year>1603</year>
    </movie>
</data>


In case we have not parsed the results through a file, but through a STRING, we do not have to manually generate the file ourselves.

To have our ElementTree (and also our needed `.write()` method) we just need to instance it with our root element.

Here is an example:

In [51]:
import os

# Resets the XML worked
root = ET.fromstring(STRING_XML)
ET.dump(root)

# Making changes to the XML file
for book in root:
    book.tag = 'movie'

# Saving results
ET.ElementTree(root).write("movies.xml")

# Parsing,howing and deleting the file created
print("Showing changes done".center(50, "*"))
ET.dump(ET.parse("movies.xml"))
os.remove("movies.xml")


<data>
    <book title="The Little Prince">
        <author>Antoine de Saint-Exupéry</author>
        <year>1943</year>
    </book>
    <book title="Hamlet">
        <author>William Shakespeare</author>
        <year>1603</year>
    </book>
</data>
***************Showing changes done***************
<data>
    <movie title="The Little Prince">
        <author>Antoine de Saint-Exupéry</author>
        <year>1943</year>
    </movie>
    <movie title="Hamlet">
        <author>William Shakespeare</author>
        <year>1603</year>
    </movie>
</data>



---

## Creating an XML from scratch

Finally, we will show how to generate an XML file from scratch, that is, without starting from a file (or string) that we have parsed. This is not very complicated, since it basically consists of instantiating the `Element` class and populating it with `SubElements`.

We go, as if it were an XML and, once the loading is finished, save the results through the `.write()` method (accessible throw the `ElementTree` class as we have seen in the [last episode](#saving-changes-done)).

However, examples will be shown below to put the theory into practice.

---

### Creating the root element

To begin with, we must generate an `Element` where the data of our XML file will be stored.



In [52]:
# Creating a new Element
new_xml = ET.Element('data')

# Printing the results
ET.dump(new_xml)

<data />



---

### Adding SubElements

Once the file is generated, we must load the subelements we need, this can be achieved by using the `SubElement` class.

On this we will indicate the parent element and its subelements.

In [53]:
# Creating a new Element
new_xml = ET.Element('data')

# Populating the Element with SubElements
ET.SubElement(new_xml, 'movie', {'title': 'The Little Prince', 'rate': '5'})
ET.SubElement(new_xml, 'movie', {'title': 'Hamlet', 'rate': '5'})

# Printing the results
ET.dump(new_xml)

<data><movie title="The Little Prince" rate="5" /><movie title="Hamlet" rate="5" /></data>



---

[< __INTRO MODULE 5__](./Introduction.ipynb)