# Parsing XML Files

[XML](https://www.w3.org/XML/), Extensible Markup Language, is a markup language much like HTML.
It is a simple and flexible data format that defines a set of rules for encoding documents in a way that 
is both human and machine readable. As a self-descriptive markup language, XML plays an important role in many information systems. It stores data in plain text format, which provides a platform-independent way of storing, transporting, and sharing data. In this chapter we are going to learn how to parse and extract data from XML files with Python.


First and foremost, you will need to have some basic understanding about XML.
There are a lot of good introductory materials freely available online. 
We suggest the following two sections of Chapter 12 in "**Dive Into Python 3**":
* [12.2 A 5-Minute Crash Course in XML](http://www.diveintopython3.net/xml.html#xml-intro) 📖
* [12.3 The Structure Of An Atom Feed](http://www.diveintopython3.net/xml.html#xml-structure) 📖

If you are quite familiar with XML, you can skip the above materials and jump directly into the parsing sections.

XML files are not as easy as the CSV or JSON files to preview and understand.
The data we are going to parse is the XML version for the "Melbourne bike share" dataset downloaded from
[data.gov.au](https://data.melbourne.vic.gov.au/Transport-Movement/Melbourne-bike-share/tdvh-n9dv).

Let's first open the file in your favorite editor to preview it. Note that it is always necessary to inspect the file before we parse it, as the inspection can give an idea of what the format of the file is, what information it stores, etc. If you scroll through the opened file, you will find that the data has been encompassed in XML syntax, using things called tags. The following figure shows a snippet of the data.

![](./xml_example.png)

After inspecting the file, you should find that data values can be stored in two places in an XML file, which are:
* in between two tags, for example, 
    ```html
        <featurename>Harbour Town - Docklands Dve - Docklands</featurename>
    ```
    where the value is "Harbour Town - Docklands Dve - Docklands" for the <featurename> tag.
* as an attribute of a tag, for example
    ```html
        <coordinates human_address="{&quot;address&quot;:&quot;&quot;,&quot;city&quot;:&quot;&quot;
        ,&quot;state&quot;:&quot;&quot;,&quot;zip&quot;:&quot;&quot;}" 
        latitude="-37.814022" longitude="144.939521" needs_recoding="false"/>
    ```
    where the value of latitude is -37.814022 and longitude is 144.939521. 

The attributes in XML store rich information about a specific tag.
Comparing XML with JSON, you will find that the XML tags and attributes hold data in 
a similar way to the JSON keys. 
The advantage of XML is that each tag in XML can hold more than one attribute, and
more values can be stored in one node. See the "coordinate" tag above.

Now, how can we extract data stored either in between tags or as attributes?
The goal is to parse the XML file, extract relevant information, and store the information in Pandas DataFrame that looks like
![](./parsed_xml.png)

In the following sections, we will demonstrate the process of loading and exploring a XML file, extracting
data from the XML file and storing the data in Pandas DataFrame.
* * * 

## 1. Loading and Exploring an XML file

Python can parse XML files in many ways.
You can find several Python libraries for parsing XML from 
[" XML Processing Modules"](https://docs.python.org/2/library/xml.html).
Here we will show you how to use the following Python libraries
to parse our XML file.
* ElementTree
* lxml
* beautifulsoup

There are a couple of good materials worth reading
* The office ElementTree [API](https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree) documentation, which provides not only the API reference but also a short tutorial on using ElementTree. 📖
* [Parsing XML](http://www.diveintopython3.net/xml.html#xml-parse), Section 12.4 in Chapter 12 of "**Dive into Python**" does a good job on elaborating the process of parsing an example XML file with ElementsTree. 📖

If you are a visual learner, we suggest the following YouTube video
* [Parsing XML files in Python](https://www.youtube.com/watch?v=c2qlCZhkwtE)

We strongly suggest that you read these materials, although we are going to reproduce some of their content
along with our own XML file.

Let's start with ElementTree. 
There are several ways to import the data, which depends on how the data is stored.
Here we will read the file from disk.

In [None]:
import xml.etree.ElementTree as etree    
tree = etree.parse("./Melbourne_bike_share.xml")  

In the ElementTree API, an element object is designed to store data in a hierarchical structure according to the XML tag structure.
Each element has a number of properties associated with it, for example, a tag, a text string,
a set of attributes and a set of child elements.
The <font color="blue">parse()</font> function is one of the entry points of the ElementTree library.
It parses the entire XML document at once into an ElementTree object that contains a hierarchy of Element objects. 
see ["How ElementTree represents XML"](http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html). 📖

The first element in every XML document is called the root element,
and an XML document can only have one root.
However, the returning ElementTree object is not the root element. 
Instead, it represents the entire document.
To get a reference to the root element, call <font color="blue">getroot()</font> method.

In [None]:
root = tree.getroot()     
root.tag

As expected, the root element is the <font color='orange'>response</font> element. See the original XML file.
You can also check the number of children of the root element by typing
```python
    len(root)
```
It will give you one. To get the only child, one can use the <font color="blue">getchildren()</font> method.
But it will result in a warning message
that looks like 
```python
    /Users/land/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: DeprecationWarning: This method 
    will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
    from ipykernel import kernelapp as app.
```
This is because the method has already been deprecated in Python 2.7.
Indeed, an element acts like a list in the ElementTree API.
The items of the list are the element’s children.

In [None]:
for child in root:           
    print (child)

The <font color='orange'>root</font> list only contains its direct children elements. The children elements of each entry in the list are not included. 

Each element can also have its own set of attributes. The <font color="orange">attrib</font> property of an element is a mutable 
Python dictionary. 
Does the root have attributes? Let's check it out.

In [None]:
root.attrib

It returns a empty dictionary. 
So far, the element tree seems to be empty.
Now you need to <font color='red'>either examine the original xml to discover the structure,
or further traverse the element hierarchy by iteratively printing out all the elements and 
data contained therein </font>.
The <font color='orange'>root</font> element has only one child.
It can be accessed by index, for example:
```python
    root[0]
```
A FOR loop can be used to print out all the children of <font color='orange'>root[0]</font>.

In [None]:
print ("the total number of rows: ", len(root[0]))

In [None]:
for child in root[0]:
    print (child)

The tag of each child is the same, called 'row', which stores information about one bike station.
Let's keep on retrieving the children of these rows. Instead of doing that for 
all the rows, we retrieve the children of <font color="orange">root[0][0]</font> and that should correspond to the first record.

In [None]:
for child in  root[0][0]:
    print (child)

Fortunately, the tags of the retrieved child elements correspond to the column names in the DataFrame.
Thus, all the tags storing the data we want have been found. 
To confirm it you can inspect the original XML file 
or simply look at the figure shown in Section 1. 
Another way of exploring the element hierarchy is to use the iteration function of ElementTree, `iter()`.
The iterator loops over all elements in the tree, in section order.
Each element is represented as a Python tuple, where the first entry is a tag,
the second is the text, and the last is a dictionary of attributes.

In [None]:
for elem in tree.iter():
    print (elem.tag, elem.text, elem.attrib)

Besides ElementTree, there are other Python libraries that can be used to parse XML files.
Here we show two of them, which are **`lxml`** and **`BeautifulSoup`**.

### 1.1 The lxml package
[**`lxml`**](http://lxml.de) is an open source third-party library that builds on top of two C libraries 
libxml2 and libxslt.
It is mostly compatible but superior to the well-known ElementTree API.
To study **`lxml`** in detail, you should refer to:
* [the lxml.etree tutorial](http://lxml.de/tutorial.html), a tutorial on XML processing with lxml.etree.
* and [Going Further With lxml](http://www.diveintopython3.net/xml.html#xml-lxml), Section 12.6 in Chapter 12 of "**Dive into Python 3**". 📖 

Here we are going to briefly show you how to extract the text content of an element tree
using **XPath**.
**XPath** allows you to extract the separate text chunks into a list:

In [None]:
from lxml import etree
ltree = etree.parse("./Melbourne_bike_share.xml")
for el in ltree.xpath('descendant-or-self::text()'):
    print (el)

In the <font color='blue'>xpath()</font> function,
the <font color='orange'>descendant-or-self::</font> is an axis selector that limits the search to the context node, its children, their children, and so on out to the leaves of the tree. The <font color = 'blue'>text()</font> function selects only text nodes, discarding any elements, comments, and other non-textual content. The return value is a list of strings.
Read [XPath processing](http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/xpath.html) 📖 for a short introduction
to `xpath` and [W3C's website on Xpath](http://www.w3.org/TR/xpath/) for a detailed introduction to XPath.
Note that <font color='blue'>lxml</font> is significantly faster than the built-in <font color='blue'>ElementTree</font> library on parsing large xml documents.
If your XML files are very large, you should consider using <font color='blue'>lxml</font>.

### 1.2 The Beautiful Soup Pacakge
[Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) is an another Python library for pulling data out of HTML and XML files. It provides Pythonic idioms for iterating, searching, and modifying the parsed tree.
We begin by reading in our XML file and creating a Beautiful Soup object with the BeautifulSoup function. In regard to the assessment, we suggest the use of beautiful soup.

In [None]:
from bs4 import BeautifulSoup
btree = BeautifulSoup(open("./Melbourne_bike_share.xml"),"lxml-xml") 

There are two different ways of passing an XML document into the BeautifulSoup constructor.
One is to pass in a string, another is to parse an open filehandle. the above example follows the second approach.
The second argument is the parser to be used to parse the document.
Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parsed trees from the same document.

In [None]:
print(btree.prettify())

The soup object contains all of the XML content in the original document.
The XML tags contained in the angled brackets provide structural information (and sometimes formatting).
If you were to take a moment to print out the parsed tree, you would find Beautiful Soup did a good job.
It provides a structural representation of the original XML document. 
Now it is easy for you to eyeball the document and the tags or attributes containing the data we want. <font color="red">We will stop here and leave the extraction of the data with Beautiful Soup as a simple exercise for you.</font>
The documentation of how to use Beautiful Soup can be found [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

* * *

## 2. Extracting XML data into DataFrame
So far we have loaded XML into an element tree and have also found all the tags that contain the data we want. 
We have worked with our XML file in a top-down fashion, starting with the root element, 
then getting its child elements, and so on. 
We have also gained a brief idea of **lxml** and **beautiful soup**.
This section will show you how to extract the data from all the tags and put it into Pandas DataFrame, a common
and standard storage structure we used in the previous chapter. 
This structure will also be used in the following chapters. 
Before we walk through the extracting process, please read: 
* [Searching For Nodes Within An XML Document](http://www.diveintopython3.net/xml.html#xml-find) Section 12.5 in Chapter 12 of "**Dive into Python 3**". 📖 

Let's first just look at one tag, i.e., '*featurename*'.
Since we don't know where it is, the code should loop over all the elements in the tree.
To produce a simple list of the featurenames, the logic could be simplified using 
`findall()` to look for all the elements with tag name '*featurename*'.
Both the ElementTree and the Element classes implement `findall(match)` function.
The one implemented by the ElementTree class finds all the matched subelements starting from root.
The other implemented by the Element finds those sub-elements starting from a given Element in the tree.
All the matched elements returned by the function are stored in a list.
The `match` argument should take values on either tag names or paths to specific tags.
Try 
```python
    tree.findall('featurename')
```
and 
```python
    tree.findall('row/featurename')
```
What did you get?

The '*featurename*' tag is not the child or grandchild of the root element.
In order to get all the '*featurename*', 
we should first figure out the path from the root to the '*featurename*' tag.
By looking at the original file or basing on what we learnt from the previous section, we know the path is
```html
    row/row/featurename
```
Thus,

In [None]:
elements = tree.findall('row/row/featurename')
elements

The above list should contain 50 Elements corresponding to '*featurename*'.
As you may notice, the items returned by <font color="blue">findall()</font> are Element objects, each representing a node in the
XML parse tree. 
What we want is the data stored in those objects.
To pull out the data, we can access the element properpties: tag, attrib and text.

In [None]:
featurename = [elem.text for elem in elements]
featurename

You might wonder whether there is another way to extract the text stored in the '*featurename*' tag.
It might be possible that the structure of an XML file is quite complex (more complex that our example XML file) 
and it is not easy to figure out the path. 
There are other ways to search for descendant elements, i.e., children, grandchildrens, 
and any element at any nesting level. 
Using the same function, <font color = 'blue'>findall()</font>, we can construct an XPath argument to look for all
'*featurename*' elements.

In [None]:
tree.findall('.//featurename')

It is very similar to the previous example, except for the two forward slashes at the beginning of the query.
The two forward slashes are short for <font color='orange'>/descendant-or-self::node()/</font>. 
Here <font color='orange'>.//featurename</font> selects any 'featurename' element in the XML document. 
Similarly, we can extract the text with <font color='orange'>Element.text</font>.

Remember that to visit the elements in the XML document in order, 
you can use <font color='blue'>iter()</font> to create an iterator that iterates over all the ElementTree instances in a tree.
We have shown you how to explore the element hierarchy with this iteration fucntion.
Here you are going to learn how to find specifc elements.
[ElementTree's API](https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findall)
shows that <font color='blue'>iter()</font> function can take an argument <font color='blue'>tag</font>.
If the tag is specified, the iterator loops over all elements in the tree and returns 
a list of elements having the specified tag.

In [None]:
featurename = [] 
for elem in tree.iter(tag = 'featurename'):
   featurename.append(elem.text) 
featurename

The code pulls out data from all elements with a tag equal to '*featurename*', and stores the text in a list.
Similarly, you can retrieve data from elements having the following tags: 'id', 'terminalname', 'nbbikes',
'nbemptydoc', and 'uploaddate' as follows. Note that we only print out the first 10 records of the retrieved data.

In [None]:
id = [] 
for elem in tree.iter(tag='id'):
       id.append(elem.text) 
id[:10]

In [None]:
terminalname = []
for elem in tree.iter(tag='terminalname'):
       terminalname.append(elem.text) 
terminalname[:10]

In [None]:
nbbikes = []
for elem in tree.iter(tag='nbbikes'):
       nbbikes.append(elem.text)  
nbbikes[:10]

In [None]:
nbemptydoc  = []
for elem in tree.iter(tag='nbemptydoc'):
       nbemptydoc.append(elem.text) 
nbemptydoc[:10]

In [None]:
uploaddate = []
for elem in tree.iter(tag='uploaddate'):
       uploaddate.append(elem.text)  
uploaddate[:10]

As mentioned in the introduction section, latitudes and longitudes
are stored as attributes in 'coordinates' elements. 
Extracting them needs to access specific attributes that corresponds
to latitude and longitude.
Recall that attributes are dictionaries. 
To extract a specific attribute value, you can use the 
square brackets along with the attribute name as the key to obtain
its value.
Let's first extract all the latitudes and longitudes and store them in two lists,
"lat" and "lon" respectively.

In [None]:
lat = []
lon = []
for elem in tree.iter(tag='coordinates'):
    lat.append(elem.attrib['latitude'])
    lon.append(elem.attrib['longitude'])
print (lat[0:10])
print (lon[0:10])

The last step is to store the extracted data into Pandas DataFrame.
There are multiple ways of constructing a DataFrame object. 
Here you are going to generate a DataFrame by passing a Python dictionary to DataFrame's constructor
and setting the index to IDs.

In [None]:
import pandas as pd 
dataDict = {}
dataDict['Featurename'] = featurename
dataDict['TerminalName'] = terminalname
dataDict['NBBikes'] = nbbikes
dataDict['NBEmptydoc'] = nbemptydoc
dataDict['UploadDate'] = uploaddate
dataDict['lat'] = lat
dataDict['lon'] = lon
df = pd.DataFrame(dataDict, index = id)
df.index.name = 'ID'
df.head()

Pandas DataFrame has automatically sorted the columns according the alphabetic order of column names. 
You can change the order and make the dataframe look exactly the same as the one you got in Chapter 2. It can be easily done by specifying the value of the `columns` argument in the DataFrame constructor.
As a simple exercise, you are going to tidy up the dates, as we have done in Chapter 2.
* * *

## 3. Summary
In this chapter we have shown you the process of extracting data from XML files with Python built-in libraries and briefly introduced two open source third-party libraries, i.e., lxml and Beautiful Soup. Together with the previous
chapter, you have learnt how to extract data from CSV, JSON and XML files, 
the three common machine-readable formats. 
Being able to handle these formats with Python and related libraries is one of the must-have skills for a 
data wrangler. 
* * *