## Tutorial 16: Parsing XML

Extensible Markup Language, more commonly know as XML, is a standard
format for structuring documents and information. One particular
extension is XHTML, a standard used to describe the content of webpages.

We have already worked a bit with parsing the (X)HTML code returned from
the MediaWiki API using regular expressions. Regular expressions are a
great way to start, but for more extensive use a proper library that
fully parses XML offers much more control and avoid common pitfalls. 

In these notes we see how to use the `xml` module to parse the text
returned from Wikipedia. 

### Creating an ElementTree object

It will be easier to understand how to parse XML code using a smaller
example than we would get from Wikipedia. Here is a very simple snippet
of code that contains a title and two paragraphs.

In [None]:
html = """<div>
<h1 class='page'><i>A title in italics</i></h1>
<p>Here is one paragraph of text with something in a <b>bold</b> font.</p>
<p> Another paragraph! In this case I have <a href="https://github.com">a link</a> that <i id='my'>you</i> click on.</p>
</div>
"""

print(html)

We start by reading in the submodule `xml.etree.ElementTree`. By convention,
we'll save it as `ET`.

In [None]:
import xml.etree.ElementTree as ET

Next, use the `fromstring` function to take the string and convert it into an
`ElementTree` object.

In [None]:
tree = ET.fromstring(html)
type(tree)

The object has three element corresponding to the three top-level elements
in the XML. Elements are accessed the same way they would be in a list: with
square brackets and an index. The first element is our `h1` element:

In [None]:
print(tree[0])

The second elements two are paragraph tags:

In [None]:
print(tree[1])
print(tree[2])

Also like a list, we can cycle through the elements with a `for` loop:

In [None]:
for child in tree:
    print(child)

Finally, we can also manually convert the tree to a list:

In [None]:
list(tree)

Typically there is not much reason to manually convert an `ElementTree`
into a list in your final code, but it can be very useful when testing
and debugging.

### Working with XML Elements

Let's take the first element of tree, the title of our document.

In [None]:
child = tree[0]

There are several useful properties given to us by the element.
The `tag` properties of the element is a string giving the type 
of element.

In [None]:
child.tag

The `attrib` property is a dictionary that yields the properties (if there are any)
of the XML tag. Looking at the 'h1' element in the example, we see that there is an
attribute named 'class' that's equal to 'page'. 

In [None]:
child.attrib

Finally, the property `text` contains the actual text *inside* of the element. 

In [None]:
child.text

You should notice that there is **no** text in the tag? What's going on here?!
If you look at the XML input, there is an 'i' tag inside of the 'h1' tag and all
of the text is inside of *this* tag. We can see all of the elements inside of 'h1',
as above, using the `list` function:

In [None]:
list(child)

This child of the child has a tag equal to 'i' (its an italic symbol in HTML):

In [None]:
child[0].tag

But no attributes:

In [None]:
child[0].attrib

However, it **does** have a text property containg the actual text:

In [None]:
child[0].text

Let's now work with the first paragraph element:

In [None]:
child = tree[1]
child.tag

As you should expect, it has a 'b' (bold) element inside of it:

In [None]:
list(child)

What happens if we try to grab the text?

In [None]:
child.text

It only contains the text *up to* the 'b' tag, similar to what happened
with the title element... This could get be very difficult to work with
if we wanted all of the information inside of a paragraph or other element.
The solution is to use the method `itertext`; it (when converted into a
list) returns all of the text inside of an element. 

In [None]:
list(child.itertext())

The individual elements can be combined by using the function `join`:

In [None]:
"".join(child.itertext())

### Loops and XPath Expression

We now have the basic elements for working with an XML document. If we wanted,
for example, to get a list with one element for each paragraph we could use a
`for` loop and `if` statement:

In [None]:
p = []
for child in tree:
    if child.tag == "p":
        text = "".join(child.itertext())
        p.append(text)
        
p

For some applications, this approach (cycling through children) is ideal.
One drawback, however, is that it becomes difficult to find elements that
might be buried deeper in the XML tree. For example, if we wanted all links
in the document.

A way to address this is to use a notation called an *XPath Expression* that
describes a element in an XML document. We won't go into the
[full spec](https://www.w3.org/TR/xpath-31/) for XPath expression, but will
show a few examples that will be most useful.

To use an XPath expression to find nodes in an `ElementTree`, we use the
`findall` method. A simply query simply just starts with './/' (this means 
that the tag can start anywhere) and includes the name of the tags that you
want to find:

In [None]:
list(tree.findall(".//i"))

If you want to find one element inside of another, use a `/`. For example,
this finds italics tags inside of a paragraph:

In [None]:
list(tree.findall(".//p/i"))

Finally, we can specific attributes using square brackets:

In [None]:
list(tree.findall(".//i[@id='my']"))

These will go a long way towards letting us parse information in
the Wikipedia XML output.

### Wikipedia Application

Let's try to apply what we have now seen to some actual data from Wikipedia. Load
the `wiki` module:

In [None]:
import wiki

assert wiki.__version__ >= 3

And pull up the page on *Plato* (it will be useful to also open
the [Paris page](https://en.wikipedia.org/wiki/Paris) itself.)

In [None]:
data = wiki.get_wiki_json("Paris")
html = data['text']['*']
html[:1000]

Now, create a `xml.etree.ElementTree.Element` object named `tree` from the
html data.

Using a for loop, create a list named `p` with one element for each paragraph in `tree` 
containing all of the text in the paragraph.

The element `p[0]` should contain just four new lines. Check to make sure
that `p[1]` matches the first real paragraph on the Wikipedia page.

Using an XPath expression, find all of the 'h2' elements (you do not need to save them).
These correspond to the section headings in the article. 

Now, there is a 'span' element inside of the headers of class "mw-headline"
that contains the actual text of the section. Write an XPath expression that
grabs these elements and store them as a variable named `headings`:

Now, cycle through the headings, extract the `text` element and append these
two a list named `headings_text`:

Print out the object `headings_text`:

Verify that these links match those on the page.

Finally, there is a special Wikipedia XML span element of class 'geo'.
The page may contain many of these, but we only need to the first so
use `tree.find` in place of `tree.findall`. In the code below, find this
first element and extract the text:

You should see the string '48.8567; 2.3508'. This is the latitude and longitude of
Paris. We would be able to automate detection of this information to add context to
any pages with an associate latitude and longitude.