# Week 14: XPath

So yes, we spent most of last week sorting our heads around how to make Python work with Xpath and discussing XML.  Your assigned readings for last week included the W3C School's XPath tutorials, along with an optional refresher on XML.

Let's be clear: even if you have worked with XML before, maybe even taken the metadata class using XML, knowing the precise structure and **names** of the bits and bobs inside of XML will be necessary to wrap your head around what XPath is all about.  

## Readings for this week

I'm going to be doing some demos in this notebook, focusing more on how the Python works and leave a lot of the XPath narrative to the W3C School's XPath lesson:  https://www.w3schools.com/xml/xpath_intro.asp.  The terminology section is one of the most important, so that might be something worth printing out or taking notes on.  You'll need to know the names of things to understand the later lessons. 

# XPath Basics

XPath statements tend to look a little like URLs, because the core tree structure behind websites and XML documents is about the same.  Philosophically speaking.  Let's take a basic XML snippet:

```XML
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>
```

There are numerous ways to describe this structure.

* `<book>` is the root element with two children:  `<author>` and `<title>`.
* The `<author>` element is a child of `<book>` and sibling to `<title>`.  

These descriptions are the basis for how XPath queries are constructed.  So, you don't say "I want the title element wihtin the book element", it's "Find the book element anywhere in the tree, then get the child element called author.  We can express this statement as such:

`'//book/author'`

Yes, the narrative is much longer than the actual statement, but this is the basis for every advanced XPath query.  We at least think this is correct, but we haven't tested it.  So let's inject this into the Python pattern we saw last week.  This pattern will be a little different because we're working off of a string instead of a file.  There are separate functions to use when reading XML from a file.

In [1]:
from lxml import etree

xml = """
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('//book/author'))

[<Element author at 0x106374c08>]


Good news:  it worked!

Bad news:   WTF is this `<Element author...` crap?

# The two parts to every Xpath statement

What the `xpath` statement has returned back to us is an `Element` object.  This is a little bundle of processed XML and something that we can act on in smart ways.

What we need to ask ourselves next is: what do we want to get out of that element?

Each element should have:

1. A piece to select the right element or elements
2. A piece to extract the data that you want out those elements

Meaning, why did we look for that element?  What was our purpose?  Did we want an attribute value or the element text?  We need to use additional XPath syntax to actually extract information out of the element.

## getting text of out element values with /text() and //text()

So this brings us to the second part of nearly every expath statement:  the data extraction piece.  We've got `'//book/author'`, which will find the element in question.  We need to add `'//text()'` to extract out the actual text.  The `'//`' part of that says "anywhere in the tree.  I usually recommend it in case there's additional text in other elements.  We'll explore this later, just remember that when I use two `/`s that I'm doing so on purpose. 

Sometimes you want to leave these things separate, so you have a two stage query:  find all the author elements and then extract the information out of those elements.  But in many cases we can put everyting together in one statement.  We can do that now.

In [2]:
from lxml import etree

xml = """
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('//book/author//text()')) 

['Human, A.']


Yay!  We've gotten it.  Now let's explore why we want to do `//text()` instead of `/text()`.

In [3]:
print(tree.xpath('//book/author//text()'))
print(tree.xpath('//book/author/text()'))

['Human, A.']
['Human, A.']


It actually works just fine with our example XML because all the text that I want is under the `<author>` element.  But what if I had sub elements that I can't always expect?  Let's take a look at the title element and tweak it from there.

In [4]:
xml = """
<book>
    <author>Human, A.</author>
    <title>This is <b>not</b> a book</title>
</book>"""


I've changed the `<title>` statement to now have an element bolded.  While we know that this is just a display markup and the bolding has no semantic real meaning with the data structure, XML doesn't care and sees it just the same as another element.  Meaning that when I try to get text out with just a single slash, `/text()`, that it'll skip over the bolded text.

In [5]:
tree = etree.fromstring(xml)
print(tree.xpath('//book/title/text()')) 

['This is ', ' a book']


We see two things here:

* instead of having one string of text I have two inside of my list.
* the text content inside of the `<b>` element has been omitted.

This means that with `/` it only looks for text one level deep.  Literally, "just the text directly inside of the `<title>` element and do not search any deeper within the children elements."

But I can change this to `//` and let it look deeper.

In [6]:
print(tree.xpath('//book/title//text()')) 

['This is ', 'not', ' a book']


We can see that all the text is there, but now I've got three strings instead of one.  This is a good use case for our `.join` manouver.  I actively want all this data to be a single value in one of my cells, so I can join it all back together.  The next question being: what should the delimiter be?  I can see that all the white space the belongs in the title is retained, so I can just use my empty string join.

In [7]:
title_list = tree.xpath('//book/title//text()')

title = "".join(title_list)

title

'This is not a book'

## In sum about getting element text...

To get text out of an element you need to:

* have an xpath statement that is selecing the right element(s)
* use additional XPath functions to extract the data you want out of that element.
    * You may do this as part of a single xpath query:
        * `//book/title//text()'
    * Or as a separate query:
        * ```titles_elem_list = tree.xpath('//book/title')
        for title in titles_elem_list:
            print(title.xpath('.//text()'))```
        * The odd `.//` that you have to do is because you've split the queries apart and need to make your second query more specific.  In this case, `.` indicates "start from this element", then the `//` says "Look into all descendents within this element"
* sometimes you'll have sub elements in text, particularly text that has HTML formatting within it, or text with additional tags.  You must decide if you want to:
    * look one level deep for text (using `/text()`)
    * or if you want to look at all descendents within that element for text (using `//text()`)
        * this also means that that you may have a single string of text (that you as a human understand to be a single string of text) coming in as a list of many strings.  You can use `"".join(list_of_strings)` to concatenate all of those strings back into one.

In [8]:
titles_elem_list = tree.xpath('//book/title')
for title in titles_elem_list:
    print(title.xpath('.//text()'))
    # this 

['This is ', 'not', ' a book']


## Getting text from attributes

Attributes are little pieces of text inside of an element.  Sometimes these are uninteresting little pieces of metadata, but sometimes they're the super juicy bit of information that you want.  Let's start with a new example:

```XML
<book id = "42">
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>```

Instead of having a separate element of `<book_id>`, the unique ID for this book is embedded within an attribute of the `<book>` element.  We can now use another XPath syntax tool to say that we want the text of a specific attribute's value.  We do have to know the name of the attribute and (usually) which element within it.  Where we might normally put `//text()` after an element's name to get the element text, we can now say `@attribute_name` to have it give us that value.

In [9]:
xml = """
<book id = "42">
    <author>Human, A.</author>
    <title >This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('@id')) 

['42']


This example looks a little odd, and that's because the `id` attribute is within the root element.  Let's add some more attribute values to play with this syntax more.

In [10]:
xml = """
<book id = "42">
    <author id = "HumanEntity-0003">Human, A.</author>
    <title lang = "en">This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('@id')) # I'll get the book id
print(tree.xpath('//@id')) # look in the entire tree for instances of an id attribute and give them all to me
print(tree.xpath('author/@id')) # just the id attribute from the author element
print(tree.xpath('title/@lang')) # lang attribute from the title

['42']
['42', 'HumanEntity-0003']
['HumanEntity-0003']
['en']


## Using attributes to specify elements

We need a more complex example here.

```XML
<book id = "42">
    <author id = "HumanEntity-0003" role = "primary">Human, Not A.</author>
    <author id = "HumanEntity-0004" role = "other">Popsicle, Meat</author>
    <author id = "HumanEntity-0005" role = "other">Cardassian, Kim</author>
    <title lang = "en">This is not a book</title>
</book>
```

So now we have several `<author>` elements.  We could potentially want only the primary author, only the other authors, or specify the ID of the author.

In [11]:
xml = """<book id = "42">
    <author id = "HumanEntity-0003" role = "primary">Human, Not A.</author>
    <author id = "HumanEntity-0004" role = "other">Popsicle, Meat</author>
    <author id = "HumanEntity-0005" role = "other">Cardassian, Kim</author>
    <author id = "NonHuman-0006">Here, Y. Am. I.</author>
    <title lang = "en">This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('author/@id')) # so we can get all the IDs now
print(tree.xpath('author/@role')) # and all the roles.
# but we can see that not all authors have a role, maybe that has meaning to our schema?
# this is why you need to know the schema you are working with

['HumanEntity-0003', 'HumanEntity-0004', 'HumanEntity-0005', 'NonHuman-0006']
['primary', 'other', 'other']


We can select specific elements based on their attribute values with the syntax `element[@attribute_name = "value"]` or we can select specific elements that have a certain attribute with `element[@attribute_name]`.  Note how we'll need to go back to using our `//text()` tool because this is only helping us select elements, not extract data.

Now that we're getting into the meatier sections of XPath, we can look at some of the functions at our disposal.  We can see that there are two types of author ids:  one for "HumanEntity" and one for "NonHuman".  We can use `element[contains(@attribute_name, "partial match text"]` to select an element based off a partial match for that text.  We'd still need to use `attribute_name` or `//text()` to get the data we want out of it.

Watch out! This requires quotes, so you need to be careful about how you're either using `'` to surround the xpath statement and `"` to give the statement quotes, or escaping out your quotes.

In [12]:
print(tree.xpath('author[@role = "primary"]//text()')) # all with primary role
print(tree.xpath('author[@role = "other"]//text()')) # all with other role

# now doing a partial text match for id values that contain things...

print(tree.xpath('author[contains(@id, "HumanEntity")]//text()')) #grab the element text
print(tree.xpath('author[contains(@id, "NonHuman")]/@id')) # grab the id attribute value text

['Human, Not A.']
['Popsicle, Meat', 'Cardassian, Kim']
['Human, Not A.', 'Popsicle, Meat', 'Cardassian, Kim']
['NonHuman-0006']


# Namespaces

So far we've only been working on pretty plain XML with no namespaces, this means that we've been able to use plain element names in our xpath statements.  However, many times our metadata will be coming in with specific named schemas, perhaps even multiple.  Namespaces are a complex topic, but we can talk about this in the context of our assignment.

```XML
<book xmlns = "http://ischool.illinois.edu/faketown" id = "42">
    <author id = "HumanEntity-0003" role = "primary">Human, Not A.</author>
    <author id = "HumanEntity-0004" role = "other">Popsicle, Meat</author>
    <author id = "HumanEntity-0005" role = "other">Cardassian, Kim</author>
    <title lang = "en">This is not a book</title>
</book>
```

If you look closely, you'll see that I added a (fake) namespace to our root book element.  This means that our previous paths will fail. The empty lists tell us that the XPath statements are failing, but the rest of the code is running just fine.  There are two parts to handling namespaces within the `xpath()` function.  

Part 1:  create a dictionary with an alias for that namespace as the key and the URL that matches the URL in the `xmlns` attribute.  Save this dictionary to a variable.  In this case, we'd want `{'book': 'http://ischool.illinois.edu/faketown'}`  We can give it any name we want, but it should be something pretty readable.

Part 2:  Pass that namespace dictionary to the `xpath()` function via this syntax:  `tree.xpath('alias:element/stuff()', namespace = namespace_dict_variable)`.

You'll need to use the `alias:` thing before each and every element name in your xpath statement.  Note that I said element name, and not attribute name or xpath functions (e.g. `text()` can stay just that.

In [22]:
xml = """<book xmlns = "http://ischool.illinois.edu/faketown" id = "42">
    <author id = "HumanEntity-0003" role = "primary">Human, Not A.</author>
    <author id = "HumanEntity-0004" role = "other">Popsicle, Meat</author>
    <author id = "HumanEntity-0005" role = "other">Cardassian, Kim</author>
    <author id = "NonHuman-0006">Here, Y. Am. I.</author>
    <title lang = "en">This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('author/@id'))
print(tree.xpath('author/@role'))

# ERROR: 
# Undefined namespace prefix
# Because, this has a namespace declared in the XML, but I'm not using it within my xapth

[]
[]


In [27]:
ns = {'bk': 'http://ischool.illinois.edu/faketown'}

tree = etree.fromstring(xml)

print(tree.xpath('bk:author/@id', namespaces = ns))
print(tree.xpath('author/@role', namespaces = ns))

['HumanEntity-0003', 'HumanEntity-0004', 'HumanEntity-0005', 'NonHuman-0006']
[]


So why is my second query about author failing?  Well, I forgot the alias piece in my xpath statement.

In [28]:
ns = {'bk': 'http://ischool.illinois.edu/faketown'}

tree = etree.fromstring(xml)

print(tree.xpath('bk:author/@id', namespaces = ns))
print(tree.xpath('bk:author/@role', namespaces = ns))

['HumanEntity-0003', 'HumanEntity-0004', 'HumanEntity-0005', 'NonHuman-0006']
['primary', 'other', 'other']
