# Week 14: XPath

So yes, we spent most of last week sorting our heads around how to make Python work with Xpath and discussing XML.  Your assigned readings for last week included the W3C School's XPath tutorials, along with an optional refresher on XML.

Let's be clear: even if you have worked with XML before, maybe even taken the metadata class using XML, knowing the precise structure and **names** of the bits and bobs inside of XML will be necessary to wrap your head around what XPath is all about.  

## Readings for this week

I'm going to be doing some demos in this notebook, focusing more on how the Python works and leave a lot of the XPath narrative to the W3C School's XPath lesson:  https://www.w3schools.com/xml/xpath_intro.asp.  The terminology section is one of the most important, so that might be something worth printing out or taking notes on.  You'll need to know the names of things to understand the later lessons. 

# XPath Basics

XPath statements tend to look a little like URLs, because the core tree structure behind websites and XML documents is about the same.  Philosophically speaking.  Let's take a basic XML snippet:

```XML
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>
```

There are numerous ways to describe this structure.

* `<book>` is the root element with two children:  `<author>` and `<title>`.
* The `<author>` element is a child of `<book>` and sibling to `<title>`.  

These descriptions are the basis for how XPath queries are constructed.  So, you don't say "I want the title element wihtin the book element", it's "Find the book element anywhere in the tree, then get the child element called author.  We can express this statement as such:

`'//book/author'`

Yes, the narrative is much longer than the actual statement, but this is the basis for every advanced XPath query.  We at least think this is correct, but we haven't tested it.  So let's inject this into the Python pattern we saw last week.  This pattern will be a little different because we're working off of a string instead of a file.  There are separate functions to use when reading XML from a file.

In [2]:
from lxml import etree

xml = """
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('//book/author'))

[<Element author at 0x1064af988>]


Good news:  it worked!

Bad news:   WTF is this `<Element author...` crap?

# The two parts to every Xpath statement

What the `xpath` statement has returned back to us is an `Element` object.  This is a little bundle of processed XML and something that we can act on in smart ways.

What we need to ask ourselves next is: what do we want to get out of that element?

Each element should have:

1. A piece to select the right element or elements
2. A piece to extract the data that you want out those elements

Meaning, why did we look for that element?  What was our purpose?  Did we want an attribute value or the element text?  We need to use additional XPath syntax to actually extract information out of the element.

So this brings us to the second part of nearly every expath statement:  the data extraction piece.  We've got `'//book/author'`, which will find the element in question.  We need to add `'//text()'` to extract out the actual text.  The `'//`' part of that says "anywhere in the tree.  I usually recommend it in case there's additional text in other elements.  We'll explore this later, just remember that when I use two `/`s that I'm doing so on purpose. 

Sometimes you want to leave these things separate, so you have a two stage query:  find all the author elements and then extract the information out of those elements.  But in many cases we can put everyting together in one statement.  We can do that now.

In [3]:
from lxml import etree

xml = """
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('//book/author//text()'))

['Human, A.']


Yay!  We've gotten it.  Now let's explore why we want to do `//text()` instead of `/text()`