# XPath Lesson

The object of this lesson is to introduce XPath syntax and show how it is used within the LXML Python package.  XPath is a powerful query language for XML data structures.  Many systems support XPath queries, including the Oxygen XML editor and several Python packages.  This lesson will focus on how it can be used within Python, but the core XPath content should be relevant enough you to follow along using a package or tool of your choice.  There will be two distinct sections:  how to run XPath queries in Python and how XPath queries are written.  You may choose to follow along with the second section using whichever framework you are comfortable in.

The scope of this lession will also be limited to just getting started and being able to run queries.  This includes, reading in a set of XML files, running XPath queries on them, and outputting the results.  XSLT or writing out XML files will not be covered, as they belong in separate lessons.

## What you will need:

* a set of XML files
* a computer with Python or another platform to execute XPath queries
* at a minimum, a working comfort of Python
* to use Pip
* have LXML installed for your Python instance

## Why XPath?

As stated above, XPath is a query language for working on XML trees.  Many tutorials, usually those teasing at web scraping tasks, will show how a regular expression may be used to extract data out of XML data structures.  This is not an impossible task, depending on the situation, but using a regex in such a situation can pose a dangerous and short path.  XPath is designed to run queries on XML and as such is easier to work with for complex and unpredictable data structures.

Let's dwell on this query language/data structure relationship for a moment:

* Regular expressions are designed to work at the level of individual characters
* XPath is designed to work at the level of XML elements
* SQL is designed to work at the level of cells in database tables

While it may be tempting to try and make regular expressions work on an XML file, it can be dangerous because the structure of the raw text is nearly meaningless within the perspective of XML.  Let's look at an example XML file:

``` XML
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>
```

In this very limited example, a regular expression could easily catch the text between the `author` tag, but rarely do we have such simple XML.  Let's look at what happens when complexity is added.


In [1]:
text = ["""<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>
""" ,

"""
<book>
    <author>Human, A.</author>
    <author>Human, Another.</author>
    <title>This is not a book</title>
</book>
""" ,

"""
<book><author>Human, A.</author><title>This is not a book</title></book>
""" ]

In [2]:
import re

for t in text:
    print re.findall(r'<author>(.+)<\/author>', t)

['Human, A.']
['Human, A.', 'Human, Another.']
['Human, A.']


This regex is holding strong...

In [10]:
book = """
<book>
    <author kind = "editor">Human, A.</author>
    <author kind    =  "contributor">Human, Another.</author>
    <title>This is not a book</title>
</book>
"""

re.findall(r'<author>(.+)<\/author>', book)

[]

But regex treats all characters in the text as meaninful, when this is not true for the data structure we are working with.

In [None]:
re.findall(r'<author[\s]>(.+)<\/author>', book)