# XPath Lesson

The object of this lesson is to introduce XPath syntax and show how it is used within the LXML Python package.  XPath is a powerful query language for XML data structures.  Many systems support XPath queries, including the Oxygen XML editor and several Python packages.  This lesson will focus on how it can be used within Python, but the core XPath content should be relevant enough you to follow along using a package or tool of your choice.  There will be two distinct sections:  how to run XPath queries in Python and how XPath queries are written.  You may choose to follow along with the second section using whichever framework you are comfortable in.

The scope of this lession will also be limited to just getting started and being able to run queries.  This includes, reading in a set of XML files, running XPath queries on them, and outputting the results.  XSLT or writing out XML files will not be covered, as they belong in separate lessons.

## What you will need:

* a set of XML files
* a computer with Python or another platform to execute XPath queries
* at a minimum, a working comfort of Python
* to use Pip
* have LXML installed for your Python instance

## Why XPath and why not regex?

As stated above, XPath is a query language for working on XML trees.  Many tutorials, usually those teasing at web scraping tasks, will show how a regular expression may be used to extract data out of XML data structures.  This is not an impossible task, depending on the situation, but using a regex in such a situation can pose a dangerous and short path.  XPath is designed to run queries on XML and as such is easier to work with for complex and unpredictable data structures.

Let's dwell on this query language/data structure relationship for a moment:

* Regular expressions are designed to work at the level of individual characters
* XPath is designed to work at the level of XML elements
* SQL is designed to work at the level of cells in database tables

While it may be tempting to try and make regular expressions work on an XML file, it can be dangerous because the structure of the raw text is nearly meaningless within the perspective of XML.  Let's look at an example XML file:

``` XML
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>
```

In this very limited example, a regular expression could easily catch the text between the `author` tag, but rarely do we have such simple XML.  Let's look at what happens when complexity is added.


In [1]:
text = ["""<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>
""" ,

"""
<book>
    <author>Human, A.</author>
    <author>Human, Another.</author>
    <title>This is not a book</title>
</book>
""" ,

"""
<book><author>Human, A.</author><title>This is not a book</title></book>
""" ]

In [2]:
import re

for t in text:
    print re.findall(r'<author>(.+)<\/author>', t)

['Human, A.']
['Human, A.', 'Human, Another.']
['Human, A.']


This regex is holding strong...

In [3]:
book = """
<book>
    <author kind = "editor">Human, A.</author>
    <author kind    =  "contributor">Human, Another.</author>
    <title>This is not a book</title>
</book>
"""

re.findall(r'<author>(.+)<\/author>', book)

[]

But regex treats all characters in the text as meaninful, when this is not true for the data structure we are working with.  Imagine writing the regex for when you're looking at multiple attributes that can appear in any order.  Sometimes there are situations where regex is just fine, but XPath should be your default choice for handling XML data.

# Sample workflow

Several packages within Python have support for XPath, which is the beauty of utilizing this tool.  This tutorial will feature the LXML package.  [LXML](http://lxml.de/) has their own installation directions, which I will refer you to externally: http://lxml.de/installation.html.  However, it is supported via `pip` and `conda`.

In the spirit of a Programming Historial style, let's start with just a sample workflow.

Outline of this task:

* read in source XML files
* parse via LXML
* use an XPath statement to extract some information
* assemble and dump out that informatin

More specifically, we'll be using the 2013 Digital Humanities abstracts.  These files are formatted in TEI and available at the conference website: http://dh2013.unl.edu/abstracts/.  There are two versions: an corpus of all the abstracts and the individual files.  This is not a tutorial on reading through multiple files, so this activity will be included but not expanded on.

Specific tasks:

* read in the TEI corpus
* extract the ID, title, and type of each abstract
* write out those results to a CSV

## Setting up

In [4]:
from lxml import etree

# 

In [5]:
source_xml = 'DH 2013 final_xml/ab-130.xml'

with open(source_xml, 'rt') as read_in:
    raw_text = read_in.read() # read in the basic text, the file is now a string
    
tree = etree.fromstring(raw_text) # parse the string of text

At this point, we've read in the plain text version of our XML file and parsed that using the `etree` library from LXML.  Let's play with this for a bit.  LXML has pretty full documentation about how to use `etree` here: http://lxml.de/tutorial.html#using-xpath-to-find-text  We're going to focus on the bare bones navigation and text extraction.

In [6]:
print tree

<Element {http://www.tei-c.org/ns/1.0}TEI at 0x10531b3f8>


Once the file is parsed each element becomes an object we can act on.  Very roughly, we can manually inspect the file contents and know that the structure is (with `...` meaning that some code was snipped:

``` XML
<TEI>
    <teiHeader>
        <titleStmt>
            <title>...</title>
        </titleStmt>
    ...
    </teiHeader>
    ...
    <text type = '...'>
        <body>
        ...
        </body>
    </text>
</TEI>
```

The `{http://www.tei-c.org/ns/1.0}TEI` notation is also important because it tells you that the element is being parsed as part of a namespace.  If we tried to just access `TEI` it would fail because the parser is expecting `{http://www.tei-c.org/ns/1.0}TEI` to appear.  Let's look at a basic XPath command to just extract the title text out of this individual abstract.  We need to include the namespace for TEI in the process.

In [10]:
print tree.xpath('//title/text()') # returns nothing as we expect.

[]


Note that even though this is a valid XPath statement (even though we haven't introduced it yet, so trust me), it is returning nothing.  When the element `<TEI>` is being displayed as `{http://www.tei-c.org/ns/1.0}TEI` we know now that we will need to include a namespace declaration within our XPath statements.

Ignoring a bit of the Python-specific syntax here for a moment, the LXML tools that we're using for XPath statements does have the capacity to delcare the namespaces and associated names.

Example:  `your_tree.xpath('//namespace_name:element_name/text()', namespaces = {'namespace_name': 'schema_location_url')`

The bits that you need to change are the `namespace_name` value, which is just what you want to call the namespace within your XPath statemtents.  This value just needs to be unique and something you don't mind typing in a bunch.  The second is `schema_location_url`, which is the URL to the schema you are giving a name to.  This URL should be what you are seeing within the top of the XML document you are trying to query within.  After all, the parser won't know that an XML document has a namespace unless it is stated.  In this case we can look in the top of the document and see that it includes:  `<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="ab-130">`

This is why the LXML parser is stating that `TEI` resides within the `{http://www.tei-c.org/ns/1.0}` namespace by returning: `{http://www.tei-c.org/ns/1.0}TEI`

Many times we'll have a long series of xpath statements to extract out all the information that we want, so we can store these namespace values as a variable so long as it works for all of them.  `{'tei': 'http://www.tei-c.org/ns/1.0'}` The key in this dictionary is a the name declaration.  I can name it anything in here, so long as it is unique (which is also required by virtue of this data structure being a dictionary).  The value is the URL to that schema.

While you can continue to use the syntax where you declare the namespace verbatim in every XPath function call, it can become unweidly and clutter your code.  Putting our Python eyes back on for a second, we can see that the namespace declaration object is simply a dictionary with the namespace declarations as they keys and the schema URL being the value.  We can then save this dictionary object into a separate variable as a separate part of our code we can call on at later points.

In [12]:
# create our namespace variable containing the namespace dictionary
# with the very creative variable name of `ns`
ns = {'tei': 'http://www.tei-c.org/ns/1.0'} 

# we can now replace the raw dictionary content 
# within the function call with our variable name
print tree.xpath('//tei:title/text()', namespaces = ns)

['A Comparative Study of Astronomical Clock towers in Europe and China based on their detailed 3D modeling']


At this point, we finally have all the pieces that we need to sucessfully create XPath statements that work on our XML document.  You may not be always working with documents that have required namespaces, so you may be able to skip this for your projects.

Let's turn our attention to what is happening within the statement used above: `//tei:title/text()`.  As a quick refresher for XML vocabulary, here's the naming taxonomy for a single element tag.

```XML 
<element_name attribute_name = 'attribute_value'>element_text</element_name>
```

The XPath language has methods of referring to these markup locations by name.  You'll also encounter metacharacters to indicate locations and internal function calls to perform tests and extract specific areas of text.  XPath should be considered within it's own box, separate and independent from the Python that is surrounding it.  Similar to using Regular Expresssions, this is an independent language that we are only using Python as a hook to perform it own a piece of text.

In the example XPath statement we've been using we have both