# Week 13: XPath part 1

Over this week and next week we'll be going over XPath, but also discovering more about how to parse over multiple files and do more advanced stuff with lists and nested accumulator patterns.

## Readings for this week

For new to XML, please start here: https://www.w3schools.com/xml/default.asp.  Read Introduction through Attributes, then stop.  Those who've worked with XML should at least take a skim through those pages and refresh your understanding of the XML lingo.

## What is XML?

If you truly have zero knowledge of XML, I invite you to start with the a good skim of the [Wikipedia page](https://en.wikipedia.org/wiki/XML) on the subject. Don't pour over it, but it'll provide some important background vocabulary and context.  Anyhow, XML is ruleset for marking up documents in specific ways, and has been extended to a method of storing data in a very structured way.  Instead of having a row/column structure like a CSV file, you can have nested and thus much more complex data storage this way.

Much of library metadata is stored in XML marked up documents, and that's the focus of the Metadata in Theory and Practice class offered at the iSchool.  Meanwhile, HTML is another markup language that works very similarly to XML.  Unless the HTML is severely malformed,techniques to extract data out of XML will also be useful for extracting data out of web pages.

## What is XPath?

XPath (https://en.wikipedia.org/wiki/XPath) is a query lanaguage (a la SQL, kind of) used to describe both locations for items and data extraction for XML documents/data.  This means that you can use it to both locate a specific element within an XML document but it also includes functions to pull out desired values.  Much of the time that's the text of that element, but sometimes you'll want other stuff.

XPath is a system that is platform and tool independent, and thus you can actually find tools for it in the Oxygen XML editor, and there are a few other resources.  There are many Python tools that utilize XPath and have functions for applying XPath queries, but we're going to explore one of those.  

## Installing lxml

For this first week we'll be using some other tools than python for exploring xpath, but it wouldn't hurt to go ahead and get this installed.

Your anaconda installation should already have included an installation of lxml.  Should you need it, lxml is a module available from PyPi, which means you can use pip to install it.  Please follow these directions:

1. Open up your terminal or command prompt (this is the same as you did when you were testing your anaconda installation at the beginning of the semester.
2. Type in `pip install lxml` and press enter.  You should not get an error.
3. It should begin a downloading process and not end in a "failed" statement.
4. Once you're back to the normal command prompt, type in `python` to open up IDLE.  Again, exactly how you did when testing out Anaconda.
5. Type in `from lxml import etree` and press enter.  You should not see anything returned.  Let me know if you get an error an what that error is.
6. Download a copy of the base script from the assignment and attempt to run it inside of PyCharm.  If you passed step 5 without an error, it should run without a problem.  
    * Remember that this script requires that you have the files in a place the script can find them.  By default, it is expecting them to be in a folder a the same level of the script itself.  The script will run and execute with no data if it can't find the files, so a lack of error won't help you.  The start of the program has a statement to print out the file names it finds, so you should see something appear in your ouput.  Printing of an empty list means that it didn't find anything.

# Essential vocabulary

No matter which scraping or parsing tool that you use, you will not be able to navigate the documentation or create new things if you don't know the language behind the purpose.

Let's take this one example:

```XML
<a href = "http://ischool.illinois.edu/">iSchool</a>
```

This is how you make a hyperlink in HTML.  The bit between the two tags is what shows up on the website and the bit in quotes after the href is where the link will go when you click it.  

HTML can be considered to be a specific form of XML.  Remember that XML is just a set of rules, and HTML is just one of those sets (I think there are purists who would disagree on a few points, because modern web browsers allow you to violate every known rule of XML and still render, but that's not a debate to have here).

Here are the essential names that you need to know:


* element name:  this is `a`, where you see in the <>.  The element contains all the information that you want.  The <> define where certain parts of the element exist.  Don't worry, we'll get into more of that.
* node:  roughly, this is the entire contraption that you see there.  The a element and everything about it and what's in it.
* opening tag: this is the `<a>` piece
* closing tag:  this is the `</a>` piece
* element content or value:  sometimes elements will hold just text, another element, a mix of both, or nothing at all!  The stuff that is between the > and < (so after the opening tag and before the closing tag), is the element's content.
* attributes:  these are key/value pairs that appear inside the opening tag.  You can see this is the href.  
* attribute name:  the thing on the left side of the =.  Much like dictionaries, all attribue names must be unique inside the opening tag.
* attribute value:  the thing on the right side of the =.  This is the URL.  Generally you'll find these in quotes, but not always.

Meanwhile, all valid XML must have a single root element that everything belongs inside.  You can see this in proper HTML, which is the `<html>` tag.  Every other element that you see in this website is a descendant element of that root.  Elements (except for the root element) have a parent element.

```XML

<root>
    <middle>
        <child>stuff</child>
    </middle>
</root>
```

Parent, child, and tree:

`root` is the parent of `middle`, and `middle` is the parent of `child`.  Together these make the tree.  

When you are constructing XPath queries, you'll need to operationalize the patterns and locations that you see into these sorts of terms.  Once you can do that, you can string together the names of things in the XML tree and XPath punctuation to build up your query.

# XPath punctuation and syntax

The simplest XPath query is a list of elements, separated by `/`, to desribe an exact location in the tree.  For example, in the previous structure, I could access the location of `child` via:

`/root/middle/child`

This should look very similar to a URL or a file path.  The `/` is used in a similar way.

However, sometimes there are multiple elements that you want or you don't need to spefify the full path to that element.  You can use `//` to have the query search at any level of the tree instead of starting at the root.

`//child`

This query would look for the `child` tag at any level in the tree.

There are several basic syntax elements and metacharacters:

* an element name needs no other syntax to be to be a reference for an element, as you can see with our references with `child`
* `/` look only 1 level deep, so only for immediate children of the element preceeding it
* `//` look anywhere in the descendents of the element preceeding it
* `.` indicate the current node (you'll usually use this inside of functions)
* `..` indicate the parent node of the current element.  You can use this to have it `look` up to a previous element.  For example, "find this speficic element, but then select the element parent to it"
* `@` this is used before a name to indicate that you are talking about an attribute name instead of an element name.  For example, `@href` for an `a` element.  
* `element[position number]`:  index starting at 1, allows you to indicate the "nth" instant of that element.  Example, `a[3]` would be the third a found with that query.
* `element[logical check]` You can place a variety of functions and other boolean checks inside the `[]`.  There are multiple things you can put in here (https://www.w3schools.com/xml/xsl_functions.asp).
* `element[@attribute = 'something']` you can use this to select an element with an attribute that has a specific value

Of course, all these things are used to just select the element in question.  From there, you have te extract out what you want.  This is a bit opaque when using a pretty normal xpath tool, but made much more explicit when dealing with things in python.  Particularly with lxml.  You'll only get an element object if you don't select the content that you want.  

Generally speaking, there are going to be 2 things you might out of on element:

* the attribute value
    * you can get this 

We'll use all of these in our example below, but it can be helpful to copy these and keep them handy.

# Worked exam

# How does that apply to this?

There is a notion of 1 to many in databases, which is actually quite a common feature in data.  For example, a single book may have many authors.  A class has many students.  A faculty member has many affiliations.  And so on.  XML is quite good at representing these relationships because it can nest things.  So let's break out some actual xml.  

``` XML
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>
```

This is a pretty clear case.  There's one book and one author for that book.  In this case, something like `"//book/author/text()"` would be sufficient for tracking down that author's name.

In [14]:
from lxml import etree

xml = """<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('//book/author/text()'))


['Human, A.']


And exactly that!  **But** let's take a closer look at what just happened with this result.  Note that I didn't just get a string of the thing that I wanted.  I got a list with a single string within it.  This can tell you that the `xpath()` function is well prepared for getting multiple results.  

The fact that my data may have multiple values for these items means that I need to completely change my approach for getting this data out.  We've used SQL last week that would print back to us a tbale of results.  We, as humans, were planning on processing that. We didn't have to care.  The functions we wanted to apply to each column were ready to handle instances of zero, one, or many results.  SQL just handles it.

But this is a different world, where we need to write lower level code.  So you, as the programmer, need to deal with that kind of thing.  Let's practice a primary and secondary loop pattern over this sort of returned data.

In [15]:
xml = ["""<book>
    <book_id>42</book_id>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>""",

"""<book>
    <book_id>23</book_id>
    <author>Human, A.</author>
    <author>Human, Not.</author>
    <title>This is not a book</title>
</book>"""]

# so we've got multiple chunks of xml here
# we know that book_id will only happen once (because I'm saying so here)
# but we may have multiple authors

for data_chunk in xml:
    tree = etree.fromstring(data_chunk)
    author_list = tree.xpath('//book/author/text()')
    book_id_list = tree.xpath('//book/book_id/text()')
    if len(book_id_list) != 1:
        print("Too many book_id values, skipping")
        continue
    else:
        book_id = book_id_list[0]
    for author in author_list:
        print(book_id, author)

42 Human, A.
23 Human, A.
23 Human, Not.


This looks pretty good because all the values that I'm getting back are all strings.  All those lists are gone.  This means that the results I'm spitting out from these loops are primed and ready to be written out to a file.  No futher processing.

Also note that it worked just fine when I had a list of one item.  

Had we not used this primary/secondary patter, we would have ended up with this:

In [16]:
for data_chunk in xml:
    tree = etree.fromstring(data_chunk)
    author_list = tree.xpath('//book/author/text()')
    book_id_list = tree.xpath('//book/book_id/text()')
    if len(book_id_list) != 1:
        print("Too many book_id values, skipping")
        continue
    else:
        book_id = book_id_list[0]
    print(book_id, author_list)

42 ['Human, A.']
23 ['Human, A.', 'Human, Not.']


I could make this work by running a join on those lists:

In [18]:
for data_chunk in xml:
    tree = etree.fromstring(data_chunk)
    author_list = tree.xpath('//book/author/text()')
    book_id_list = tree.xpath('//book/book_id/text()')
    if len(book_id_list) != 1:
        print("Too many book_id values, skipping")
        continue
    else:
        book_id = book_id_list[0]
    print(book_id, ";".join(author_list)) # look here for the change

42 Human, A.
23 Human, A.;Human, Not.


Depending on your data design you may want:

1. To have any multiple values represented in separate rows
    * So you'd need to use the primary/secondary loop pattern
2. Having any multiple values in a single cell is fine
    * Then you can do the "delim".join(stuff) pattern

# In Conclusion...

So now that we have a basis of strategy and tool, we can explore more about xpath itself in our next lesson.  Look next to week 14.