# Week 13: XPath part 1

Over this week and next week we'll be going over XPath, but also discovering more about how to parse over multiple files and do more advanced stuff with lists and nested accumulator patterns.

## Readings for this week

For new to XML, please start here: https://www.w3schools.com/xml/default.asp.  Read Introduction through Attributes, then stop.  Those who've worked with XML should at least take a skim through those pages and refresh your understanding of the XML lingo.

## What is XML?

If you truly have zero knowledge of XML, I invite you to start with the a good skim of the [Wikipedia page](https://en.wikipedia.org/wiki/XML) on the subject. Don't pour over it, but it'll provide some important background vocabulary and context.  Anyhow, XML is ruleset for marking up documents in specific ways, and has been extended to a method of storing data in a very structured way.  Instead of having a row/column structure like a CSV file, you can have nested and thus much more complex data storage this way.

Much of library metadata is stored in XML marked up documents, and that's the focus of the Metadata in Theory and Practice class offered at the iSchool.  Meanwhile, HTML is another markup language that works very similarly to XML.  Unless the HTML is severely malformed,techniques to extract data out of XML will also be useful for extracting data out of web pages.

## What is XPath?

XPath (https://en.wikipedia.org/wiki/XPath) is a query lanaguage (a la SQL, kind of) used to describe both locations for items and data extraction for XML documents/data.  This means that you can use it to both locate a specific element within an XML document but it also includes functions to pull out desired values.  Much of the time that's the text of that element, but sometimes you'll want other stuff.

XPath is a system that is platform and tool independent, and thus you can actually find tools for it in the Oxygen XML editor, and there are a few other resources.  There are many Python tools that utilize XPath and have functions for applying XPath queries, but we're going to explore one of those.  

## Installing lxml

This will be our first instance into a third party module that you'll need to install.  lxml is a module available from PyPi, which means you can use pip to install it.  Please follow these directions:

1. Open up your terminal or command prompt (this is the same as you did when you were testing your anaconda installation at the beginning of the semester.
2. Type in `pip install lxml` and press enter.  You should not get an error.
3. It should begin a downloading process and not end in a "failed" statement.
4. Once you're back to the normal command prompt, type in `python` to open up IDLE.  Again, exactly how you did when testing out Anaconda.
5. Type in `from lxml import etree` and press enter.  You should not see anything returned.  Let me know if you get an error an what that error is.
6. Download a copy of the base script from the assignment and attempt to run it inside of PyCharm.  If you passed step 5 without an error, it should run without a problem.  
    * Remember that this script requires that you have the files in a place the script can find them.  By default, it is expecting them to be in a folder a the same level of the script itself.  The script will run and execute with no data if it can't find the files, so a lack of error won't help you.  The start of the program has a statement to print out the file names it finds, so you should see something appear in your ouput.  Printing of an empty list means that it didn't find anything.

# Designing around juicy pieces of code

As you have discovered with other programs, there's a lot of data preparation and handling that go into our code before we get into the interesting pieces with XPath.  Just beacuse we're going to be using this XPath function to get out the data we want doesn't mean that the work is done for us.  That would be the case if we only needed one data point from one file.  But when we want to get multiple data points from multiple files, we need to have structures in place to collect that data and then export it to our desired structure.

So it very much is the case that you'll have one really juicy piece of code sitting in the middle that's doing the substantive data work, but you still have a ton of code before and after to prepare and output the data.  What we need to do is explore larger design questions.

You almost never know exactly how you're going to get from point A to point B in a program, but you know what point A is and what point B should be.  You may have to work a bit at the beginning, then a bit at the end, then fill things in the middle as you go.  You've see me do this during lecture, where we fill in some of our known structure and output at the start so we can focus on the meaty bits in the center.

## So what do we know about this homework assignment?

**Starting point:** You'll have 5 XML files that contain TEI-ed entries for a Digital Humanities conference.  These contain descriptive and administrative information about the entries.

**Ending point:** You need to create 3 CSV documents with data about each of these files.  2 hour students will be pulling in 2 columns of data in each file and 4 credit hour students will be pulling out 3 columns of data in each.

And the middle bits? We know it'll be something like looping through the files, gathering the data that we want, and then writing out all our files.

# Starting with multiple files

We're used to reading in one file at a time, and naming them explicitly within our programs.  We could hard code all 5 file names, but what if they might change?  What if there are 5,000?  We can use some tools in Python to look up what these file names are as part of your program.  This makes your code more adaptable and flexible in the future.  You can also use patterns where you can pick a smaller random sample of your source data files to develop your code on, and gradually incrememnt it up to explore for outliers and check your logic.

## the `glob` module

There actually are several methods and modules to get file names from your computer using Python, but for simple needs, we can use the `glob` module (https://docs.python.org/3.5/library/glob.html). This isn't a default module, so you'll need to import it.

In [1]:
import glob

Then the `glob.glob()` function will be what we want to use.  You pass this function a basic matching pattern and it'll return a list of matches.

You'll make heavy use of the wildcard character of `*`.  This means "anything of any length in this place. For example `"*.txt"` will match any file that ends with .txt.  Thus, `"*.xml"` will match any xml file.  You can also put a folder name in the matching pattern, but you'll need to play with how your full paths appear in your own system. My examples will be in Mac syntax, but windows will look different.

We've got this folder called `drac_chaps` that we can explore.

In [2]:
print(glob.glob("drac_chaps"))

['drac_chaps']


That isn't very interesting, but it is matching exactly what we've told it.  In this case, it just matches a single folder name. But we've got something! 

In [3]:
print(glob.glob("sneks_not_here"))

[]


This comes back with an empty list, so we can tell that there were no matches.

This means that we need a wildcard to fill in some of the stuff that should come after `drac_chaps`.

In [4]:
print(glob.glob("drac_chaps*"))

['drac_chaps']


That didn't change anything, so we can tell that the `*` doesn't match the file delimiter.  In this case, I know my delimiter is `/`, so I can add that.

In [5]:
print(glob.glob("drac_chaps/*"))

['drac_chaps/Dracula-Chapter-1-Jonathan_Harkers_Journal.txt', 'drac_chaps/Dracula-Chapter-10-Mina_Murrays_Journal.txt', 'drac_chaps/Dracula-Chapter-11-Lucy_Westenras_Diary.txt', 'drac_chaps/Dracula-Chapter-12-Dr_Sewards_Diary.txt', 'drac_chaps/Dracula-Chapter-13-Dr_Sewards_Diary.txt', 'drac_chaps/Dracula-Chapter-14-Mina_Harkers_Journal.txt', 'drac_chaps/Dracula-Chapter-15-Dr_Sewards_Diary.txt', 'drac_chaps/Dracula-Chapter-16-Dr_Sewards_Diary.txt', 'drac_chaps/Dracula-Chapter-17-Dr_Sewards_Diary.txt', 'drac_chaps/Dracula-Chapter-18-Dr_Sewards_Diary.txt', 'drac_chaps/Dracula-Chapter-19-Jonathan_Harkers_Journal.txt', 'drac_chaps/Dracula-Chapter-2-Jonathan_Harkers_Journal.txt', 'drac_chaps/Dracula-Chapter-20-Jonathan_Harkers_Journal.txt', 'drac_chaps/Dracula-Chapter-21-Dr_Sewards_Diary.txt', 'drac_chaps/Dracula-Chapter-22-Jonathan_Harkers_Journal.txt', 'drac_chaps/Dracula-Chapter-23-Dr_Sewards_Diary.txt', 'drac_chaps/Dracula-Chapter-24-Dr_Sewards_Phonograph_Diary_spoken_by_Van_Helsing.txt'

Whoo! That got something.  Let's poke around.

In [6]:
found_files = glob.glob("drac_chaps/*") # save to a variable to poke around

print(found_files[0]) # we can see that it has the relative path to that file from our current position
print(len(found_files)) # and it has 27 chapters

drac_1 = open(found_files[0], 'r') # and we can read it in to check that it indeed works.

print(drac_1.read()[:200])

drac_1.close()

drac_chaps/Dracula-Chapter-1-Jonathan_Harkers_Journal.txt
27
CHAPTER I

JONATHAN HARKER'S JOURNAL

(_Kept in shorthand._)


_3 May. Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at
Vienna early next morning; should have arrived at 6:46, but train 


`found_files` is just a list of these paths, so I can loop over it to open files, but I can also get information out of it.

In [7]:
for path in found_files:
    print(path.split("/"))

['drac_chaps', 'Dracula-Chapter-1-Jonathan_Harkers_Journal.txt']
['drac_chaps', 'Dracula-Chapter-10-Mina_Murrays_Journal.txt']
['drac_chaps', 'Dracula-Chapter-11-Lucy_Westenras_Diary.txt']
['drac_chaps', 'Dracula-Chapter-12-Dr_Sewards_Diary.txt']
['drac_chaps', 'Dracula-Chapter-13-Dr_Sewards_Diary.txt']
['drac_chaps', 'Dracula-Chapter-14-Mina_Harkers_Journal.txt']
['drac_chaps', 'Dracula-Chapter-15-Dr_Sewards_Diary.txt']
['drac_chaps', 'Dracula-Chapter-16-Dr_Sewards_Diary.txt']
['drac_chaps', 'Dracula-Chapter-17-Dr_Sewards_Diary.txt']
['drac_chaps', 'Dracula-Chapter-18-Dr_Sewards_Diary.txt']
['drac_chaps', 'Dracula-Chapter-19-Jonathan_Harkers_Journal.txt']
['drac_chaps', 'Dracula-Chapter-2-Jonathan_Harkers_Journal.txt']
['drac_chaps', 'Dracula-Chapter-20-Jonathan_Harkers_Journal.txt']
['drac_chaps', 'Dracula-Chapter-21-Dr_Sewards_Diary.txt']
['drac_chaps', 'Dracula-Chapter-22-Jonathan_Harkers_Journal.txt']
['drac_chaps', 'Dracula-Chapter-23-Dr_Sewards_Diary.txt']
['drac_chaps', 'Dracul

# Primary accumulators and secondary accumulators

You've used this pattern before, but always worth an unpacking.  So we're going to take a diversion from XML and XPath to review this pattern.

There's a couple classic pattern explorations that intro to programming works with.

In [8]:
for letter in 'abcd':
    for number in range(4):
        print(letter, number)
    print("done with sub loop!")

a 0
a 1
a 2
a 3
done with sub loop!
b 0
b 1
b 2
b 3
done with sub loop!
c 0
c 1
c 2
c 3
done with sub loop!
d 0
d 1
d 2
d 3
done with sub loop!


This is an example of a nested loop.  So on the outside we're looping over a, b, c, and d.  But for each of those letters, we're looping 4 times.  This makes for 16 total combinations between them.  We can use this pattern to loop over things we're...looping over! But mainly to process those values and collect something about them.

The problem:  write a function that takes a string of characters and a range of numbers.  It produces a new character based on the ordinal value of that character multiplied by the value of the number.  Collect a list of these letters for each character.  

So our primary loop is going to be the letters and our secondary loop will be the numbers.  This is because we want every number to apply to each letter. We want it the other way around, of course, but we've been asked to collect the data in one list for each original letter.  Making the original letter the primary loop makes things easier.

In [9]:
# what do we know we should start with?  a function

def weird_char_thing(letters, range_thing):
    all_lists = [] # a primary accumulator for my primary loop
    for character in letters:
        new_letters = [] # secondary loop collector
        for number in range_thing:
            new_ordinal = ord(character) * number
            new_char = chr(new_ordinal)
            new_letters.append(new_char) # making it a string
        # once we've finished looping over all the numbers...
        # we can add our secondary loop data to the primary loop accumulator
        all_lists.append(new_letters)
    return all_lists # return our final result

Before we look at our results, two questions:

1. How many lists will be inside of `all_lists`?
2. How many strings will be inside of each list in `all_lists`?

In [10]:
weird_char_thing('abcd', range(3, 7))

[['ģ', 'Ƅ', 'ǥ', 'Ɇ'],
 ['Ħ', 'ƈ', 'Ǫ', 'Ɍ'],
 ['ĩ', 'ƌ', 'ǯ', 'ɒ'],
 ['Ĭ', 'Ɛ', 'Ǵ', 'ɘ']]

In [11]:
weird_char_thing('abcd', range(2, 7))

[['Â', 'ģ', 'Ƅ', 'ǥ', 'Ɇ'],
 ['Ä', 'Ħ', 'ƈ', 'Ǫ', 'Ɍ'],
 ['Æ', 'ĩ', 'ƌ', 'ǯ', 'ɒ'],
 ['È', 'Ĭ', 'Ɛ', 'Ǵ', 'ɘ']]

In [12]:
weird_char_thing('abcd', range(1, 7))

[['a', 'Â', 'ģ', 'Ƅ', 'ǥ', 'Ɇ'],
 ['b', 'Ä', 'Ħ', 'ƈ', 'Ǫ', 'Ɍ'],
 ['c', 'Æ', 'ĩ', 'ƌ', 'ǯ', 'ɒ'],
 ['d', 'È', 'Ĭ', 'Ɛ', 'Ǵ', 'ɘ']]

In [13]:
weird_char_thing('abcdefg', range(1, 8))

[['a', 'Â', 'ģ', 'Ƅ', 'ǥ', 'Ɇ', 'ʧ'],
 ['b', 'Ä', 'Ħ', 'ƈ', 'Ǫ', 'Ɍ', 'ʮ'],
 ['c', 'Æ', 'ĩ', 'ƌ', 'ǯ', 'ɒ', 'ʵ'],
 ['d', 'È', 'Ĭ', 'Ɛ', 'Ǵ', 'ɘ', 'ʼ'],
 ['e', 'Ê', 'į', 'Ɣ', 'ǹ', 'ɞ', '˃'],
 ['f', 'Ì', 'Ĳ', 'Ƙ', 'Ǿ', 'ɤ', 'ˊ'],
 ['g', 'Î', 'ĵ', 'Ɯ', 'ȃ', 'ɪ', 'ˑ']]

## Core primary/secondary loop pattern

You'll notice a few things about the primary and secondary accumulator pattern:

```python
primary_collector = []
for pri_item in primary:
    secondary_collector = []
    for sec_item in secondary:
        secondary_collector.append(sec_item)
    primary_collector.append(secondary_collector)
```

* The collector lists appear just before the for loops that will be adding things to them, and are at the same indent level as that for loop.  
* Within the primary loop, the pattern is:
    1. declare the empty accumulator for the secondary
    2. loop over whatever the secondary stuff is
    3. process the thing and appened each to the secondary collector
    4. append the final secondary results to the primary
    
This is a class and standard primary/secondary loop pattern.  You might need to add in extra steps where you strip, split, or transform things, but these elements will be in there no matter what. 

Depending on how long your transformation processes are, you might want to create a transformation function to process your internal secondary or primary items so you can keep your structure cleaner.


# How does that apply to this?

There is a notion of 1 to many in databases, which is actually quite a common feature in data.  For example, a single book may have many authors.  A class has many students.  A faculty member has many affiliations.  And so on.  XML is quite good at representing these relationships because it can nest things.  So let's break out some actual xml.  

``` XML
<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>
```

This is a pretty clear case.  There's one book and one author for that book.  In this case, something like `"//book/author/text()"` would be sufficient for tracking down that author's name.

In [14]:
from lxml import etree

xml = """<book>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>"""

tree = etree.fromstring(xml)
print(tree.xpath('//book/author/text()'))


['Human, A.']


And exactly that!  **But** let's take a closer look at what just happened with this result.  Note that I didn't just get a string of the thing that I wanted.  I got a list with a single string within it.  This can tell you that the `xpath()` function is well prepared for getting multiple results.  

The fact that my data may have multiple values for these items means that I need to completely change my approach for getting this data out.  We've used SQL last week that would print back to us a tbale of results.  We, as humans, were planning on processing that. We didn't have to care.  The functions we wanted to apply to each column were ready to handle instances of zero, one, or many results.  SQL just handles it.

But this is a different world, where we need to write lower level code.  So you, as the programmer, need to deal with that kind of thing.  Let's practice a primary and secondary loop pattern over this sort of returned data.

In [15]:
xml = ["""<book>
    <book_id>42</book_id>
    <author>Human, A.</author>
    <title>This is not a book</title>
</book>""",

"""<book>
    <book_id>23</book_id>
    <author>Human, A.</author>
    <author>Human, Not.</author>
    <title>This is not a book</title>
</book>"""]

# so we've got multiple chunks of xml here
# we know that book_id will only happen once (because I'm saying so here)
# but we may have multiple authors

for data_chunk in xml:
    tree = etree.fromstring(data_chunk)
    author_list = tree.xpath('//book/author/text()')
    book_id_list = tree.xpath('//book/book_id/text()')
    if len(book_id_list) != 1:
        print("Too many book_id values, skipping")
        continue
    else:
        book_id = book_id_list[0]
    for author in author_list:
        print(book_id, author)

42 Human, A.
23 Human, A.
23 Human, Not.


This looks pretty good because all the values that I'm getting back are all strings.  All those lists are gone.  This means that the results I'm spitting out from these loops are primed and ready to be written out to a file.  No futher processing.

Also note that it worked just fine when I had a list of one item.  

Had we not used this primary/secondary patter, we would have ended up with this:

In [16]:
for data_chunk in xml:
    tree = etree.fromstring(data_chunk)
    author_list = tree.xpath('//book/author/text()')
    book_id_list = tree.xpath('//book/book_id/text()')
    if len(book_id_list) != 1:
        print("Too many book_id values, skipping")
        continue
    else:
        book_id = book_id_list[0]
    print(book_id, author_list)

42 ['Human, A.']
23 ['Human, A.', 'Human, Not.']


I could make this work by running a join on those lists:

In [18]:
for data_chunk in xml:
    tree = etree.fromstring(data_chunk)
    author_list = tree.xpath('//book/author/text()')
    book_id_list = tree.xpath('//book/book_id/text()')
    if len(book_id_list) != 1:
        print("Too many book_id values, skipping")
        continue
    else:
        book_id = book_id_list[0]
    print(book_id, ";".join(author_list)) # look here for the change

42 Human, A.
23 Human, A.;Human, Not.


Depending on your data design you may want:

1. To have any multiple values represented in separate rows
    * So you'd need to use the primary/secondary loop pattern
2. Having any multiple values in a single cell is fine
    * Then you can do the "delim".join(stuff) pattern

# In Conclusion...

So now that we have a basis of strategy and tool, we can explore more about xpath itself in our next lesson.  Look next to week 14.