# XPath B

Now that you have gotten a bit more comfortable with XPath queries, we're going to now explore how we can use a tool inside of Python to execute those queries.  There are many XPath tools out there, and there will always be differences in how individual parsers will work.  You will have to test things out and adapt accordingly.

For the purposes of our lesson here, we're going to use the XPath function within the lxml module.  This module works well and consistantly over the years, but you may find that other pacages will work similarly.

I'm going to set this all up for you, so you can use this pattern without thinking too much about it.  But I will try and explain some of it as we go.  

# The basic pattern

Tha basic pattern that we will be exploring here is this:

1. Read in the document
2. Parse that IO object into a tree object.
3. Apply you desired XPath things to that tree object.

There will be one and only one way you'll need to do 1 and 2 for this class, but there are multiple ways of handling things.  For example, you may need to be looping over multiple files rather than just 1, or you may need to clean up mangled XML data elsewhere before parsing it.  However, the 3 steps above will always hold true.

Our pattern for class will involve:

1. Read the XML file with `.read()` to read in the text as a big string.
2. Pass that string the lxml tool to parse into a tree object. (don't forget that you'll have to import this module)
3. Use the `.xpath()` function on that tree object.

## Step 0: import the lxml module

``` python
from lxml import etree
```

## Step 1: read in the file

The `rb` read in mode is required because of the encoing issues.

``` python
infile =  open('YOURFILENAME.xml', 'rb')
xml = infile.read()
infile.close()
```

# Step 2: parse into a tree object

I could call this variable name anything that I want, but we usually use `tree` as a convention.  This is using the `fromstring()` function within lxml, which will parse string text.

``` python 
tree = etree.fromstring(xml)
```

# Step 3: use the `.xpath()` on the tree object to execute an xpath query

Example when there in no namespace happening:

``` python
results = tree.xpath('//elementwhatever/text()')
```

Example when there is a namespace to handle:

``` python
results = tree.xpath('//alias:elementwhatever/text()', namespaces={'alias': "URL found in the document goes here"})
```

Don't worry, this entire lession is about unpacking more about step 3.

# Our data source

We'll be using an XML document of "Hamlet" by Shakespeare.  This is located in the hamlet-tei.xml file.  This is a proper XML file that uses the TEI schema.  https://en.wikipedia.org/wiki/Text_Encoding_Initiative  You will want to read this now so you can understand the basics of what's going on in this file.

The data file has its own attribution, but I grabbed it as a material from this workshop: http://tei.it.ox.ac.uk/Talks/2015-08-maynooth

Take some time exploring this file.  There are several chunks (and this is an overly brief and true-to-this-file explaination, and not meant to be a primer on TEI):  

* In `teiHeader`:
    * `fileDesc` node contains information about the provenance of the file and content.
    * `profileDesc/particDesc` node contains information on the characters in the play
    * `profileDesc/settingDesc` node contains setting information for the play
* In `text`:
    * this contains nodes for each act, scene, and passage.
    * each passage is in `sp` elements, with `@who` representing the standardized ID for each speaker.  The `speaker` reports out what the original text had for the speaker information, and the `l` elements have the individual lines.
    
There are other details that you will need to explore on you own.

For now, we're going to go ahead and read in our file.  You'll only need to do this once at the top of your script.

In [2]:
from lxml import etree

infile =  open('hamlet-tei.xml', 'rb')
xml = infile.read()
infile.close()

tree = etree.fromstring(xml)

# namespaces

Most proper XML files have namespaces that you'll need to navigate.  As this is not a metadata or TEI course, I will not provide an extendede discussion on what this is.

We can see in line 4 of the document, which has the root element:  `<TEI xmlns="http://www.tei-c.org/ns/1.0">`

Thas is saying that the elements found in this root node belong to the TEI schema, with a URL to the schema definition.  This information is for the parsers.  

You'll see this URL pop up again when we dig in.  The patterns from step 3 show you where to put this, but keep on reading.

We'll be talking about that namespace a ton, so the canonical pattern is to save that namespace dictionary as a variable that we can reference elsewhere.  We can same this now so we can reference it elsewhere.


In [3]:
ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

# Evaluating an extraction query to get a single result

Remember that all your previous queries all needed to end with an extraction function at the end.  This was likely either `/text()` to get the text of the element out, or `@attribute` to get some attribute text out.

For example, `//a` would select all the `a` element nodes, but not yield the contents.  But `//a/text()` would give you the hyperlink text, and `//a/@href` would give you all the URLs for the hyperlinks.

As a start, we're going to run a query that will extract out a single result.

We're going to look up the standard name of Hamlet from his character data node.

The xpath that we would want to use is `//person[@xml:id = "F-ham-ham"]/persName[@type = "standard"]/text()`, but we need to adapt this to our namespace.  Look back up to our `ns`.  We're giving our TEI schema an alias of `tei`, which means we need to provide this before each element name that we are referencing.  IMPORTANT! You only need to do this for element names, not for attribute values, content, or Xpath functions.

So now our XPath query will be:

`//tei:person[@xml:id = "F-ham-ham"]/tei:persName[@type = "standard"]/text()`

Let's put this together.



In [8]:
print(tree.xpath('//tei:person[@xml:id = "F-ham-ham"]/tei:persName[@type = "standard"]/text()', namespaces = ns))

['Hamlet, son of the former king and nephew to the\n                            present king']


Things to note: 

* I am using my alias here only for the elements, and that alias name matches what I have declared in my `ns` object.
* I have `namespaces = ns` which will need to be in **each and every xpath query you run for this assignment**.
* my xpath query is just a string
* I've used double quotes in my xpath query, which means that I need to use single quotes to surround the string.
* my results are coming back as a list with one element.  I know and exect there to be just a single result, but the results will always be coming back to you as a list.
* that extra text is from a the newline in the XML file itself.

# Query to extract many results

Let's adapt our previous result to find all the standard names for these chacacters.  We don't need to do much, because XPath is already taking care of the hard work for us.  We need to take out the `@xml:id = "F-ham-ham"`, which will allow it to select all the person nodes.

In [11]:
results = tree.xpath('//tei:person/tei:persName[@type = "standard"]/text()', namespaces = ns)
print(results)

['First Player', 'All', 'Ambassador', 'Player Prologue', 'Player Queen', 'Bernardo, sentinel', 'Norwegian Captain', 'First Clown', 'Fortinbras, Prince of ', 'Francisco, a soldier', 'Gentleman, courtier', 'Gentlemen', "Father's Ghost, Ghost of Hamlet's\n                            Father", 'Guildenstern, courtier', 'Hamlet, son of the former king and nephew to the\n                            present king', 'Horatio, friend to Hamlet', 'Claudius, King of Denmark', 'Laertes, son to Polonius', 'Lucianus', 'Marcellus, Officer', 'Messenger', 'Ophelia, daughter to Polonius', 'Osric, courtier', 'Second Clown', 'Polonius, Lord Chamberlain', 'Player King', 'Priest', 'Gertrude, Queen of Denmark and mother to\n                            Hamlet', 'Rosencrantz, courtier', 'Reynaldo, servant to Polonius', 'Sailor', 'Servant', 'Voltemand, courtier']


Now we have a list of results to play with!

How many characters have standard names?

In [12]:
print(len(results))

33


Loop through the names and normalize the spaces.

In [15]:
for name in results:
    print(" ".join(name.split()))

First Player
All
Ambassador
Player Prologue
Player Queen
Bernardo, sentinel
Norwegian Captain
First Clown
Fortinbras, Prince of
Francisco, a soldier
Gentleman, courtier
Gentlemen
Father's Ghost, Ghost of Hamlet's Father
Guildenstern, courtier
Hamlet, son of the former king and nephew to the present king
Horatio, friend to Hamlet
Claudius, King of Denmark
Laertes, son to Polonius
Lucianus
Marcellus, Officer
Messenger
Ophelia, daughter to Polonius
Osric, courtier
Second Clown
Polonius, Lord Chamberlain
Player King
Priest
Gertrude, Queen of Denmark and mother to Hamlet
Rosencrantz, courtier
Reynaldo, servant to Polonius
Sailor
Servant
Voltemand, courtier


# Profiling structures

You can't be an expert in all schemas, so sometimes you need to use some tools in python to profile the data that you are working with.

We can look inside the Hamlet person node and see that there are 4 reported variations:

``` XML
<persName type="form">Ha.</persName>
<persName type="form">Ham.</persName>
<persName type="form">Hamlet.</persName>
<persName type="form">Hem.</persName>
```

But can we confirm that this really is the case?  Alternatively, what if we were the ones writing this data.  Also, we don't systematilally know which of these were commonly used.  

Let's write a query that finds all the speaker representations of Hamlet, and then runs the results through the couter.

Here's our xpath to find all of Hamlet's passages:

`//tei:sp[@who = "#F-ham-ham"]`

Now find all the speaker elements in there.

`//tei:sp[@who = "#F-ham-ham"]/tei:speaker`

Now get all that text out!

`//tei:sp[@who = "#F-ham-ham"]/tei:speaker/text()`

In [21]:
results = tree.xpath('//tei:sp[@who = "#F-ham-ham"]/tei:speaker/text()', namespaces = ns)
print(results)
print(len(results))

['Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 'Ha.', 'Ham.', 'Ham.', 'Ham.', 'Ham.', 

There's 340 of these, which would be super annoying to count by hand.

In [22]:
from collections import Counter

print(Counter(results))

Counter({'Ham.': 336, 'Ha.': 2, 'Hamlet.': 1, 'Hem.': 1})


So now we know more about this data!  And in just a few lines of code.

# Selecting nodes

Up to now, we've been focusing on the extraction of data.  However, this tool is much more powerful than that.  As we've discussed with other data structures in the past, sometimes it can be really valuable to isolate the specific data granularity that you want.  Once you have those chunks isolated, you can drill down into them to get out information that you want.  We can do the same thing here.

The value of being able to select just a node (instead of extracting information out of it) is that you can save that object node as a variable and apply xpath queries directly onto it.  Yes, we could always include that information in our original xpath if we were wanting a single value.

However, when we can isolate a node we can run however many xpath queries we want on that node.  And this is why it is powerful.

Some of the examples that we will be going through below could also be done with xpath functions, but those aren's always consistantly supported inside these packages.  Also, this lesson is meant to highlight brining in data into python.

So with that said, let's explore this.

You can easily select just the nodes for your query by omitting the extraction chunk of your query.