# [MEDST-250] Text Analysis for Medievalists
In this Jupyter notebook, we will look to see how to analyze and parse text from an XML file. We will be looking at a Python module that makes it easy for us to parse through XML files.

### Topics Covered
- Using `xml` module
- Web scraping
- Parsing a corpus of XML documents

### Table of Contents

[The Data](#section data)<br>


1 - [Section 1: Parsing XML using ElementTree](#section 1)<br>

2 - [Section 2: Web scraping crash course!](#section 2)<br>

2 - [Section 3: Parsing a corpus of XML documents](#section 3)<br>

In [None]:
import xml.etree.cElementTree as ET  # XML parser
import glob  # navigate file system
import requests  # make requests to web servers
import time  # will help us pause python's for loop

---

## The Data <a id='data'></a>
In this notebook, you'll be working with a XML file from Piers Plowman. An XML file is a basic markup of content in a file that gathers and contains a certain amount of material within different tags and subtags. We will be using this XML file to learn in general how to parse XML files of texts. The texts come from http://piers.chass.ncsu.edu/texts.html. From now on, we will be using an example in the data folder called `ppexample.xml` as our test for parsing through. 

---



## Section 1: Parsing XML File using `ElementTree` Module  <a id='section 1'></a>

First we need to import the XML file into an `ElementTree` instance, this basically creates an `ElementTree` format of all this subtags of each tag in the XML file so we can further analyze the file.

In [None]:
xml_file = 'data/ppexample.xml'
tree = ET.ElementTree(file=xml_file)
tree

When we talk about tree file structures, we often want to work from the root into specific nodes. With XML we might want to know the top level, or `root`.

In [None]:
root = tree.getroot()
print(tree.getroot())

Great, so now we know how to get the root, but it's just an object right now and we can't see much else.

We can then find all the child tags in the root and `print` them out so we can view them in the future. An XML file has a root file that has children, and each of these children has a tag and attribute, and maybe some associated text with it. 

In [None]:
for child in root:
    print(child.tag)

Still not super helpful, but let's start to build out our hierarchical understanding of XML by traversing through the children and building out a path:

In [None]:
teiHeader = root.find('teiHeader')
for child in teiHeader:
    print(child.tag)

In [None]:
fileDesc = root.find('teiHeader/fileDesc')
for child in fileDesc:
    print(child.tag)

In [None]:
titleStmt = root.find('teiHeader/fileDesc/titleStmt')
for child in titleStmt:
    print(child.tag)

In [None]:
titleStmt = root.find('teiHeader/fileDesc/titleStmt/title')
for child in titleStmt:
    print(child.tag)

Looks like we may have hit the bottom! Since we've hit a terminal element there's likely text here, so we can just `find` the path and ask for the `.text` property:

In [None]:
root.find('teiHeader/fileDesc/titleStmt/title').text

Looks like we have [this](http://piers.chass.ncsu.edu/texts/F/4?tags=off&view=all) file. Let's take a closer look on that page

---

We can navigate around these tags now that we have a better visual understanding:

In [None]:
publicationStmt = root.find('teiHeader/fileDesc/publicationStmt')
for child in publicationStmt:
    print(child.tag)

The `itertext` method will generate a `list` of `str`ings from all children:

In [None]:
list(publicationStmt.itertext())

If we just wanted one element, we can ask with the direct path like we did above:

In [None]:
root.find('teiHeader/fileDesc/publicationStmt/publisher').text

This is where the `strip` method can come in handy. We don't need that extra whitespace:

In [None]:
root.find('teiHeader/fileDesc/publicationStmt/publisher').text.strip()

Great! But while we love metadata, we're here for the text!

## Challenge

Take some time to play with the tags that exists in the `text/` path and try to write some direct paths to pull out the information.

***WARNING***: This will be difficult because the text of the tree is highly fragmented.

In [None]:
## YOUR CODE HERE


---

Let's look at another way to iterate through content.

We can iterate through all the elements in the XML file by creating an iterator (something that goes through all parts of a file) and iterating through all possible elements/their tags.

We'll also print out the `attrib`, which sometimes gives us more information:

In [None]:
iter_ = tree.getiterator()
for elem in iter_:
    print(elem.tag, elem.attrib)

We now see, there are so many tags of certain names. In our tree variable we can then look for specific types of tags such as dates by getting the "text" attribute out of each element. We can use conditionals to select what we're interested in:


In [None]:
for elem in iter_:
    if elem.tag == "date":
        print(elem.text)

If we want to use `itertext()` but make it one string, we can `join` the resulting `list`:

In [None]:
for elem in iter_:
    if elem.tag == "foreign":
        print(elem.attrib)
        print(''.join(list(elem.itertext())))

In [None]:
for elem in iter_:
    if elem.tag == "note":
        print(elem.attrib)
        print(''.join(list(elem.itertext())))
        print()

Say we only wanted the `textual` notes and we don't care about the `lexcial` ones, we can run the same code as above but filter out with another conditional:

In [None]:
for elem in iter_:
    if elem.tag == "note" and elem.attrib['type'] == 'textual':
        print(''.join(list(elem.itertext())))
        print(elem.attrib)
        print()

## Challenge

Write some code below that extracts all the corrections from the document:

In [None]:
## YOUR CODE HERE


---

## Section 2: Web scraping crash course! <a id='section 2'></a>

So we now we know how to parse *1* XML document. But what if we had a whole corpus? Like all of the *passi* from each manuscript? Well [the archive has them](http://piers.chass.ncsu.edu/texts.html)! Unfortunately, that's like over 100 downloads. It'd be great if we could automate that downloading.

Guess what? ***WE CAN, YAY***

Web scraping, in simple terms, is when you write a script to automate the collection of data from the internet. We have a simple example here because all we need to do is change the URL.

Note the [URL for our example above](http://piers.chass.ncsu.edu/texts/F/4/xml):

`http://piers.chass.ncsu.edu/texts/F/4/xml`

You see the end of that URL? The `F` is just the manuscript name. The `4` is *passus* 4. Since we know *Piers Plowman* has 20 *passi* we can just manipulate that string to download all of the XML files.

Let's first focus on just generating all the strings for one manuscript:

In [None]:
base_url = 'http://piers.chass.ncsu.edu/texts/F/1/xml'

## YOUR CODE HERE

Now let's add in one more `for` loop to do this for each manuscript:

In [None]:
mss = ['F', 'G', 'Hm', 'L', 'M', 'O', 'R', 'W']

## YOUR CODE HERE


Before we actual start scraping, we need to know how `requests` works. The `requests` library will basically get you the response from a web server and can give you the raw text of the website's code. In this case, these URLs return XML code. Let's see how it works with one:

In [None]:
requests.get('http://piers.chass.ncsu.edu/texts/F/4/xml').text

There it is! Then we can just save that to a file:

In [None]:
xml_data = requests.get('http://piers.chass.ncsu.edu/texts/F/4/xml').text

with open('passus_4.xml', 'w') as f:
    f.write(xml_data)

Done!

We can then put it all together:

In [None]:
# all manuscript names
mss = ['F', 'G', 'Hm', 'L', 'M', 'O', 'R', 'W']

# iterate through manuscripts
for ms in mss:
    
    # iterate through passi
    for i in range(1,21):
        
        # build url
        url = 'http://piers.chass.ncsu.edu/texts/{}/{}/xml'.format(ms, str(i))
        print(url)
        
        # get the response
        res = requests.get(url).text
        
        #create a file name
        fname = 'data/passus/passus-' + ms + '-' + str(i) + '.xml'
        
        # save the file
        with open(fname, 'w') as f:
            f.write(res)

        # pause for a second so we don't overload their servers
        time.sleep(1)

Let's check what we've got:

In [None]:
!ls data/passus

In [None]:
!cat data/passus/passus-F-1.xml

---

## Section 3: Parsing an XML corpus <a id='section 3'></a>

Great, now we have a massive corpus! Now we could start comparing across the corpus:

In [None]:
# to hold our collected data
manu_data = {'F': [], 
             'G': [],
             'Hm': [],
             'L': [],
             'M': [],
             'O': [],
             'R': [],
             'W': []}

# iterate through XML files
for fname in glob.glob('data/passus/*'):

    # get manuscript name and passus number
    manuscript = fname.split('-')[1]
    passus = fname.split('-')[2].split('.')[0]
    
    print(fname)
    print(manuscript)
    print(passus)
    print()
    
    # parse XML file
    xml_file = fname
    tree = ET.ElementTree(file=xml_file)
    root = tree.getroot()
    
    print(list(root.find('teiHeader/fileDesc/titleStmt/title').itertext()))
    
    ## YOUR CODE HERE
    
    
    
    
    
    
    
    
    
    print('='*10)

## Bibliography

All work is adapted or taken from:
- Driscoll, Mike. (2013, April). Python 101 – Intro to XML Parsing with ElementTree. Retrieved from https://www.blog.pythonlibrary.org/2013/04/30/python-101-intro-to-xml-parsing-with-elementtree/

- Driscoll, Mike. (2010, November). Python: Parsing XML with lxml. Retrieved from https://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/


---
Notebook developed by: Tejas Priyadarshan

Data Science Modules: http://data.berkeley.edu/education/modules
