# [MEDST-250] Text Analysis for Medievalists
In this Jupyter notebook, we will look to see how to analyze and parse text from an XML file. We will be looking at two different modules in Python that make it easy for us to parse through XML files. These modules will be the ElementTree module and the lxml module. Both the commands for both API's are quite similar, as you are dealing with the tree structure of XML (will be discussed later), so it is purely up to you to choose how you wish to deal with this. 

### Topics Covered
- Using ElementTree module (Method 1)
- Using lxml Module (Method 2)

### Table of Contents

[The Data](#section data)<br>


1 - [Section 1: Parsing XML using ElementTree](#section 1)<br>

2 - [Section 2: Parsing XML using lxml](#section 2)<br>




**Dependencies:**

In [None]:
import xml.etree.cElementTree as ET
from lxml import etree
## No need to worry about this, this simply imports the modules, ElementTree and lxml, that allows us to parse the XML file.

---

## The Data <a id='data'></a>
In this notebook, you'll be working with a XML file from Piers Plowman. An XML file is a basic markup of content in a file that gathers and contains a certain amount of material within different tags and subtags. We will be using this XML file to learn in general how to parse XML files of Medieval Texts The texts come from http://piers.chass.ncsu.edu/texts.html. From now on, we will be using an example in the data folder called "ppexample.xml" as our test for parsing through. 

---



## Section 1 Parsing XML File using cElementTree Module  <a id='section 1'></a>

First we need to import the XML file into an ElementTree instance, this basically creates a ElementTree format of all this subtags of each tag in the XML file so we can further analyze the file.

In [None]:
xml_file = 'data/ppexample.xml'
tree = ET.ElementTree(file=xml_file)
root = tree.getroot()
print(tree.getroot())

We then find all the child tags and roots and print them out so we can view them for better viewing in the future. An XML file has a root file that has children, and each of these children has a tag and attribute, as well as some associated text with it. 

In [None]:
for child in root:
    print(child.tag, child.attrib)
    for step_child in child:
        print("Step Child Tag: " + step_child.tag)
        for child in step_child:
            
            print("Children of this: " + child.tag)

Now we can see each child tag and it's step child tags that come from it. This is where the "Tree" hierarchy of each element tag comes in, each tag has it's own sub/step childs that branch off of it. 

To fully grasp this ideology, we can iterate through all the elements in the XML file by creating an iterator (something that goes through all parts of a file) and iterating through all possible elements/their tags

In [None]:
iter_ = tree.getiterator()
for elem in iter_:
    print("Element tag: " + elem.tag)

We now see, there are so many tags of certain names. In our tree variable we can then look for specific types of tags such as dates by getting the "text" attribute out of each element.


In [None]:
for elem in iter_:
    if elem.tag == "date":
        print(elem.text)

This allows us for easier search based on the attributes we are looking for such as dates or names, I've found a sample XML function that parses through XML file using element tree and prints out the tags inside the file below:

In [None]:
def parseXML(xml_file):
    """
    Parse XML with ElementTree
    """
    tree = ET.ElementTree(file=xml_file)
    print(tree.getroot())
    root = tree.getroot()
    print("tag=%s, attrib=%s" % (root.tag, root.attrib))
 
    for child in root:
        print("child tag: " + child.tag, child.attrib)
        for step_child in child:
            print("step child tag: "+ step_child.tag)
 
    # iterate over the entire tree
    print("-" * 40) 
    print("Iterating using a tree iterator") 
    print("-" * 40) 
    iter_ = tree.getiterator()
    for elem in iter_:
        if elem.text != None:
            print("Elem Tag: " + elem.tag, "Elem Text: " + elem.text)

        
parseXML('data/ppexample.xml')

Go ahead and play around with the file "data/ppexample.xml", this will allow you to see how XML files can be parsed through and indexed to find actual information.

---
<b>Question 1:</b> Create a list of all the possible Labels and the text and terms associated with them. Labels and terms are XML tags that can be indexed.

<b>Answer: </b> Below


In [None]:
search = ["label", "term"]
for elem in iter_:
        if elem.text != None and elem.tag in search :
            print("Elem Tag: " + elem.tag + "||" "Elem Text: " + elem.text)


---

## Section 2 Using lxml Module <a id='section 2'></a>

Now we're going to look at another module, lxml, that allows us to parse through XML files as well. It follows the tree format that XML gives us and parses similar to ElementTree above, however allows us to have a lot more succint of a function than above.

In [None]:
def parseBookXML(xmlFile):
    """"""
    context = etree.iterparse(xmlFile)
    book_dict = {}
    books = []
    #Searches through elements in the etree, then adds them to a dictionary
    for action, elem in context:
        if not elem.text:
            text = "None"
        else:
            text = elem.text
        print(elem.tag + " => " + text)
        book_dict[elem.tag] = text
        #Can search for certain tags
        if elem.tag == "book":
            books.append(book_dict)
            book_dict = {}
    return books
 
parseBookXML('data/ppexample.xml')

---

## Bibliography

All work is adapted or taken from:
- Driscoll, Mike. (2013, April). Python 101 – Intro to XML Parsing with ElementTree. Retrieved from https://www.blog.pythonlibrary.org/2013/04/30/python-101-intro-to-xml-parsing-with-elementtree/

- Driscoll, Mike. (2010, November). Python: Parsing XML with lxml. Retrieved from https://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/


---
Notebook developed by: Tejas Priyadarshan

Data Science Modules: http://data.berkeley.edu/education/modules
