# Denison DA210/CS181 Homework 3.c - Step 1

Before you turn this notebook in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells**.

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [234]:
from lxml import etree
import pandas as pd
import os.path

datadir = "publicdata"

---

## Part A

**Q1:** In the following cell, reproduce your function from last homework
```
    getLocalXML(filename, datadir=".", parser=None)
````
that performs the common steps of creating a path from the given `filename` and `datadir` and parses the XML file, using the passed `parser`, if any, and returns the Element at the **root** of the tree.  If parser is not passed, the standard `XMLParser` should be used.

If the file is not found, or if the parse is unsuccessful (due to XML not being "well formed"), the function should return `None`. Remember that if a parse is unsuccessful, the `etree` module raises an exception.  That means that you should have a `try` block, and indented within that block, the `parse()` invocation should occur.  The `try` block is followed by an `except Exception as e:` line, and within that, your return `None`.  If no exception is raised, code execution will proceed beyond the `try`/`except` block, and that is where you would return the root of the parsed tree.

In [235]:
def getLocalXML(filename, datadir =".", parser = None):
    """Parses the input file and returns the root of the tree
    """
    file = os.path.join(datadir, filename)
    if not os.path.isfile(file):
        return None
    xmlparser = etree.XMLParser(remove_blank_text = True)
    try:
        xmltree = etree.parse(file, xmlparser)
    except Exception as e:
        return None
    root = xmltree.getroot()
    return root
    

In [236]:
# Testing cell
assert getLocalXML.__doc__ is not None # don't forget the docstring!

myparser = etree.XMLParser(remove_blank_text=True)

wroot = getLocalXML("widombooks.xml", datadir, myparser)
assert len(wroot) == 8
bad = getLocalXML("foo.xml", datadir, myparser)
assert bad == None
bad2 = getLocalXML("bad.xml", datadir)
assert bad2==None

**Q2:** Use your function to obtain the root Element from the data directory and the XML file named `breakfast.xml`, assigning to Python variable `broot`.

In [237]:
broot = getLocalXML("breakfast.xml", datadir, myparser)


In [238]:
# Testing cell
assert isinstance(broot, etree._Element)
assert len(broot) == 5

**Q3:** Using the Element `broot`, find all children with the tag `'food'` and store them in a list of Elements called `foodlist`.

In [239]:
foodlist = broot.findall("food")

In [240]:
# Testing cell
assert isinstance(foodlist, list)
assert len(foodlist) == 5
assert isinstance(foodlist[0], etree._Element)
assert foodlist[0].tag == 'food'

**Q4:** Create two parallel lists consisting of the prices and the calories for each of the food elements under menu.  You can use your solution to **Q3** or can use another method for iterating over the children of the root node.  For each, you will access the attributes of the food node and collect the values of the two desired attributes.  The final lists should be assigned to `prices` and `calories`, respectively.  Make sure you do your type conversions so that prices are real-valued (i.e., floating-point numbers) and calories are integers.

In [241]:
prices =[]
calories = []
for food in foodlist: 
    prices.append(float(food.attrib["price"]))
    calories.append(int(food.attrib["calories"]))

In [242]:
# Testing cell
assert isinstance(prices, list)
assert len(prices) == 5
assert isinstance(prices[0], float)
assert prices[0] == 5.95
assert prices[-1] == 6.95

assert isinstance(calories, list)
assert len(calories) == 5
assert isinstance(calories[0], int)
assert calories[0] == 650
assert calories[-1] == 950

**Q5:** Use the `iter` method to iterate over all the `description`-tagged descendent nodes starting from `broot` and accumulate a list, `dlist`, with the `description` **text** value from these Element nodes.

In [243]:
dlist =[]
for desc in broot.iter("description"):
    dlist.append(desc.text)
# Display the list of text values
dlist

['Two of our famous Belgian Waffles with plenty of real maple syrup',
 'Light Belgian waffles covered with strawberries and whipped cream',
 'Light Belgian waffles covered with an assortment of fresh berries and whipped cream',
 'Thick slices made from our homemade sourdough bread',
 'Two eggs, bacon or sausage, toast, and our ever-popular hash browns']

In [244]:
# Testing cell
assert isinstance(dlist, list)
assert len(dlist) == 5
assert isinstance(dlist[0], str)
assert dlist[0].count('plenty') == 1
assert dlist[-1].count(',') == 3

---

## Part B

**Q6:** Assign to `wroot` the root `Element` object for the `widombooks.xml` in the data directory.

In [245]:
wroot = getLocalXML("widombooks.xml", datadir, myparser)

In [246]:
# Testing cell
assert isinstance(wroot, etree._Element)
assert len(wroot) == 8

**Q7:** Using the Element `wroot` from above, get the attributes of the first child tagged `'Book'`, and store your answer as a dictionary `myAttrib`.  (Note that you should actually get an object of type `etree._Attrib`, but it's effectively a dictionary.)

In [247]:
first_child = wroot.find("Book")
myAttrib = first_child.attrib


# Display the attributes dictionary
myAttrib

{'ISBN': 'ISBN-0-13-713526-2', 'Price': '85', 'Edition': '3rd'}

In [248]:
# Testing cell
assert isinstance(myAttrib, etree._Attrib)
assert myAttrib['Price'] == '85'
assert len(myAttrib) == 3

**Q8:** Using the Element `wroot`, find all children with the tag `'Book'` and store them in a list of Elements called `booklist`.

In [249]:
booklist = wroot.findall("Book")

In [250]:
# Testing cell
assert isinstance(booklist, list)
assert len(booklist) == 4
assert isinstance(booklist[0], etree._Element)
assert booklist[0].tag == 'Book'

**Q9:** Using the Element `wroot`, find all descendent nodes with the tag `'Magazine'`, extract the title text from each, and store them in a list of strings called `titlelist` (one title per magazine in `widombooks.xml`).

In [251]:
titlelist = []
for magazine in wroot.iter("Magazine"):
    titlelist.append(magazine[0].text)


# Display the Magazine title text list
titlelist

['National Geographic',
 'National Geographic',
 'Newsweek',
 "Hector and Jeff's Database Hints"]

In [252]:
# Testing cell
assert len(titlelist) == 4
assert "Newsweek" in titlelist
assert "Hector and Jeff's Database Hints" in titlelist

---

## Part C

**Q10:** Write a function
```
    findValue(node, tag)
```
that, relative to `node` finds the first child matching `tag` and returns the `.text` attribute if found, and `None`, if no match was found.

In [253]:
def findValue(node, tag):
    """ArithmeticError
    """
    value = node.find(tag)
    if value is None:
        return None
    return value.text

In [254]:
# Testing cell
assert findValue.__doc__ is not None # don't remove the docstring!

assert findValue(wroot, "Supplies") == None
booklist = wroot.findall("Book")
assert isinstance(findValue(booklist[1], "Remark"), str)
assert findValue(booklist[1], "Remark").count("Buy") == 1

**Q11:** Write a function
```
    parseBreakfast(broot)
```
that takes in the root of the `breakfast.xml` file as an `etree` `Element` object, and parses it into a `pandas` `DataFrame` with `name` as the `Index`.

_Hint_: You may find your `findValue` function helpful.

In [255]:
def parseBreakfast(broot): 
    nodes1 = []
    rows = []
    for node in broot:
        nodes1.append(node)
    for node2 in nodes1: 
        dct ={}
        dct["price"] = float(node2.attrib["price"])
        dct["calories"] = int(node2.attrib["calories"])
        dct[node2[0].tag] = findValue(node2, node2[0].tag)
        dct[node2[1].tag] = findValue(node2, node2[1].tag)
        rows.append(dct)
    df = pd.DataFrame(rows)
    df = df.set_index("name")
    return df 


In [256]:
# Debugging cell
breakfast_df = parseBreakfast(broot)
breakfast_df

Unnamed: 0_level_0,price,calories,description
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Belgian Waffles,5.95,650,Two of our famous Belgian Waffles with plenty ...
Strawberry Belgian Waffles,8.95,900,Light Belgian waffles covered with strawberrie...
Berry-Berry Belgian Waffles,8.95,900,Light Belgian waffles covered with an assortme...
French Toast,4.5,600,Thick slices made from our homemade sourdough ...
Homestyle Breakfast,6.95,950,"Two eggs, bacon or sausage, toast, and our eve..."


In [257]:
# Testing cell
breakfast_df = parseBreakfast(broot)
assert breakfast_df.shape == (5,3)

assert breakfast_df.loc["Belgian Waffles", "price"] == 5.95
assert breakfast_df.loc["French Toast", "calories"] == 600
assert breakfast_df.loc["Homestyle Breakfast", "description"].startswith("Two eggs, bacon")

---

---

## Part D

**Q12:** How much time (in minutes/hours) did you spend on this homework assignment?

40 mins 

**Q13:** Who was your partner for this assignment?  If you worked alone, say so instead.

Alone