Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE`/`raise NotImplementedError` or "YOUR ANSWER HERE", as well as your name and collaborators below:

## 04_HW1

Please start by running the cell below. Note that, just like in class, we create a custom parser that we will use in the cells to follow. The function `print_tree()` is there to help you check your code as your write it, following the principle that it's wise to "develop code incrementally" rather than many lines of code at once.

In [None]:
import json
import io
from lxml import etree
import os.path
import pandas as pd

myparser = etree.XMLParser(remove_blank_text=True)
datadir = "public_data"

def print_tree(node, pretty_print=True, encoding='utf-8'):
    result = etree.tostring(node, pretty_print=pretty_print)
    if isinstance(result, bytes):
        result = result.decode(encoding)
    print(result)

**Q1** Consider the JSON file, named `breakfast.json` in `public_data`.  This JSON contains a **list** of breakfast dishes served by a restaurant.  Each breakfast food is a dictionary, mapping keys about the breakfast food to values.  The keys present for each breakfast food include:

- `name`: The string name of the dish as it appears on the menu,
- `price`: A real value (in US currency) for the amount charged for ordering the dish,
- `description`: The string long description for the dish,
- `calories`: The integer number of calories for the dish.

Your job is to, by hand, and using a text editor, create a file in **the current directory** named `breakfast.xml` that contains well-formed (parseable) representation of this same data.

Additional Specifications of the XML:

- the root Element should be named `menu`
- the children Elements of `menu` should all be named `food`
- We consider `price` and `calories` to be meta-data, and so these should be XML-attributes of a `food` Element
- The children of each `food` Element should be `name` and `description`
- The string values for `name` and `description` should be the *text* of their respective Element nodes.

Please refer to the "10 Golden Rules of Well-Formed XML" from class. You can find this in chapter 17 of the online copy of the book. You are welcome to start from one of the XML files we gave you and modify it to match the specification above. Most simple text editors can open an XML file and can save XML files. Please be clear about the difference between tags, text, and attributes, and pay attention to the assert statements below.

**Remember to remove our solution `breakfast.xml` from the release version**

In [None]:
# Testing cell

path = os.path.join('.', "breakfast.xml")
assert os.path.isfile(path)

tree = etree.parse(path)
root = tree.getroot()

assert len(root) == 5
assert root.tag == 'menu'
for child in root:
    assert child.tag == 'food'
assert root[0].get('price') == "5.95"
assert root[4][0].tag == 'name'
assert root[4][0].text == 'Homestyle Breakfast'

## Basic Operations

As an aid for working with Element nodes, we summarize some of the fundamental operations

Operation     |  Syntax Hint  |Brief Description
:-------------|:--------------|:-----------------------------------------
Get a Child   | `[index]`     |Access the node's child at index
Get tag       | `.tag`        |Obtain tag of node
Get text      | `.text`       |Obtain text of node up to child node or end tag
Access all attributes | `.attrib` | Obtain dictionary of all of node's xml attributes
Access one attribute | `.get()` | Fetch value for specified attribute, or `None` if not present
Find child node | `.find()` | Search for first child matching search specification (by tag)
Iterator child search | `.iterfind()` | Iterator for all children matching search specification (by tag)
Unconditional Child Iteration | *node* | A node itself can be used as an iterator to obtain all children in document order
Count children | `len(`*node*`)` | Find the number of children of a node
Interator on descendants | `iter()` | Iterator over all descendents


## Operations Questions

The questions below are designed to be very similar to the inclass activity. First we need to read in the data.

**Q2** Write a function:
    
    getLocalXML(filename, datadir=".", parser=None)
    
that performs the common steps of creating a path from the given `filename` and `datadir` and parses the XML file, using the passed `parser`, if any, and returns the Element at the root of the tree.  If the file is not found, or if the parse is unsuccessful (due to XML not being "well formed"), the function should return `None`. Even though the testing cell below tests this function on `widombooks.xml`, your solution needs to work in general, because we need it for `reed.xml` in the next part of the homework.

In [None]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell
wroot = getLocalXML("widombooks.xml", datadir, myparser)
assert len(wroot) == 8
bad = getLocalXML("foo.xml", datadir, myparser)
assert bad == None
bad2 = getLocalXML("bad.xml", datadir)
assert bad2==None

**Q3** Using the Element `wroot` from above, get the attributes of the first child tagged `'Book'`, and store your answer as a dictionary `myAttrib`.

In [None]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()

print(myAttrib)

In [None]:
# Testing cell

assert myAttrib['Price'] == '85'
assert len(myAttrib) == 3

**Q4** Using the Element `wroot`, find all children with the tag `'Book'` and store them in a list of Elements called `booklist`.

In [None]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
print(booklist)
booklist[0].tag

In [None]:
# Testing cell
assert len(booklist) == 4
assert type(booklist) == list
assert booklist[0].tag == 'Book'

**Q5** Using the Element `wroot`, find all children with the tag `'Magazine'`, extract the title text from each, and store them in a list of strings called `titlelist` (one title per magazine in `widombooks.xml`). Hint: use loops.

In [None]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()

print(titlelist)

In [None]:
# Testing cell
assert len(titlelist) == 4
assert "Newsweek" in titlelist
assert "Hector and Jeff's Database Hints" in titlelist

## Building Data Frame from XML

The objective of the next series of questions is to, from `reed.xml` in the data directory, open it and parse it into a Dictionary of Lists representation and then to build the pandas data frame.  This follows the pattern from the book and also contained in the inclass Notebook from Friday.  

Those sources have more explanation and step-by-step progression, while the questions here will only specify what is expected.  So it might behoove you to read and/or work through the inclass before solving these homework problems. We begin with a testing cell that refers to the function `getLocalXML` you wrote above, followed by a cell providing you a handy function from the reading.

In [None]:
# Testing cell
root = getLocalXML("reed.xml", datadir, myparser)
assert len(root) == 703

In [None]:
# A gift for you

def child_value(node, tag):
    first_find = node.find(tag)
    if first_find != None:
        return first_find.text

**Q6** Step 1, build List of Dictionaries of courses with columns (i.e. keys in each of the row-based dictionaries) of `'reg_num'`, `'subj'`, and `'crse'`.  Name your list of dictionaries `LoD`.

In [None]:
# Solution cell
LoD = []
# YOUR CODE HERE
raise NotImplementedError()

print(LoD[0])

In [None]:
# Testing cell
assert isinstance(LoD, list)
assert len(LoD) == 703
assert isinstance(LoD[0], dict)
assert 'subj' in LoD[0]
assert LoD[0]['subj'] == 'ANTH'

**Q7** Development Step 2: Repeat your code from Step 1 so that you the dictionaries in `LoD` includes all **leaf** children of each course (i.e. column-keys for everything **other than** `time` and `place`).

In [None]:
# Solution cell
LoD = []
# YOUR CODE HERE
raise NotImplementedError()

print(LoD[0])

In [None]:
# Testing cell
assert isinstance(LoD, list)
assert len(LoD) == 703
assert isinstance(LoD[0], dict)
assert 'subj' in LoD[0]
assert LoD[0]['subj'] == 'ANTH'
assert 'days' in LoD[0]
assert LoD[0]['days'] == 'M-W'

**Q8** Finally, use your `LoD` to create a pandas DataFrame, and set the index to an appropriate column(s) that defines a unique independent variable combination for the data set.  Assign the data frame to Python variable `df`.

In [None]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()

df.head()

In [None]:
# Testing cell
assert isinstance(df, pd.core.frame.DataFrame)
assert len(df) == 703
assert df.iloc[0]['title'] == 'Introduction to Anthropology'