Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE`/`raise NotImplementedError` or "YOUR ANSWER HERE", as well as your name and collaborators below:

# 04_HW3: XPath for reading into `pandas`

The purpose of this homework is to give you more experience reading hierarchical data into a `pandas` dataframe.  We want to parse through XML or HTML to build a table of data in one of our two-dimensional formats (e.g., LoL, DoL, LoD, or DoD) and then read this into `pandas`.

In [None]:
import pandas as pd
from lxml import etree
import json
import requests
import io

## The dataset: NYC Water Consumption


The data is from https://catalog.data.gov/dataset/water-consumption-in-the-new-york-city-3d2f0 and the xml can be downloaded from there. For your convenience, we note that the full URL, including query parameters, for downloading this dataset is as follows

https://data.cityofnewyork.us/api/views/ia2d-e54m/rows.xml?accessType=DOWNLOAD

You are encouraged to look carefully at the data, e.g., in Google Chrome:

https://data.cityofnewyork.us/api/views/ia2d-e54m/rows.xml

Also for your convenience, and as a means of seeing the table we want to obtain from the xml, we include in this assignment directory a csv of the data, which we read in and display the first five observations:

In [None]:
NYCwater=pd.read_csv('Water_Consumption_In_The_New_York_City.csv')
print('length:',len(NYCwater))
NYCwater.head()

You can see that there are four columns, 38 rows, and we see observations where a row is determined by a year, and dependent variables include population, total water consumption in millions of gallons per day, and a derived column of per-capita gallons consumed per day. Please do not use this CSV file in your solutions to the problems below.

## XML Processing
Now we turn to obtaining and processing the xml version of the dataset.

**Q1:** Write a function

`
getNYCWater(url)
`

that uses the `requests` module to obtain the data based on the URL given above and, if successful, returns an `lxml` Element for the root of the resultant tree.  If not successful, the function should return `None`.  You are welcome to use functions you have developed previously.

In [None]:
def getNYCWater(url):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Testing cell
protocol = 'https'
location = 'data.cityofnewyork.us'
resource = '/api/views/ia2d-e54m/rows'
fmt = 'xml'
query = 'accessType=DOWNLOAD'

my_url = "{}://{}{}.{}?{}".format(protocol, location, resource, fmt, query)

root = getNYCWater(my_url)
assert type(root) is etree._Element
assert len(root) == 1
assert getNYCWater("http://httpbin.org/post") == None

**Q2:** Use `xpath` to obtain a single entry list, `yearList`, containing the string text of the first year in the dataset.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell
assert len(yearList) == 1
assert yearList[0] == '1979'

**Q3:** Use `xpath` to obtain a single entry list `yearList` containing the text of the last year in the dataset.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell
assert len(yearList) == 1
assert yearList[0] == '2018' # note: this will change when they release more data

**Q4:** Use xpath to obtain a list `yearList` containing the text of the ALL of the years in the dataset.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell
assert len(yearList) == 40
assert yearList[-1] == '2018'

**Q5:** Using a single `xpath` expression and a condition on the year, produce a single entry list `year98` containing a string text of the population of New York in 1998.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell
assert len(year98) == 1
assert year98[0] == '7858259'

**Q6:** Using a single `xpath` expression, produce a list `conList` containing a string text of the `consumption` nodes for all the rows represented in the XML file. Pay attention to type.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell
assert len(conList) == 40
assert conList[0] == '1512'
assert conList[-1] == '1007.50'

> As can be seen from the above exercises, using XPath, we can easily obtain the data in columns.  On the other hand, if we use XPath to get a list of Element rows, we could then obtain the information about each row in programmatic fashion.  So the next two exercises will obtain the same goal: to get a pandas dataframe whose columns are `[year, population, totalconsumption, percapita]` from the XML.  **Note that all four of these columns should be integers or floats, not strings**. Your solution should continue to work even if the dataset is updated in a future year (so do not assume you know the number of rows).

**Q7:** Write a function, 

`
NYCdf1(treeroot)
`

that, given the Element root, uses the above mechanism of XPath to get column data and, after conversion to integer/float data type, uses `pandas` to create and return the corresponding `pandas` dataframe by specifying a dictionary of columns. Your dataframe should not have an index.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

df=NYCdf1(root)
df.head()

In [None]:
# Testing cell
df=NYCdf1(root)

assert df.shape == (40,4)
assert 1979 in list(df.iloc[0])
assert 7102100.0 in list(df.iloc[0])
assert 1512.0 in list(df.iloc[0])
assert 213 in list(df.iloc[0])
assert 1007.5 in list(df.iloc[-1])

**Q8:** Write a function, 

`
NYCdf2(treeroot)
`

that, given the Element root, uses XPath to get a list of row Elements and then iterates over each row, creating a list of the values for the row.  This should be appended to an accumulating list of lists.  Then the list of lists should be constructed into a `pandas` dataframe (without an index) to be returned. Again make sure that all values are integer/float data type.  Other than perhaps column order, this dataframe should yield the same result as our prior function.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

df=NYCdf2(root)
df.tail()

In [None]:
# Testing cell
df=NYCdf2(root)
assert df.shape == (40,4)
assert 1979 in list(df.iloc[0])
assert 7102100.0 in list(df.iloc[0])
assert 1512.0 in list(df.iloc[0])
assert 213 in list(df.iloc[0])
assert 1007.5 in list(df.iloc[-1])