Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

In [None]:
import requests
import pandas as pd
from lxml import etree
import lxml.html as lh
import io
import os.path

dataout = "_generated"
datadir = "../../data/22"

protocol = "http"
location = "personal.denison.edu"
resourcepath = "/~bressoud/datasystems/{}"

buildURL = lambda s: "{}://{}{}".format(protocol, location, resourcepath.format(s))

parser = etree.HTMLParser(remove_blank_text=True)

def print_tree(node, pretty_print=True, encoding='utf-8', limit=0):
    result = etree.tostring(node, pretty_print=pretty_print)
    if isinstance(result, bytes):
        result = result.decode(encoding)
    if limit > 0:
        print(result[:limit])
    else:
        print(result)

In [None]:
def attr_string(node):
    s = ''
    for k, v in node.attrib.items():
        nextval = " {}='{}'".format(k, v)
        s += nextval
    return s

def print_leaf_node(node, level):
    indent = level*'  '
    tag_string = "<{}{}>".format(node.tag, attr_string(node))
    nodetext = str(node.text).strip()
    if node.text != None and nodetext != '':
        tag_string += nodetext + ''
    end_tag = "</{}>".format(node.tag)
    print(indent, tag_string, end_tag, sep='')
    
def print_start_tag(node, level):
    indent = level*'  '
    tag_string = "<{}{}>".format(node.tag, attr_string(node))
    nodetext = str(node.text).strip()
    if node.text != None and nodetext != '':
        tag_string += nodetext + ''
    print(indent, tag_string, sep='')

def print_end_tag(node, level):
    indent = level*'  '
    tag_string = "</{}>".format(node.tag)
    print(indent, tag_string, sep='')


def print_levels(node, level, maxlevel, maxchildren=30):
    if len(node) == 0:
        print_leaf_node(node, level)
    else:
        print_start_tag(node, level)
        if len(node) > 0 and level < maxlevel:
            for i, child in enumerate(node):
                if i < maxchildren:
                    print_levels(child, level+1, maxlevel, maxchildren)
                else:
                    print((level+1)*'  ', '...')
                    break
        print_end_tag(node, level)

In [None]:
path = os.path.join(datadir, "ind2016table.html")
os.path.isfile(path)

In [None]:
with open(path) as f:
    tree = etree.parse(f, parser)
    root = tree.getroot()

In [None]:
print_levels(root[0][0], 0, 3, 3)

## Example 1: Simple Table

> HTML table construct: https://www.w3schools.com/html/html_tables.asp

Example URL: http://personal.denison.edu/~bressoud/datasystems/ind2016.html

- table
    - thead
        - tr
            - th
            - th
            - etc.
    - tbody
        - tr
            - td
            - td
            - etc.
        - tr
            - td
            - td
            - etc.
        - etc.
        
Notes:

- In our first example, the names of columns and the values for table elements are part in the `text` of the `td` Element.

- For some earlier version tables, `thead`/`tbody` are not used and may not be present.  If the web scraper knows they are present, then you don't have to rely on processing a first row one way (to get column headers) and the remaining rows in a different way.  If the table uses `th`, then that helps.  

    - Another point of confusion that I have seen is where the HTML parser for a "Developer Tools" actually adds in the structure of `thead` and `tbody` when they do not, in fact, exist in the source HTML.  So students see one thing in the developer tools, and then their code is written to use `thead`/`tbody` and they get no matching XPath results.
    
- For some more complex examples, might have multiple rows of headers before the table data rows begin.


In [None]:
url = buildURL("ind2016.html")
response = requests.get(url)
assert response.status_code == 200

In [None]:
tree1 = etree.parse(io.BytesIO(response.content), parser)
root1 = tree1.getroot()

In [None]:
column_names = root1.xpath(".//table/thead/tr/th/text()")

In [None]:
column_names = root1.xpath(".//table//tr/th/text()")

In [None]:
column_names

In [None]:
tdlist = root1.xpath(".//table//tr/td/text()")
LoL = []
fieldcount = 0
for item in tdlist:
    if fieldcount == 0:
        row = []
    row.append(item)
    if fieldcount < 5:
        fieldcount += 1
    else:
        LoL.append(row)
        fieldcount = 0
LoL

In [None]:
DoL = {}
for index, column in enumerate(column_names):
    xpath = ".//table//tr/td[{}]/text()".format(index+1)
    DoL[column] = root1.xpath(xpath)
DoL

In [None]:
df = pd.DataFrame(DoL)
df.set_index('code', inplace=True)
df

### Possible Variation: Multiple Tables in a Single Page

If the desired table for web scraping is **not** the first table in the page, the above code breaks.  General case might have multiple preceeding tables and even multiple following tables.

General solution would be to first use a position or an attribute to get the Element of the **correct** table, and then to do table processing relative to that node, instead of from the root of the document tree.

### Possible Variation: More Compicated Data Extraction from `td`

A `td` cell in an HTML table may well contain the desired data for a tabular data frame extraction, but could well be "buried", and not just be the `text` of the `td` node.  There may be a subtree at the `td` node, and the data might be in a link (`a` reference).  Or it could be part of an attribute, either of the `td` or a subelement.  Or the extracted data may contain "extra", like a footnote or icon picture, in addition to the element itself.

## Example 2: Wikipedia Table

Non-API URL: https://en.m.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population

### Goal

We see in the rendered page a table of state populations.  Population data and ranks are relative to the 2010 census and population estimates for 2019.  In our case, we are interested in the most recent data, even if it is an estimate, and so we want to extract the current rank (as an integer), the string of the name of the state (we don't care about the state flag picture), and the estimate of the population as of July 1, 2019.  These are the first, third, and fourth columns in the table.

### Access

To discourage web scraping (in violation of its accepable use and robots.txt policies), pages like the above in fact have HTML that use scripts to generate the content.  So if you fetch the above, you will not find the underlying table.

They do provide an API where you can, through the API, request the HTML page content for many pages.  Documentation is: https://en.wikipedia.org/api/rest_v1/. See subsection on page content at https://en.wikipedia.org/api/rest_v1/#/Page%20content.  We are using the fourth version of the API, an HTTP GET going to the `/page/html/{title}` where, for this example, `{title}` is `List_of_states_and_territories_of_the_United_States_by_population`.  Note that, in accordance with the API, this endpoint is accessed in a resource path that starts `/api/rest_v1`.

0. Before processing, we retrieve the HTML.

In [None]:
"""
https://en.wikipedia.org/api/rest_v1/page/html/List_of_states_and_territories_of_the_United_States_by_population
"""

protocol = "https"
location = "en.wikipedia.org"
resourcepath = "/api/rest_v1/page/html/{}"
pop_page = "List_of_states_and_territories_of_the_United_States_by_population"

url = buildURL(pop_page)

In [None]:
response = requests.get(url)
assert response.status_code == 200

tree2 = etree.parse(io.BytesIO(response.content), parser)
root2 = tree2.getroot()

1. See the set of tables in the whole document

In [None]:
root2.xpath("//table")

2. Often a `class` attribute can distinguish tables, so we obtain the class attributes for the set of tables in the document.  If a table does not have a class attribute, it will not appear.

In [None]:
root2.xpath("""//table/@class""")

3. Discover (by inspection in Developer tools) that class of our state population table is `'wikitable sortable'`, so get node list of tables that satisfy:

In [None]:
table_list = root2.xpath("""//table[@class='wikitable sortable']""")

In [None]:
print_tree(table_list[0], limit=4000)

4. We find that the index 0 node in the table_list is the one we want, so we assign to a new variable and print the top three levels of the tree to see its structure:

In [None]:
population_table = table_list[0]
print("table")
limit = 3
for child1 in population_table:
    print(" ", child1.tag)
    for i, child2 in enumerate(child1):
        print("   ", child2.tag)
        for j, child3 in enumerate(child2):
            print("     ", child3.tag)
            if j > limit:
                print("      ...")
                break
        if i > limit:
            print("   ...")
            break

> Conclusion: a `tbody` *is* present, but is not being used for just the body of the table.  The header is part of the body.  We know this because of the `th` nodes.  We also observe that there are **two** rows of header information.  If we look at the rendered result, this makes sense, as we can distinguish the two header rows, along with both some horizontal and vertical spanning.

 5. Look more closely at the first data-carrying row.
    - Be careful in the `xpath`; if we use `//tr[3]` we would get the third row from all five of the tables present in the document.  So we make the xpath relative to the population table

In [None]:
tablerow_list = population_table.xpath(".//tr[3]")
tablerow_list

In [None]:
datarow = tablerow_list[0]
print_tree(datarow)

> Conclusion: 
> 1. current state rank is in first `td` in row, and is the text of a `span` node under the `td` Element (note that some tables can use additional `td` elements in the set of rows, used for spacing, borders, and other rendering results.
> 2. state rank in 2010 is in second `td` in row; we disregard based on our goals
> 3. third `td` contains the state information, and we will explore that further below
> 4. fourth `td` contains the estimated population in 2019, and value is in the text of the `td` element.
> 5. there are `id` attributes for set of `td` nodes, but (by inspection not shown here) they are **not** consistent between rows.
>
> So we will use positioning to get the columns (as relative `td` within the row) that we are interested in.

In [None]:
DoL = {'rank': None, 'state': None, 'population': None}

Following xpath expression gets the set of rows starting with the first data row, and ending with the last state and before territories and aggregates in the table.

In [None]:
rowset = population_table.xpath(".//tr[position() > 2 and position() < 54]")
firstrow = rowset[0]

In [None]:
rank_column = [int(row[0].find('span').text) for row in rowset]
rank_column[:5]

In [None]:
rank_column = population_table.xpath(
    ".//tr[position() > 2 and position() < 54]/td[1]/span/text()")
rank_column = [int(s) for s in rank_column]
rank_column[:5]

In [None]:
convert_pop = lambda p: int(p.replace(",", ""))
pop_column = [convert_pop(row[3].text) for row in rowset]
pop_column[:5]

In [None]:
pop_column = population_table.xpath(
    ".//tr[position() > 2 and position() < 54]/td[4]/text()")
pop_column = [convert_pop(p) for p in pop_column]
pop_column[:5]

In [None]:
state_td = firstrow[2]
print_tree(state_td)

Structure:

- `span`
    - `figure-inline`
        - `span`
            - `img`
    - `span`
- `a`

So under the `td`, we have children of a `span` and an `a` (which is a hyperlink reference).  That first span has embedded stuff for the structure of an image and the image itself.  The hyperlink has the information we seek, with the `text` of the `a` having the name of the state.  We want to avoid the attributes of the `a`, as the `href` and `title` attributes have the possiblitiy of being named differently from the state itself.

On debugging, found that, for the row containing the District of Columbia, that the `a` is not an immediate child, but instead is under another `span`.  Now, `lxml` supports a `find` that can take a subset of XPath, so the solution that works on the `rowset`:

In [None]:
state_column = [row[2].find('.//a').text for row in rowset]
state_column[:5]

In [None]:
state_column = population_table.xpath(
    ".//tr[position() > 2 and position() < 54]/td[3]//a/text()")
state_column[:5]

In [None]:
DoL['rank'] = rank_column
DoL['state'] = state_column
DoL['population'] = pop_column

df = pd.DataFrame(DoL)
df.head()

## Example 3: Data Organized as (Nested) Ordered or Unordered Lists

> HTML List Constructs: https://www.w3schools.com/html/html_lists.asp

**web page with `ind0` data set**: http://personal.denison.edu/~bressoud/datasystems/ind0.html

In [None]:
protocol = "http"
location = "personal.denison.edu"
resourcepath = "/~bressoud/datasystems/{}"

url = buildURL("ind0.html")
response = requests.get(url)
assert response.status_code == 200

In [None]:
tree3 = etree.parse(io.BytesIO(response.content), parser)
root3 = tree3.getroot()

Discovery 1: Find the unordered lists

In [None]:
root3.xpath('//ul')

> There are ten, but unlike the table exploration from Example 2, these Element subtrees are not distinct.  They come from different levels based on the nesting, and so most, if not all, of the above are part of the same single dataset.

When we check the first of these, we find that it is **not** the one we are looking for.

In [None]:
print_tree(root3.xpath('//ul')[0])

Discovery 2: Sometimes, particularly for complex pages, we need to narrow in on a subtree, and sometimes an HTML heading level can help us find the subset of the document that we need.  Using Developer Tools, we find that the **ind0** header for the table is an `h2` element, and that the data is (multiple levels down) in a `div` that is sibling to the `h2`.  So we anchor our futher scraping by using XPath and finding the `h2`, going back up a level, and then going down the sibling `div` to the first `ul`.  The `h2` is within a node whose `id` attribute is `"main-content"`

In [None]:
path = """//*[@id="main-content"]/h2/../div"""
result = root3.xpath(path)
assert len(result) == 1
div_ancestor = result[0]

If we added `"//ul"` or `"//ul[1]"`, we end up finding *multiple* results because of the nesting.  The first form will find the top and all subordinate `ul` nodes, the second will find all **first** `ul` nodes, but nested `ul`'s also have a first one.

From the ancester div node, we use the subset of XPath capability of the `find()` method to find the *first* descendent:

In [None]:
ul_root = div_ancestor.find(".//ul")
print_tree(ul_root)

In [None]:
parse_indicator = lambda s: (s[:s.index(':')], float(s.split()[1]))

### Procedural

At least, **after** we got the root of the unordered list subtree

In [None]:
LoD = []
for country_list in ul_root:
    assert country_list.tag == 'li'
    code = country_list.text
    for time_list in country_list.find('ul'):
        assert time_list.tag == 'li'
        if time_list[0].tag == 'span':
            year = int(time_list[0].text)
        else:
            year = int(time_list.text)
        rowD = {'code': code, 'year': year}
        for indicator in time_list.find('ul'):
            ind, value = parse_indicator(indicator.text)
            rowD[ind] = value
        LoD.append(rowD)
LoD

In [None]:
df = pd.DataFrame(LoD)
df.set_index(['code', 'year'], inplace=True)
df

### XPath Alternative

??

## Example 4: Post Example

```
<form action='index_cms.php' method='post' style='margin-left:10px;'>
  <label for='year'>
    <select name='year' id='year'>
        <option value='2020'>Select Year</option>
        <option value='2020'>2020</option>
        <option value='2019'>2019</option>
        <option value='2018'>2018</option>
        <option value='2017'>2017</option>
        ...
        <option value='1999'>1999</option>
    </select>
  </label>
  <input name='newYear' type='submit' value='Get different year' />
</form>
```

In [None]:
"""
https://ww2.energy.ca.gov/almanac/transportation_data/gasoline/margins/index_cms.php
"""

protocol = "https"
location = "ww2.energy.ca.gov"
resourcepath = "/almanac/transportation_data/gasoline/margins/index_cms.php"

url = buildURL(resourcepath)

In [None]:
response = requests.get(url)
assert response.status_code == 200

tree4 = etree.parse(io.BytesIO(response.content), parser)
root4 = tree4.getroot()

In [None]:
form = root4.find(".//form[@action='index_cms.php']")
print_tree(form)

**Conclusions**

1. A GET to this resource path results in an HTML page with multiple (weekly) tables, each of which has data of interest.
2. The page has a form element, whose `method` attribute is `"post"`.  That means that, when the embedded form is "filled out" and the user submits the form, an HTTP POST is the result:
    - The `action` attribute of the form determines the resource path, relative to the current location, for the URI/resource path needed in the HTTP POST
    - The "form", in this case, just consists of a dropdown list, whose entries are given by the sequence of `option` nodes, and whose values are the possible years.  The key for this field is called `year`, as given in the `select` node.  The value will be one of the year values.
    - The `input` node determines the submission of the form.  In this case, when the user clicks the `"Get different year"`, the form will be submitted and, in addition to the key=value items from the form items, the `name` of the `input` attribute, `newYear` will be mapped to the `value` of "Get different year".
    
A second way of gathering this necessary information would be to use a browser's Developer Tools, observe the Network behavior when a user selects a year and submits by hitting the `Get different year` button.  This action will result in the HTTP POST request, and examination of the request will show the POST, the resource path (`index_cms.php`), and the body will show the URL-encoded key-value pairs with entries for keys `year` and `newYear` mapping to the selected year and "Get different year", respectively.

### Emulating an Interactive Form-Based POST

We use an HTTP POST to convey information from the client to the server.  The information conveyed is in the $\textit{body}$ of the request.  So, in contrast to most earlier examples, we need to change two things in using the `requests` module to make this request:

1. We must get a POST request instead of a GET request.
2. The request must include a body that consists of key-value pairs.

For (1), the `requests` module has a `post` top level function.  For (2), we construct a *dictionary* with the desired mappings.  We pass that to the `post()` using named parameter `data`.  The requests module is very flexible in how it interprets an argument provided through `data`.  If it is a string, it simply puts the encoded bytes of the string in the body.  If it is a dictionary, it interprets it and generates a URL-encoded version, as we will see below:

In [None]:
year = 1998

payload = {'year': year, 'newYear': 'Get different year'}
response = requests.post(url, data=payload)
assert response.status_code == 200

In [None]:
request = response.request
request.body

In the above, we use the response to get the request object.  We then examine the body of the request and see a character sequence with key=value mappings, separated by `&`.  Forms in the body of a POST follow the same URL-encoding that we use for query parameters.  Spaces can get mapped to `+` character (or `%20`).  We did not have to perform this formatting for ourselves, the `requests` module can take a mapping dictionary and perform this task for us.

W3Schools on URL Encoding: https://www.w3schools.com/tags/ref_urlencode.ASP

In [None]:
request.method

In [None]:
request.path_url

In [None]:
request.headers

Note how, also, the requests module informed the server about the format of the body of the post through setting of the `'Content-Type'` header line.

### Processing the Data in the HTML Tree

In the result, there is a **table per week**.

In [None]:
tree4 = etree.parse(io.BytesIO(response.content), parser)
root4 = tree4.getroot()

Another discovery process finds that each of the weekly tables is an immediate chile of a `div` whose `class` attribute is `'contnr``.  This knowledge allows us to directly get the set of weekly tables with a specific xpath and no chance for ambiguity or other tables in the tree to get collected.

In [None]:
# Get a list of the weekly tables

table_list = root4.xpath("//div[@class='contnr']/table")
print(len(table_list))

In [None]:
print_tree(table_list[0])

Given that each table represents a single week, and that the rows in the table represent variables, then each table will give us a single row for a table representing the data of the page.  With an eye toward collecting a List of Dictionaries for construction of the table, we will develop processing of one table to result in one (row) dictionary.

We can see from the print of the tree, that the first piece of data needed, the date, is in a `caption` child of the `table`.  Let us postulate data columns:

`['distrib_cost', 'crude_cost', 'refine_cost', 'storage', 'state_local_tax', 'state_excise_tax', 'fed_excise_tax', 'retail_price']`

Assume we just want the `Branded` data.

In [None]:
data_cols = ['distrib_cost', 'crude_cost', 'refine_cost', 'storage', 
             'state_local_tax', 'state_excise_tax', 'fed_excise_tax', 'retail_price']

In [None]:
table = table_list[0]
date = table[0][0].text

Each individual row has one `th` and two `td` nodes, and for the `Branded` data, we want the first of those `td` nodes.  First row contains the Branded/Unbranded header.

In [None]:
datastrings = table.xpath("./tr[position()>1]/td[1]/text()")

In [None]:
datalist = [float(s[1:]) for s in datastrings]
datalist

In [None]:
D = {key:value for key, value in zip(data_cols, datalist)}

In [None]:
D['date'] = date

In [None]:
D

As a function to process one table:

In [None]:
def processTable(table):
    data_cols = ['distrib_cost', 'crude_cost', 'refine_cost', 'storage', 
                 'state_local_tax', 'state_excise_tax', 'fed_excise_tax', 'retail_price']
    date = table[0][0].text
    datastrings = table.xpath("./tr[position()>1]/td[1]/text()")
    datalist = [float(s[1:]) for s in datastrings]
    D = {key:value for key, value in zip(data_cols, datalist)}
    D['date'] = date
    return D

With a function, we can then easily use a list comprehension to generate our list of dictionaries over the set of tables acquired through our original XPath:

In [None]:
LoD = [processTable(table) for table in table_list]

And finally build our Data Frame:

In [None]:
df = pd.DataFrame(LoD)
df.set_index('date', inplace=True)
df.head()