Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

In [None]:
import requests
import pandas as pd
from lxml import etree
import io
import os
from IPython.display import Image

import util
from util import *
import importlib

In [None]:
importlib.reload(util)
from util import *

# Web Scraping

Web scraping is the process of taking data that resides in web pages, whose format is HTML, and, after parsing it into a tree, perform operations on the tree to extract information and transform it into a useable form.

This notebook complements the treatment from Chapter 22 of our textbook, but does not replace it.  

## Warmup

### Discovery

Consider the web page at http://personal.denison.edu/~bressoud/datasystems/basic.html, shown in its rendered form below:

In [None]:
Image("figs/basic.jpg", width=300)

**Student Action** Go to the above web page using Chrome, navigate to View->Developer->Inspect Elements, and hand-draw the tree rooted at `html`.  You can omit the `/html/head/style` elements, and abbreviate text children.

### Programmatic Extraction

Run the cell below to acquire the HTML that you just examined from its location on the web.  Notice that the cell uses the `HTMLParser` constructor to use in the `etree.parse()`, since the parsing of HTML has some important differences from parsing XML (with the `XMLParser`).  On completion of the cell, `root` refers to the root html element.

In [None]:
# Reading from the web into an XML Element, using custom parser
protocol = "http"
location = "personal.denison.edu"
resourcepath = "/~bressoud/datasystems/{}"

buildURL = lambda s: "{}://{}{}".format(protocol, location, resourcepath.format(s))

htmlparser = etree.HTMLParser(remove_blank_text=True) 

url = buildURL("basic.html")
response = requests.get(url)
assert response.status_code == 200

tree = etree.parse(io.BytesIO(response.content), htmlparser)
root = tree.getroot()

> The next few steps illustrate extracting information from the HTML tree using XPath (declarative) and `lxml` Element object traversal (procedural).

**Student Action**  Function `print_levels(node, level, maxlevel, maxchildren)` takes an Element node and, from our perspective, always starts at level 0, and prints out an XML/HTML tree showing beginning and end tags and nesting (using recursion), up to a maxlevel depth of the recursion and up to a maximum number of children under any given node.  Invoke this function below multiple times on the HTML tree.  Each time, try a different value for maxlevel.  We start you off with a maxlevel of 0.  If maxchildren is not specified, it defaults to a value of 30, so we omit it in this example.

In [None]:
print_levels(root, 0, maxlevel=0)
# YOUR CODE HERE
raise NotImplementedError()

**Student Action** Delete the `raise Exception` and change the assignment to `xs` to be an XPath expression that retrieves the `href` attribute from the `a` node. (We are using `xs` to name an **x**path **s**tring.)

In [None]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
result = root.xpath(xs)
print(result[0])

**Student Action** Now create, in `xs` an XPath expression that yields the text of the three `li` items in the outside unordered list (`ul`).

In [None]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
result = root.xpath(xs)
print(result)

**Variations** How about the inner list items?  All list items?  How would you remove extraneous whitespace?

## Topnames Table with GET

### Discovery

Consider the web page at http://personal.denison.edu/~bressoud/datasystems/topnames.html, shown in its rendered form below:

In [None]:
Image("figs/topnames.jpg", width=400)

**Student Action** Go to the above web page 
using Chrome, navigate to View->Developer->Inspect Elements.

- Find where in the HTML that the "topnames" label for the table exists (not the tab, but in the content of the page proper).  Traverse "up" to the first `div` ancestor, and draw the subtree starting at that point, down to the point where you have included the label text, but only covering the first full subtree of the `div` (containing the label).

- Find the `table` node within the overall tree, and then draw the subtee rooted at `table`.  You need only include the first two data-carrying rows of data.  You can also use abbreviations of your own choosing to make this less onerous.

In [None]:
# Reading from the web into an XML Element, using custom parser
url = buildURL("topnames.html")
response = requests.get(url)
assert response.status_code == 200

tree = etree.parse(io.BytesIO(response.content), htmlparser)
root = tree.getroot()

**Student Action** Create in `xs` an XPath expression that (uniquely) 
finds the `div` that is the common ancestor of both the topnames label and the `table` containing the data.  Assign the node itself (not the list containing the node) to the variable `divroot` and use the `print_levels` to show enough of the tree that the print out includes the label text.

In [None]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
print_levels(divroot, 0, maxlevel=2)

**Student Action** [Optional] Suppose you knew that a table and its label existed on a web page structured this way, but you did not necessarily know the text of the label, but the label as placed in the web page was part of the extraction.  Create in `xs` and XPath expression that, based on operating on `divroot`, retrieves the label of the table.

In [None]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
result = divroot.xpath(xs)
print_results(result[:5])
assert len(result) == 1

**Student Action**  Either using the `root` of the html tree, or from the `divroot` found above, assign to `xs` an XPath expression that assigns to `table` the Element subtree rooted at `table`.  Then use `print_levels` to make sure your understanding of the table tree matches your hand-drawn tree from earlier.  If you chose to start from `root`, would your expression be **guaranteed** to get the table you are interested in?

In [None]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
print_levels(table, 0, maxlevel=3, maxchildren=3)

### Programmatic Extraction

#### List of Lists

**Student Action** Assign to `xs` an XPath expression that retrieves the names of the columns of the table, assigning to `col_names`.  Write your expression as an absolute one, working from the root.

In [None]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
print(col_names)

Note that if the HTML document had multiple tables, then the XPath expression above would match header cells from all `th` elements anywhere in the document, including ones in totally separate tables.  In such a case, we may have to make further assumptions about the structure of the HTML to uniquely find the desired table.  

In the following, we will assume that `table` refers to the correct table, and focus on extracting the data and transforming it into a data frame.

We show a method from the book that solves this problem by the following steps:

1. Use XPath to retrieve a single list of the text property of all `td` nodes under the `tbody`.
2. Using the knowledge that there are four fields per row, iterate over this single list and, by putting sequential sets of four elements into a row list, create a LoL representation of the data.
3. Build the dataframe based on the LoL and the column names.

In [None]:
tdlist = table.xpath("./tbody/tr/td/text()")
print(len(tdlist))
print(tdlist[0:10])

We extract this into a list of lists that we can feed to `pandas`. Each list is a row, i.e., a slice of length 4 from `tdlist`.

In [None]:
LoL = []
fieldcount = 0
for item in tdlist:
    if fieldcount == 0:
        row = []
    row.append(item)
    if fieldcount < 3:
        fieldcount += 1
    else:
        LoL.append(row)
        fieldcount = 0
LoL[:10]

In [None]:
# Turning the LoL into a dataframe
df = pd.DataFrame(LoL,columns=col_names)
df.set_index(['year','sex'], inplace=True)
df.head(10)

#### Dictionary of Column Lists

A perhaps simpler solution involves using the regularity of the columns in a table (be it in HTML or other regular table form).  Within each `tr`, the **position** of each of the `td` elements for the four fields in this table is always the same, regardless of row.  So at position 1 within all the rows, we always have the year, at position 2, we always have the sex, and so forth.

**Student Action** Assign to `xs` an XPath expression that retrieves, relative to `table`, the text property for all `td` elements under a `tr` where the `td` is the position 1 child of the `tr`:

In [None]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
year_vector = table.xpath(xs)
print(len(year_vector))
print(year_vector[:10])

We could do this four times, with different values for the position, creating four vectors and then constructing the dictionary.  Instead, we want to use Python string formatting to 
dynamically create an xpath that retrieves the `td` at a position given by a variable, and then traverses to the text attribute.  

**Student Action** Assign to `xs_template` a Python string using `{}` in place of the position, so that the testing code shows its use by obtaining the four data columns from the table.

In [None]:
xs_template = ""
# YOUR CODE HERE
raise NotImplementedError()
years = table.xpath(xs_template.format(1))
print(years[:8])
sexes = table.xpath(xs_template.format(2))
print(sexes[:8])
names = table.xpath(xs_template.format(3))
print(names[:8])
counts = table.xpath(xs_template.format(4))
print(counts[:8])

In [None]:
DoL = {}
for index, column in enumerate(col_names):
    xpath = xs.format(index+1)
    DoL[column] = table.xpath(xpath)
df = pd.DataFrame(DoL)
df.set_index(['year','sex'], inplace=True)
df.head(10)

## POST Example

### Discovery

Consider the web page at https://ww2.energy.ca.gov/almanac/transportation_data/gasoline/margins/index_cms.php

We can infer from a PHP page, often with a `php` extension, that the page is a dynamic one, that, on an HTTP request, will respond to the request by dynamically generating the HTML content.  PHP is a scripting language that lets a server execute code instead of serving up static content.

On the given page, navigate toward the bottom of the page where you will find a `Select Year` drop down and a button labeled `Get different year`, and pick the year you were born, or 1999 if you were born earlier than 1999, and then click the `Get different year` button.

The two GUI elements of the drop-down and the submission button consistute what, in HTML, is called a **form** (albeit, one of the simplest forms one might imagine).

The way that a PHP or other dynamic resource path at a web server obtains the information for processing is via the client making a POST request.  The request includes information in the **body** of the request that passes, from client to server, the information needed to do the dynamic generation.  ... otherwise, the page could simply be static.

In this case, by making a year choice and clicking the submission button, the web browser makes a POST request and formats the body with the form data.  We need our web scraping client applications to be able to perform the same way.

**Student Action** Go to the above web page using Chrome

- navigate to View->Developer->Developer Tools
- In the Tools window, select the `Network` tab
- In the window showing the web page itself, navigate to near the bottom and repeat your earlier action of selecting a year and then clicking the `Select different year` botton.
    - This action should result in a set of entries appearing in the Network window.  The first of these in the `Name` subwindow, should be labeled `index_cms.php`
- Click on that first entry, `index_cms.php`
    - The lower window will subdivide, and you shoud see sub-tabs with names `Headers`, `Preview`, `Response`, etc.
- Click on the `Headers` sub-tab
- Make sure the `Request Headers` and the `Form Data` elements under `Headers` are expanded and the others are collapsed.
- Click the `view source` next to the `Request Headers` element:
    - Find the HTTP request line.  Do you recognize the syntax
    - What does the `Content-length` header line tell you?
    - How about the `Content-type` header?
    - Are these headers formed by the client or by the server?
- Now examine the `Form Data` element:
    - what are the key-value pairs?
    - Now click on the `view source`
    - Where, in an HTTP request, would this information be placed?
    - Click back on the `view parsed` to get back to the key-value view
    - Toggle back and form between `view URL encoded` and `view decoded`; what is the difference?  Which of these two is used in the `view source` view?

In [None]:
protocol = "https"
location = "ww2.energy.ca.gov"
resourcepath = "/almanac/transportation_data/gasoline/margins/index_cms.php"

url = buildURL(resourcepath)

In [None]:
response = requests.get(url)
assert response.status_code == 200

tree = etree.parse(io.BytesIO(response.content), htmlparser)
root = tree.getroot()

If we use Developer Tools (or View Source) we can find the form that contains the year drop-down and the submit button labeled `Get different year`.  We want to examine that `form` subtree within our HTML.

**Student Action** Assign to `xs` an Xpath string that finds the `form` node where the `action` attribute is set to the page's PHP of `index_cms.php`

In [None]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()

form = root.find(xs)
print_levels(form, 0, maxlevel=3, maxchildren=5)

**Conclusions**

1. A GET to this resource path results in an HTML page with multiple (weekly) tables, each of which has data of interest.
2. The page has a form element, whose `method` attribute is `"post"`.  That means that, when the embedded form is "filled out" and the user submits the form, an HTTP POST is the result:
    - The `action` attribute of the form determines the resource path, relative to the current location, for the URI/resource path needed in the HTTP POST
    - The "form", in this case, just consists of a dropdown list, whose entries are given by the sequence of `option` nodes, and whose values are the possible years.  The key for this field is called `year`, as given in the `select` node.  The value will be one of the year values.
    - The `input` node determines the submission of the form.  In this case, when the user clicks the `"Get different year"`, the form will be submitted and, in addition to the key=value items from the form items, the `name` of the `input` attribute, `newYear` will be mapped to the `value` of "Get different year".

### Emulating an Interactive Form-Based POST

We use an HTTP POST to convey information from the client to the server.  The information conveyed is in the $\textit{body}$ of the request.  So, in contrast to most earlier examples, we need to change two things in using the `requests` module to make this request:

1. We must invoke a POST request instead of a GET request.
2. The request must include a body that consists of key-value pairs.

For (1), the `requests` module has a `post` top level function.  For (2), we construct a *dictionary* with the desired mappings.  We pass that to the `post()` using named parameter `data`.  The requests module is very flexible in how it interprets an argument provided through `data`.  If it is a string, it simply puts the encoded bytes of the string in the body.  If it is a dictionary, it interprets it and generates a URL-encoded version, as we will see below:

In [None]:
year = 1999

payload = {'year': year, 'newYear': 'Get different year'}
response = requests.post(url, data=payload)
assert response.status_code == 200

In [None]:
request = response.request
request.body

In the above, we use the response to get the request object.  We then examine the body of the request and see a character sequence with key=value mappings, separated by `&`.  Forms in the body of a POST follow the same URL-encoding that we use for query parameters.  Spaces can get mapped to `+` character (or `%20`).  We did not have to perform this formatting for ourselves, the `requests` module can take a mapping dictionary and perform this task for us.

W3Schools on URL Encoding: https://www.w3schools.com/tags/ref_urlencode.ASP

In [None]:
request.method

In [None]:
request.path_url

In [None]:
request.headers

Note how, also, the requests module informed the server about the format of the body of the post through setting of the `'Content-Type'` header line.

### Processing the Data in the HTML Tree

In the result, there is a **table per week**.

In [None]:
tree1999 = etree.parse(io.BytesIO(response.content), htmlparser)
root1999 = tree1999.getroot()

Another discovery process finds that each of the weekly tables is an immediate child of a `div` whose `class` attribute is `'contnr`.  This knowledge allows us to directly get the set of weekly tables with a specific xpath and no chance for ambiguity or other tables in the tree to get collected.

In [None]:
# Get a list of the weekly tables

table_list = root1999.xpath("//div[@class='contnr']/table")
print(len(table_list))

In [None]:
first_table = table_list[0]
print_levels(first_table, 0, maxlevel=3, maxchildren=4)

**Student Activity** Experiment with `first_table` and hand-draw enough of the table to demonstrate that you understand its structure.

Given that each table represents a single week, and that the rows in the table represent variables, then each table will give us a single row for a table representing the data of the page.  With an eye toward collecting a List of Dictionaries for construction of the table, we will develop processing of one table to result in one (row) dictionary.

We can see from the print of the tree, that the first piece of data needed, the date, is in a `caption` child of the `table`.  Let us postulate data columns:

`['distrib_cost', 'crude_cost', 'refine_cost', 'storage', 'state_local_tax', 'state_excise_tax', 'fed_excise_tax', 'retail_price']`

Assume we just want the `Branded` data.

In [None]:
data_cols = ['distrib_cost', 'crude_cost', 'refine_cost', 'storage', 
             'state_local_tax', 'state_excise_tax', 'fed_excise_tax', 'retail_price']

In [None]:
date = first_table[0][0].text
date

Each individual row has one `th` and two `td` nodes, and for the `Branded` data, we want the first of those `td` nodes.  First row contains the Branded/Unbranded header.

In [None]:
datastrings = first_table.xpath("./tr[position()>1]/td[1]/text()")

In [None]:
datalist = [float(s[1:]) for s in datastrings]
datalist

In [None]:
D = {key:value for key, value in zip(data_cols, datalist)}

In [None]:
D['date'] = date

In [None]:
D

As a function to process one table:

In [None]:
def processTable(table):
    data_cols = ['distrib_cost', 'crude_cost', 'refine_cost', 'storage', 
                 'state_local_tax', 'state_excise_tax', 'fed_excise_tax', 'retail_price']
    date = table[0][0].text
    datastrings = table.xpath("./tr[position()>1]/td[1]/text()")
    datalist = [float(s[1:]) for s in datastrings]
    D = {key:value for key, value in zip(data_cols, datalist)}
    D['date'] = date
    return D

With a function, we can then easily use a list comprehension to generate our list of dictionaries over the set of tables acquired through our original XPath:

In [None]:
LoD = [processTable(table) for table in table_list]

And finally build our Data Frame:

In [None]:
df = pd.DataFrame(LoD)
df.set_index('date', inplace=True)
df.head(8)