# Denison DA210/CS181 SW Lab #9 - Step 1

Before you get your checkpoints, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells**.

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [None]:
import os
import io
import sys
import importlib
import pandas as pd
from lxml import etree
import requests
from IPython.display import Image

htmlparser =  etree.HTMLParser()

module_dir = os.path.join("..", "..", "modules")
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import util
importlib.reload(util)

---

## Part A: HTML structure

Although we will typically assume that HTML is properly formatted, that is no guarantee when using real websites, as they may have been created without (or before) such assumptions.

For example, consider the following string: `"<html><head><title>test<body><h1>header title</h3>"`.  This is "bad" for several reasons:

* The `<html>`, `<head>`, `<title>`, and `<body>` tags are all missing a closing tag.
* The `<h1>` header tag is closed with a `<h3>` tag.

If we try to use an XML parser, this will fail:

In [None]:
# Broken HTML: <html>, <head>, <title>, <body> not closed, <h1> closed as <h3>
bad_html = "<html><head><title>test<body><h1>header title</h3>"

# Try and fail to parse as XML
xmlparser = etree.XMLParser()
try:
    tree = etree.parse(io.StringIO(bad_html), parser=xmlparser)
    util.print_xml(tree.getroot())
except:
    # Should end up here
    print("Failed to parse as XML")

However, we can instead use a HTML parser provided by `etree`, which can handle such messy HTML:

In [None]:
# Try again as HTML
htmlparser = etree.HTMLParser()
try:
    # This one should work
    tree = etree.parse(io.StringIO(bad_html), parser=htmlparser)
    util.print_xml(tree.getroot())
except:
    print("Failed to parse as HTML")

Now, let's consider well-formed HTML.  As XML, it must have a single root node.  For HTML documents, this should be `<html>`.  This node should have, at most, one `<head>` child and one `<body>` child, in that order.

The `<head>` node contains meta information about the HTML document.  A common part of this is the webpage title, using the `<title>` tag.

The `<body>` node contains the content of the webpage.  This can include text nodes (e.g., using `<div>`, `<p>`, and `<span>`), headers (`<h1>` through `<h6>`), links (`<a>`), lists (`<ul>` or `<ol>`) and tables (`<table>`).

Here is a simple example (which we can parse as either XML or HTML, as it is properly formed XML):

In [None]:
# A simple HTML string
simple_html = "<html><head><title>test</title></head><body><h1>header title</h1></body></html>"
tree = etree.parse(io.StringIO(simple_html), parser=xmlparser)

# Display the HTML
util.print_xml(tree.getroot())

---

## Part B: Web scraping - data acquisition

We can either work with locally saved HTML documents, or download them from the web.  We won't focus on this for now, so the code in the following cell doesn't need to make too much sense to you yet (see Chapters 18-21 for what we've skipped so far in this regard if you're curious).

#### Scraping via GET request

At a high level, we can use a _URL_ to access a document on the web, and form a _request_ to _get_ the content at that URL.  If the _response_ has _status_ `200`, then the request was successful.

In this case, we will download the HTML source of the page: [http://datasystems.denison.edu/basic.html](http://datasystems.denison.edu/basic.html).

In [None]:
# Download HTML from a web URL

location = "datasystems.denison.edu"
resource = "/basic.html"

url = util.buildURL(resource, location)
response = requests.get(url)
assert response.status_code == 200

# Display the retrieved HTML text
basic_html = response.text
print(basic_html)

As you can see, this webpage is slightly more complex than our previous simple example: it has a nested list, with the outer list being unordered (bullet points), and the inner list being ordered (numbered).

It also contains two heading levels, as well as bolded text and a link inside of a paragraph node.

#### Scraping via `curl`

Alternatively, we can use the `curl` command (a command-line tool, not part of Python itself) to download the webpage content to a local HTML file.  The following command will save the HTML source of [http://datasystems.denison.edu/basic.html](http://datasystems.denison.edu/basic.html) to your computer, in a file `basic.html` in the same folder as this notebook.

In [None]:
# Download the HTML to a file -- do not modify this!
!curl -s -o basic.html http://datasystems.denison.edu/basic.html

---

## Part C: Web Scraping a Wikipedia Table

Consider the Wikipedia page of Municipalities in Ohio: https://en.wikipedia.org/wiki/List_of_municipalities_in_Ohio.  A simpler version of this page can be found using the Wikipedia API: https://en.wikipedia.org/api/rest_v1/page/html/List_of_municipalities_in_Ohio.

#### Discovery

Go to the above web page using Chrome, navigate to View->Developer->Inspect Elements, or in Firefox, Tools->Browser Tools->Web Developer Tools.

You do not have to submit the sketches mentioned below, but the following steps walk you through the discovery process.

1. Find where in the HTML the table exists.
2. Sketch the HTML subtree starting at that point (not all nodes, but at least 2 rows of data).

> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: Use the HTML subtree you drew to answer this question.  Either procedurally or using XPath, how could you determine how many data rows are in this table, if you had a variable `table_node` that gives the node at the root of this subtree (e.g., with `tag` `"table"`)?
>
> Note: You don't have to write the code for this question -- that comes next.

#### Code Setup

As mentioned earlier, you do not need to understand anything from the next code cell that you haven't already seen.  We'll learn more about web requests in Unit 5.

In [None]:
# DO NOT CHANGE THIS

# Read HTML from the web
resource_path = "/api/rest_v1/page/html/List_of_municipalities_in_Ohio"

url = util.buildURL(resource_path, "en.wikipedia.org")
response = requests.get(url)
assert response.status_code == 200

# Use a custom HTML parser to parse the response content into an XML Element
tree = etree.parse(io.BytesIO(response.content), htmlparser)
root = tree.getroot()

#### Data Acquisition

**Q1:** Either procedurally or using XPath, find the node with `tag` `"table"` that represents the root of the subtree for this table.  Assign the node to the variable `tableroot`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the first 20 lines of the tree
util.print_xml(tableroot, depth=16, nlines=20)

In [None]:
# Testing cell
assert tableroot.tag == "table"
assert tableroot.get("class") == "wikitable sortable mw-collapsible"

**Q2:** Use the result of `print_xml` from Question 1 to make sure your understanding of the `table` subtree matches your hand-drawn tree from earlier.  If you chose to start from `root`, would your expression be **guaranteed** to get the table you are interested in?  Why or why not?

YOUR ANSWER HERE

**Q3:** Extract the column names from the table, and store the list in a variable `col_names`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Print the list of column names
print(col_names)

In [None]:
# Testing cell
assert type(col_names) is list
assert len(col_names) == 5
assert "Population (2020)" in col_names
assert "County" in col_names
assert "Population (2020)[2]" not in col_names

> You've reached the second checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: In general, you have two choices for scraping the data in this table: you can either acquire all data in a single big list and split it into a _list of lists_, or you can acquire each column individually and store the data in a _dictionary of lists_.
>
> Looking at the HTML and/or the web page, why would it be challenging to build an LoL for this table?

#### Programmatic Extraction: Dictionary of Column Lists

A perhaps simpler solution involves using the regularity of the columns in a table (be it in HTML or other regular table form).  Within each `tr`, the **position** of each of the `td` elements for the five fields in this table is always the same, regardless of row.  So at position 1 within all the rows, we always have the `Name`, at position 2, we always have the `Class`, and so forth.

**Q4:** Scrape the municipality names from the first column of the table.  Assign the result to the variable `names_list`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Print info about the year vector
print(len(names_list))
print("Name list prefix:", names_list[:10])

In [None]:
# Testing cell
assert type(names_list) is list
assert len(names_list) == 926
assert names_list[0] == "Akron"
assert names_list[-1] == "Zoar"

**Q5:** For the `Class` and population columns, we could do this three times, with different values for the position, creating three separate lists.  Instead, we want to use Python string formatting to dynamically create an xpath query that retrieves the `td` at a position given by a variable, and then traverses to the text attribute.

Assign to `xs_template_q5` a Python string using `{0}` in place of the position from your solution above, so that the testing code shows its use by obtaining data columns from the table.  (This is another approach to create a format string in Python.  You should *not* precede your string with an `f` in this case.)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Test your template, retrieving two of the column vectors
classes = tableroot.xpath(xs_template_q5.format(2))
print(classes[:8])
pop2020 = tableroot.xpath(xs_template_q5.format(3))
print(pop2020[:8])

In [None]:
# Testing cell
assert type(xs_template_q5) is str
assert "pineapple" in xs_template_q5.format("pineapple")
assert len(classes) == 926
assert len(pop2020) == 926
assert classes[-1] == "Village"
assert pop2020[3] == "19,225"

**Q6:** The format string you defined in the previous question doesn't work with the last column.  Write another format string that returns only the first `County` listed for each municipality (e.g., only `"Stark"` for `"Alliance"`).

Assign your string to `xs_q6`.  You can hard-code the column number, so you don't need `{}` in your string.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Test your template, retrieving two of the four column vectors
counties = tableroot.xpath(xs_q6)
print(counties[:8])

In [None]:
# Testing cell
assert type(xs_q6) is str
assert len(counties) == 926
assert counties[0] == "Summit"
assert counties[1] == "Stark"
assert counties[2] == "Lorain"

**Q7:** Using the results from the previous questions, or any other approach you choose, build a DoL representing this table.  The dictionary keys should be column names, and the value lists should contain the per-column table data.  You can keep all values as strings.  Note that you only need the first `County` listed for any given municipality.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Print the first ten rows of the resulting DoL
print(list(DoL.keys()))
for rowid in range(10):
    print([DoL[col][rowid] for col in DoL])

In [None]:
# Debugging cell -- try to make a DataFrame
pd.DataFrame(DoL)

In [None]:
# Testing cell
assert type(DoL) is dict
assert type(DoL["Name"]) is list
assert len(DoL["County"]) == 926
assert DoL["Name"][0] == "Akron"
assert DoL["Population (2020)"][1] == "21,672"
assert DoL["County"][1] == "Stark"

> You've reached the third (and final) checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 3: How could you modify your scraping of the last column to retrieve a list of `County` values for each municipality, rather than just the first listed?
>
> (Note: You don't have to actually code this, just think about it.)

---

---
## Part D

How much time (in minutes/hours) did you spend on this lab outside of class?

YOUR ANSWER HERE