Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE`/`raise NotImplementedError` or "YOUR ANSWER HERE", as well as your name and collaborators below:

# 05_HW1: Web-scraping basics

As you learned in the most recent in-class worksheet, web-scraping entails getting a desired HTML document, parsing it, and extracting information from it. In order to get the data we want, we often need HTTP GET requests using URL-query parameters, POST requests with parameters, and resource paths beginning with `/api/` to reduce noise in the resource we are getting (more on this in the next chapter). We summarize:

## HTML as a tree

1. Get the HTML through HTTP
   - Variations of the HTTP
     - A static html page, intended for web browser/human viewing
       - Usually of type .html, e.g., http://personal.denison.edu/~bressoud/datasystems/ind0.html
     - A dynamic html page, intended for web browser/human viewing
       - Can be of type PHP, ASP, or JSP, e.g., https://ww2.energy.ca.gov/almanac/transportation_data/gasoline/margins/index_cms.php
       - Sometimes need GET with URL-query-parameters
       - Can do POST with URL-encoded body
     - An API endpoint (will be discussed in chapter 23)
       - Typically dynamic
       - URL and/or POST Body parameters
       - Different formats for return
       - Most often with authentication/authorization
   - Examples for today
     - https://api.kivaws.org/v1/loans/newest
     - Even though this starts with `api`, we do not need the material from chapter 23 to web scrape it.
2. Process the result into a tree
   - If well structured (close to, or satisfying XHTML), can use same technique as for XML with the `lxml` module package
   - If less well structured
     - HTML parser of `lxml`
   - All result in a tree structure, but can differ in some of the details of the operations to inspect/traverse/manipulate the tree
3. Understand the tree structure and navigate the tree to iterate over and build the data
   - Basic structure of HTML
     - [W3Schools Tutorial Link](https://www.w3schools.com/html/)
     - head
     - body
     - div and span
   - Lists
   - Tables
   
Please run the cell below to import all packages we will need.

In [None]:
from IPython.core.debugger import set_trace
import requests
from lxml import etree
import lxml.html as lh
import pandas as pd
import json
import re
import io

In [None]:
url = 'https://api.kivaws.org/v1/loans/newest'
response = requests.get(url)
if response.status_code != 200:
    print('Unable to retrieve url:', url, 'Status code:', response.status_code)
else:
    htmltree = lh.parse(io.BytesIO(response.content))
    htmlroot = htmltree.getroot()

In [None]:
print(etree.tostring(htmlroot, pretty_print=True).decode('utf-8')[:1200])

In [None]:
datarows = htmlroot.xpath('/html/body/div/table/tr/td/..')

In [None]:
datarows

In [None]:
html_root_element = htmlroot
table_rows = html_root_element.xpath('/html/body/div/table/tr')

In [None]:
html_root_element

In [None]:
head_e = html_root_element[0]
body_e = html_root_element[1]
print(head_e, body_e)

In [None]:
contentdiv_e = body_e[1]
print('tag = {}, attributes = {}'.format(contentdiv_e.tag, contentdiv_e.attrib))

In [None]:
table_e = contentdiv_e[0]
print('tag = {}, attributes = {}'.format(table_e.tag, table_e.attrib))

In [None]:
table_rows = []
for table_child in table_e:
    if table_child.tag == 'tr':
        table_rows.append(table_child)

In [None]:
header_row = table_rows.pop(0)

In [None]:
headers = [child.text for child in header_row]
headers

In [None]:
ListOfLists = []
for row_e in table_rows:
    row = []
    for child in row_e:
        if len(child) == 0:
            row.append(child.text)
        else:
            assert len(child) == 1
            assert child[0].tag == 'a'
            row.append(child[0].text)
    #row = [child.text for child in row_e]
    ListOfLists.append(row)
df = pd.DataFrame(ListOfLists, columns=headers)
df