Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE`/`raise NotImplementedError` or "YOUR ANSWER HERE", as well as your name and collaborators below:

# 05_HW1: Web-scraping basics

As you learned in the most recent in-class worksheet, web-scraping entails getting a desired HTML document, parsing it, and extracting information from it. In order to get the data we want, we often need HTTP GET requests using URL-query parameters, POST requests with parameters, and resource paths beginning with `/api/` to reduce noise in the resource we are getting (more on this in the next chapter). We summarize:

## HTML as a tree

1. Get the HTML through HTTP
   - Variations of the HTTP
     - A static html page, intended for web browser/human viewing
       - Usually of type .html, e.g., http://personal.denison.edu/~bressoud/datasystems/ind0.html
     - A dynamic html page, intended for web browser/human viewing
       - Can be of type PHP, ASP, or JSP, e.g., https://ww2.energy.ca.gov/almanac/transportation_data/gasoline/margins/index_cms.php
       - Sometimes need GET with URL-query-parameters
       - Can do POST with URL-encoded body
     - An API endpoint (will be discussed in chapter 23)
       - Typically dynamic
       - URL and/or POST Body parameters
       - Different formats for return
       - Most often with authentication/authorization
   - Examples for today
     - https://api.kivaws.org/v1/loans/newest
     - Even though this starts with `api`, we do not need the material from chapter 23 to web scrape it.
2. Process the result into a tree
   - If well structured (close to, or satisfying XHTML), can use same technique as for XML with the `lxml` module package
   - If less well structured
     - HTML parser of `lxml`
   - All result in a tree structure, but can differ in some of the details of the operations to inspect/traverse/manipulate the tree
3. Understand the tree structure and navigate the tree to iterate over and build the data
   - Basic structure of HTML
     - [W3Schools Tutorial Link](https://www.w3schools.com/html/)
     - head
     - body
     - div and span
   - Lists
   - Tables
   
Please run the cell below to import all packages we will need.

In [None]:
from IPython.core.debugger import set_trace
import requests
from lxml import etree
import lxml.html as lh
import pandas as pd
import json
import re
import io

### Using HTTP headers and parameters

Recall from your previous homework, the function

`
buildURL(location, resource,protocol)
`

that returns a string URL based on the three component parts of `protocol`, `location`, and `resource`. Recall also 

`
simpleHTTPGet(location, resource)
`

that uses `buildURL` and the `requests` module to construct and issue an HTTP request to the URL from the location and resource using the GET HTTP method, and to obtain a string consisting of the body of the returned data resource (or `None` if the request fails to retrieve string data). These functions are provided below, in case you didn't solve them correctly on the previous homework. Because the protocol is so often `'http'`, we can specify it as an optional argument as shown below, which defaults to the value `'http'`. For more on optional arguments, please consult the following resource:

- http://docs.activestate.com/activepython/3.5/dip/power_of_introspection/optional_arguments.html

In [None]:
def buildURL(location, resource,protocol='http'):
    """Construct a URL, given the protocol (with or without the trailing '://'), 
       a location, and a resource.
       Return: string version of the url
    """
    fmt1 = "{}{}{}"
    fmt2 = "{}://{}{}"
    if protocol[-3:] == "://":
        url = fmt1.format(protocol, location, resource)
    else:
        url = fmt2.format(protocol, location, resource)
    return url

buildURL('httpbin.org', '/get','http://')

def simpleHTTPGet(location, resource):
    """Perform an HTTP GET to the given location and resource (but no query parameters)
       Rerturn: string of the textual body of the response, if successful,
                None if not successful
    """
    url = buildURL(location, resource,'http')
    resp = requests.get(url)
    if resp.status_code != 200:
        return None
    return resp.text
simpleHTTPGet('personal.denison.edu', '/')

**Q1** We have seen that some `get` requests result in a response body whose format is JSON, while other requests result in text or html body. When the result is JSON, we know that we can go beyond returning just a string containing the result ... we can "interpret" the json and yield an actual Python data structure.  Rewrite your `simpleHTTPGet()` function so that it obtains the data, same as before, but if the **content type** is JSON, we, instead of returning a string, return the json interpreted result. Because the content type sometimes includes character set information, our check for content type must look for `application/json` at the beginning of the content type header. If that's not the content type, then return the `text` like before. Hint: `requests.Response` objects have many useful built-in functions, described in the documentation https://www.w3schools.com/python/ref_requests_response.asp

In [None]:
# Solution cell

# YOUR CODE HERE
raise NotImplementedError()
simpleHTTPGet('httpbin.org', '/get')

In [None]:
# Testing cell

s1 = simpleHTTPGet('personal.denison.edu', '/')
assert len(s1) == 464
assert type(s1) is str
j2 = simpleHTTPGet('httpbin.org', '/get')
assert type(j2) is dict
assert len(j2) == 4

**Q2** The second most common use case for HTTP is to use a GET method with URL parameters.  So write a function

`
HTTPGetParams(location, resource, pdict,protocol)
`

that rewrites your `simpleHTTPGet` to use the dictionary of URL parameters specified in `pdict` to specify the parameter name to parameter value mapping to be included in the GET request. Please arrange for `protocol` to be an optional argument, whose default value is `'http'`. See the assert statements below for examples.

Although we will use a dictionary `pdict`, and that is what is expected in the `requests` module, we know that, through the processing performed by `requests`, the dictionary will result in adding the `?` and the `<name>=<value>` pairs to the url before the HTTP GET request is sent from the client to the server, e.g., sending URLs of the form `https://covidtracking.com/api/states/daily?state=OH&date=20200407`. You might consider referring back to the COVID problem of 03_HW3, or to the documentation for the `requests` module (including parameters to the `get` function):
https://www.w3schools.com/python/module_requests.asp

Like the previous problem, if the content type is json, you should return an interpreted json-based Python data structure. Otherwise, return a string with the text of the response, or return `None` if the request was not successful.

In [None]:
# Solution cell

# YOUR CODE HERE
raise NotImplementedError()
HTTPGetParams('httpbin.org', '/get', {'a': 5, 'b': 'Hello world'})
HTTPGetParams('httpbin.org', '/get', {'a': 5, 'b': 'Hello world'}, 'http')

In [None]:
# Testing cell

s1 = HTTPGetParams('personal.denison.edu', '/', {})
assert len(s1) == 464
assert type(s1) is str

j2 = HTTPGetParams('httpbin.org', '/get', {'a': 5, 'b': 'Hello world'})
assert type(j2) is dict
assert len(j2) == 4
assert len(j2['args']) == 2

# This example should be familiar from midterm3
j3 = HTTPGetParams('api.kivaws.org', '/v1/loans/newest.json', {'page':1,'per_page':5},'https')
assert type(j3) is dict
assert len(j3) == 2
assert len(j3['paging']) == 4
assert len(j3['loans']) == 5

### Web-scraping Kiva loans

**Q3** Review 03_HW3, where we employed URL parameters to customize queries for COVID data. We will now do the same for data on Kiva loans. Kiva is an organization that describes itself as "an online lending platform connecting online lenders to entrepreneurs." You can learn more here:
https://www.kiva.org/about

Later, we will study the Kiva API in detail. For now, it is enough to know that a URL of the following form yields ten results from the second page of loans, with lots of information in each result (please click to see):
http://api.kivaws.org/v1/loans/newest.json?page=2&per_page=10

Whereas a URL of the following form yields five results from the first page of loans, with only the ids of the loans (rather than, say, their amounts as well):
http://api.kivaws.org/v1/loans/newest.json?page=1&per_page=5&ids_only=true

By now, when you see these kinds of URLs you should be thinking in terms of the `pdict` parameter from the previous problem. Please write a function

`
kivaNewestLoans(pagenum, loansperpage, onlyids)
`

that allows a Python program to specify integers for the `pagenum`, the `loansperpage`, and a boolean for whether or not to only include ids in the result, returning the JSON resulting from making the HTTP GET with appropriate parameters.  Be careful, as a Python boolean used for the third parameter is not exactly what is expected as a URL parameter (see the example URL above). Please build a dictionary based on the given parameters and pass it as an argument to your `HTTPGetParams` function, then return the result.

In [None]:
# Solution cell

# YOUR CODE HERE
raise NotImplementedError()
kivaNewestLoans(1, 5, True)

In [None]:
# Testing cell

new2 = kivaNewestLoans(2, 2, False)
assert new2
assert 'loans' in new2
assert 'paging' in new2
assert len(new2['loans']) == 2

### HTTP with file extensions

We have seen that URLs often take us to resources with different file extensions, so it can be useful to generalize our functions above to know what file extension to expect. For example, the Kiva data is also available in XML format via the link:
http://api.kivaws.org/v1/loans/newest.xml?page=1&per_page=5&ids_only=true

Please compare this link to the previous problem to see where the file extension comes in.

**Q4** Rewrite the `buildURL` function to allow both `protocol` and `extension` to be optional. If not specified, `protocol` should be `'http'`. If no `extension` is specified, the URL resource ends as given by `resource`, but otherwise we append a period (`'.'`) and the given extension to the end of the URL resource. Furthermore, please make the function smarter, so that if the first `'/'` in `resource` is not present, the function realizes this and puts it in. See the assert statements for examples. Hint: start with     `buildURL(location, resource, protocol='http', extension=None)`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


In [None]:
# Testing cell

assert buildURL('api.kivaws.org','/v1/loans/newest','https','xml') == 'https://api.kivaws.org/v1/loans/newest.xml'
assert buildURL('api.kivaws.org','v1/loans/newest','http','json') == 'http://api.kivaws.org/v1/loans/newest.json'
assert buildURL('personal.denison.edu','~bressoud/datasystems/ind0.html') == 'http://personal.denison.edu/~bressoud/datasystems/ind0.html'
assert buildURL('personal.denison.edu','/~bressoud/datasystems/ind0.html','https') == 'https://personal.denison.edu/~bressoud/datasystems/ind0.html'

### HTTP Post with parameters

**Q5** Recall from the classroom demonstration that we can emulate the selection of field values on a webpage by issuing a HTTP POST request and using the `requests` module's ability to place data in the body of the post request. If you study the inclass activity, you will see how this was done with a POST request to http://www.energy.ca.gov/almanac/transportation_data/gasoline/margins/index.php, featuring the year 1999. Please write a function

    HTTPPostParams(location, resource, pdict, protocol='http')

that submits a POST request using the given dictionary `pdict`. You might want to start with `HTTPGetParams` and modify it accordingly. Please get the response to your POST request, check the content type, and return either an interpreted json-based Python data structure, a string with the text of the response, or `None` if the POST request was not successful. Please see the assert statements for examples.

In [None]:
# Solution cell

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
# Testing cell

s1 = HTTPPostParams('personal.denison.edu', '/', {})
assert len(s1) == 464
assert type(s1) is str

j2 = HTTPPostParams('httpbin.org', '/post', {'a': 5, 'b': 'Hello world'})
assert type(j2) is dict
assert len(j2) == 8
assert len(j2['form']) == 2

j3 = HTTPPostParams('httpbin.org', '/post', {'a': 5, 'b': 'Hello world','c':'CS181 is great!'},'https')
assert j3['url'] == 'https://httpbin.org/post'

j4 = HTTPGetParams('api.kivaws.org', '/v1/loans', {'try_this':'making a mess!'},'https')
assert j4 == None



**Q6** Please write a function `getCAGas(year)` that mimics the inclass activity to select an appropriate location, resource, dictionary, and protocol, to retrieve as a string, the html that contains the tables of the gas margins for the state of California for the given `year`. Your solution should call `HTTPPostParams` and return the result. Hint: you will need to study the inclass activity to figure out the dictionary that is required.

In [None]:
# Solution cell

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell

htmlresult = getCAGas(2000)
assert type(htmlresult) is str
assert 60000 < len(htmlresult) < 100000