# Dealing with http requests

## The `urllib.request` module

This is part of the standard library, no 3rd party library is required.
It can be used to fetch data from a website or to talk to REST endpoints of some webservice. The only current limitation is the fact that it cannot handle asynchronous requests out of the box. If speed is of concern and you would like to fetch data from many sources at the same time, have a look at the [HTTPX Module](https://www.python-httpx.org/). It offers an almost identical interface as the Requests module, with async options.

Note: [requests](https://docs.python-requests.org/en/latest/index.html) is a more userfriendly / popular version of `urrlib.request`

The most simple way to fetch the content of a given endpoint is like this:

In [None]:
import urllib.request

with urllib.request.urlopen("https://ipecho.net/plain") as response:
    print(response.status, response.reason, response.headers["content-type"])
    print(response.read().decode('utf-8')

The `with` block opens a connection to `ipecho.net` and automatically closes the connection when the block ends.
Within the block, `response` is a file-like object which can be passed to `json.read` or other functions expecting an input stream. Here we just use the `read()` method of such objects.

If authentication is required, things get more tedious:

In [None]:
import urllib.request
import base64

username = "chucknorris"
password = "does-not-need-a-pw---the-pw-needs-him"

# Here, the testserver "https://httpbin.org/basic-auth/{username}/{password}"
# will check that the provided credentials match the url parameters:
request = urllib.request.Request(
    f"https://httpbin.org/basic-auth/{username}/{password}",
    headers={"Authorization": b"Basic " + base64.b64encode(f"{username}:{password}".encode())}
)
with urllib.request.urlopen(request) as response:
    print(response.status, response.reason, response.headers["content-type"])
    print(response.read().decode())

In [None]:
# If the credentials do not match, we get "unauthorized":
request = urllib.request.Request(
    f"https://httpbin.org/basic-auth/chucknorris/no-password",
    headers={"Authorization": b"Basic " + base64.b64encode(f"looser:wrong-password".encode())}
)
with urllib.request.urlopen(request) as response:
    print(response.status, response.reason, response.headers["content-type"])
    print(response.read().decode())

The above was using [basic authentication](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication#basic). Visit the link to see which other authentication methods exist.

Next, we are trying [bearer authentication](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication#bearer).

In [None]:
# The content of the token does not matter for the test server https://httpbin.org/bearer
token = b"Y2h1Y2tub3JyaXM6ZG9lcyBub3QgbmVlZCBhIHB3IC0gdGhlIHB3IG5lZWRzIGhpbQ=="
request = urllib.request.Request(
    "https://httpbin.org/bearer",
    headers={"Authorization": b"Bearer " + token}
)
with urllib.request.urlopen(request) as response:
    print(response.status, response.reason, response.headers["content-type"])
    print(response.read().decode())

In [None]:
# But if the header is wrong, we get "unauthorized"
token = b"Y2h1Y2tub3JyaXM6ZG9lcyBub3QgbmVlZCBhIHB3IC0gdGhlIHB3IG5lZWRzIGhpbQ=="
request = urllib.request.Request(
    "https://httpbin.org/bearer",
    headers={"Authorization": b"Basic " + token}
)
with urllib.request.urlopen(request) as response:
    print(response.status, response.reason, response.headers["content-type"])
    print(response.read().decode())

### GET, POST, PUT, DELETE requests

As you might have guessed, the `urllib.request` module not only offers `get` requests, but any kind of HTTP request. The most popular come with their own method, include the typical CRUD (create, read, update delete) requests:

```python
urllib.request.Request(url, data=data, method="POST")     # CREATE data. Posted data = "payload"
urllib.request.Request(url)                               # READ data
urllib.request.Request(url, data=data, method="PUT")      # UPDATE data
urllib.request.Request(url, data=data, method="DELETE")   # DELETE data

# other methods: HEAD, PATCH
```

## json data

With `urllib.request`, programmers have to serialize and deserialize json themselves.

In [None]:
import urllib.request
import json
response = urllib.request.urlopen("https://ipinfo.io/json")
data = json.load(response) # response behaves like a file which you can read from
print(data)

When using the `POST` method to upload json data, the `data` argument to `urllib.request.urlopen` or `urllib.request.Request` must be bytes:
```py
data = {"username": "chucknorris", "fact": "Two heart attacks had Chuck Norris"}
urllib.request.urlopen(url, data=json.dumps(data).encode()) # By default, if you pass `data`, method="POST" will be used
```

In [None]:
import json
import urllib.request

data = {
    "name": "Chuck Norris",
    "age": "83",
    "email": "chucknorris@roundhouse.gov",
    "image": "chucknorris.png"
}

request = urllib.request.Request(
    "http://validate.jsontest.com/",
    headers={"content-type": "application/json"},
    data=json.dumps(data).encode()
)
with urllib.request.urlopen(request) as response:
    print(response.status, response.reason)
    print(response.headers["content-type"])
    print(response.read())

## Exercise 1: get info, decode json

Get information about a given Kostenstelle. The endpoint is: `http://n-vermeul.ethz.ch/sap_info?kostl=6005` (try out any *Kostenstelle* you are interested in)

- [ ] Decode the json payload and print the `"beschr"` entry of the first item in the returned list.

In [None]:
import urllib.request
import json
with urllib.request.urlopen('http://n-vermeul.ethz.ch/sap_info?kostl=6005') as response:
    ...

# now your turn!

## Exercise 2: handle errors

If the Kostenstelle does not exist and the Response has not a status code of `2**`, `urllib.request.urlopen` will raise a `HTTPError`. Handle this situation by enclosing the `with` block by a `try`, `except` directive and test with `kostl=-1`.

- [ ] Provoke an exception, e.g. by providing an *invalid* `kostl` value like `kostl=-1`.
- [ ] enclose the `with` block by a try-except block that handles the error
  - hint: the error is part of the `urllib.error`
- [ ] Provoke other exceptions (eg. `something://` instead of `http://`) 
- [ ] try to figure out where these exceptions are raised
- [ ] handle these exceptions, using additional `except` blocks

In [None]:
import urllib.request

# catch this!
with urllib.request.urlopen('http://n-vermeul.ethz.ch/sap_info?kostl=-1') as response:
    ...


## Exercise 3: extract and re-arrange data

You might be only interested in the following attributes:

```python
interesting_keys = {"kostl": int, "fonds": str, "beschr": str}
```

The structure should remain, an array of dictionaries, but `kostl` should be converted to an `int` e.g.

```python
[
    {"kostl": 6005, "fonds": "0-21408-18", "beschr": "PHRT-Driver-Project PRECISE"},
    {"kostl": 6005, "fonds": "2-67400-09", "beschr": "SysX_SyBIT-Rinn"},
 ...
]
```

- [ ] decode the content of the response into a Python structure, using `json.load()` and assign it to a variable `cost_centres_raw`
- [ ] Write a function `rearrange_dict()` which takes a single raw entry (`dict`) of the `cost_centres_raw` `list` as input and returns a rearanged `dict`, only containing items with keys present in `interesting_keys` (each value should be converted to the corresponding type given in `interesting_keys`).
- [ ] Use a list comprehension (`[do_something_with(item) for item in items]`) to transform the `list` of raw `dict`'s to a `list` of rearranged `dict`'s using `rearrange_dict()`.

In [None]:
import urllib.request
import json

interesting_keys = {"kostl": int, "fonds": str, "beschr": str}

def rearrange_dict(row: dict) -> dict:
    return ... # your code goes here :)

with urllib.request.urlopen("http://n-vermeul.ethz.ch/sap_info?kostl=6005") as response:
    cost_centres_raw = json.load(response)
    # your code goes here :)

## Excercise 4: filtering

1. Familiarize yourself with [filter functions in Python](https://gitlab.ethz.ch/vermeul/python-best-practices/-/blob/master/07-Built-in_Functions.md#filtering-the-filter-function)
2. create a filter function which selects certain records, e.g. the `fonds` attribute starts with `2-70`
3. apply the filter to the costcentres
4. extract certain attributes you're interested in

**Bonus:**

* apply the filter to the outer list comprehension

In [None]:
import urllib.request
import json

with urllib.request.urlopen("http://n-vermeul.ethz.ch/sap_info?kostl=6005") as response:
    cost_centres_raw = json.load(response)

def fonds_filter(cost_centre):
    # define your filter here, should return True or False

# apply the filter, loop over the remaining costcentres
    
# store the matching raw entries in a list `cost_centre_infos`

cost_centre_infos = [...]
    
cost_centre_infos

## Excercise 5: sorting

1. Read about the [sort functions in Python](https://gitlab.ethz.ch/vermeul/python-best-practices/-/blob/master/07-Built-in_Functions.md#sorting-the-sorted-function)
2. start with the code below
3. create a `def my_sort(costcentre)` function which uses the regex above
4. hint: `match.groupdict()['middle_number']` will access the capture group item
5. hint: the `my_sort` should return 0 if no match, or return int(middle_number) otherwise

**Bonus**

6. try to achieve the same thing using the nested list comprehensions

In [None]:
import urllib.request
import json
import re

with urllib.request.urlopen("http://n-vermeul.ethz.ch/sap_info?kostl=6005") as response:
    cost_centres_raw = json.load(response)

extract_middle_number = re.compile(r'''
    ^\d+                       # start with at least a number
    \-                         # followed by a dash
    (?P<middle_number>\d+)     # followed by numbers, capture these
    \-                         # followed by a dash
    \d+                        # followed by the last digits
    $                          # end of string
    ''', 
    re.X
)

def fonds_filter(cost_centre):
    if cost_centre.get('fonds', '').startswith('2-70'):
        return cost_centre
    return None
    
def my_sort(costcentre):
    """This function first extracts the middle number, using a regular expression.
    If we have a match, return the integer representation of that number to enforce number comparison.
    If we do not have a match, return 0
    """
    ### your code here
    

interesting_cost_centres = filter(fonds_filter, cost_centres_raw)
interesting_attributes = {'kostl','fonds', 'beschr'}
reduced_cost_centres = [
    {attr: cost_centre.get(attr, '') for attr in interesting_attributes}
       for cost_centre in interesting_cost_centres
]
sorted_cost_centres = sorted(reduced_cost_centres, key=my_sort)
    
sorted_cost_centres

## Solution to Exercise 1

In [None]:
import urllib.request
import json
with urllib.request.urlopen('http://n-vermeul.ethz.ch/sap_info?kostl=6005') as response:
    cost_centres = json.load(response)

for cost_centre in cost_centres:
    print(cost_centre["beschr"])

## Solution to Exercise 2

**Solution to 2.1 - 2.3**

In [None]:
import urllib.request

try:
    with urllib.request.urlopen('http://n-vermeul.ethz.ch/sap_info?kostl=-1') as response:
        ...
except urllib.error.HTTPError as exc:
    print(f"💥 This url does not exist: {exc}")

**Solution to 2.4 - 2.6**

In [None]:
import urllib.request

try:
    with urllib.request.urlopen('something://n-vermeul.ethz.ch/sap_info?kostl=-1') as response:
        ...
except urllib.error.HTTPError as exc:
    print(f"💥 This url does not exist: {exc}")
except urllib.error.URLError as exc:
    print(f"💣 This endpoint is not even a valid schema: {exc}")

## Solution to Exercise 3

**Solution 3.1 - 3.6**

In [None]:
import urllib.request
import json
from dataclasses import dataclass, fields

interesting_keys = {"kostl": int, "fonds": str, "beschr": str}

def rearrange_dict(raw_entry):
    return {key: interesting_keys[key](val) for key, val in raw_entry.items() if key in interesting_keys}

with urllib.request.urlopen("http://n-vermeul.ethz.ch/sap_info?kostl=6005") as response:
    # your code goes here :)
    cost_centres_raw = json.load(response)

cost_centres = [rearrange_dict(entry) for entry in cost_centres_raw]
cost_centres

## Solution to Exercise 4

**Solution to 4.1 - 4.4**

In [None]:
import urllib.request
import json

with urllib.request.urlopen("http://n-vermeul.ethz.ch/sap_info?kostl=6005") as response:
    # your code goes here :)
    cost_centres_raw = json.load(response)

def fonds_filter(cost_centre):
    if cost_centre.get('fonds', '').startswith('2-70'):
        return cost_centre
    return None

cost_centre_infos = list(filter(fonds_filter, cost_centres_raw))

cost_centre_infos

**Bonus** same thing, using list comprehension

In [None]:
import urllib.request
import json

with urllib.request.urlopen("http://n-vermeul.ethz.ch/sap_info?kostl=6005") as response:
    cost_centres_raw = json.load(response)

cost_centre_infos = [entry for entry in cost_centres_raw if entry.get('fonds', '').startswith('2-70')]

cost_centre_infos

## Solution to Exercise 5

**Solutions to 5.1 - 5.5**

In [None]:
import urllib.request
import json
import re

with urllib.request.urlopen("http://n-vermeul.ethz.ch/sap_info?kostl=6005") as response:
    cost_centres_raw = json.load(response)

extract_middle_number = re.compile(r'''
    ^\d+                       # start with at least a number
    \-                         # followed by a dash
    (?P<middle_number>\d+)     # followed by numbers, capture these
    \-                         # followed by a dash
    \d+                        # followed by the last digits
    $                          # end of string
    ''', 
    re.X
)

def fonds_filter(cost_centre):
    if cost_centre.get('fonds', '').startswith('2-70'):
        return cost_centre
    return None
    
def my_sort(costcentre):
    """This function first extracts the middle number, using a regular expression.
    If we have a match, return the integer representation of that number to enforce number comparison.
    If we do not have a match, return 0
    """
    ### your code here
    fonds = costcentre.get('fonds','')
    match = extract_middle_number.search(fonds)
    if match:
        return int(match.groupdict()['middle_number'])  # return int() to enforce number comparison
    else:
        return 0
    

interesting_cost_centres = filter(fonds_filter, cost_centres_raw)
interesting_attributes = {'kostl','fonds', 'beschr'}
reduced_cost_centres = [
    {attr: cost_centre.get(attr, '') for attr in interesting_attributes}
       for cost_centre in interesting_cost_centres
]
sorted_cost_centres = sorted(reduced_cost_centres, key=my_sort)
    
sorted_cost_centres