# Fetching URLs with `urllib`

MCS 275 Spring 2024 - Emily Dumas

## Basics

The function `urlopen` in module `urllib.request` will open a URL on request and return an object that behaves like an open file.  In this notebook, we always pause for 0.5 seconds before calling it to ensure we don't overload anyone's HTTP server by evaluating a cell repeatedly.

In [7]:
from urllib.request import urlopen
import time

In [8]:
time.sleep(0.5)
fp = urlopen("http://www.example.com/")  # example.com home page
print(fp.read())
fp.close()

b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    

We see HTML code here.  But it is really a `bytes` object.  The response payload is an array of bytes in general, and to convert to characters we need to know the encoding.  Here it is UTF-8:

In [9]:
time.sleep(0.5)
fp = urlopen("http://www.example.com/")  # example.com home page
print(fp.read().decode("UTF-8"))
fp.close()

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

The `.decode` method of a bytes object returns a string.  The argument is the **encoding**.

## Context manager

Like `open()`, you can use `urlopen()` as a context manager, e.g.

In [12]:
time.sleep(0.5)
with urlopen("http://www.example.com/")  as fp:
    print(fp.read().decode("UTF-8"))

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

## Status code

The HTTP response code or status code is also available as the `.status` attribute of the response object.


In [13]:
time.sleep(0.5)
with urlopen("http://www.example.com/")  as fp:
    print("The request generated status:",fp.status)

The request generated status: 200


Status 200 or more generally anything between 200 and 299 means success.

## Headers

The header section of a HTTP response is a set of key-value pairs.  They are stored in the `.headers` attribute of the response object.  They are parsed lazily; convert to a `dict` to see them all.

In [14]:
time.sleep(0.5)
with urlopen("http://www.example.com/")  as fp:
    h = dict(fp.headers)
    print(h)

{'Accept-Ranges': 'bytes', 'Age': '207524', 'Cache-Control': 'max-age=604800', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Wed, 10 Apr 2024 19:39:23 GMT', 'Etag': '"3147526947"', 'Expires': 'Wed, 17 Apr 2024 19:39:23 GMT', 'Last-Modified': 'Thu, 17 Oct 2019 07:18:26 GMT', 'Server': 'ECS (cha/8094)', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'Content-Length': '1256', 'Connection': 'close'}


But if you know the key you want, you can directly index the `.headers` object.

In [15]:
time.sleep(0.5)
with urlopen("http://www.example.com/")  as fp:
    print("The content type returned by this request was:")
    print(fp.headers["Content-type"])

The content type returned by this request was:
text/html; charset=UTF-8


## Auto decode

The following block is the preferred way to fetch an HTML document as a string.  It uses whatever encoding the server specifies, so it will work even for non-UTF-8 payloads.

In [16]:
time.sleep(0.5)
url = "http://www.example.com/"
with urlopen(url) as fp:
    headers = dict(fp.headers)
    encoding = fp.headers.get_content_charset()
    payload_str = fp.read().decode(encoding)

print("Here is the HTML as a string:")
print(payload_str)

Here is the HTML as a string:
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example

## Getting non-HTML objects

### Retrieving an image and loading it with PIL

In [22]:
from urllib.request import urlopen
import time
from PIL import Image

imgurl = "https://www.dumas.io/teaching/2024/spring/mcs275/slides/images/file-schematic-solution.png"
time.sleep(0.5)
with urlopen(imgurl) as fp:
    print("Type:",fp.headers["Content-type"])
    img = Image.open(fp)

print("Got a PIL image of size {} and mode {}".format(img.size,img.mode))
img.show() # display using system viewer

Type: image/png
Got a PIL image of size (974, 554) and mode RGBA


### Retrieving a JSON document

In [23]:
from urllib.request import urlopen
import json
import time

time.sleep(0.5)
# This is the "bored API", a site which provides a random JSON document
# describing a thing you could do to eliminate boredom
with urlopen("https://www.boredapi.com/api/activity") as fp:
    data = json.load(fp)
    
print("Why don't you...",data["activity"])

Why don't you... Cook something together with someone
