# 18) The Web, Untangled <a class="tocSkip">

The World Wide Web is designed with three main ideas:

    - HTTP (HyperText Transfer Protocol)
    A protocol for web clients and servers to interchange requests and responses.
    
    - HTML (HyperText Markup Language)
    A presentation format for results.
    
    - URL (Uniform Resource Locator)
    A way to uniquely represent a server and a resource on that server.

In its simplest usage, a web client connected to a web server with HTTP, requested a URL and received HTML. Almost every computer language has been used to write web clients and web servers. The dynamic languages Perl, PHP and Ruby have been especially popular. The web is a client-server system. The client makes a request to a server: it opens a TCP/IP connection, sends the URL and other information via HTTP and receives a response. The format of the response is also defined by HTTP. It includes the status of the request and the response's data and format.

An important aspect of HTTP is that is stateless. Each HTTP connection that you make is independent of all other. This simplifies basic web operations but complicates others, for example:

- Caching - Remote content that does not change should be saved by the web client and used to avoid downloading from the server again.

- Sessions - A shopping website should remember the contents of your shopping cart.

- Authentication - Sites that require your username and password should remember them while you are logged in.

Solutions to statelessness include cookies, in which the server sends the client enough specific information to be able to identify uniquely when the client sends the cookie back.

### Web clients

Let us use Python's standard web library to get something from a website. The URL in the following example returns information from a test website:

In [6]:
import urllib.request as ur

In [13]:
# Connection

url = 'http://www.example.com/'
conn = ur.urlopen(url)

In [8]:
# Print the connection status

print(conn.status)

200


One of the most important parts of the response is the HTTP status code. A 200 means that everything worked fine. There are dozens of HTTP status codes, grouped into five ranges by their first digit:

    1xx (information)
    The server received the request but has some extra information for the client.
    
    2xx (success)
    It worked; every success code other than 200 conveys extra details.
    
    3xx (redirection)
    The resource moved, so the response returns the new URL to the client.
    
    4xx (client error)
    Some problem from the client side, such as the well known 404 (not found).
    
    5xx (server error)
    500 is the generic whoops; you might see a 502 (bad gateway) if there is some disconnect between a web server and a backend application server.

To get the data contents from the web page use the read() method of the connection variable. This returns a bytes value that can be converted to a string:

In [14]:
# Get data and convert to string

data = conn.read()
str_data = data.decode('utf8')
str_data

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

### Web servers

Web developers have found Python to be an excellent language for writing web servers and server-side programs. This has led to a variety of Python-based web frameworks. A web framework provides features with which you can build websites, so it does more than a simple web (HTTP) server. You will see features such as routing (URL to server function), templates (HTML with dynamic inclusions), debugging and more. Python web development made a leap with the definition of the Web Server Gateway Interface (WSGI), a universal API between Python web applications and web servers. The following web framework uses WSGI.

Web servers handle the HTTP and WSGI details, but you use web frameworks to actually write the Python code that powers the site. There are many Python web frameworks; at a minimum a web framework handles client requests and server responses. Most major web frameworks also include HTTP protocol handling, authentication, authorization, establishing session, getting parameters, validating parameters and more.

#### Bottle

Bottle is a web framework that consists of a single Python file so it is easy to try and deploy. The following code will run a test web server and return a line of text when your browser accesses the URL http://localhost:9999/.

In [18]:
from bottle import route, run, static_file

In [20]:
# Generating home page

@route('/')
def home():
    return "This is a home page"

run(host = 'localhost', port = 9999)

Bottle v0.12.19 server starting up (using WSGIRefServer())...
Listening on http://localhost:9999/
Hit Ctrl-C to quit.



Bottle uses the route decorator to associate a URL with the following function; in this case, / (the home page) is handled by the home() function. There are many features to bottle. In particular, you can try adding these arguments when you call run():

    - debug = True creates a debugging page if you get an HTTP error.
    - reloader = True reloads the page in the browser if you change any of the Python code.

### Web APIs and REST

Often, data is available only within web pages. If you want to access it, you need to access the pages through a web browser and read it. If the authors of the website made any changes since the last time you visited, the location and style of the data might have changed. Instead of publishing web pages, you can provide data though a web application programming interface (API). Clients access your service by making requests to URLs and getting back responses containing status and data. Instead of HTML pages, the data is in formats that are easier for programs to consume, such as JSON or XML. Sometimes you might want a little bit of information, but what you need is available only in HTML pages, surrounded by extraneous content. We could extract the data manually, but it is more reproducible to create an automated web fetcher called a crawler. After the contents have been retrieved from the remote web servers, a scraper parses it to find the required data.

If you already have the HTML data from a website and just want to extract data from it, BeautifulSoup is a good choice. HTML parsing is harder than it sounds. This is because much of the HTML on public web pages is technically invalid: unclosed tags, incorrect nesting and other complications. Let us build a full program. It searches for videos using an API at the Internet Archive. This is one of the few APIs that allows anonymous access. The program then does the following:

- Prompts you for part of a movie or video title
- Searches for it at the Internet Archive
- Returns a list of identifiers, names and descriptions
- Lists them and asks you to select one
- Displays that video in your web browser

In [68]:
import sys
import webbrowser
import requests

In [69]:
# Searches for a title on the Internet Archive

def search(title):
    ''' Return a list of 3-item tuples (identifier, title, description) about videos. '''
    
    search_url = "https://archive.org/advancedsearch.php"
    params = {
            "q": "title:({}) AND mediatype:(movies)".format(title),
            "fl": "identifier, title, description",
            "output": "json",
            "rows": 10,
            "page": 1,
            }
    resp = requests.get(search_url, params = params)
    data = resp.json()
    docs = [(doc["identifier"], doc["title"], doc["description"]) for doc in data["response"]["docs"]]
    
    return docs

In [70]:
# Lets the user choose one of the searches

def choose(docs):
    ''' Print line number, title and truncated dscription for each tuple in :docs.
    Get the user to pick a line number. If it is valid, return the first item in the chosen tuple.
    Otherwise, return None. '''

    last = len(docs) - 1
    for num, doc in enumerate(docs):
        print(f"{num}: ({doc[1]}){doc[2][:30]}...")
    
    index = input(f"Which would you like to see (0 to {last})? ")
    try:
        return docs[int(index)][0]
    except:
        return None

In [71]:
# Displays the chosen url in a browser

def display(identifier):
    ''' Display the Archive video with :identifiier in the browser. '''
    details_url = "https://archive.org.details/{}".format(identifier)
    print('Loading', details_url)
    webbrowser.open(details_url)

In [72]:
# Combines the above functions

def main(title):
    ''' Find any movies that match :title. Get the users choice and display in browser. '''
    
    identifiers = search(title)
    if identifiers:
        identifier = choose(identifiers)
        if identifier:
            display(identifier)
        else:
            print('Nothing selected')
    else:
        print('Nothing found for ', title)        