# 18) The Web, Untangled <a class="tocSkip">

The World Wide Web is designed with three main ideas:

    - HTTP (HyperText Transfer Protocol)
    A protocol for web clients and servers to interchange requests and responses.
    
    - HTML (HyperText Markup Language)
    A presentation format for results.
    
    - URL (Uniform Resource Locator)
    A way to uniquely represent a server and a resource on that server.

In its simplest usage, a web client connected to a web server with HTTP, requested a URL and received HTML. Almost every computer language has been used to write web clients and web servers. The dynamic languages Perl, PHP and Ruby have been especially popular. The web is a client-server system. The client makes a request to a server: it opens a TCP/IP connection, sends the URL and other information via HTTP and receives a response. The format of the response is also defined by HTTP. It includes the status of the request and the response's data and format.

An important aspect of HTTP is that is stateless. Each HTTP connection that you make is independent of all other. This simplifies basic web operations but complicates others, for example:

- Caching - Remote content that does not change should be saved by the web client and used to avoid downloading from the server again.

- Sessions - A shopping website should remember the contents of your shopping cart.

- Authentication - Sites that require your username and password should remember them while you are logged in.

Solutions to statelessness include cookies, in which the server sends the client enough specific information to be able to identify uniquely when the client sends the cookie back.

### Web clients

Let us use Python's standard web library to get something from a website. The URL in the following example returns information from a test website:

In [6]:
import urllib.request as ur

In [13]:
# Connection

url = 'http://www.example.com/'
conn = ur.urlopen(url)

In [8]:
# Print the connection status

print(conn.status)

200


One of the most important parts of the response is the HTTP status code. A 200 means that everything worked fine. There are dozens of HTTP status codes, grouped into five ranges by their first digit:

    1xx (information)
    The server received the request but has some extra information for the client.
    
    2xx (success)
    It worked; every success code other than 200 conveys extra details.
    
    3xx (redirection)
    The resource moved, so the response returns the new URL to the client.
    
    4xx (client error)
    Some problem from the client side, such as the well known 404 (not found).
    
    5xx (server error)
    500 is the generic whoops; you might see a 502 (bad gateway) if there is some disconnect between a web server and a backend application server.

To get the data contents from the web page use the read() method of the connection variable. This returns a bytes value that can be converted to a string:

In [14]:
# Get data and convert to string

data = conn.read()
str_data = data.decode('utf8')
str_data

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <