### CS424

Prof. Götz Pfeiffer<br />
School of Mathematics, Statistics and Applied Mathematics, NUI Galway

# Lecture 11: Hypertext Transfer Protocol

The HyperText Transfer Protocol (HTTP) is an **application layer protocol** for distributed,
collaborative, hypermedia information systems.  It is the foundation of data communication for the world wide web.

**Hypertext** is a network of nodes containing structured text
that uses hyperlinks to refer to other text nodes.

HTTP works as a request-response protocol in a client-server model.
A client (typically a web browser) submits a **HTTP request** to the server.
The server then can perform actions on behalf of the client, and returns
a **HTTP response** to the client.

The idea of hypertext dates back to Ted Nelson's [Xanadu project](https://en.wikipedia.org/wiki/Project_Xanadu) (1965),
or even [Vannevar Bush](https://en.wikipedia.org/wiki/Vannevar_Bush) memex system (1930).  It only became practical with the
advent of widespread point-and-click interfaces on personal computers.
The first version of HTTP (0.9) was part of [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee)'s 1989 proposal
of a World Wide Web of documents,  primarily for the purpose of
organzing the ever growing  technical manuals for all the equipment at the
CERN particle accelerators.  The success of the protocol
as universal data exchange format for all aspects of human life came somewhat unexpected.


## Stateless

HTTP is a **stateless** protocol. A stateless protocol does not require the HTTP server to retain information or status about each user for the duration of multiple requests. However, some web applications implement states or server side sessions using for instance HTTP **cookies** or **hidden variables** within web forms.

## Layers

The framework of the internet protocol suite defines data exchange protocols as a stack of several layers:

* Link layer: ARP, PPP
* Internet layer: IP
* Transport layer: TCP, UDP
* Application layer: HTTP, ssh , FTP, telnet, SMTP

HTTP is an application layer protocol that presumes an underlying reliable transport layer,
usually TCP.

## HTTP Requests

An HTTP **request message** consists of 4 parts:

* a **request line**;

* a sequence of **request header fields**;

* an **empty line**;

* an (optional) **message body**.

### Request line

A request line like
```http
GET /products HTTP/1.1
```
consists of a **HTTP command** (`GET`), a  **resource** in the form of a path (`/products`)
and the version of the protocol used (`HTTP/1.1`).

### Header Fields

The request header fields form a sequence of key/value pairs of the form `name: value`. 
Examples are
```http
Host: localhost:3001
Connection: keep-alive
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
Upgrade-Insecure-Requests: 1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8,de;q=0.6
Cookie: _depot_session=Qk5va3BXbnplS1NET3prcmlnNS94Q0xHSitocmxTSGZ6MnI4TWlrVnZyc3VJaHcvN2pFZytBT3kxNWR6VG5ETFRoUHczN0pGNkhNSjNSaEVMRG1LYU9ZZzVoWGwvb2tvUUJxeUVsNFlhOFgvL0VCRThpaVBpUThhMFFNUko0UldDbjkrVXNPZm9NNkwyRFhvMjZXNjdnPT0tLVRHNlErcHFUaEF2RGdZL2w4RXVqT0E9PQ%3D%3D--4f5cf13aa1795afc2aa50c552e365f46175e3a2b
```

The `Host` field is mandatory under `HTTP/1.1`.

### Message Body

The optional body of the request consists of everything between the (required) empty line
and the end of the request.  The meaning (and presence) of a message body
depends on the HTTP command.

## HTTP commands

The first version of  HTTP had only one command: GET. Now there are several other commands, allowing
for a wider range of services.

### GET

Requests a **representation** of the specified resource.
A `GET` request **should** only retrieve data and should not have any other effect.

### HEAD

Identical to `GET`, except that the response will not contain the  message body.

### POST

Requests the server to accept the entity enclosed in the request (body)
as a new subordinate of the given resource, e.g., add a new item to a database.


### PUT

The `PUT` method requests that the enclosed entity be stored under the supplied URI. If the URI refers to an already existing resource, it is modified; if the URI does not point to an existing resource, then the server can create the resource with that URI.

### DELETE

The `DELETE` method deletes the specified resource.

### PATCH

The `PATCH` method applies partial modifications to a resource.

### Safe Methods

Some of the methods (for example, `HEAD` and `GET`) are, by **convention**, considered as **safe**. 
Thise means they are intended only for information retrieval and **should not change the state of the server**. In other words, they **should not have side effects**, beyond relatively harmless effects such as logging, caching, the serving of banner advertisements or incrementing a web counter. Making arbitrary `GET` requests without regard to the context of the application's state can therefore be considered safe. However, note that this is only a convention.  It is not mandated by the standard, nor can it be enforced.

By contrast, methods such as `POST`, `PUT`, `DELETE` and `PATCH` are intended for actions that may cause side effects either on the server, or external side effects such as financial transactions or transmission of email. Such methods are therefore not usually used by conforming web robots or web crawlers.



### Idempotent Methods

In algebra, an element $e$ of a ring $R$ is called an **idempotent**, if $e^2 = e$.
If $R$ is a ring of operators, this means that applying the operator $e$ several times in a row
has the same effect as applying $e$ just once.

In this sense, the methods `PUT` and `DELETE` are defined to be idempotent, meaning that multiple identical requests should have the same effect as a single request.
Safe methods (which by definition have **no effect** on the server state)
are idempotent as well.

In contrast, the `POST` method is not necessarily idempotent. Sending an identical `POST` request multiple times may further affect state or cause further side effects (such as financial transactions). 

Again, this is only a convention that is not and can not be enforced.
Ignoring this convention, however, may result in undesirable consequences.


##  HTTP Responses

A HTTP **response message** consists of 4 parts:

* a **status line**;

* a sequence of **response header fields**;

* an **empty line**

* an (optional) **message body**.

### Status Line

A status line like
```http
HTTP/1.1 200 OK 
```
consists of a HTTP version (`HTTP/1.1`), a machine-readable status code (`200`) and a human-readable reason phrase (`OK`). 

Common status codes (and reason phrases) are

* `100 Continue`
* `200 OK`
* `204 No Content`
* `301 Moved Permanently`
* `302 Found`
* `400 Bad Request`
* `401 Unauthorized`
* `402 Payment Required`
* `403 Forbidden`
* `404 Not Found`
* `500 Internal Server Error`

Here, the first digit of the status code defines the **response class**:

* `1..`: **Informational** - Request received, continuing process

* `2..`: **Success** - The action was successfully received, understood, and accepted
        
* `3..`: **Redirection** - Further action must be taken in order to complete the request

* `4..`: **Client Error** - The request contains bad syntax or cannot be fulfilled

* `5..`: **Server Error** - The server failed to fulfill an apparently valid request

### Header Fields

```http
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Content-Type: text/html; charset=utf-8
Etag: W/"a88d71f787ee296868c8cd19906d76cc"
Cache-Control: max-age=0, private, must-revalidate
X-Request-Id: c9339d6e-f687-441b-bd64-bf4a73d261e0
X-Runtime: 0.136222
Server: WEBrick/1.3.1 (Ruby/2.3.1/2016-04-26)
Date: Sun, 08 Oct 2017 18:00:44 GMT
Content-Length: 6149
Connection: Keep-Alive
Set-Cookie: _depot_session=K1hGd0wyS3grSjIvOHFXZjVjKzVGbUhybFNjc1gvSEp3N21oMVZaTVQ0cHp5UlBKcmFDa0gzL3RHS3pjaFIxcmJoRTNrM0pjWmZ2UUR0anF6UEZ1Z1hZMmxPNktMOVg1a2tnajVQYUZmZ1VkL1F5UjVhcEdOa0p4N1dmMmJNNHk5WWp6MjYwaEg1UDVFTWJlZmx4bytnPT0tLXk1cHhFbEk2ZkxVekR0M2hCSFhTdWc9PQ%3D%3D--b9ddb7dcc84cd4baaa83ee9fade933be2667e4e6; path=/; HttpOnly
```

### Example Session

```
telnet schmidt
```

##  URLs and URIs

HTTP **resources** are identified and located on the network by Uniform Resource Locators (URLs), using the Uniform Resource Identifiers (URI's) schemes `http` and `https`. 

A URI is a string of characters used to identify a resource. The most common form of URI is the Uniform Resource Locator (URL), frequently referred to informally as a **web address**.

A typical URL has the form `http://www.example.com/index.html`, which indicates a protocol (`http`), a hostname (`www.example.com`), and a file name (`index.html`).

More specifically, every HTTP URL conforms to the syntax of a generic URI, which is of the form:
```
scheme:[//[user[:password]@]host[:port]][/path][?query][#fragment]
```
This comprises:

* The **scheme**, consisting of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus (`+`), period (`.`), or hyphen (`-`). Although schemes are case-insensitive, the canonical form is lowercase and documents that specify schemes must do so with lowercase letters. It is followed by a colon (`:`). Examples of popular schemes include `http` or `https`, `ftp`, `mailto` and `file`. 

* Two slashes (`//`): This is required by some schemes and not required by some others. 

* An **authority part**, comprising:
  * An optional authentication section of a **user name** and **password**, separated by a colon, followed by an at symbol (`@`)
  * A **host**, consisting of either a registered name (including but not limited to a hostname), or an IP address. 
  * An optional **port number**, separated from the hostname by a colon
  
* A **path**, which contains data, usually organized in hierarchical form, that appears as a sequence of segments separated by slashes (`/`). Such a sequence may resemble or map exactly to a file system path, but does not always imply a relation to one. 

* An optional **query**, separated from the preceding part by a question mark (`?`), containing a query string of non-hierarchical data. Its syntax is not well defined, but by convention is most often a sequence of attribute–value pairs separated by a delimiter (`&` or `;`).

* An optional **fragment**, separated from the preceding part by a hash (`#`). The fragment contains a fragment identifier providing direction to a secondary resource.  When the primary resource is an HTML document, the fragment is often an id attribute of a specific element, and web browsers will scroll this element into view.  The fragment thus is directed towards the client and will not become part of the request sent to the server.

