# Web Processing

Most of the data you process or resources you will use in data analysis will not initially
be on a local device. To access remote data and resources you will need to know how data
is transferred and processed through the web, now called the cloud.

## Foundational Concepts

### [**A**pplication **P**rogramming **I**nterface](https://en.wikipedia.org/wiki/API)

A set of methods (verbs) and data structures (nouns) exposed by a server, program, or
process for use by a client.

### [**Re**presentation **S**tate **T**ransfer](https://en.wikipedia.org/wiki/REST)

A convention for how a server structures and exposes an **API**. This is the dominate
convention used to for web data processing. It is characterized by the following properties:

* All resources on the server have a unique address, such as an
[**U**niform **R**esource **L**ocator](https://en.wikipedia.org/wiki/URL).
* Each resource will point to dependent resources so that a client can explore and use
the resources without additional information. This is usually done through relative paths.
Note this is the most commonly violated property, hence the proliferation of API docs.
* The server and client both maintain a *state* independent of each other, and are blind
to the internals of each others *state*.
* The history is independently *persisted* on each side through their internal states.
* The client interactions with the server are *stateless*. There is no shared *session*. The
server does not need to know the details of the client to process the *request*. Likewise
the client does not need to know the details of the server to process the *response*.
* Clients initiate *transactions* on the server by sending a *request* to the server
that invokes one single method (verb) exposed by the server. The request contains the
description of the *transition* (nouns), changes, that have occurred on the client side.
This can include *"I need data."*. In strict **ReST** verbs are not included in the body of
the request.
* The server is responsible for taking the *transition transactions* and updating its own
internal state. This can include *"No change necessary."*
* The server sends the client a *response* used to *transition* the client state. This can
include *"Here is the data you asked for."*

The gist is that the server exposes a limited set of methods (verbs) that can be applied
to a large number of locations (object nouns). The client invokes a method (verb) on a 
location (object noun) optionally supplying data (subject nouns).

> Manager (client) says "Body shop technician (server) please paint (verb) car `X47T`
> (object noun) with this green paint (subject noun)."

### [**H**yper **T**ext **T**ransfer **P**rotocol](https://en.wikipedia.org/wiki/HTTP)
The main convention for processing information on the web. In principle it conforms to
**ReST**, in practice there are wide variations. There are a number of important **API**
methods for working with web server data, each of these methods must be called against a
**URL**:

* `HEAD` is there a resource at that location and if so send the metadata *headers*.
* `OPTIONS` of the below methods, which ones does does the location support.
* `GET` return the data at the location.
* `POST` using only the attached data create a new resource, server will return the location
of the created resource.
* `PUT` using only the attached data replace an existing resource, data not included will be
dropped.
* `PATCH` using the attached data update an existing resource, keeping any data not included
in the request.
* `DELETE` remove the location.
There are more methods, however these are the main ones for working with web server
**API**s.

#### [Headers](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields)

When working with **HTTP** every *request* and *response* is prefixed by a *header* which
contains metadata about the message. A few important headers that are edited frequently:

* Client Request Fields
    * `Authorization` this contains security information. The accepted standard is for the
    client to send a cryptographically secure pseudo-random **API** key that the server
    previously generated and distributed through a secondary channel. The key is sent as a 
    *bearer token credential*. The format is `Authorization: Bearer <<API key>>`.
    * `Accept` the content type the client will be accept in the response.
    * `Content-Type` the type of media contained in the clients request. The main types for
    data processing:
        * `Content-Type: application/json` for data transactions
        * `Content-Type: text/html` for webpages this is usually
        * `Content-Type: text/csv` for CSV 
        * `Content-Type: text/plain` for plain text, including MarkDown
        * `Content-Type: application/xml` for XML.
    * `Content-Length` the length of the body of the request in bytes.
* Server Response Fields
    * `Content-Type` same as above.
    * `Content-Length` same as above.
    * `Location` the new location of a requested resource, for a 300 redirection response.
    
Note that if secure **HTTPS** is used, then the headers are encrypted as well. Never transmit
headers containing *credentials* on insecure **HTTP**. As well, the use of *basic*
authorization is discouraged as the username and password credentials are sent as encoded
plain text.

There are a couple of common violations of **ReST** in many **HTTP** **API** services:

* Using a `GET` to explicitly create or update on the server, through URL key-value
parameters. In strict **ReST** URL parameters are meant to refine the request such as
specifying search strings.
* Using a `POST` for general transactions. In this case one might see JSON fields called
`action` and `id` that specify how the resources should be modified. In strict **ReST**,
servers should implement `PUT`, `PATCH`, and `DELETE` for modifying actions.

## Python Requests

The standard Python library for communicating with web servers is [`requests`](https://docs.python-requests.org/en/latest/index.html). For the most part the library transparently handles securing
communication, errors, and so on.

In [None]:
import requests

## First Example

Begin by opening the [Python Docs](https://docs.python.org/3/) homepage in a browser, and
viewing the **HTML** using the view source. Now lets use the `get` method from the
`requests` library and compare the results.

Every **HTTP** method has a corresponding `requests`
[method](https://docs.python-requests.org/en/latest/api/#main-interface) and returns a 
[response](https://docs.python-requests.org/en/latest/api/#requests.Response) object.

Always check the response code, or use the `ok` Boolean. For illustration we will print the
code, message, and final resolved URL.

In [None]:
print(requests.get("https://docs.python.org/3/").headers["content-type"])
print(requests.get("https://docs.python.org/3/").headers["content-length"])

In [None]:
HOMEURL = "https://docs.python.org/3/"
BROKENURL = "https://docs.python.org/4/"

def webtext(source, key = "No token provided"):
    """
    This assumes the URL is plain text and tried to retrieve it. Returns an empty on bad
    requests:
    * `text` content of the request as text.
    """

    # If credentials are needed to access the resource we would build a headers
    # dictionary and pass it to the get method
    mockheader = {
        "Authorization" : f"Bearer {key}"
    }

    # Simple GET of the source URL
    with requests.get(source) as r:
        print(f"{r.status_code} {r.reason} {r.url}")

        # On bad request exit early
        if not r.ok:
            print("Bailing. Response not usable.")
            return ""
        
        # Good request, do more processing
        print("Here is your text.")
        return r.text

#### Test the Example

Note that these are secure URLs. The `requests` library automatically negotiates the
cryptography.

In [None]:
# Ok will be true here
print(webtext(HOMEURL))

# Ok will be false here
print(webtext(BROKENURL))

Lets look at the headers to see how the object is structured.

### Example from `Requests`

Let's give the example from the documentation a try and grab some **JSON** from
[GitHub Events](https://api.github.com/events). For integrating and testing ReST APIs you
can use the [Postman Service](https://www.postman.com/).

In [None]:
GITURL = "https://api.github.com/events"
events = requests.get(GITURL).json()
print(type(events))
print(len(events))
print(events[0])

### Example from Hurricanes

Using `requests` we can skip manually downloading any data files and instead `get` them
straight from the NOAA website.

In [None]:
ATLANTICURL = "https://www.aoml.noaa.gov/hrd/hurdat/hurdat2.html"
PACIFICURL = "https://www.aoml.noaa.gov/hrd/hurdat/hurdat2-nepac.html"
atlanticdata = requests.get(ATLANTICURL).text.split("\r\n")[3:-6]
pacificdata = requests.get(PACIFICURL).text.split("\n")[7:-5]


In [None]:
print(len(atlanticdata))
print(len(pacificdata))
print(atlanticdata[0])
print(atlanticdata[-1])
print(pacificdata[0])
print(pacificdata[-1])

## Debugging Strategy

When requests fail use the following steps:

1. Check the authorization header
2. Check the content type header
3. Check the content length header
4. Check the accept header
5. Check the body of the request to make sure it conforms to the server expectations.

There are no guarantees that the content type will match what was sent in the body of either
the request or the response. Content type sniffing is generally discouraged, both because it
is computationally expensive and can lead to false results. Instead attempt to validate that
the body does conform to the content type. If it does not then take the necessary remedial
action.

In [22]:
MAPURL = "https://atlas.microsoft.com/route/directions/json?api-version=1.0&query=52.50931,13.42936:52.50274,13.43872"
CLIENT = "4a587ce1-7a97-40a4-a8a4-5ca900957626"
TOKEN = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJjZGFmZmJlMC0xYmRmLTViN2YtNjhjZC0wMjYxYWMxZjgzMTUiLCJtYXhSYXRlUGVyU2Vjb25kIjoiNTAwIiwib2lkIjoiOWI1OGM1YjktNDE0Zi00NjY0LWI0NDctZjFlZjg3OGQ1YjMwIiwibmJmIjoxNzMyNjQ4ODIxLCJleHAiOjE3MzI2Nzc2MjEsImlhdCI6MTczMjY0ODgyNiwiaXNzIjoiNGE1ODdjZTEtN2E5Ny00MGE0LWE4YTQtNWNhOTAwOTU3NjI2IiwiYXVkIjoiL3N1YnNjcmlwdGlvbnMvNzEzYWJmOWUtNWNhMC00ZjFlLWJhNjAtNThiZDgxZDMwMGJkL3Jlc291cmNlR3JvdXBzL3JnX2J2Y19tYXBzL3Byb3ZpZGVycy9NaWNyb3NvZnQuTWFwcy9hY2NvdW50cy9tcF9idmNfbWFwcyJ9.BCRkiJwLfh4QTpfZVLlLprObqLxwANHG3xYkL4gLVB4"
headers = {
    "x-ms-client-id": CLIENT,
    "Authorization": f"Bearer {TOKEN}"
}
with requests.get(MAPURL, headers = headers) as r:
    print(f"{r.status_code} {r.reason} {r.url}")

401 Unauthorized https://atlas.microsoft.com/route/directions/json?api-version=1.0&query=52.50931,13.42936:52.50274,13.43872
