# Working with APIs in Python

Making API requests in Python can be really simple. There's a low-level module called urllib that can also make the kinds of web requests that we want, but it's not as friendly as the `requests` module, which we'll be using.

In [None]:
import requests

## Authentication

You'll have to authenticate each request to the Harvard Art Museum API with an API key. Other APIs may require different kinds of authentication (sometimes very complicated auth! Look for libraries at that point), but HAM has some pretty simple authentication, which makes things easy for us.

You can sign up for a key [here](https://www.harvardartmuseums.org/collections/api). Documentation for the entire API is hosted on GitHub and can be viewed [here](https://github.com/harvardartmuseums/api-docs).

In [None]:
APIKEY = "b0cde630-ce66-11e8-951c-b3d75228cc98" # Enter your API key here

## Basic request

We're going to start off with a basic request to the API. This API, like many others, has a variety of endpoints, each with their own url, slightly modified from a base url. We'll worry about the general case in a bit, for now let's look at a basic API request.

In this example, we'll re-create the first example in the [Object endpoint documentation](https://github.com/harvardartmuseums/api-docs/blob/master/sections/object.md), which will give each of you the records for 10 objects that have never been viewed online in the museum's collections.

In [None]:
url = "https://api.harvardartmuseums.org/object"
parameters = {
    "q":"totalpageviews:0",
    "size":10,
    "apikey":APIKEY
}
R = requests.get(url,params=parameters)
R.json()

### Refresher on Dictionaries

Python dictionaries are sets of key / value pairs, where a value can be accessed by its key. You're essentially naming a value in a container, so you can easily call it up later.

Dictionaries have very fast lookups, so you can get a value from its key very quickly, no matter how large the dictionary is. However, they are also unordered, so if you iterate through all of the key / value pairs in the dictionary, there's no guarantee that they'll be in the same order.

We're just going to be looking up data in dictionaries, so here's a quick refresher on the syntax:

In [None]:
parameters['q']

In [None]:
parameters['apikey'] # This also works when we've set the value to another variable

In [None]:
parameters['q'] = "totalpageviews:1" # You can also set the value of a key like you would a variable

## Making a Request

The request syntax is so simple, you might have missed it. Let's query again for objects with only one pageview, and take a closer look.

In [None]:
R = requests.get(url,params=parameters)

### Formatted parameters

That request has created a request object, which contains not only the data that we get from the Harvard Art Museums, but information on the request we sent, like the URL that it used. Notice that requests has turned our query parameter dictionary into a GET request at the end of our URL.

If you've been working with API requests or web scraping before, you might be used to seeing URLs get constructed like this:

```python
url = "https://api.harvardartmuseums.org/object?q=" + query + "&apikey=" + apikey
```

If you have, I'm sure you'll appreciate how much simpler this is, especially when dealing with more query parameters.

In [None]:
R.url

### Taking a look at the results

Request objects have a built-in method, `.json()`, which converts a JSON file received as a response to a request from a string of text that happens to be in this data format into Python native data structures, like lists, dictionaries, numbers and strings. We can use this method to see a dictionary representation of what we've gotten from the API request.

In [None]:
R.json()

## Changing our request

Let's say we're not interested in the most obscure parts of the collection (pot sherds, apparently), but rather in the most popular parts of the collection. There are a few ways we might go about doing this. One way might be to sort our search results by `totalpageviews`, and see what the top 10 are.

To do that, we can go back to the [Object API documentation](https://github.com/harvardartmuseums/api-docs/blob/master/sections/object.md) and look for hints about what we might be able to do.

In [None]:
parameters = {
    "size":10,
    "apikey":APIKEY,
    "sort": "totalpageviews",
    "sortorder": "desc"
}
R = requests.get(url,params=parameters)
R.json()

### Looking at the results

Often, you'll want to look at some specific aspect of the data you're getting. Since the API returns everything, you'll have to format the output in some friendly, readable format.

We're being pretty low-level with the text formatting here, and one important key to understanding this bit is that "\t" means "tab", so that you can insert that character, which normally does something else.

Feel free to play around with this cell to format it more to your liking. The string `format()` method allows you to interpolate variables or expressions into a string. You use curly braces (`{}`) in the string where you'd like to substitute variables; you can also use named arguments (`"Test {foo}".format(foo="bar")` prints "Test bar"). In this next cell, we'll iterate through the results and print them out in a nicer format.

In [None]:
records = R.json()['records']
print("views\tartwork")
print()
for record in records:
    print("{}\t{}".format(record['totalpageviews'],record['title']))
    # `.format` puts its arguments sequentially in the string calling it wherever there are {} pairs
    # It does a lot more than that, with more advanced documentation here: 
    # https://docs.python.org/3.4/library/string.html#id1

The top result from this query is a Van Gogh painted titled "Self-Portrait Dedicated to Paul Gauguin." You can grab just the first object by accessing the records list (which is indexed from 0):

In [None]:
topResult = R.json()['records'][0]
topResult

You can easily access properties from the image record:

In [None]:
topResult['title']

### Exercise
- Try using the `person` endpoint to search for information about Van Gogh. Get his `id` number.
- Try displaying all HAM works by Van Gogh using that `id`. Filter your results to only include records with an image associated.

In [None]:
# Write your code here

## More endpoints to love

You might notice, looking at the documentation, that we've only been accessing the "objects" API endpoint, when there are many other endpoints that we could ask for information.

A note on terminolog: an API endpoint is a one place that you can go to ask specific questions about a certain part of a dataset or service. Many APIs, especially commercial APIs, contain many, many endpoints, to facilitate all sorts of different activity on a platform.

For example, you can take a look at the [reddit API documentation](https://www.reddit.com/dev/api/) (which we won't be using, this is just an example), to see all of the different endpoints that an application might need to serve as an alternative front end for reddit. 

Endpoints on the same API are likely to behave similarly, but they will all serve different purposes. Looking at our HAM endpoints, it looks like they all follow the same basic formulation: `https://api.harvardartmuseums.org/RESOURCE_TYPE`. We can use this to our advantage, and create a function to query any endpoint easily.

In [None]:
def ham_query(apikey, endpoint, **kwargs):
    """Sends kwargs to the specified endpoint, using apikey for authentication"""
    params = kwargs
    params['apikey'] = apikey
    url = "https://api.harvardartmuseums.org/{}".format(endpoint)
    R = requests.get(url,params=params)
    return R

In [None]:
response = ham_query(APIKEY, "gallery", floor=2)

In [None]:
response.json()

### Boy, that's convenient!

That function works because Python has this neat ability to take arbitrary arguments in functions, if you tell it to. Essentially, there are two special arguments in function definitions: `*args` and `**kwargs`. These make available `args` and `kwargs` objects, respectively, in your function. `args` is a list, and `kwargs` is a dictionary. This makes it so that you don't have to specify all of the arguments your function can take, you can just give it general rules for lists or key pairs of data as input.

You might be wondering why you wouldn't just use a dictionary or list instead of those arguments. In our case, it's mostly a stylistic choice, and one that saves us a couple of key strokes.

#### Try out some other endpoints!

In [None]:
# Here's an example: all of the current exhibits with their begin and end dates
response = ham_query(APIKEY, "exhibition", status="current", size=100)
current_exhibits = response.json()['records']
current_exhibits
print()
for exhibit in current_exhibits:
    print("{} ({} to {})".format(exhibit['title'],exhibit['begindate'],exhibit['enddate'])) 

### Endpoint Exercises

[HAM API Documentation](https://github.com/harvardartmuseums/api-docs)

- Get a list of all the medium types in the museum
- How many levels of mediums are there?
- Choose a medium from the most specific (highest numerical value) level. Note the medium id. Create a new query to the object endpoint, using the medium id as a filter. How many objects are there created from this medium?
- Choose a level 2 medium instead. Create a new query to the object endpoint, again filtering by this medium id. Print out the medium types of the returned records. What do you notice about the types?
- BONUS: reorganize your list of medium objects into a nested structure so that child media are accessed through a list under their parents. Print out the list like so:
- Metal
    - Pb
    - potin
    - ...
    - copper alloy 
        - copper-antimony-arsenic alloy
        - copper-iron alloy
        - copper-tin-antimony-arsenic alloy
        - leaded copper-tin-antimony alloy
        - etc ...

In [None]:
# Write your code here

## Individual Objects and IIIF

The HAM object API can provide more information (such as `exhibition`, `citation`, `publication`, and `marks`) if you ask for a specific object by its objectid. For some records that have been extensively annotated (often those with `verificationlevel` == 4) the lists for these properties can contain hundreds of entries.

In [None]:
def ham_query_with_id(apikey, endpoint, ID, **kwargs):
    """Sends kwargs to the specified endpoint, using apikey for authentication. ID is a required arg (appears in the route)"""
    params = kwargs
    params['apikey'] = apikey
    url = "https://api.harvardartmuseums.org/{}/{}".format(endpoint,ID)
    R = requests.get(url,params=params)
    return R

In [None]:
objectid = topResult['objectid']
topResultFull = ham_query_with_id(APIKEY, "object", objectid)

print(topResultFull.url)
print("Verification Level 4: {}".format(topResultFull.json()['verificationlevel'] == 4))
print()
print(topResultFull.json())

When we printed the 10 most popular records above (under **Looking at the Results**), you may have noticed a sharp dropoff after the first few records. Our Van Gogh painting is particularly popular, with ~8000 more views than the second most popular record and more than 4x as many as the tenth most popular. This particular Art Museum record is used as the default image asset for the demo installation of [Project Mirador](http://projectmirador.org/demo/), an image viewer for [IIIF (International Image Interoperability Framework)](https://iiif.io/) media assets. 

We're not going to go deep into IIIF in this workshop, but want to mention that IIIF is both a community of developers and a collection of APIs and API-compliant tools that you can use to share, manipulate, and display visual materials. The [Image API](https://iiif.io/api/image/2.1/) and [Presentation API](https://iiif.io/api/presentation/2.1/) are the most used outputs as of now, though there are also APIs for Authentication, Search, and beta versions for other media (video and VR).

### IIIF Image API

Within our `topResultFull` object, there is an images list, which contains IIIF baseurls as well as Image Delivery Service URLs:

In [None]:
ham_images = topResultFull['images']
ham_images

This particular record has 6 images associated with it. Try copying and pasting some of the `baseimageurl`s in your browser:

In [None]:
for index, image in enumerate(ham_images, start=1):
    print("image {} baseimageurl: {}\nimage {} iiifbaseuri: {}".format(index,image['baseimageurl'],index,image['iiifbaseuri']))

You'll notice that the `baseimageurls` use Harvard's Name Resolution service, which redirects to an Image Delivery Service URL that displays the image. This is nice, but we're more interested in the `iiifbaseuris` because we can manipulate IIIF resources using the Image API. Try opening one of those. What happens?

The IIIF Image API spec requires that we pass not just a baseurl, but a well-formed IIIF-compliant URI to get an image. Let's check out that [documentation](https://iiif.io/api/image/2.1/) and see what else we need to construct one of those.

From the docs:

>The IIIF Image API URI for requesting an image must conform to the following URI Template:
>
>`{scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}`
>
>For example:
>
>`http://www.example.org/image-service/abcd1234/full/full/0/default.jpg`
>
The parameters of the Image Request URI include region, size, rotation, quality and format, which define the characteristics of the returned image. These are described in detail in Image Request Parameters.

The `iiifbaseuri`s include up through the `{identifier}`, but we need to include additional parameters to get the server to actually render the image for us. These parameters are passed within the URI itself, rather than in a query string appended after a delimiter (usually `?`), which is what we've been using `requests` to do. Let's write a function that can generate IIIF URIs for us. Because all of the parameters we want to insert are required, we won't use `**kwargs` - instead we'll set default params which you can override by passing in new ones.

In [None]:
def iiif_query(baseuri, region="full", size="full", rotation=0, quality="default", format="jpg", info=False):
    """Creates a valid IIIF URL, with the option to request image information"""
    if baseuri[-1:] != "/":
        baseuri += "/"
    if info == True:
        return baseuri+"info.json"
    else:
        url = baseuri+"{}/{}/{}/{}.{}".format(region, size, rotation, quality, format)
        return url

Now let's try using this function to display the images and links within our notebook. We'll need to add a few more modules to do this: `display`, `Image`, and `HTML`, all from `IPython.display`.

In [None]:
from IPython.display import display, Image, HTML
for img in ham_images:
    image_url = iiif_query(img['iiifbaseuri'])
    display(HTML("<a href='{}'>{}</a>".format(image_url,image_url)))
    display(Image(url=image_url, height=200, width=200))

Now we have some valid image URLs! We've displayed the content of those URLs here directly using Jupyter's display and image libraries, but you can also open them in your browser directly!

This is nice, but the Image API lets us do a lot more by just by passing in some parameters. Maybe we want to generate some square, grayscale images for a gallery:

In [None]:
for img in ham_images:
    image_url = iiif_query(img['iiifbaseuri'], quality="gray", region="square")
    display(HTML("<a href='{}'>{}</a>".format(image_url,image_url)))
    display(Image(url=image_url, height=200, width=200))

### IIIF Image API Exercise
Let's try requesting only the right half of an image (using `region`), in black and white, and getting back a PNG:

In [None]:
# Write your code here

Feel free to try to manipulate the images in other ways as well! That's it for our quick introduction to the Image API.

### IIIF Presentation API

If you're interested in the Presi API (for presenting structured IIIF resources as part of a more fully-functional web app), check out [this documentation](https://iiif.io/api/presentation/2.1/) to learn how IIIF manifests structure sequences of canvases which image viewers then present to end users. You can find an HAM Object's manifest in the `seeAlso` field, or by by appending the object ID to a baseurl:

In [None]:
print(topResult['seeAlso'])
print('https://iiif.harvardartmuseums.org/manifests/object/{}'.format(topResult['id']))

#### Mirador

You can consume these resources using [Mirador](http://projectmirador.org/), an image viewer which uses the IIIF Image and Presentation APIs. We used to use Mirador in a different version of this workshop which integrated Omeka, a content management system.

If you head to the [Project Mirador Demo](http://projectmirador.org/) page, you can add a new manifest in the top left ("four boxes icon" -> "Replace Object" -> "Add new object from URL"). Paste in your manifest URL there.

Example manifest URL for the second most viewed painting, "The Gare Saint-Lazare: Arrival of a Train": https://iiif.harvardartmuseums.org/manifests/object/228649

## More stuff!

So far, we've only been getting limited sets of object data. But what if there were a big query we wanted to make? Let's try it out on "Unidentified culture" materials in the museum.

In [None]:
unknown = ham_query(APIKEY, "object", culture="Unidentified culture", size=100).json()

Looking at our previous queries, it looks like we've got some information about our query in the "info" section. Let's take a look at that...

In [None]:
unknown['info']

### Iterating through pages

It looks like we have 7 pages of data to get, and our response gives us a "next" url for easy iteration. Nice!

However, let's look at how we would iterate even without this convenience factor.

In [None]:
unknown.keys()

It looks like we have two components to our response, info and records. Since `info` is request specific, we're just after `records`, and we'll want to combine them all. 

We could set this up in a regular loop, which would query the API as fast as our processors can go, which can produce many queries per second, and is usually limited more by network speed than by processor speed. However, this can put a strain on the API endpoint, so it can be good practice to build in timers when making many requests. Sometimes an API will specify a number of requests/second that you're allowed to make, sometimes not. Putting even a fraction of a second delay in your code will help make sure that you don't accidentally get yourself banned from the API.

In [None]:
import time

In [None]:
unknown_records = []
keepGoing = True
page = 1

while keepGoing:
    R = ham_query(APIKEY, "object", culture="Unidentified culture", size=100, page=page)
    time.sleep(0.5)
    response = R.json()
    unknown_records.extend(response['records'])
    if response['info']['pages'] == page:
        keepGoing = False
    else:
        page += 1

In [None]:
len(unknown_records)

In [None]:
unknown_records

# Enhancing Data

We've now used Python and the Art Museum API to create a custom dataset. We've also played with the IIIF Image API to programmatically alter images. In this next section, we're going to provide you space to experiment and enhance your dataset by using a third API to join an additional datasource. We're going to provide some directions for an exercise using GeoNames and Folium, a wrapper for the Javascript mapping library Leaflet. But you're also free to find another API such as Wikidata, the Getty Union List of Artist Names, or the Google Vision API.

## Overview
- Pick a current exhibit to examine; find fields that can be geocoded
- Use the GeoNames API to geocode those fields
- Display the points as markers on a Leaflet map

## Working with HAM Exhibits
Let's start by picking one current exhibit to examine. We've already stored this data in `current_exhibits`.
- We'll want to pick an exhibit that has a number of people associated with it, so that we have multiple locations to georeference.
- People are stored in `exhibit['records']['people']`. Take a look at these and see what we could geocode.
- Within each record, this is stored in a `peoplecount` field (eg `exhibit['records']['peoplecount']`).
- You'll need to do a new query to the `object` API for each exhibit.
- Let's sum these up to get a quick count of people associated with each exhibit. Sort the exhibits by this count, and print those along with the exhibit names and IDs.

In [None]:
# Write code here!

Pick two of the top three exhibits to further examine.
- Store all the objects in those exhibits
- Store the exhibition information (endpoint `exhibition`) as well

In [None]:
username = 'cdc43339'
url = "http://api.geonames.org/searchJSON"
Q = {
    'q':'Cambridge',
    'username':username
}
R = requests.get(url,params=Q)
R.json()

## Geonames API

Let's start by checking out the JSON version of the [GeoNames API](http://www.geonames.org/export/JSON-webservices.html).

- Write a function that hits `searchJSON` and geocodes a placename. It should return the latitude and longitude (`lat` and `lng`) of the top result. You'll need to include your GeoNames username in the API calls.

In [None]:
USERNAME = ''
def search_place(placename):
    """Searches the GeoNames searchJSON endpoint for a placename, returning a latitude and longitude"""
    # Your code here
    return lat, lng

- Write another function which takes an exhibit, gets all the people in the exhibit, and geocodes their birthplace and deathplace. The function should return a dictionary of people objects that also have `birthplace_coordinates` and `deathplace_coordinates` attributes. Be warned that not every person will have a birthplace and/or deathplace.

In [None]:
def geocode_exhibit_people_locations(exhibit):
    exhibit_people = {}
    for record in exhibit['records']:
        # your code here
    return exhibit_people

## Folium and Leaflet

We can use [Folium](https://python-visualization.github.io/folium/index.html) to incorporate [Leaflet](https://leafletjs.com/reference-1.5.0.html), an open-source JavaScript library for interactive web maps. Here are some more Folium examples: https://nbviewer.jupyter.org/github/python-visualization/folium/tree/master/examples/.
- Check out both the Folium library documentation and the Leaflet API documentation to understand how Leaflet maps are implemented.
- Install Folium if you haven't already: `conda install folium -c conda-forge` or `pip install folium`
- Import Folium and create a map object. Folium has a number of different basemap options and other possible configurations - check out the documentation to customize the look of your map.

In [None]:
import folium
people_map = folium.Map(
    zoom_start=8,
    tiles='Stamen Watercolor'
)
people_map

- Create a function which can add people to the map as markers. The function should accept an object or list of people; a map, to which the markers will be added; and any other parameters you'd like
- Label each marker with the person's name and the location.
- Bonus points for creating different colored markers for birth and death locations and for styling the label / making it more readable.

In [None]:
def add_people_to_map(people, leaflet_map):
        """Adds people as markers to a Leaflet map"""
    for painting_id in people:
        p = people[painting_id]
        for person in p:
        # your code here

## Exporting

Now we have some cool data, but maybe we want to do something with it outside of Python. It's common to see CSV data traded around, since it's just a plain text spreadsheet file, so most things can parse it. Let's make one of those! We could use the relatively low level `csv` library, but instead, let's use a higher level library, `pandas`

In [None]:
import pandas as pd # Common invocation of pandas. Gotta save those 4 keystrokes.

### "Be a dataframe!" - us

Pandas thinks of things in terms of dataframes, which will be familiar if you work in R. Basically, they're really efficient arrays of data. They also translate really well to a tabular format.

To make an iterable object into a dataframe, sometimes you can just get away with shouting "Hey you! Be a dataframe!" at it (in code). Since we have a list of dictionaries with consistent keys, there's a good chance this process will do something smart for us:

In [None]:
pd.DataFrame(unknown_records)

What do you know! It worked. But let's take a look at a more hands on approach to the same thing.

In [None]:
pd.DataFrame.from_dict(unknown_records)

`pd.DataFrame.from_dict` gives you more control over the conversion process, so you can provide more options if things don't look how you expect them to.

As a side note, we do have some data structures in here that don't make a lot of sense in a tabular format. Look at `worktypes` at the very end. That's a list, and each cell has list data in it. We won't be able to do much with that in Excel or some other tabular data processing tool, but it also won't break anything for us. It just looks weird. Within the dataframe, they still work like lists though, so you can access the data while you're still in Python if you're clever about it.

### Exporting

From here, our export process is really easy. We just say "Hey you! Be a CSV file now!", and so it shall be.

In [None]:
df = pd.DataFrame(unknown_records)
df.to_csv("unknown_ham_records.csv",index=None)

# Data collected!

Now we've got some interesting data and exported it. We could throw it at a program like Tableau or Excel to visualize it or further explore it. We could also continue to explore it in Python. Another options would be to remix it into an Omeka site. This is a good option if you're interested in exploring other RESTful methods, like `POST`, `DELETE`, or `UPDATE.` Check out the `Omeka` notebook for more information on how to pull off a big data heist!

*Not a real heist, we are using freely available data that the museum has generously made available. Please do not steal any physical art.*