<a href="https://colab.research.google.com/github/columbia-data-club/meetings/blob/main/2023/studio_open_data_and_rest_apis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![A blue background with a REST API logo and the Studio logo, a lion’s head and the word “Studio”](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/studio-images/rest-api.png)

# Open Data and REST APIs

March 31, 2023

by [Moacir P. de Sá Pereira](https://moacir.com) for the [Columbia University Libraries Studio](https://studio.cul.columbia.edu).

This is a low-impact Python notebook that shows researchers how to use open data that often sits behind a [REST API](https://en.wikipedia.org/wiki/Representational_state_transfer) or something similar. We’ll start with [The Metropolitan Museum of Art Collection API](https://metmuseum.github.io/). This should be a perfect introduction for digital humanists who don’t yet know how to turn JSON into a spreadsheet.

We’ll be using two Python libraries, both useful for any sort of data acquisition and analysis, [Requests](https://pypi.org/project/requests/) and [pandas](https://pandas.pydata.org/), but we hope the touch will be light, as the goal is to describe how REST (or REST-like) APIs work and how humans (and humanists) can make use of them.

## What Is an API?

“API” stands for “[Application Programming Interface](https://en.wikipedia.org/wiki/API),” and it can often be used somewhat imprecisely. As an interface, it means precisely that it is the means by which something (typically a computer program) can access or use an “application.” 

One way of thinking about this is about the lock on your apartment door. The lock exposes an interface that allows something else to lock or unlock the lock, typically a key. Furthermore, that interface is typically “documented” in some way, meaning that a user can fashion their key following the documentation and thereby lock and unlock the door. The documentation can describe the pattern of teeth and grooves for a mechanical lock, or the number required for a numpad lock, or the particular encoding needed for a keycard lock. 

The door lock has one input (the key), and it can then lock or unlock, so the interface is rather straight forward. Furthermore, the user of the lock does not have to understand exactly what the lock’s mechanism is. They simply need to know that if they use the key in a certain way, the lock will either lock or unlock. This lets the creator of the lock only “expose” the necessary components of the lock to the user and “obscure” the internal details, about which the user probably doesn’t care too much. 

In other words, I don’t know (or need to know) how the lock to my apartment works. I just know that the key I have will unlock and lock it, though sometimes I need to jiggle it a bit since the interface isn’t perfect!

APIs for computer programs can be far more complex than a lock, but the idea is the same: there exist a finite number of means by which a user can interact with the computer program, and those means tend to be documented, so programmers can use the programs predictably; they can be confident that a certain set of inputs will generate the outputs they want. The internal details are not  important to the user/programmer.




## What Is a REST or REST-like API?

APIs are widespread in software engineering, but often people will use the term a bit imprecisely to refer specifically to “[Web APIs](https://en.wikipedia.org/wiki/Web_API),” what I am calling “REST or REST-like APIs.” The Web API is still a documented set of inputs made available to the user/programmer, but the difference is that the user/programmer typically interacts with the application over the Web, using different forms of URLs to access different parts of the API. 

Web APIs make it possible for websites to effectively talk to each other. You could hypothetically make a personal homepage that uses [Twitter’s API](https://developer.twitter.com/en) to load your ten most recent Tweets on one part of your webpage, [Reddit’s API](https://www.reddit.com/dev/api/) to load your Reddit profile on another part of your webpage, and [Facebook’s Graph API](https://developers.facebook.com/docs/graph-api) to load information from your Facebook in another part of your webpage.

What’s more, you can often use APIs not only to `GET` information ([`GET` is a “verb” that is part of the http protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/GET), but also [`POST`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST) new information. It’s via Twitter’s Web API that many Twitter bots are able to post “automatically.” They don’t click on a webpage and type in a message; they build a payload and `POST` that payload to the API, which creates a Tweet.

[REST APIs](https://en.wikipedia.org/wiki/Representational_state_transfer) are a specific type of Web API that rely on specific principles and constraints to make the API somewhat predictable; if you know a Web API is a REST API, you can save on reading some documentation! RESTful APIs can be very useful when designing your own web application, and popular web frameworks often have built-in logic for creating or interacting with RESTful APIs.

Though “REST” is in the title of this notebook and the Web API we will be using, [The Metropolitan Museum of Art Collection API](https://metmuseum.github.io/), presents itself as RESTful, there won’t be any more discussion of REST!

## The Metropolitan Museum of Art Collection API

[The Metropolitan Museum of Art Collection API](https://metmuseum.github.io/) lets us connect, programmatically (as in, with a computer program like this notebook), to The Met’s entire collection. We can feed the API search terms and get data about the 470k+ artworks in The Met’s collection. 

Choosing The Met’s API is a bit of a funny choice, since you can actually [download the entire dataset](https://github.com/metmuseum/openaccess/blob/master/MetObjects.csv?raw=true) from GitHub as a 300MB spreadsheet. This obviates the need for the API, since with the download, we can query it however we want to, without needing to use the API’s specific points of entry. Of course, loading a 300MB spreadsheet into Excel may not be a lot of fun.

We’ll be using two specific “endpoints” for our API access, the search endpoint, which lets us query the collection, and the object endpoint, which lets us get data about specific objects. Endpoints are what they sound like: they are the URLs you use to access those specific points of entry into the API, and you can attach information to the endpoints to get specific information.

### Search

We can see that the [search endpoint](https://metmuseum.github.io/#search) (`/public/collection/v1/search`) lets us attach several query parameters that let us fine-tune our query. The `q` parameter is for the query itself, but we can limit the query to works that have images, for example, by also making use of the `hasImages` parameter.

When we execute a search, the API returns for us a list of `objectIDs` (as well as the total number of objects) that match our query. But the `objectIDs` by themselves don’t tell us anything. We need to then use the object endpoint for each `objectID` to get the data about each individual object.

### Object

The [object endpoint](https://metmuseum.github.io/#object) (`/public/collection/v1/objects/[objectID]`) has no parameters other than the `objectID`, which makes up a part of the URL itself. But where the search endpoint only returns two values (total number of hits and list of `objectIDs`), the object endpoint returns over 50 properties of the object, though of course not every object has values for all of those properties. 

But let’s see this in action

## Postman

Before accessing The Met’s API with Python, I want to first access it using [Postman](https://www.postman.com/). Postman is a webapp that lets users interact with Web APIs in a way that might be a bit friendlier than a Python notebook, at first. 

1. Go to postman.com and log in or create an account (you can use your Google credentials if you like). 
2. Get to your workspace, and click “New” and choose “HTTP Request.”
3. For the URL, beside the “GET” button, paste in `https://collectionapi.metmuseum.org/public/collection/v1/search`
4. Under “Query Params,” for “Key,” type in “q”, and for the “Value,” type in whatever artist you want to find. I’ll be searching for [Rayyane Tabet](https://en.wikipedia.org/wiki/Rayyane_Tabet), but you can type whatever you want—and it does not have to be an artist’s name.
5. Click on “Send,” and results should appear at the bottom, something like this:

![A screenshot of the Postman interface showing a search for Rayyane Tabet using The Met’s API and the server response](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/studio-images/met-api-tabet-search.png)

And the results will look something like this:

```json
{
    "total": 5,
    "objectIDs": [
        820847,
        324008,
        324010,
        324009,
        324007
    ]
}
```

This is [JSON](https://en.wikipedia.org/wiki/JSON), a means of sending data over the web (among other things), and we can see that it returned two properties: `total` and `objectIDs`. The former has a value of a single number (`5`), and the latter is a list of IDs, where each ID corresponds to a work that matched our query. 

We can now pick one of those IDs and use the object endpoint.

1. Click on “New” again and choose “HTTP Request.”
1. For the URL, paste in `https://collectionapi.metmuseum.org/public/collection/v1/objects/:objectID`
1. By typing in `:objectID`, there should now be a section under “Query Params” called “Path Variables.” For the `objectID` key’s value, paste in one of the `objectIDs` from above. I’ll use `324008`, because it has images.
1. When you click on “Send,” you should get something like this at the bottom:

![A screenshot of the Postman interface showing a search for Rayyane Tabet using The Met’s API and the server response](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/studio-images/met-api-object-hittite-fragment.png)

Now, Tabat was born in 1983, and this piece came to The Met in 1943, as we can tell from the `accessionYear` value. This means that for some reason, this piece was returned when we did our query, even though the piece is not by the artist in question. Nevertheless, we are treated to a few images of the piece:

![A photo of a lion-hunt scene from a Neo-Hittite relief](https://images.metmuseum.org/CRDImages/an/web-large/DP-16679-001.jpg)

So let’s change the `objectID` to `820847` so we can get an actual piece by Tabat, Orthostates. One of the properties we’re given is the `objectURL`, which in this case happens to be https://www.metmuseum.org/art/collection/search/820847, so we can click on that to learn more about the piece, and we can see that Orthostates is linked to the four Neo-Hittite pieces, because Orthostates is “composed of thirty-two charcoal rubbings made by the artist from basalt fragments of a 10th–9th century B.C. Neo-Hittite frieze originally from the site of Tell Halaf,” and The Met happens to own four of those fragments.

But what if we wanted to get the data on all five pieces and save them to a spreadsheet? Here’s where we can start using Python.

One last note, though. There are some [sample API calls to various museums’ collections](https://www.postman.com/opamcurators/workspace/open-access-museums/documentation/1501710-22671d6d-74c0-44af-b258-3fa06f4c920c) available as a Postman collection that you can try out.

## Accessing The Met with Python

As mentioned above, we’ll be accessing The Met’s API with the [Requests](https://pypi.org/project/requests/) library, and then we will collect the information into a [pandas](https://pandas.pydata.org/) dataframe before exporting it out. The process is remarkably similar to using Postman, but we have extra tricks up our sleeve. We’ll start by importing our libraries and setting a few constants. The first constant is the URL for The Met’s API. The second constant is how long to wait between API calls. Typically, there is a limit on how many API calls one can make in a certain period of time. For The Met, it’s 80 calls per second, so we’ll tell Python to wait $\frac{1}{80}$ of a second between each call. We need to import the Time library to tell Python to sleep.

In [22]:
import time
import requests
import pandas as pd

api_url = "https://collectionapi.metmuseum.org/public/collection/v1/"
sleep_time = 1 / 80

Next, we’ll write a pair of functions for querying The Met’s API. The first is a generic function that makes an HTTP `GET` request and handles errors. The second function I’m going to make this a bit complex in order to make it thorough. By that I mean I’m including the entire set of query parameters one can submit.

In [25]:
def make_request(url, params=None):
  try:
    response = requests.get(url, params=params)
    if response.status_code == 200:
      return response
    else:
      raise Exception(f"Status code for {url} was {response.status_code}")
  except Exception as e:
    print(f"ERROR: {str(e)}")

def search_endpoint(payload):
  # q = search term
  # isHighlight = true or false
  # title = true or false
  # tags = true or false
  # departmentId = integer
  # isOnView = true or false
  # artistOrCulture = true or false
  # medium = string
  # hasImages = true or false
  # geoLocaltion = string
  # dateBegin = integer
  # dateEnd = integer. Need both.
  time.sleep(sleep_time)
  response = make_request(api_url + "search", payload)
  return response.json()


In [26]:
payload = {
    "q": "Rayyane Tabet",
}

response = search_endpoint(payload)
print(f'Searching for "{payload["q"]}" yielded {response["total"]} objects.')
print(f"Their objectIDs are {response['objectIDs']}")

Searching for "Rayyane Tabet" yielded 5 objects.
Their objectIDs are [820847, 324008, 324010, 324009, 324007]


We can now make a second function for accessing the object endpoint, to which we can feed `objectIDs`.

In [30]:
def object_endpoint(objectID):
  time.sleep(sleep_time)
  response = make_request(api_url + "objects/" + str(objectID))
  return response.json()


In [31]:
object_endpoint(820847)


{'objectID': 820847,
 'isHighlight': False,
 'accessionNumber': '2019.288.1–.32',
 'accessionYear': '2019',
 'isPublicDomain': False,
 'primaryImage': '',
 'primaryImageSmall': '',
 'additionalImages': [],
 'constituents': [{'constituentID': 207170,
   'role': 'Artist',
   'name': 'Rayyane Tabet',
   'constituentULAN_URL': '',
   'constituentWikidata_URL': '',
   'gender': ''}],
 'department': 'Ancient Near Eastern Art',
 'objectName': '',
 'title': 'Orthostates',
 'culture': '',
 'period': '',
 'dynasty': '',
 'reign': '',
 'portfolio': '',
 'artistRole': 'Artist',
 'artistPrefix': '',
 'artistDisplayName': 'Rayyane Tabet',
 'artistDisplayBio': 'Lebanese, born 1983',
 'artistSuffix': '',
 'artistAlphaSort': 'Tabet, Rayyane',
 'artistNationality': 'Lebanese',
 'artistBeginDate': '1983',
 'artistEndDate': '9999',
 'artistGender': '',
 'artistWikidata_URL': '',
 'artistULAN_URL': '',
 'objectDate': '2017-ongoing',
 'objectBeginDate': 2017,
 'objectEndDate': 2017,
 'medium': '32 charcoal 

Now we can combine the search and object functions and make use of the `title` property on each object to make a little report.

In [33]:
search = search_endpoint({"q": "Rayyane Tabet"})
print(f'The search yielded {search["total"]} objects.')
print(f"Their titles are:")
for objectID in search["objectIDs"]:
  object = object_endpoint(objectID)
  print(f"    * {object['title']}")

The search yielded 5 objects.
Their titles are:
    * Orthostates
    * Orthostat relief: lion-hunt scene
    * Orthostat relief: winged human-headed lion
    * Orthostat relief: lion attacking a deer
    * Orthostat relief: seated figure holding a lotus flower


Finally, let’s choose some of the properties of the object and create a pandas dataframe out of our responses.

In [36]:
columns = ["objectID", "title", "artistDisplayName", "objectDate", 
           "accessionNumber", "accessionYear", "isPublicDomain",
           "department", "objectURL"]
objects = [object_endpoint(objectID) for objectID in search["objectIDs"]]
df = pd.DataFrame(objects, columns=columns)
df.head()

Unnamed: 0,objectID,title,artistDisplayName,objectDate,accessionNumber,accessionYear,isPublicDomain,department,objectURL
0,820847,Orthostates,Rayyane Tabet,2017-ongoing,2019.288.1–.32,2019,False,Ancient Near Eastern Art,https://www.metmuseum.org/art/collection/searc...
1,324008,Orthostat relief: lion-hunt scene,,ca. 10th−9th century BCE,43.135.2,1943,True,Ancient Near Eastern Art,https://www.metmuseum.org/art/collection/searc...
2,324010,Orthostat relief: winged human-headed lion,,ca. 10th−9th century BCE,43.135.4,1943,True,Ancient Near Eastern Art,https://www.metmuseum.org/art/collection/searc...
3,324009,Orthostat relief: lion attacking a deer,,ca. 10th−9th century BCE,43.135.3,1943,True,Ancient Near Eastern Art,https://www.metmuseum.org/art/collection/searc...
4,324007,Orthostat relief: seated figure holding a lotu...,,ca. 10th−9th century BCE,43.135.1,1943,True,Ancient Near Eastern Art,https://www.metmuseum.org/art/collection/searc...


Um. This really looked straightforward? Let's export as a csv, then, and call it a day!

In [37]:
df.to_csv("met_objects.csv")

That’s all. The only place left to go is up, probably by doing lateral moves. We can start trying different query parameters and comparing how they work compared with Postman. We can start doing some like exploratory data analysis on our results. Or we can abandon The Met’s API and [use another Museum API](https://www.postman.com/opamcurators/workspace/open-access-museums/documentation/1501710-22671d6d-74c0-44af-b258-3fa06f4c920c) to investigate their collection. Or even transcend the museum milieu and practice with any of the Web APIs [Todd Motto tracks](https://github.com/toddmotto/public-apis).

The goal here has been to break down the “working with an API” problem into its constituent parts:

1. Learn about the API from the documentation
1. Query the API
1. Manage the results of the query and do additional requests as needed
1. Collect the results into a dataframe
1. Download the dataframe for further use as a spreadsheet.

We lucked out (well, it was intentional) in that The Met’s API did not require any authentication or requesting an API key for access. That can complicate the process, of course, but ultimately it just adds a step between the first two above. Furthermore, once the authentication is captured in code, you can reuse it as much as you need.

Happy data gathering!