## Querying crossref


### Step 1: load packages

* **jsonlite** for working with JSON
* **purrr** for working with vectors

If necessary, install the packages before loading them.

In [3]:
#install.packages("jsonlite")
#install.packages("purrr")
library(jsonlite)
library(purrr)

### Step 2: specify the query parameters

The crossref API is well documented (see [here](https://github.com/CrossRef/rest-api-doc)) and offers many functionalities.

For our use case, we want to retrieve *DOIs of publications from authors that are afiiliated with a specific organization (Humboldt-Universität zu Berlin)*.

Crossref has not yet implemented organisation identifiers in the affiliation information (see [here](https://data.crossref.org/schemas/common4.4.2.xsd)). Therefore, we have to rely on searching for name variants and postprocess the results to remove duplicates etc.

We can query the *affiliation* field of crossref via:
* the endpoint *works* and
* the field query parameter *query.affiliation*.

For querying the name variants, we choose:
* humboldt+universität+berlin
* humboldt+university+berlin
* hu+berlin

We store the base query and name variants in two objects:

In [5]:
base_query <- "https://api.crossref.org/works?query.affiliation="
name_variants <- list("humboldt+universit%C3%A4t+berlin", "humboldt+university+berlin", "hu+berlin")

### Step 3: politely using APIs

APIs offer valuable services to many people and organizations. Therefore, it is important to *politely* use APIs and not burden them with too many requests.  Some services regularly make data dumps available, so you might not even have to use the service's API.

Some APIs specify polite use in their documentation, including crossref (see [here](https://github.com/CrossRef/rest-api-doc#etiquette)). To comply with the API etiquette, we will append the *mailto* parameter to the query. This allows crossref to contact us in case there are any issues with our query.

In [6]:
# change this value to your mail adress:
mailto <- "&mailto=janedoe@example.org"

### Step 4: build the query URLs

Now we have all the components to build the query URLs. We use the function *paste0*, which concatenates strings without adding space between the components. Because the components have varying length, we use *rep* to specify that *base_query* and *mailto* are used three times (for all three name variants).

In [7]:
urls <- paste0(rep(base_query, 3), name_variants, rep(mailto, 3))
print(urls)

[1] "https://api.crossref.org/works?query.affiliation=humboldt+universit%C3%A4t+berlin&mailto=janedoe@example.org"
[2] "https://api.crossref.org/works?query.affiliation=humboldt+university+berlin&mailto=janedoe@example.org"      
[3] "https://api.crossref.org/works?query.affiliation=hu+berlin&mailto=janedoe@example.org"                       


### Step 5: exploring the results

The crossref API returns results in JSON, a common data format. We use *fromJSON*, pass it one of the query URLs to retrieve and convert results, and store them in the object *results*.

Information in JSON objects is often stored similarly to nested lists with names. Therefore, we can access a specific piece of information by subsetting the *results* object using names. Here, we want to access DOIs, publication year and type of publications matching the query. Notice that results are returned in different classes - two character vectors and one data frame.

By default, crossref returns 20 items at a time.

You can find out how many items match your query by accessing *total-results* - in this case, we have more than 80,000 matches!

In [9]:
# using fromJSON + query URL to retrieve and parse results
results <- jsonlite::fromJSON("https://api.crossref.org/works?query.affiliation=hu+berlin&mailto=janedoe@example.org")

# access the results by subsetting
results$message$items$DOI
results$message$items$published
results$message$items$type

# total number ot items matching the query
results$message$`total-results`

Unnamed: 0_level_0,date-parts
Unnamed: 0_level_1,<list>
1,"2007, 12, 15"
2,"2015, 1"
3,"2019, 12, 6"
4,"2000, 6"
5,"2018, 5, 25"
6,"2021, 3, 10"
7,"2019, 1"
8,"2015, 1"
9,"2019, 3"
10,"2006, 3, 1"


### Step 6: retrieving all DOIs matching the query

**To reduce the load on the crossref API, we will not execute this step in the workshop - I will provide you with the data necessary to proceed.**

To retrieve the information on all matches, we have to iterate through the results. For this purpose, crossref offers *cursors*. They work like this: to your first query, you add the *cursor* parameter with the value " \* ". Alongside the results, crossref returns a *next-cursor* field. You can use this value to access the next set of items, and so on.

We can implement this in R using a *while loop*, a useful form of iteration if you don't exactly know how long a sequence is. 

In the example below, we first add "&cursor=\*" to the first of our query URLs, pass that URL to *fromJSON*, and store the result in *results*. Next, we extract the DOIs, publication year (since this is a data frame, extraction is a little more complex), and resource type. We store the next cursor in *next_cursor*.

We then initiate a while loop that will repeat itself until a condition is met. Here, the loop is repeated until crossref does not return a next cursor, which happens when all matches are returned. Within the loop, we will use *append* to add the new results to the objects we previously created for storing DOIs, publication year and resource type.

In [1]:
# DO NOT RUN THIS!

# first request with "cursor=*""
#results <- fromJSON(paste0(urls[1], "&cursor=*"))

# extract and store information
#DOIs <- results$message$items$DOI
#publication_year <- unlist(map(results$message$items$published$`date-parts`, 1))
#resource_type <- results$message$items$type
#next_cursor <- results$message$`next-cursor`

#while (!is.null(next_cursor)) {
#    results <- fromJSON(paste0(urls[1], "&cursor=", next_cursor))
#    DOIs <- append(DOIs, results$message$items$DOI)
#    publication_year <- append(publication_year, unlist(map(results$message$items$published$`date-parts`, 1)))
#    resource_type <- append(resource_type, results$message$items$type)
#    next_cursor <- results$message$`next-cursor`
#}