## Querying crossref


### Step 1: load packages

* **jsonlite** for working with JSON
* **purrr** for working with vectors

If necessary, install the packages before loading them.

In [1]:
#install.packages("jsonlite")
library(jsonlite)

### Step 2: specify the query parameters

For our use case, we want to retrieve *DOIs of publications published since 2015 from authors that are afiiliated with a specific organization (Humboldt-Universität zu Berlin)*.

The crossref API is well documented (see [here](https://github.com/CrossRef/rest-api-doc)) and offers many functionalities.
In the query URL we build, we first specify that we are interested in *works*.

Crossref has not yet implemented organisation identifiers in the affiliation information (see the schema definition [here](https://data.crossref.org/schemas/common4.4.2.xsd)). Therefore, we have to rely on searching for name variants in the affiliation field. Here, we pass the term "humboldt+universit+berlin"to the query parameter *query.affiliation*. (For demonstration purposes, we use a very simple query. To ensure that all publications from authors affiliated with an organization are found, multiple name variants should be used - see [this example](https://github.com/tuub/oa-eval).)

We use the filter *from-pub-date* with the value 2015-01-01 to limit results to publications published sinc 2015.

Below, the resulting query is stored in the object *query*.

In [2]:
query <- "https://api.crossref.org/works?query.affiliation=humboldt+universit+berlin&filter=from-pub-date:2015-01-01"

### Step 3: politely using APIs

APIs offer valuable services to many people and organizations. Therefore, it is important to *politely* use APIs and not burden them with too many requests.  Some services regularly make data dumps available, so you might not even have to use the service's API.

Some APIs specify polite use in their documentation, including crossref (see [here](https://github.com/CrossRef/rest-api-doc#etiquette)). To comply with the API etiquette, we will append the *mailto* parameter to the query. This allows crossref to contact us in case there are any issues with our query.

In [3]:
# change this value to your mail adress:
mailto <- "&mailto=jdoe@example.org"

### Step 4: exploring the results

The crossref API returns results in JSON, a common data format. We concatenate the strings *query* and *mailto* and pass the new URL string to the function *fromJSON*. The function retrieves and converts results from JSON to R objects. We store the converted results in the object *results*.

Information in JSON objects is stored similarly to nested lists with names. Therefore, we can access a specific piece of information by subsetting the *results* object using names. Here, we want to access DOIs, publication year and type of publications matching the query. Notice that results are returned in different classes - two character vectors and one data frame.

By default, crossref returns 20 items at a time.

You can find out how many items match your query by setting the number of rows to zero (*rows=0*) - in this case, we have more than 51,000 matches!

In [4]:
# using fromJSON + query URL to retrieve and parse results
results <- jsonlite::fromJSON(paste0(query, mailto))

# access the results by subsetting
results$message$items$DOI
results$message$items$published
results$message$items$type

# total number ot items matching the query
results$message$`total-results`

Unnamed: 0_level_0,date-parts
Unnamed: 0_level_1,<list>
1,"2020, 7, 15"
2,"2019, 7, 24"
3,"2016, 10, 24"
4,"2021, 8, 13"
5,"2020, 12, 31"
6,"2021, 4, 6"
7,"2016, 6, 14"
8,"2019, 12, 12"
9,"2018, 11, 15"
10,"2020, 3, 12"


### Step 5: retrieving all DOIs matching the query

**To reduce the load on the crossref API, we will not execute this step in the workshop - I will provide you with the data necessary to proceed.**

To retrieve the information on all matches, we have to iterate through the results, which by default are returned in sets of 20. For this purpose, crossref offers *cursors*. They work like this: to your first query, you add the *cursor* parameter with the value " \* ". Alongside the results, crossref returns a *next-cursor* field. You can use this value to access the next set of 20 items, and so on.

We can implement this in R using a *while loop*, a useful form of iteration if you don't exactly know how long a sequence is. 

In the example below, we first add "&cursor=\*" to the first of our query URLs, pass that URL to *fromJSON*, and store the result in *results*. Next, we extract the DOIs, publication year (since this is a data frame, extraction is a little more complex), and resource type and store them in a data frame. We store the next cursor in *next_cursor*.

We then initiate a while loop that will repeat itself until a condition is met. Here, the loop is repeated until the number of rows (= items already retrieved) is no longer smaller than the total number of items matching the query. Within the loop, we will use *rbind* to add the new results to the data frame we previously created for storing DOIs, publication year and resource type.

In [5]:
# DO NOT RUN THIS!

# first request with "cursor=*""
#results <- fromJSON(paste0("https://api.crossref.org/works?query.affiliation=humboldt+universit+berlin&filter=from-pub-date:2022-01-01&cursor=", "*&mailto=janedoe@example.org"))
# extract and store information in a data frame
#DF <- data.frame(c(DOIs = list(results$message$items$DOI),
#             publication_year = list(unlist(map(results$message$items$published$`date-parts`, 1))),
#             resource_type = list(results$message$items$type)))
# store next cursor
#next_cursor <- results$message$`next-cursor`

# iterate through the results until the condition is met; append new results with each iteration
#while (nrow(DF) < results$message$`total-results`) {
#  results <- fromJSON(paste0("https://api.crossref.org/works?query.affiliation=humboldt+universit+berlin&filter=from-pub-date:2022-01-01&cursor=", next_cursor, "&mailto=janedoe@example.org"))
#  DF <- rbind(DF, data.frame(c(DOIs = list(results$message$items$DOI),
#                     publication_year = list(unlist(map(results$message$items$published$`date-parts`, 1))),
#                     resource_type = list(results$message$items$type))))
#  next_cursor <- results$message$`next-cursor`
#}

### EXERCISE: try the package rcrossref

*rcrossref* is an R package specifically for using the crossref API. Crossref holds a lot of metadata about various aspects of scholarly publication. For example, crossref offers citation counts for works, based on reference matching within its holdings. 

The availability of this information depends on members choosing to add references upon metadata creation. Citation information is also less complete compared to other bibliometric databases. However, opening up citation information is a major issue in bibliometric research, and therefore will hopefully grow in the future.

The function *cr_citation_count* works in a similar way to what we did above. The function takes a vector of dois, and returns a dataframe with citation counts provided by crossref.

Below, we first install and load *rcrossref*. Then, we load a sample of 50 DOIs and retrieve citation counts from crossref.

**Have a look at the result.**
* What is the average citation count of this sample?
* What publication has the highest citation count?
* Try to access that publication. Can you open and read it?

In [1]:
#install.packages("rcrossref")
library(rcrossref)

"package 'rcrossref' was built under R version 4.0.5"


In [3]:
# import the sample DOIs
DOI_sample <- read.csv("../data/DOI_sample.csv", row.names = "X")

In [4]:
# create a data frame for storing the results
DF_citations <- data.frame(c(doi = character(),
                             count = numeric()))

# iterate through the sample DOIs and retrieve citation information
for (i in 1:nrow(DOI_sample)) {
    tryCatch(
        {DF_citations <- rbind(DF_citations, cr_citation_count(doi = DOI_sample[i,]))},
        error = function(e) {print("No citations found.")}
    )
}
                           
print(DF_citations)
summary(DF_citations$count)

                                                           doi count
1                                           10.3139/104.111575     2
2                                           10.3139/104.111560     1
3                                           10.3139/104.111574     0
4                                               10.3852/15-254     7
5                                           10.1130/ges02032.1     4
6                                       10.1055/s-0037-1598699     1
7                                       10.1055/s-0036-1594189     0
8                                      10.1515/cdbme-2017-0005     8
9                                        10.1515/zwf-2021-0045     0
10                                    10.1177/0269881115609072    73
11                                      10.1002/chem.201603098    14
12                                      10.1002/ange.201803136    40
13                                      10.1055/s-0038-1668731     0
14                                

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    1.00    5.84    6.50   73.00 