# PID Graph for datasets related to FREYA

This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to fetch all connections to a publication with a DataCite DOI.

## Prepare the R GraphQl client
Load necessary libraries and set up the API endpoint.

In [66]:
library("httr")
library("ghql")
library("jsonlite")
library("IRdisplay")
library("dplyr")
library("purrr")
library("igraph")

cli <- GraphqlClient$new(
  url = "https://api.datacite.org/graphql"
)
qry <- Query$new()

## Generate the GraphQL query
In this query, we are looking through all DataCite DOIs that are assigned to datasets and finding those that include the FREYA grant number within a funding reference. 

Then for each of those datasets, we're asking for:

1. identifiers for the creators (in this case ORCID IDs)
2. identifiers for items related to that dataset (citations, versions, etc.)
3. identifiers for any funders related to that dataset

In this example, we already know that the datasets will have the EC as a related funder, but pulling in the `funderIdentifier` will allow us to plot that information as part of our graph.

In [67]:
query <- '{
  datasets(query: "fundingReferences.awardNumber:777523") {
    totalCount
    nodes {
      id
      creators {
        id
      }
      publications {
        nodes {
          id
        }
      }
      datasets {
        nodes {
          id
        }
      }
      fundingReferences {
        funderIdentifier
      }
    }
  }
}'

## Run the query
We'll run the query and process the JSON response it returns.

In [68]:
qry$query('getdata', query)
data <- fromJSON(cli$exec(qry$queries$getdata))

## Display the number of datasets
This step is just to check our work. We already know there should be only 2 matching datasets (as of 6 June 2019 when this notebook was first created). 

In [69]:
display_json(data$data$datasets$totalCount)

Now let's prepare the data frame. We're defining what all of the nodes are for our graph. 

In [70]:
datasets <- data.frame(id=map(data$data$datasets$nodes$id, ~ discard(.x, ~ is.na(.x))), pid_type=c('dataset'))
researchers <- data.frame(id=bind_rows(data$data$datasets$nodes$creators)[,1], pid_type=c('researcher'))
references <- data.frame(id=bind_rows(map(data$data$datasets$nodes$publications$nodes$id, ~ discard(.x, ~ is.na(.x))), pid_type=c('publication')))
dataset_references <- data.frame(id=bind_rows(map(data$data$datasets$nodes$datasets$nodes$id, ~ discard(.x, ~ is.na(.x))), pid_type=c('dataset')))
funders <- data.frame(id=bind_rows(data$data$datasets$nodes$fundingReferences)[,1], pid_type=c('funder'))
nodes <- unique(rbind(datasets, researchers, references, dataset_references, funders))

nodes <- nodes %>% filter(!is.na(as.character(id))) %>% mutate(id = ifelse(startsWith(as.character(id), '10.'), paste('https://doi.org/', id, sep=''), as.character(id)))

# remove duplicates
nodes <- nodes %>% distinct(id, .keep_all = TRUE)
nodes

ERROR: Error in data.frame(id = bind_rows(map(data$data$datasets$nodes$publications$nodes$id, : arguments imply differing number of rows: 0, 1


Then we define the edges for our graph. We loop through the nodes because each node could have multiple edges.

In [None]:
edges <- data.frame(to=character(), from=character())
nodes_with_creators <- bind_rows(data$data$datasets$nodes) %>% filter(lengths(creators) != 0)
for (i in 1:nrow(nodes_with_creators)) {
  row <- data.frame(to=unlist(nodes_with_creators[i,2]), from=nodes_with_creators[i,1])
  edges <- unique(rbind(edges, row))
}

nodes_with_references <- bind_rows(data$data$datasets$nodes) %>% filter(lengths(relatedIdentifiers) != 0)
for (i in 1:nrow(nodes_with_references)) {
  row <- data.frame(to=unlist(nodes_with_references[i,3]), from=nodes_with_references[i,1])
  edges <- unique(rbind(edges, row))
}

nodes_with_funding <- bind_rows(data$data$datasets$nodes) %>% filter(lengths(fundingReferences) != 0)
for (i in 1:nrow(nodes_with_funding)) {
  row <- data.frame(to=unlist(nodes_with_funding[i,4]), from=nodes_with_funding[i,1])
  edges <- unique(rbind(edges, row))
}

## Generating the graph
We're going to make sure that all of the DOIs we're receiving are being expressed as URLs. This will help us with de-duping and filtering in later steps.

In [None]:
edges <- edges %>% filter(!is.na(as.character(to))) %>% mutate(to = ifelse(startsWith(as.character(to), '10.'), paste('https://doi.org/', to, sep=''), as.character(to)))

Next we format the graph for display. Here are the formatting choices we're making. 

1. We're going to display only the unique edges and nodes. 
2. We're coloring datasets red, researchers green, publications blue, and funders yellow.
3. We're making the nodes a nice size for viewing. 
4. This is not a directed graph (`relatedidentifiers` don't specify a relational direction), so we don't need to have any arrows. 

In [None]:
g <- graph_from_data_frame(d=unique(edges), vertices=unique(nodes))

# Node colors
col = c('#e45718','#fecf59','#48b1f4','#53c48c')
V(g)[(V(g)$pid_type=="publication")]$color<-'#48b1f4'
V(g)[(V(g)$pid_type=="researcher")]$color<-'#53c48c'
V(g)[(V(g)$pid_type=="funder")]$color<-'#fecf59'
V(g)[(V(g)$pid_type=="dataset")]$color<-'#e45718'

V(g)$size <- 8
E(g)$arrow.mode <- 0
l <- layout_with_dh(g)

And finally, we plot the actual graph. 

In [None]:
plot(g, vertex.label=NA, layout=l, arrow.mode=0)

# Add a legend
legend("bottomleft", legend=levels(as.factor(V(g)$pid_type)), col = col, bty = "n", pch=20 , pt.cex = 3, cex = 1, text.col=col , horiz = FALSE, inset = c(0.1, -0.1))

## Generate resource lists
We can generate a list of the datasets in APA format. 

In [None]:
ids <- substring(datasets[,1], 17)
ids <- paste(ids, collapse = ',')
url <- paste('https://api.datacite.org/dois?style=apa&page[size]=250&sort=created&ids=', ids, sep = '')
response <- GET(url, accept("text/x-bibliography"))
display_markdown('## Datasets')
display_markdown(content(response, as = 'text'))

We can also generate a list of all of the publications that were related to these items. 

In [None]:
ids <- references[,1]
ids <- paste(ids, collapse = ',')
url <- paste('https://api.datacite.org/dois?style=apa&page[size]=250&sort=created&ids=', ids, sep = '')
response <- GET(url, accept("text/x-bibliography"))
display_markdown('## References')
display_markdown(content(response, as = 'text'))  