# PID Graph Key Performance Indicators (KPIs)

This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to fetch summary statistics about nodes and connections.


**Goal**: By then end of this notebook you should be able to visualize the PID Graph summary statistics in different ways. We would retrieve the data from the PID Graph, we will transform it and then will create a visulisation to make sense of it.



In [3]:
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport
import pandas as pd

_transport = RequestsHTTPTransport(
    url='https://api.datacite.org/graphql',
    use_json=True,
)

client = Client(
    transport=_transport,
    fetch_schema_from_transport=True,
)


### Define the Data we want to obtain

The first step is to define what data we want. For this example we will extract the PID Graph summary statistics. 

We obtain this data using the Graphql query language. A Graphql query looks like a JSON snippet. 

In this examples we will just request all the fields related to sholarly items and their connections.

In [4]:
query = gql("""
{
  publications {
    totalCount
    publicationConnectionCount
    datasetConnectionCount
    softwareConnectionCount
    personConnectionCount
    organizationConnectionCount
    funderConnectionCount
  }
  datasets {
    totalCount
    datasetConnectionCount
    softwareConnectionCount
    personConnectionCount
    organizationConnectionCount
    funderConnectionCount
  }
  softwareSourceCodes {
    totalCount
    softwareConnectionCount
    personConnectionCount
    organizationConnectionCount
    funderConnectionCount
  }
  people {
    totalCount
    organizationConnectionCount
  }
  organizations {
    totalCount
  }

  funders {
    totalCount
  }
}
""")

### Run the query and parse the JSON response


Once we make the query we will obtain a response that looks like a nested list. 


In [5]:
# Run the query and parse the JSON response
data = client.execute(query)

### Transform data for visualisation.

We need to transform the nested list into some consumable for visulisations. In this case we will transform it into a flat list of Nodes and Connections/Edges. 


In [26]:
# generate data frame for nodes
publications = pd.DataFrame({'Publication', data['publications']['totalCount']})
datasets = pd.DataFrame({'Dataset', data['datasets']['totalCount']})
softwareSourceCodes = pd.DataFrame({'SoftwareSourceCode', data['softwareSourceCodes']['totalCount']})
people = pd.DataFrame({'Person', data['people']['totalCount']})
organizations = pd.DataFrame({'Organization', data['organizations']['totalCount']})
funders = pd.DataFrame({'Funder', data['funders']['totalCount']})

nodes = pd.concat([publications, datasets, softwareSourceCodes, people, organizations, funders])


edges = pd.DataFrame(columns=('from', 'to', 'count'))
edges.loc[0] = [ 'Publication', 'Publication', data['publications']['publicationConnectionCount']]
edges.loc[1] = [ 'Publication', 'Dataset', data['publications']['datasetConnectionCount']]
edges.loc[2] = [ 'Publication', 'SoftwareSourceCode', data['publications']['softwareConnectionCount']]
edges.loc[3] = [ 'Publication', 'Person', data['publications']['personConnectionCount']]
edges.loc[4] = [ 'Publication', 'Organization', data['publications']['organizationConnectionCount']]
edges.loc[5] = [ 'Publication', 'Funder', data['publications']['funderConnectionCount']]
edges.loc[6] = [ 'Dataset', 'Dataset', data['datasets']['datasetConnectionCount']]
edges.loc[7] = [ 'Dataset', 'SoftwareSourceCode', data['datasets']['softwareConnectionCount']]
edges.loc[8] = [ 'Dataset', 'Person', data['datasets']['personConnectionCount']]
edges.loc[9] = [ 'Dataset', 'Funder', data['datasets']['funderConnectionCount']]
edges.loc[10] = [ 'Dataset', 'Organization', data['datasets']['organizationConnectionCount']]
edges.loc[11] = [ 'SoftwareSourceCode', 'SoftwareSourceCode', data['softwareSourceCodes']['softwareConnectionCount']]
edges.loc[12] = [ 'SoftwareSourceCode', 'Person', data['softwareSourceCodes']['personConnectionCount']]
edges.loc[13] = [ 'SoftwareSourceCode', 'Organization', data['softwareSourceCodes']['organizationConnectionCount']]
edges.loc[14] = [ 'SoftwareSourceCode', 'Funder', data['softwareSourceCodes']['funderConnectionCount']]
edges.loc[15] = [ 'Person', 'Organization', data['people']['organizationConnectionCount']]




edges


Unnamed: 0,from,to,count
0,Publication,Publication,2122966
1,Publication,Dataset,2876214
2,Publication,SoftwareSourceCode,4312
3,Publication,Person,179041
4,Publication,Organization,545
5,Publication,Funder,22214
6,Dataset,Dataset,23203323
7,Dataset,SoftwareSourceCode,1586
8,Dataset,Person,412960
9,Dataset,Funder,62655


In [None]:
# generate data frame for nodes

publications <- data.frame(id=c('Publication'), count=data$data$publications$totalCount)
datasets <- data.frame(id=c('Dataset'), count=data$data$datasets$totalCount)
softwareSourceCodes <- data.frame(id=c('SoftwareSourceCode'), count=data$data$softwareSourceCodes$totalCount)
people <- data.frame(id=c('Person'), count=data$data$people$totalCount)
organizations <- data.frame(id=c('Organization'), count=data$data$organizations$totalCount)
funders <- data.frame(id=c('Funder'), count=data$data$funders$totalCount)
nodes <- rbind(publications, datasets, softwareSourceCodes, people, organizations, funders)


edges <- rbind(edges, data.frame(from="SoftwareSourceCode", to="Funder", count=data$data$softwareSourceCodes$funderConnectionCount))
edges <- rbind(edges, data.frame(from="Person", to="Organization", count=data$data$people$organizationConnectionCount))

# to fix "Some vertex names in edge list are not listed in vertex data frame"
#edges[which(! edges$from %in% nodes$id) ,]
#edges[which(! edges$to %in% nodes$id) ,]

head(edges)
head(nodes)

### Generate the graph

The igraph library provides versatile options for descriptive network analysis and visualization.
To generate the graph we take the nodes and edges we generated and plot them together. 


In [None]:

# generate graph
g <- graph_from_data_frame(edges, vertices=nodes)

# add node colors
cols <- brewer.pal(12, "Set3")
V(g)[(V(g)$name=="Publication")]$color<-cols[5]
V(g)[(V(g)$name=="Person")]$color<-cols[1]
V(g)[(V(g)$name=="Dataset")]$color<-cols[4]
V(g)[(V(g)$name=="SoftwareSourceCode")]$color<-cols[10]
V(g)[(V(g)$name=="Organization")]$color<-cols[11]
V(g)[(V(g)$name=="Funder")]$color<-cols[2]

# add labels
V(g)$label=paste(nodes$id, format(V(g)$count, big.mark=",", trim=TRUE), sep="\n")
E(g)$label <- format(E(g)$count, big.mark=",", trim=TRUE)

# calculate node size and edge width
V(g)$size <- log(V(g)$count, 1.3)
E(g)$width <- log(E(g)$count, 2)

E(g)$arrow.mode <- 0
l <- layout_nicely(g)
plot(g, layout=l, arrow.mode=0, vertex.label.color=c("black"), vertex.label.family="Helvetica", vertex.label.cex=c(0.7), edge.color=c("gray90"), edge.label.color=c("black"), edge.label.family="Helvetica", edge.label.cex=c(0.7), edge.loop.angle=-pi/4)
title(paste("PID Graph\nNumber of nodes and connections",
format(Sys.Date(), "(%d %B %Y)"), sep=" "),cex.main=1.5)