<a href="https://colab.research.google.com/github/digital-science/dimensions-api-lab/blob/master/2-sample-applications/Citation-analysis/Citation-Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Dimensions API Lab In Google Colab"/></a>

# Citation Analysis using the Dimensions API

This notebooks shows how to extract citations data using the Dimensions Analytics API. 

Two approached are considered: one that is most suited for smaller analyses, and one which is more query-efficient and hence is more suited for analyses involving lots of publications.

>[Citation Analysis using the Dimensions API](#scrollTo=eV50UGicGoXz)

>>[Install the Dimensions API library and login](#scrollTo=wbGOYIr6B6lQ)

>[Method A: getting citations for one publication at a time](#scrollTo=ykHU1pbE9uXt)

>>[Comments about this method](#scrollTo=y5dPmfV6qn7Y)

>[Method B: Getting citations for multiple publications via a single query](#scrollTo=hog-cAHvDGNF)

>>[Creating a second-level citations network](#scrollTo=Eo_2nBQJJBVg)

>>[Building a Simple dataviz](#scrollTo=Id9h9iXDOVkk)

>[Final considerations](#scrollTo=Y7pg_VXhVZ2z)

>>[Querying for more than 1000 results](#scrollTo=Y7pg_VXhVZ2z)

>>[Querying for more than 50K results](#scrollTo=Y7pg_VXhVZ2z)

>>[Some publications can have MANY citations](#scrollTo=Y7pg_VXhVZ2z)

>>[Pre-checking citations counts](#scrollTo=Y7pg_VXhVZ2z)



## 1. Prerequisites: installing the Dimensions API library 

In [None]:
!pip install dimcli -U --quiet 
import dimcli
import json
dimcli.login()
dsl = dimcli.Dsl()

# 2. Method A: getting citations for one publication at a time

By using the field `reference_ids` we can easily look up citations for individual publications (= incoming links). For example, here are the papers citing "pub.1053279155":

In [None]:
%dsldf search publications where reference_ids in [ "pub.1053279155" ] return publications[id+doi+title+year]

Returned Publications: 5 (total = 5)


Unnamed: 0,doi,id,title,year
0,10.1007/s11227-018-2338-1,pub.1103275659,Towards ontology-based multilingual URL filter...,2018
1,10.1515/iwp-2015-0057,pub.1012651711,Das Experteninterview als zentrale Methode der...,2015
2,10.1007/978-3-319-24129-6_3,pub.1005502446,Challenges for Ontological Engineering in the ...,2015
3,10.1007/978-3-642-24809-2_10,pub.1008922470,Transforming a Flat Metadata Schema to a Seman...,2012
4,10.1007/978-3-642-24731-6_38,pub.1053157726,Practice-Based Ontologies: A New Approach to A...,2011


Let's try another paper ie "pub.1103275659" - in this case there are 3 citations 

In [None]:
%dsldf search publications where reference_ids in [ "pub.1103275659" ] return publications[id+doi+title+year]

Returned Publications: 3 (total = 3)


Unnamed: 0,doi,id,title,year
0,10.1016/j.future.2019.04.038,pub.1113878770,Perception layer security in Internet of Things,2019
1,10.1109/isncc.2018.8530984,pub.1109815383,A Fault Tolerant Approach for Malicious URL Fi...,2018
2,10.1109/access.2018.2872928,pub.1107354292,"Social Internet of Vehicles: Complexity, Adapt...",2018


Using this simple approach, if we start with a list of publications (our 'seed') we can set up a simple loop to get through all of them and launch a 'get-citations' query each time. 


TIP 
The `json.dumps` function easily transforms a list of objects into a string which can be used directly in our query eg

```
> json.dumps(seed)
'["pub.1053279155", "pub.1103275659"]'
```

In [None]:
seed = [ "pub.1053279155" , "pub.1103275659"]
q = """search publications where reference_ids in [{}] return publications[id+doi+title+year]"""
results = {}
for p in seed:
  data = dsl.query(q.format(json.dumps(p)))
  results[p] = [x['id'] for x in data.publications]

In [None]:
results

{'pub.1053279155': ['pub.1103275659',
  'pub.1012651711',
  'pub.1005502446',
  'pub.1008922470',
  'pub.1053157726'],
 'pub.1103275659': ['pub.1113878770', 'pub.1109815383', 'pub.1107354292']}

## Comments about this method

* this approach is straightforward and quick, but it's better used with small datasets  
* we create one query per publication (and so on, for a N-degree network)
* if you have lots of publicaitons, it'll lead to lots of queries which may not be too efficient


# 3. Method B: Getting citations for multiple publications via a single query

We can use the same query template but instead of looking for a single publication ID, we can put multiple ones in a list. 

So if we combine the two citations list for "pub.1053279155" and "pub.1103275659", we will get 5 + 3 = 8 results in total. 

*However* then it's down to us to figure out which paper is citing which!

In [None]:
%dsldf search publications where reference_ids in [ "pub.1053279155" , "pub.1103275659"] return publications[id+doi+title+year]

Returned Publications: 8 (total = 8)


Unnamed: 0,doi,id,title,year
0,10.1016/j.future.2019.04.038,pub.1113878770,Perception layer security in Internet of Things,2019
1,10.1007/s11227-018-2338-1,pub.1103275659,Towards ontology-based multilingual URL filter...,2018
2,10.1109/isncc.2018.8530984,pub.1109815383,A Fault Tolerant Approach for Malicious URL Fi...,2018
3,10.1109/access.2018.2872928,pub.1107354292,"Social Internet of Vehicles: Complexity, Adapt...",2018
4,10.1515/iwp-2015-0057,pub.1012651711,Das Experteninterview als zentrale Methode der...,2015
5,10.1007/978-3-319-24129-6_3,pub.1005502446,Challenges for Ontological Engineering in the ...,2015
6,10.1007/978-3-642-24809-2_10,pub.1008922470,Transforming a Flat Metadata Schema to a Seman...,2012
7,10.1007/978-3-642-24731-6_38,pub.1053157726,Practice-Based Ontologies: A New Approach to A...,2011


In order to resolve the citations data we got above, we must also extract the full references for each citing paper (by including `reference_ids` in the results) and then recreate the citation graph programmatically. EG

In [None]:
seed = [ "pub.1053279155" , "pub.1103275659"]

In [None]:
data = dsl.query(f"""search publications where reference_ids in {json.dumps(seed)} return publications[id+doi+title+year+reference_ids]""")

In [None]:
def build_network_dict(seed, pubs_list):
  network={x:[] for x in seed} # seed a dictionary 
  for pub in pubs_list:
    for key in network:
      if pub.get('reference_ids') and key in pub['reference_ids']:
        network[key].append(pub['id'])
  return network

A simple way to represent the citation network is a dictionary data structure with `'cited_paper' : [citing papers]`

In [None]:
network1 = build_network_dict(seed, data.publications)
network1

{'pub.1053279155': ['pub.1103275659',
  'pub.1012651711',
  'pub.1005502446',
  'pub.1008922470',
  'pub.1053157726'],
 'pub.1103275659': ['pub.1113878770', 'pub.1109815383', 'pub.1107354292']}

## Creating a second-level citations network

Let's now create a second level citations network!

This means going through all pubs citing the two seed-papers, and getting all the citing publications for them as well. 

In [None]:
all_citing_papers = []
for x in network1.values():
  all_citing_papers += x
all_citing_papers = list(set(all_citing_papers))

In [None]:
all_citing_papers

['pub.1005502446',
 'pub.1103275659',
 'pub.1012651711',
 'pub.1107354292',
 'pub.1109815383',
 'pub.1053157726',
 'pub.1113878770',
 'pub.1008922470']

Now let's extract the network structure as previously done

In [None]:
data2 = dsl.query(f"""search publications where reference_ids in {json.dumps(all_citing_papers)} return publications[id+doi+title+year+reference_ids]""")
network2 = build_network_dict(all_citing_papers, data2.publications)
network2

{'pub.1005502446': [],
 'pub.1008922470': ['pub.1089701016',
  'pub.1026187633',
  'pub.1002394460',
  'pub.1012381129',
  'pub.1046653745',
  'pub.1016243129'],
 'pub.1012651711': ['pub.1101318936'],
 'pub.1053157726': ['pub.1109914120',
  'pub.1113063906',
  'pub.1099624152',
  'pub.1104531912',
  'pub.1011868512',
  'pub.1001626874'],
 'pub.1103275659': ['pub.1113878770', 'pub.1109815383', 'pub.1107354292'],
 'pub.1107354292': ['pub.1113878770', 'pub.1113902569', 'pub.1113065837'],
 'pub.1109815383': [],
 'pub.1113878770': ['pub.1121687821', 'pub.1121692873']}

Finally we can merge the two levels into one single dataset (note: nodes with same data will be merged automatically)

In [None]:
final = dict(network1, **network2 )
final

{'pub.1005502446': [],
 'pub.1008922470': ['pub.1089701016',
  'pub.1026187633',
  'pub.1002394460',
  'pub.1012381129',
  'pub.1046653745',
  'pub.1016243129'],
 'pub.1012651711': ['pub.1101318936'],
 'pub.1053157726': ['pub.1109914120',
  'pub.1113063906',
  'pub.1099624152',
  'pub.1104531912',
  'pub.1011868512',
  'pub.1001626874'],
 'pub.1053279155': ['pub.1103275659',
  'pub.1012651711',
  'pub.1005502446',
  'pub.1008922470',
  'pub.1053157726'],
 'pub.1103275659': ['pub.1113878770', 'pub.1109815383', 'pub.1107354292'],
 'pub.1107354292': ['pub.1113878770', 'pub.1113902569', 'pub.1113065837'],
 'pub.1109815383': [],
 'pub.1113878770': ['pub.1121687821', 'pub.1121692873']}

## Building a Simple dataviz

We can build a simple visualization using the pyvis library. 

NOTE: the 'mygraph.html' file will be saved in the local directory (in Colab, open the 'Files' left panel and download it to your computer to open it).

In [None]:
!pip install pyvis --quiet
from pyvis.network import Network

In [None]:
net = Network()

nodes = []
for x in final:
  nodes.append(x)
  nodes += final[x]
nodes = list(set(nodes))

net.add_nodes(nodes) # node ids and labels = ["a", "b", "c", "d"]

for x in final:
  for target in final[x]:
    net.add_edge(x, target)

net.show("mygraph.html")

# Final considerations 

## Querying for more than 1000 results

Each API query can return a maximum of 1000 records, so you must use the limit/skip syntax to get more. 

See the [paginating results section in the docs](https://docs.dimensions.ai/dsl/language.html#paginating-results) for more info.

## Querying for more than 50K results

Even with limit/skip, one can only download 50k records for each single query. 

So if your list of PUB-ids is getting too long (eg > 300) you should consider splitting up the list into chunks create an extra loop to go through all of them without hitting the max upper limit. 

## Some publications can have MANY citations 

For example, we can have a single pub with 200K+ citation: https://app.dimensions.ai/details/publication/pub.1076750128 

That's quite an exceptional case, but there are several publications with more than 10k citations each. When you encounter such cases, you will hit the 50k limit pretty quickly, so you need to keep an eye out for these and possibly 'slice' the data in different ways eg by year or journal (so to get less results).

## Pre-checking citations counts 

The `times_cited` and `recent_citations` fields of [publications](https://docs.dimensions.ai/dsl/data-sources.html#publications) can be used to check how many citations a paper has (ps `recent_citations` counts the last two years only). 

So, by using these aggregated figures, it is possible to get a feeling for the size of citations-data we'll have to deal with before setting up a proper data extraction pipeline. 


---
# Want to learn more?

Check out the [Dimensions API Lab](https://digital-science.github.io/dimensions-api-lab/) website, which contains many tutorials and reusable Jupyter notebooks for scholarly data analytics. 