# Transforming API results into Pandas dataframes

DimCli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas [dataframe objects](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe). 

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

>  [pandas](https://pandas.pydata.org/pandas-docs/stable/) is a popular software library written for the Python programming language for data manipulation and analysis.

In [2]:
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai"

!pip install dimcli -U --quiet

# import all libraries and login
import pandas
import dimcli
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

DimCli v0.6.2.1 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


## 1. Getting started: the `as_dataframe` method

This utility method allows to quickly turn any query results into a dataframe. 

In [3]:
# we'll reuse this query later on 
query = """search publications for "graphene" 
            where year in [2013:2019] 
            return publications sort by times_cited limit 1000"""
res = dsl.query(query)

Returned Publications: 1000 (total = 405843)


In [4]:
df = res.as_dataframe()
df.head(10)

Unnamed: 0,author_affiliations,volume,pages,year,title,id,type,issue,journal.id,journal.title
0,"[[{'first_name': 'Manish', 'last_name': 'Chhow...",5.0,263-275,2013,The chemistry of two-dimensional layered trans...,pub.1050119463,article,4.0,jour.1041224,Nature Chemistry
1,"[[{'first_name': 'A. K.', 'last_name': 'Geim',...",499.0,419-425,2013,Van der Waals heterostructures,pub.1024857999,article,7459.0,jour.1018957,Nature
2,,,,2013,"Nanoenergy, Nanotechnology Applied for Energy ...",pub.1031762191,book,,,
3,"[[{'first_name': 'C.', 'last_name': 'Patrignan...",40.0,100001,2016,Review of Particle Physics,pub.1059158429,article,10.0,jour.1327822,Chinese Physics C
4,"[[{'first_name': 'John B.', 'last_name': 'Good...",135.0,1167-76,2013,The Li-ion rechargeable battery: a perspective.,pub.1019126274,article,4.0,jour.1081898,Journal of the American Chemical Society
5,"[[{'first_name': 'Andrea C.', 'last_name': 'Fe...",8.0,235-246,2013,Raman spectroscopy as a versatile tool for stu...,pub.1015305822,article,4.0,jour.1037429,Nature Nanotechnology
6,"[[{'first_name': 'Han', 'last_name': 'Liu', 'c...",8.0,4033-41,2014,Phosphorene: an unexplored 2D semiconductor wi...,pub.1009826879,article,4.0,jour.1038917,ACS Nano
7,"[[{'first_name': 'Sheneve Z.', 'last_name': 'B...",7.0,2898-926,2013,"Progress, challenges, and opportunities in two...",pub.1038090434,article,4.0,jour.1038917,ACS Nano
8,"[[{'first_name': 'Mingsheng', 'last_name': 'Xu...",113.0,3766-98,2013,Graphene-like two-dimensional materials.,pub.1022830330,article,5.0,jour.1077147,Chemical Reviews
9,"[[{'first_name': 'Oriol', 'last_name': 'Lopez-...",8.0,497-501,2013,Ultrasensitive photodetectors based on monolay...,pub.1023181230,article,7.0,jour.1037429,Nature Nanotechnology


Pandas dataframes offer a myriad of utilities for inspecting data. Check out the [official docs](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) or google a [pandas tutorial](https://www.google.com/search?q=pandas+tutorial) to lean more about it. 

In [5]:
# the table shape
df.shape

(1000, 10)

In [6]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()

2013    357
2014    295
2015    201
2016     92
2017     47
2018      8
Name: year, dtype: int64

In [7]:
# eg distribution of publication [type]
df['type'].value_counts()

article      995
monograph      2
book           2
chapter        1
Name: type, dtype: int64

## 2. Helper Methods for 'Publications' queries

What follows are specialized versions of the `as_dataframe` method for results sets composed of publication records. 

###  Extracting authors: `as_dataframe_authors`

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the `authors_affiliations` sub-key. 

> Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with **one row per author**.

In [8]:
authors = res.as_dataframe_authors()
authors.head()

Unnamed: 0,first_name,last_name,corresponding,orcid,current_organization_id,researcher_id,affiliations,pub_id
0,Manish,Chhowalla,True,,grid.430387.b,ur.0633062306.03,"[{'id': 'grid.430387.b', 'name': 'Rutgers, The...",pub.1050119463
1,Hyeon Suk,Shin,,,grid.42687.3f,ur.07617630407.83,"[{'id': 'grid.42687.3f', 'name': 'Ulsan Nation...",pub.1050119463
2,Goki,Eda,,,grid.4280.e,ur.01150450507.27,"[{'id': 'grid.4280.e', 'name': 'National Unive...",pub.1050119463
3,Lain-Jong,Li,,['0000-0002-4059-7783'],grid.45672.32,ur.01313340113.13,"[{'id': 'grid.28665.3f', 'name': 'Academia Sin...",pub.1050119463
4,Kian Ping,Loh,,['0000-0002-1491-743X'],grid.4280.e,ur.0752174033.73,"[{'id': 'grid.4280.e', 'name': 'National Unive...",pub.1050119463


Using the authors dataframe, we can easily get the top ten values for `current_organization_id`. 

In [9]:
authors['current_organization_id'].value_counts()[:10]

                 188
grid.59025.3b    144
grid.12527.33    116
grid.5379.8      105
grid.59053.3a     85
grid.19006.3e     78
grid.168010.e     76
grid.13402.34     72
grid.21940.3e     69
grid.116068.8     65
Name: current_organization_id, dtype: int64

> Explanation: the most frequent value turns to be grid.59025.3b ie [Nanyang Technological University in Singapore](https://www.grid.ac/institutes/grid.59025.3b). The first result is empty, meaning that for those authors Dimensions has no info about `current_organization_id`. 

### Extracting Affiliations: `as_dataframe_authors_affiliations`

As you can see from the results of the previous section, the `affiliations` of each author is yet another nested JSON object. 

> Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication

The `as_dataframe_authors_affiliations` method allows to quickly extract that affiliations data and return a dataframe with **one row per affiliation**.

This can be useful e.g. if one wants to count research organizations at *the time of writing* (as opposed to `current_organization_id`, which is the *most recent organization* of a researcher). 

In [10]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()

Unnamed: 0,aff_id,aff_name,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
0,grid.430387.b,"Rutgers, The State University of New Jersey",New Brunswick,5101720.0,United States,US,New Jersey,US-NJ,pub.1050119463,ur.0633062306.03,Manish,Chhowalla
1,grid.42687.3f,Ulsan National Institute of Science and Techno...,Ulsan,1833750.0,South Korea,KR,,,pub.1050119463,ur.07617630407.83,Hyeon Suk,Shin
2,grid.4280.e,National University of Singapore,Singapore,1880250.0,Singapore,SG,,,pub.1050119463,ur.01150450507.27,Goki,Eda
3,grid.28665.3f,Academia Sinica,Taipei,1668340.0,Taiwan,TW,,,pub.1050119463,ur.01313340113.13,Lain-Jong,Li
4,grid.4280.e,National University of Singapore,Singapore,1880250.0,Singapore,SG,,,pub.1050119463,ur.0752174033.73,Kian Ping,Loh


In [11]:
affiliations.describe(include="all")

Unnamed: 0,aff_id,aff_name,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
count,6999.0,6999,6999.0,6999.0,6999,6999,6999.0,6999.0,6999,6999.0,6999,6999
unique,750.0,1061,460.0,463.0,53,53,56.0,56.0,984,4159.0,3415,1891
top,,Nanyang Technological University,,,China,CN,,,pub.1019661721,,Wei,Zhang
freq,925.0,220,925.0,930.0,1972,1972,5120.0,5120.0,105,159.0,62,294


Let's get the top ten values for `aff_id`. 

In [12]:
affiliations['aff_id'].value_counts()[:10]

                 925
grid.59025.3b    220
grid.12527.33    134
grid.5379.8      115
grid.19006.3e    100
grid.168010.e    100
grid.21940.3e     96
grid.59053.3a     91
grid.21729.3f     87
grid.116068.8     86
Name: aff_id, dtype: int64

> Explanation: the most frequent value is still [grid.59025.3b](https://www.grid.ac/institutes/grid.59025.3b), meaning that most authors' current organization is the same organization of when they published these articles. 

Another example: we can now easily analyze the data by country too. 

In [13]:
affiliations['aff_country'].value_counts()[:10]

China             1972
United States     1570
                   925
South Korea        342
Singapore          327
United Kingdom     312
Australia          171
Japan              158
Canada             138
Germany            126
Name: aff_country, dtype: int64

> Explanation: the vast majority of authors in this dataset are from China, closely followed by the USA. 

### Extracting Concepts: `as_dataframe_concepts`

The `as_dataframe_concepts` method allows to quickly extract all concepts attached to publications **one row per concept**, so to make it easier to do operations like counting or plotting the results.

NOTE: concepts are normalized keywords describing the main topics of a publication, which are automatically derived from the publication text using machine learning. In the JSON data, concepts are returned with an ordered list like this (first concepts are the most relevant): 

```
{'concepts': ['electrochemical conversion',
  'conversion',
  'CO2',
  'formate',
  'formic acid',
  'acid'],
 'id': 'pub.1122072646'}
```

The `as_dataframe_concepts` method includes in the results three extra measures that are handy to carry out further analyses:

1. `position`: an integer representing the position of the concept in the list, in absolute terms e.g. the first concept gets 1, while the fifth gets 5.
2. `score`: an integer representing the relevance of the concept, normalized against the total number of concepts for that publication. So if a publication has 10 concepts in total, the first one gets 1, the second one 0.9, etc..
3. `frequency`: an integer repreenting how often that concept occurs in the dataset returned ie how many documents have that concept.

By sorting and segmenting concepts using these three parameters, it is possible to fine-tune the concept extraction algorithm that is most suitable to the application at hand.






In [14]:
concepts = dsl.query("""search publications for "graphene" where year=2019 return publications[id+title+year+concepts] limit 100""").as_dataframe_concepts()
concepts.head()

Returned Publications: 100 (total = 100511)


Unnamed: 0,name,position,score,frequency,pubid,title,year
0,graphene oxide synthesis,1,1.0,1,pub.1120445400,Graphene oxide synthesis by facile method and ...,2019
1,synthesis,2,0.8,8,pub.1120445400,Graphene oxide synthesis by facile method and ...,2019
2,facile method,3,0.6,1,pub.1120445400,Graphene oxide synthesis by facile method and ...,2019
3,method,4,0.4,27,pub.1120445400,Graphene oxide synthesis by facile method and ...,2019
4,characterization,5,0.2,4,pub.1120445400,Graphene oxide synthesis by facile method and ...,2019


In [15]:
concepts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4137 entries, 0 to 44
Data columns (total 7 columns):
name         4137 non-null object
position     4137 non-null int64
score        4137 non-null float64
frequency    4137 non-null int64
pubid        4137 non-null object
title        4137 non-null object
year         4137 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 258.6+ KB


E.g. Sorting by **position** permits to highlight concepts that are important but don't appear in many documents.

In [16]:
concepts.groupby("name").sum().sort_values("position", ascending=True)

Unnamed: 0_level_0,position,score,frequency
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
system stores,1,1.00,1
combined study,1,1.00,1
cerium compounds,1,1.00,1
green reduction,1,1.00,1
polymer surface,1,1.00,1
...,...,...,...
results,591,5.61,289
structure,619,14.17,576
effects,652,7.52,324
applications,667,15.92,900


E.g. Sorting by **frequency** highlights concepts that are shared by many documents in our dataset (= the one generated by the original query).

In [17]:
concepts.drop_duplicates("name").sort_values("frequency", ascending=True)

Unnamed: 0,name,position,score,frequency,pubid,title,year
0,graphene oxide synthesis,1,1.00,1,pub.1120445400,Graphene oxide synthesis by facile method and ...,2019
69,pure TiO 2,70,0.23,1,pub.1123014472,Вплив модифікування діоксиду титану сіркою та ...,2019
70,gap narrowing,71,0.22,1,pub.1123014472,Вплив модифікування діоксиду титану сіркою та ...,2019
71,nanocomposite samples,72,0.21,1,pub.1123014472,Вплив модифікування діоксиду титану сіркою та ...,2019
74,safranine T,75,0.18,1,pub.1123014472,Вплив модифікування діоксиду титану сіркою та ...,2019
...,...,...,...,...,...,...,...
6,structure,7,0.82,24,pub.1112946271,Synthesis and modelling of the mechanical prop...,2019
3,method,4,0.40,27,pub.1120445400,Graphene oxide synthesis by facile method and ...,2019
17,applications,18,0.11,30,pub.1122256034,The Surface of Polymers,2019
7,materials,8,0.79,35,pub.1112946271,Synthesis and modelling of the mechanical prop...,2019


E.g. Sorting by **score** permits to identify concepts that are both frequent and important.

In [18]:
concepts.groupby("name").sum().sort_values("score", ascending=False)

Unnamed: 0_level_0,position,score,frequency
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
properties,572,25.75,1296
materials,764,21.59,1225
method,532,17.45,729
applications,667,15.92,900
structure,619,14.17,576
...,...,...,...
ecological point,77,0.01,1
probe molecules,69,0.01,1
delivery,100,0.01,1
conventional methods,83,0.01,1


## 2. Helper Methods for 'Grants' queries

###  Extracting Funders: `as_dataframe_funders`

Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the `funders` sub-key. 

This methods allows to quickly extract that data and return a dataframe with **one row per funder**.

In [19]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)

Returned Grants: 1000 (total = 9204)


In [20]:
res.as_dataframe_funders().head(10)

Unnamed: 0,id,city_name,types,acronym,state_name,latitude,name,country_name,linkout,longitude,grant_id,grant_title,grant_start_date,grant_end_date
0,grid.421091.f,Swindon,[Government],EPSRC,England,51.567093,Engineering and Physical Sciences Research Cou...,United Kingdom,[https://www.epsrc.ac.uk/],-1.784602,grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
1,grid.270680.b,Brussels,[Government],EC,,50.85165,European Commission,Belgium,[http://ec.europa.eu/index_en.htm],4.36367,grant.8585457,Estimating the Prevalence of AntiMicrobial Res...,2020-01-01,2021-12-31
2,grid.270680.b,Brussels,[Government],EC,,50.85165,European Commission,Belgium,[http://ec.europa.eu/index_en.htm],4.36367,grant.8586121,Earth observation service for preventive contr...,2019-11-01,2022-10-31
3,grid.454774.1,New Delhi,[Government],DBT,,28.601473,Department of Biotechnology,India,[http://www.dbtindia.nic.in/],77.23578,grant.8657420,Translational research and clinical developmen...,2019-10-07,2022-10-07
4,grid.248883.d,Ottawa,[Government],CIHR,Ontario,45.381893,Canadian Institutes of Health Research,Canada,[http://www.cihr-irsc.gc.ca/e/193.html],-75.745224,grant.8527034,Mechanisms of Leishmania dissemination and tra...,2019-10-01,2024-09-30
5,grid.52788.30,London,[Nonprofit],WT,,51.525867,Wellcome Trust,United Kingdom,[http://www.wellcome.ac.uk/],-0.135005,grant.8558648,Molecular mechanisms of carbohydrate uptake in...,2019-10-01,2022-09-30
6,grid.52788.30,London,[Nonprofit],WT,,51.525867,Wellcome Trust,United Kingdom,[http://www.wellcome.ac.uk/],-0.135005,grant.8103743,The Chemical Empire: A New History of Syntheti...,2019-10-01,2023-10-01
7,grid.425888.b,Bern,[Government],SNF,,46.94923,Swiss National Science Foundation,Switzerland,[http://www.snf.ch/en],7.432395,grant.8599112,Gauging Global Governance: The Effectiveness o...,2019-10-01,2022-09-30
8,grid.270680.b,Brussels,[Government],EC,,50.85165,European Commission,Belgium,[http://ec.europa.eu/index_en.htm],4.36367,grant.8585082,Understanding the roles of pathogen infection ...,2019-10-01,2021-09-30
9,grid.457875.c,Arlington,[Government],NSF MPS,Virginia,38.880566,Directorate for Mathematical & Physical Sciences,United States,[http://www.nsf.gov/dir/index.jsp?org=MPS],-77.11099,grant.8566624,Collaborative Research: Principal Component An...,2019-10-01,2022-09-30


### Extracting investigators: `as_dataframe_investigators`

Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the `investigator_details` sub-key. 

This methods allows to quickly extract that data and return a dataframe with **one row per investigator**.

> NOTE: `investigator_details` are not returned by default in a grants query hence one must specify this in the query results

In [21]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigator_details] limit 1000"""
res = dsl.query(query)

Returned Grants: 1000 (total = 9204)
Field 'project_num' is deprecated in favor of grant_number. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details


In [22]:
res.as_dataframe_investigators().head(10)

Unnamed: 0,first_name,last_name,middle_name,id,role,affiliations,grant_id,grant_title,grant_start_date,grant_end_date
0,Anotida,Madzvamuse,,ur.01301306304.17,PI,"[{'country': 'United Kingdom', 'state_code': N...",grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
1,Gift,Muchatibaya,,ur.01125437006.53,Co-PI,,grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
2,Farai,Nyabadza,,ur.016431672433.40,Co-PI,"[{'country': 'South Africa', 'state_code': Non...",grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
3,Zindoga,Mukandavire,,ur.01176116461.04,Co-PI,,grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
4,Jasmina,Panovska-Griffiths,,ur.01037661532.12,Co-PI,,grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
5,Edward,Lungu,,ur.0644520733.99,Co-PI,,grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
6,Hatson John Boscoh,Njagarah,,ur.010541731643.13,Co-PI,,grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
7,Eduard,Campillo Funollet,,ur.014162252725.80,Co-PI,"[{'country': 'United Kingdom', 'state_code': N...",grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
8,K,White,,ur.015153160033.34,Co-PI,,grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
9,Istvan,Kiss,Zoltan,ur.016546121033.78,Co-PI,"[{'country': 'United Kingdom', 'state_code': N...",grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30


## Conclusions 

Moving Dimensions API results to pandas dataframes **makes it easier** to **analyze the data** and **answer research questions**. 

Note: the examples above only scratch the surface of what can be done with pandas! 