# Transforming API results into Pandas dataframes

DimCli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas [dataframe objects](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe). 

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

>  [pandas](https://pandas.pydata.org/pandas-docs/stable/) is a popular software library written for the Python programming language for data manipulation and analysis.

In [1]:
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai"

!pip install dimcli -U --quiet

# import all libraries and login
import pandas
import dimcli
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

DimCli v0.6.1.2 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


## 1. Getting started: the `as_dataframe` method

This utility method allows to quickly turn any query results into a dataframe. 

In [2]:
# we'll reuse this query later on 
query = """search publications for "graphene" 
            where year in [2013:2019] 
            return publications sort by times_cited limit 1000"""
res = dsl.query(query)

Returned Publications: 1000 (total = 405308)


In [3]:
df = res.as_dataframe()
df.head(10)

Unnamed: 0,year,issue,volume,title,id,type,pages,author_affiliations,journal.id,journal.title
0,2013,4.0,5.0,The chemistry of two-dimensional layered trans...,pub.1050119463,article,263-275,"[[{'first_name': 'Manish', 'last_name': 'Chhow...",jour.1041224,Nature Chemistry
1,2013,7459.0,499.0,Van der Waals heterostructures,pub.1024857999,article,419-425,"[[{'first_name': 'A. K.', 'last_name': 'Geim',...",jour.1018957,Nature
2,2013,,,"Nanoenergy, Nanotechnology Applied for Energy ...",pub.1031762191,book,,,,
3,2016,10.0,40.0,Review of Particle Physics,pub.1059158429,article,100001,"[[{'first_name': 'C.', 'last_name': 'Patrignan...",jour.1327822,Chinese Physics C
4,2013,4.0,135.0,The Li-ion rechargeable battery: a perspective.,pub.1019126274,article,1167-76,"[[{'first_name': 'John B.', 'last_name': 'Good...",jour.1081898,Journal of the American Chemical Society
5,2013,4.0,8.0,Raman spectroscopy as a versatile tool for stu...,pub.1015305822,article,235-246,"[[{'first_name': 'Andrea C.', 'last_name': 'Fe...",jour.1037429,Nature Nanotechnology
6,2014,4.0,8.0,Phosphorene: an unexplored 2D semiconductor wi...,pub.1009826879,article,4033-41,"[[{'first_name': 'Han', 'last_name': 'Liu', 'c...",jour.1038917,ACS Nano
7,2013,4.0,7.0,"Progress, challenges, and opportunities in two...",pub.1038090434,article,2898-926,"[[{'first_name': 'Sheneve Z.', 'last_name': 'B...",jour.1038917,ACS Nano
8,2013,5.0,113.0,Graphene-like two-dimensional materials.,pub.1022830330,article,3766-98,"[[{'first_name': 'Mingsheng', 'last_name': 'Xu...",jour.1077147,Chemical Reviews
9,2013,7.0,8.0,Ultrasensitive photodetectors based on monolay...,pub.1023181230,article,497-501,"[[{'first_name': 'Oriol', 'last_name': 'Lopez-...",jour.1037429,Nature Nanotechnology


Pandas dataframes offer a myriad of utilities for inspecting data. Check out the [official docs](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) or google a [pandas tutorial](https://www.google.com/search?q=pandas+tutorial) to lean more about it. 

In [4]:
# the table shape
df.shape

(1000, 10)

In [5]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()

2013    359
2014    296
2015    202
2016     90
2017     46
2018      7
Name: year, dtype: int64

In [6]:
# eg distribution of publication [type]
df['type'].value_counts()

article      995
book           2
monograph      2
chapter        1
Name: type, dtype: int64

## 2. Helper Methods for 'Publications' queries

What follows are specialized versions of the `as_dataframe` method for results sets composed of publication records. 

###  Extracting authors: `as_dataframe_authors`

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the `authors_affiliations` sub-key. 

> Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with **one row per author**.

In [7]:
authors = res.as_dataframe_authors()
authors.head()

Unnamed: 0,first_name,last_name,corresponding,orcid,current_organization_id,researcher_id,affiliations,pub_id
0,Manish,Chhowalla,True,,grid.430387.b,ur.0633062306.03,"[{'id': 'grid.430387.b', 'name': 'Rutgers, The...",pub.1050119463
1,Hyeon Suk,Shin,,,grid.42687.3f,ur.07617630407.83,"[{'id': 'grid.42687.3f', 'name': 'Ulsan Nation...",pub.1050119463
2,Goki,Eda,,,grid.4280.e,ur.01150450507.27,"[{'id': 'grid.4280.e', 'name': 'National Unive...",pub.1050119463
3,Lain-Jong,Li,,['0000-0002-4059-7783'],grid.45672.32,ur.01313340113.13,"[{'id': 'grid.28665.3f', 'name': 'Academia Sin...",pub.1050119463
4,Kian Ping,Loh,,['0000-0002-1491-743X'],grid.4280.e,ur.0752174033.73,"[{'id': 'grid.4280.e', 'name': 'National Unive...",pub.1050119463


Using the authors dataframe, we can easily get the top ten values for `current_organization_id`. 

In [8]:
authors['current_organization_id'].value_counts()[:10]

                 188
grid.59025.3b    145
grid.12527.33    114
grid.5379.8      105
grid.19006.3e     78
grid.59053.3a     77
grid.168010.e     76
grid.13402.34     72
grid.21940.3e     69
grid.4280.e       66
Name: current_organization_id, dtype: int64

> Explanation: the most frequent value turns to be grid.59025.3b ie [Nanyang Technological University in Singapore](https://www.grid.ac/institutes/grid.59025.3b). The first result is empty, meaning that for those authors Dimensions has no info about `current_organization_id`. 

### Extracting Affiliations: `as_dataframe_authors_affiliations`

As you can see from the results of the previous section, the `affiliations` of each author is yet another nested JSON object. 

> Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication

The `as_dataframe_authors_affiliations` method allows to quickly extract that affiliations data and return a dataframe with **one row per affiliation**.

This can be useful e.g. if one wants to count research organizations at *the time of writing* (as opposed to `current_organization_id`, which is the *most recent organization* of a researcher). 

In [9]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()

Unnamed: 0,aff_id,aff_name,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
0,grid.430387.b,"Rutgers, The State University of New Jersey",New Brunswick,5101720.0,United States,US,New Jersey,US-NJ,pub.1050119463,ur.0633062306.03,Manish,Chhowalla
1,grid.42687.3f,Ulsan National Institute of Science and Techno...,Ulsan,1833750.0,South Korea,KR,,,pub.1050119463,ur.07617630407.83,Hyeon Suk,Shin
2,grid.4280.e,National University of Singapore,Singapore,1880250.0,Singapore,SG,,,pub.1050119463,ur.01150450507.27,Goki,Eda
3,grid.28665.3f,Academia Sinica,Taipei,1668340.0,Taiwan,TW,,,pub.1050119463,ur.01313340113.13,Lain-Jong,Li
4,grid.4280.e,National University of Singapore,Singapore,1880250.0,Singapore,SG,,,pub.1050119463,ur.0752174033.73,Kian Ping,Loh


In [10]:
affiliations.describe(include="all")

Unnamed: 0,aff_id,aff_name,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
count,7008.0,7008,7008.0,7008.0,7008,7008,7008.0,7008.0,7008,7008.0,7008,7008
unique,750.0,1065,460.0,463.0,53,53,56.0,56.0,984,4149.0,3416,1891
top,,Nanyang Technological University,,,China,CN,,,pub.1019661721,,Wei,Zhang
freq,947.0,221,947.0,952.0,1951,1951,5136.0,5136.0,105,158.0,62,295


Let's get the top ten values for `aff_id`. 

In [11]:
affiliations['aff_id'].value_counts()[:10]

                 947
grid.59025.3b    221
grid.12527.33    133
grid.5379.8      115
grid.168010.e    100
grid.19006.3e    100
grid.21940.3e     96
grid.21729.3f     87
grid.59053.3a     82
grid.116068.8     81
Name: aff_id, dtype: int64

> Explanation: the most frequent value is still [grid.59025.3b](https://www.grid.ac/institutes/grid.59025.3b), meaning that most authors' current organization is the same organization of when they published these articles. 

Another example: we can now easily analyze the data by country too. 

In [12]:
affiliations['aff_country'].value_counts()[:10]

China             1951
United States     1563
                   947
South Korea        346
Singapore          330
United Kingdom     312
Australia          171
Japan              154
Canada             138
Germany            130
Name: aff_country, dtype: int64

> Explanation: the vast majority of authors in this dataset are from China, closely followed by the USA. 

## 2. Helper Methods for 'Grants' queries

###  Extracting Funders: `as_dataframe_funders`

Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the `funders` sub-key. 

This methods allows to quickly extract that data and return a dataframe with **one row per funder**.

In [13]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)

Returned Grants: 1000 (total = 9204)


In [14]:
res.as_dataframe_funders().head(10)

Unnamed: 0,id,city_name,types,acronym,state_name,latitude,name,country_name,linkout,longitude,grant_id,grant_title,grant_start_date,grant_end_date
0,grid.421091.f,Swindon,[Government],EPSRC,England,51.567093,Engineering and Physical Sciences Research Cou...,United Kingdom,[https://www.epsrc.ac.uk/],-1.784602,grant.8558055,UK-Africa Postgraduate Advanced Study Institut...,2020-03-31,2021-03-30
1,grid.270680.b,Brussels,[Government],EC,,50.85165,European Commission,Belgium,[http://ec.europa.eu/index_en.htm],4.36367,grant.8585457,Estimating the Prevalence of AntiMicrobial Res...,2020-01-01,2021-12-31
2,grid.270680.b,Brussels,[Government],EC,,50.85165,European Commission,Belgium,[http://ec.europa.eu/index_en.htm],4.36367,grant.8586121,Earth observation service for preventive contr...,2019-11-01,2022-10-31
3,grid.454774.1,New Delhi,[Government],DBT,,28.601473,Department of Biotechnology,India,[http://www.dbtindia.nic.in/],77.23578,grant.8657420,Translational research and clinical developmen...,2019-10-07,2022-10-07
4,grid.248883.d,Ottawa,[Government],CIHR,Ontario,45.381893,Canadian Institutes of Health Research,Canada,[http://www.cihr-irsc.gc.ca/e/193.html],-75.745224,grant.8527034,Mechanisms of Leishmania dissemination and tra...,2019-10-01,2024-09-30
5,grid.52788.30,London,[Nonprofit],WT,,51.525867,Wellcome Trust,United Kingdom,[http://www.wellcome.ac.uk/],-0.135005,grant.8558648,Molecular mechanisms of carbohydrate uptake in...,2019-10-01,2022-09-30
6,grid.52788.30,London,[Nonprofit],WT,,51.525867,Wellcome Trust,United Kingdom,[http://www.wellcome.ac.uk/],-0.135005,grant.8103743,The Chemical Empire: A New History of Syntheti...,2019-10-01,2023-10-01
7,grid.425888.b,Bern,[Government],SNF,,46.94923,Swiss National Science Foundation,Switzerland,[http://www.snf.ch/en],7.432395,grant.8599112,Gauging Global Governance: The Effectiveness o...,2019-10-01,2022-09-30
8,grid.270680.b,Brussels,[Government],EC,,50.85165,European Commission,Belgium,[http://ec.europa.eu/index_en.htm],4.36367,grant.8585082,Understanding the roles of pathogen infection ...,2019-10-01,2021-09-30
9,grid.457875.c,Arlington,[Government],NSF MPS,Virginia,38.880566,Directorate for Mathematical & Physical Sciences,United States,[http://www.nsf.gov/dir/index.jsp?org=MPS],-77.11099,grant.8566624,Collaborative Research: Principal Component An...,2019-10-01,2022-09-30


### Extracting investigators: `as_dataframe_investigators`

Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the `investigator_details` sub-key. 

This methods allows to quickly extract that data and return a dataframe with **one row per investigator**.

> NOTE: `investigator_details` are not returned by default in a grants query hence one must specify this in the query results

In [16]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigator_details] limit 1000"""
res = dsl.query(query)

Returned Grants: 1000 (total = 9204)
Field 'project_num' is deprecated in favor of grant_number. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details


In [None]:
res.as_dataframe_investigators().head(10)

## Conclusions 

Moving Dimensions API results to pandas dataframes **makes it easier** to **analyze the data** and **answer research questions**. 

Note: the examples above only scratch the surface of what can be done with pandas! 