<a href="https://colab.research.google.com/github/digital-science/dimensions-api-lab/blob/master/1-getting-started/4%20Using%20Dimcli%20to%20return%20pandas%20dataframes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Dimensions API Lab In Google Colab"/></a>

# Dimcli: returning DataFrames

DimCli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas [dataframe objects](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe). 

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

>  [pandas](https://pandas.pydata.org/pandas-docs/stable/) is a popular software library written for the Python programming language for data manipulation and analysis.

In [8]:
import pandas
import dimcli
from dimcli.shortcuts import dslquery
dimcli.login()

DimCli v0.5.4 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


## 1. Getting started: the `as_dataframe` method

This utility method allows to quickly turn any query results into a dataframe. 

In [9]:
# we'll reuse this query later on 
query = """search publications for "graphene" where year=[2013:2019] return publications sort by times_cited limit 1000"""
res = dslquery(query)

Returned Publications: 1000 (total = 365807)


In [10]:
df = res.as_dataframe()
df.head(10)

Unnamed: 0,author_affiliations,id,issue,journal.id,journal.title,pages,title,type,volume,year
0,"[[{'first_name': 'Manish', 'last_name': 'Chhow...",pub.1050119463,4,jour.1041224,Nature Chemistry,263,The chemistry of two-dimensional layered trans...,article,5,2013
1,"[[{'first_name': 'A. K.', 'last_name': 'Geim',...",pub.1024857999,7459,jour.1018957,Nature,419,Van der Waals heterostructures,article,499,2013
2,"[[{'first_name': 'Huanping', 'last_name': 'Zho...",pub.1004394295,6196,jour.1346339,Science,542-546,Interface engineering of highly efficient pero...,article,345,2014
3,"[[{'first_name': 'Likai', 'last_name': 'Li', '...",pub.1032956475,5,jour.1037429,Nature Nanotechnology,372-377,Black phosphorus field-effect transistors,article,9,2014
4,"[[{'first_name': 'C.', 'last_name': 'Patrignan...",pub.1059158429,10,jour.1327822,Chinese Physics C,100001,Review of Particle Physics,article,40,2016
5,"[[{'first_name': 'John B.', 'last_name': 'Good...",pub.1019126274,4,jour.1081898,Journal of the American Chemical Society,1167-76,The Li-ion rechargeable battery: a perspective.,article,135,2013
6,"[[{'first_name': 'Michael F. L.', 'last_name':...",pub.1007405937,6119,jour.1346339,Science,535-539,Carbon Nanotubes: Present and Future Commercia...,article,339,2013
7,"[[{'first_name': 'Andrea C.', 'last_name': 'Fe...",pub.1015305822,4,jour.1037429,Nature Nanotechnology,235,Raman spectroscopy as a versatile tool for stu...,article,8,2013
8,"[[{'first_name': 'Han', 'last_name': 'Liu', 'o...",pub.1009826879,4,jour.1038917,ACS Nano,4033-41,Phosphorene: an unexplored 2D semiconductor wi...,article,8,2014
9,"[[{'first_name': 'Martin A.', 'last_name': 'Gr...",pub.1045181228,7,jour.1037430,Nature Photonics,506-514,The emergence of perovskite solar cells,article,8,2014


Pandas dataframes offer a myriad of utilities for inspecting data. Check out the [official docs](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) or google a [pandas tutorial](https://www.google.com/search?q=pandas+tutorial) to lean more about it. 

In [11]:
# the 'describe' method returns basic statistic for all columns of a dataframe
df.describe(include='all')

Unnamed: 0,author_affiliations,id,issue,journal.id,journal.title,pages,title,type,volume,year
count,998,1000,934.0,997,997,996,1000,1000,997.0,1000.0
unique,997,1000,135.0,143,143,987,1000,4,166.0,
top,"[[{'first_name': 'Jingjing', 'last_name': 'Dua...",pub.1045356440,1.0,jour.1015766,Chemical Society Reviews,421-425,Layered SnS2‐Reduced Graphene Oxide Composite ...,article,7.0,
freq,2,1,174.0,106,106,2,1,997,67.0,
mean,,,,,,,,,,2014.114
std,,,,,,,,,,1.103731
min,,,,,,,,,,2013.0
25%,,,,,,,,,,2013.0
50%,,,,,,,,,,2014.0
75%,,,,,,,,,,2015.0


In [12]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()

2013    362
2014    316
2015    206
2016     82
2017     30
2018      4
Name: year, dtype: int64

In [13]:
# eg distribution of publication [type]
df['type'].value_counts()

article      997
monograph      1
book           1
chapter        1
Name: type, dtype: int64

## 2. Helper Methods for 'Publications' queries

##  Publications authors: `as_dataframe_authors`

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the `authors_affiliations` sub-key. 

> Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with **one row per author**.

In [14]:
authors = res.as_dataframe_authors()
authors.head()

Unnamed: 0,affiliations,current_organization_id,first_name,last_name,orcid,researcher_id,pub_id
0,"[{'id': 'grid.430387.b', 'name': 'Rutgers, The...",grid.430387.b,Manish,Chhowalla,,ur.0633062306.03,pub.1050119463
1,"[{'id': 'grid.42687.3f', 'name': 'Ulsan Nation...",grid.42687.3f,Hyeon Suk,Shin,,ur.07617630407.83,pub.1050119463
2,"[{'id': 'grid.4280.e', 'name': 'National Unive...",grid.4280.e,Goki,Eda,,ur.01150450507.27,pub.1050119463
3,"[{'id': 'grid.482254.d', 'name': 'Institute of...",grid.45672.32,Lain-Jong,Li,['0000-0002-4059-7783'],ur.01313340113.13,pub.1050119463
4,"[{'id': 'grid.4280.e', 'name': 'National Unive...",grid.4280.e,Kian Ping,Loh,['0000-0002-1491-743X'],ur.0752174033.73,pub.1050119463


Using the authors dataframe, we can easily get the top ten values for `current_organization_id`. 

In [15]:
authors['current_organization_id'].value_counts()[:10]

                 173
grid.59025.3b    153
grid.168010.e    150
grid.12527.33     93
grid.5379.8       86
grid.59053.3a     86
grid.5333.6       78
grid.116068.8     71
grid.19006.3e     71
grid.13402.34     68
Name: current_organization_id, dtype: int64

> Explanation: the most frequent value turns to be grid.59025.3b ie [Nanyang Technological University in Singapore](https://www.grid.ac/institutes/grid.59025.3b). The second result is empty, meaning that for those authors Dimensions has no info about `current_organization_id`. 

## Publications affiliations: `as_dataframe_authors_affiliations`

As you can see from the results of the previous section, the `affiliations` of each author is yet another nested JSON object. 

> Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication

The `as_dataframe_authors_affiliations` method allows to quickly extract that affiliations data and return a dataframe with **one row per affiliation**.

This can be useful e.g. if one wants to count research organizations at *the time of writing* (as opposed to `current_organization_id`, which is the *most recent organization* of a researcher). 

In [16]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()

Unnamed: 0,aff_city,aff_city_id,aff_country,aff_country_code,aff_id,aff_name,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
0,New Brunswick,5101717.0,United States,US,grid.430387.b,"Rutgers, The State University of New Jersey",New Jersey,US-NJ,pub.1050119463,ur.0633062306.03,Manish,Chhowalla
1,Ulsan,1833747.0,South Korea,KR,grid.42687.3f,Ulsan National Institute of Science and Techno...,,,pub.1050119463,ur.07617630407.83,Hyeon Suk,Shin
2,Singapore,1880252.0,Singapore,SG,grid.4280.e,National University of Singapore,,,pub.1050119463,ur.01150450507.27,Goki,Eda
3,Taipei,1668341.0,Taiwan,TW,grid.482254.d,"Institute of Atomic and Molecular Sciences, Ac...",,,pub.1050119463,ur.01313340113.13,Lain-Jong,Li
4,Singapore,1880252.0,Singapore,SG,grid.4280.e,National University of Singapore,,,pub.1050119463,ur.0752174033.73,Kian Ping,Loh


In [17]:
affiliations.describe(include="all")

Unnamed: 0,aff_city,aff_city_id,aff_country,aff_country_code,aff_id,aff_name,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
count,6413,6407.0,6413,6413,6413,7187,2202,2202,7187,7187.0,7187,7187
unique,428,,50,50,718,998,51,51,985,4298.0,3450,1985
top,Beijing,,United States,US,grid.59025.3b,Nanyang Technological University,California,US-CA,pub.1019661721,,Yi,Wang
freq,631,,1949,1949,229,229,548,548,105,140.0,63,300
mean,,3039684.0,,,,,,,,,,
std,,1553733.0,,,,,,,,,,
min,,66093.0,,,,,,,,,,
25%,,1816670.0,,,,,,,,,,
50%,,2174003.0,,,,,,,,,,
75%,,4722625.0,,,,,,,,,,


Let's get the top ten values for `aff_id`. 

In [18]:
affiliations['aff_id'].value_counts()[:10]

grid.59025.3b    229
grid.168010.e    202
grid.12527.33    111
grid.21729.3f    106
grid.19006.3e     98
grid.5379.8       90
grid.9227.e       84
grid.116068.8     83
grid.5333.6       82
grid.59053.3a     82
Name: aff_id, dtype: int64

> Explanation: the most frequent value is still [grid.59025.3b](https://www.grid.ac/institutes/grid.59025.3b), meaning that most authors' current organization is the same organization of when they published these articles. 

Another example: we can now easily analyze the data by country too. 

In [19]:
affiliations['aff_country'].value_counts()[:10]

United States     1949
China             1914
Singapore          328
South Korea        313
United Kingdom     261
Germany            201
Japan              168
Australia          157
Switzerland        122
Taiwan             101
Name: aff_country, dtype: int64

> Explanation: the vast majority of authors in this dataset are from the United States, closely followed by China. 

## 2. Helper Methods for 'Grants' queries

##  Grants funders: `as_dataframe_funders`

Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the `funders` sub-key. 

This methods allows to quickly extract that data and return a dataframe with **one row per funder**.

In [20]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dslquery(query)

Returned Grants: 1000 (total = 9280)


In [21]:
res.as_dataframe_funders().head(10)

Unnamed: 0,acronym,country_name,id,name,grant_id,grant_title,grant_start_date,grant_end_date
0,WT,United Kingdom,grid.52788.30,Wellcome Trust,grant.8103743,The Chemical Empire: A New History of Syntheti...,2019-10-01,2023-09-30
1,CIHR,Canada,grid.248883.d,Canadian Institutes of Health Research,grant.8527034,Mechanisms of Leishmania dissemination and tra...,2019-10-01,2024-09-30
2,NSF SBE,United States,grid.457916.8,"Directorate for Social, Behavioral & Economic ...",grant.7923573,CAREER: A Case Study of Malaria Elimination Ef...,2019-09-01,2024-08-31
3,WT,United Kingdom,grid.52788.30,Wellcome Trust,grant.8103357,Quantifying the influence of wind on mosquito ...,2019-09-01,2020-08-31
4,CIHR,Canada,grid.248883.d,Canadian Institutes of Health Research,grant.8482815,Investigating the Host-Pathogen Interactions o...,2019-09-01,2020-08-31
5,CIHR,Canada,grid.248883.d,Canadian Institutes of Health Research,grant.8527284,Investigation of antibody-mediated immune mech...,2019-09-01,2020-08-31
6,NSF MPS,United States,grid.457875.c,Directorate for Mathematical & Physical Sciences,grant.8522823,Membrane and Monolith Enzyme Reactors for Prot...,2019-08-01,2022-07-31
7,BBSRC,United Kingdom,grid.418100.c,Biotechnology and Biological Sciences Research...,grant.8483462,Defining mechani1sms of CD8+ T-cell mediated i...,2019-07-31,2022-07-30
8,AMS,United Kingdom,grid.420331.7,Academy of Medical Sciences,grant.7921305,Analysis of innate immune responses following ...,2019-07-15,2021-07-14
9,MSFHR,Canada,grid.453291.8,Michael Smith Foundation for Health Research,grant.7823904,Phosphoinositide kinases: molecular determinan...,2019-07-01,2024-06-30


##  Grants investigators: `as_dataframe_investigators`

Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the `investigator_details` sub-key. 

This methods allows to quickly extract that data and return a dataframe with **one row per investigator**.

> NOTE: `investigator_details` are not returned by default in a grants query hence one must specify this in the query results

In [22]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigator_details] limit 1000"""
res = dslquery(query)

Returned Grants: 1000 (total = 9280)


In [23]:
res.as_dataframe_investigators().head(10)

Unnamed: 0,affiliations,first_name,id,last_name,middle_name,role,grant_id,grant_title,grant_start_date,grant_end_date
0,"[{'city': 'York', 'state_code': None, 'country...",Sabine,ur.07547552115.05,Clarke,Marie,PI,grant.8103743,The Chemical Empire: A New History of Syntheti...,2019-10-01,2023-09-30
1,,Nathan,,Peters,,PI,grant.8527034,Mechanisms of Leishmania dissemination and tra...,2019-10-01,2024-09-30
2,"[{'state_code': 'US-OR', 'id': 'grid.170202.6'...",Melissa,ur.011503504405.93,Graboyes,,PI,grant.7923573,CAREER: A Case Study of Malaria Elimination Ef...,2019-09-01,2024-08-31
3,"[{'city': 'Liverpool', 'state_code': None, 'co...",Christopher,ur.010302660415.98,Jones,Mark,PI,grant.8103357,Quantifying the influence of wind on mosquito ...,2019-09-01,2020-08-31
4,,Nicola,,Case,,PI,grant.8482815,Investigating the Host-Pathogen Interactions o...,2019-09-01,2020-08-31
5,,Madeleine,,Wiebe,Claire,PI,grant.8527284,Investigation of antibody-mediated immune mech...,2019-09-01,2020-08-31
6,"[{'state_code': 'US-IN', 'id': 'grid.131063.6'...",Merlin,,Bruening,,PI,grant.8522823,Membrane and Monolith Enzyme Reactors for Prot...,2019-08-01,2022-07-31
7,"[{'city': None, 'state_code': None, 'name': 'U...",Tim,,Connelley,,PI,grant.8483462,Defining mechani1sms of CD8+ T-cell mediated i...,2019-07-31,2022-07-30
8,"[{'city': None, 'state_code': None, 'name': 'U...",Musa,,Hassan,Abdul,Co-PI,grant.8483462,Defining mechani1sms of CD8+ T-cell mediated i...,2019-07-31,2022-07-30
9,"[{'state': None, 'state_code': None, 'city': N...",Ruth,ur.010655625375.45,Payne,,PI,grant.7921305,Analysis of innate immune responses following ...,2019-07-15,2021-07-14


## Conclusions 

Moving Dimensions API results to pandas dataframes **makes it easier** to **analyze the data** and **answer research questions**. 

Note: the examples above only scratch the surface of what can be done with pandas! 

---
# Want to learn more?

Check out the [Dimensions API Lab](https://digital-science.github.io/dimensions-api-lab/) website, which contains many tutorials and reusable Jupyter notebooks for scholarly data analytics. 