# Dimcli: returning DataFrames

DimCli includes a few utilities that make it easier to transform Dimensions JSON data into Pandas [dataframe objects](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe). 

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

>  [pandas](https://pandas.pydata.org/pandas-docs/stable/) is a popular software library written for the Python programming language for data manipulation and analysis.

In [1]:
import pandas
import dimcli
from dimcli.shortcuts import dslquery
%dsl_login

DimCli v0.5.3 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


## 1. Getting started: the `as_dataframe` method

This utility method allows to quickly turn any query results into a dataframe. 

In [2]:
# we'll reuse this query later on 
query = """search publications for "graphene" where year=[2013:2019] return publications sort by times_cited limit 1000"""
res = dslquery(query)

Returned Publications: 1000 (total = 365474)


In [4]:
df = res.as_dataframe()
df.head(10)

Unnamed: 0,author_affiliations,id,issue,journal,pages,title,type,volume,year
0,"[[{'first_name': 'Manish', 'last_name': 'Chhow...",pub.1050119463,4,"{'id': 'jour.1041224', 'title': 'Nature Chemis...",263,The chemistry of two-dimensional layered trans...,article,5,2013
1,"[[{'first_name': 'A. K.', 'last_name': 'Geim',...",pub.1024857999,7459,"{'id': 'jour.1018957', 'title': 'Nature'}",419,Van der Waals heterostructures,article,499,2013
2,"[[{'first_name': 'Huanping', 'last_name': 'Zho...",pub.1004394295,6196,"{'id': 'jour.1346339', 'title': 'Science'}",542-546,Interface engineering of highly efficient pero...,article,345,2014
3,"[[{'first_name': 'Likai', 'last_name': 'Li', '...",pub.1032956475,5,"{'id': 'jour.1037429', 'title': 'Nature Nanote...",372-377,Black phosphorus field-effect transistors,article,9,2014
4,"[[{'first_name': 'C.', 'last_name': 'Patrignan...",pub.1059158429,10,"{'id': 'jour.1327822', 'title': 'Chinese Physi...",100001,Review of Particle Physics,article,40,2016
5,"[[{'first_name': 'John B.', 'last_name': 'Good...",pub.1019126274,4,"{'id': 'jour.1081898', 'title': 'Journal of th...",1167-76,The Li-ion rechargeable battery: a perspective.,article,135,2013
6,"[[{'first_name': 'Michael F. L.', 'last_name':...",pub.1007405937,6119,"{'id': 'jour.1346339', 'title': 'Science'}",535-539,Carbon Nanotubes: Present and Future Commercia...,article,339,2013
7,"[[{'first_name': 'Andrea C.', 'last_name': 'Fe...",pub.1015305822,4,"{'id': 'jour.1037429', 'title': 'Nature Nanote...",235,Raman spectroscopy as a versatile tool for stu...,article,8,2013
8,"[[{'first_name': 'Han', 'last_name': 'Liu', 'o...",pub.1009826879,4,"{'id': 'jour.1038917', 'title': 'ACS Nano'}",4033-41,Phosphorene: an unexplored 2D semiconductor wi...,article,8,2014
9,"[[{'first_name': 'Martin A.', 'last_name': 'Gr...",pub.1045181228,7,"{'id': 'jour.1037430', 'title': 'Nature Photon...",506-514,The emergence of perovskite solar cells,article,8,2014


Pandas dataframes offer a myriad of utilities for inspecting data. Check out the [official docs](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) or google a [pandas tutorial](https://www.google.com/search?q=pandas+tutorial) to lean more about it. 

In [11]:
# the 'describe' method returns basic statistic for all columns of a dataframe
df.describe(include='all')

Unnamed: 0,author_affiliations,id,issue,journal,pages,title,type,volume,year
count,998,1000,934.0,997,996,1000,1000,997.0,1000.0
unique,997,1000,135.0,143,987,1000,4,166.0,
top,"[[{'first_name': 'Jingjing', 'last_name': 'Dua...",pub.1042950947,1.0,"{'id': 'jour.1015766', 'title': 'Chemical Soci...",23-36,"Large, non-saturating magnetoresistance in WTe2",article,7.0,
freq,2,1,174.0,106,2,1,997,67.0,
mean,,,,,,,,,2014.114
std,,,,,,,,,1.103731
min,,,,,,,,,2013.0
25%,,,,,,,,,2013.0
50%,,,,,,,,,2014.0
75%,,,,,,,,,2015.0


In [19]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()

2013    362
2014    316
2015    206
2016     82
2017     30
2018      4
Name: year, dtype: int64

In [20]:
# eg distribution of publication [type]
df['type'].value_counts()

article      997
monograph      1
book           1
chapter        1
Name: type, dtype: int64

## 2. Helper Methods for 'Publications' queries

##  Publications authors: `as_dataframe_authors`

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the `authors_affiliations` sub-key. 

> Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with **one row per author**.

In [16]:
authors = res.as_dataframe_authors()
authors.head()

Unnamed: 0,affiliations,current_organization_id,first_name,last_name,orcid,researcher_id,pub_id
0,"[{'id': 'grid.430387.b', 'name': 'Rutgers, The...",grid.430387.b,Manish,Chhowalla,,ur.0633062306.03,pub.1050119463
1,"[{'id': 'grid.42687.3f', 'name': 'Ulsan Nation...",grid.42687.3f,Hyeon Suk,Shin,,ur.07617630407.83,pub.1050119463
2,"[{'id': 'grid.4280.e', 'name': 'National Unive...",grid.4280.e,Goki,Eda,,ur.01150450507.27,pub.1050119463
3,"[{'id': 'grid.482254.d', 'name': 'Institute of...",grid.45672.32,Lain-Jong,Li,['0000-0002-4059-7783'],ur.01313340113.13,pub.1050119463
4,"[{'id': 'grid.4280.e', 'name': 'National Unive...",grid.4280.e,Kian Ping,Loh,['0000-0002-1491-743X'],ur.0752174033.73,pub.1050119463


Using the authors dataframe, we can easily get the top ten values for `current_organization_id`. 

In [17]:
authors['current_organization_id'].value_counts()[:10]

                 173
grid.59025.3b    153
grid.168010.e    150
grid.12527.33     93
grid.5379.8       86
grid.59053.3a     86
grid.5333.6       78
grid.116068.8     71
grid.19006.3e     71
grid.13402.34     68
Name: current_organization_id, dtype: int64

> Explanation: the most frequent value turns to be grid.59025.3b ie [Nanyang Technological University in Singapore](https://www.grid.ac/institutes/grid.59025.3b). The second result is empty, meaning that for those authors Dimensions has no info about `current_organization_id`. 

## Publications affiliations: `as_dataframe_authors_affiliations`

As you can see from the results of the previous section, the `affiliations` of each author is yet another nested JSON object. 

The `as_dataframe_authors_affiliations` method allows to quickly extract that affiliations data and return a dataframe with **one row per affiliation**.

This can be useful e.g. if one wants to count research organizations at *the time of writing* (as opposed to `current_organization_id`, which is the *most recent organization* of a researcher). 

In [12]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()

Unnamed: 0,aff_city,aff_city_id,aff_country,aff_country_code,aff_id,aff_name,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
0,New Brunswick,5101717.0,United States,US,grid.430387.b,"Rutgers, The State University of New Jersey",New Jersey,US-NJ,pub.1050119463,ur.0633062306.03,Manish,Chhowalla
1,Ulsan,1833747.0,South Korea,KR,grid.42687.3f,Ulsan National Institute of Science and Techno...,,,pub.1050119463,ur.07617630407.83,Hyeon Suk,Shin
2,Singapore,1880252.0,Singapore,SG,grid.4280.e,National University of Singapore,,,pub.1050119463,ur.01150450507.27,Goki,Eda
3,Taipei,1668341.0,Taiwan,TW,grid.482254.d,"Institute of Atomic and Molecular Sciences, Ac...",,,pub.1050119463,ur.01313340113.13,Lain-Jong,Li
4,Singapore,1880252.0,Singapore,SG,grid.4280.e,National University of Singapore,,,pub.1050119463,ur.0752174033.73,Kian Ping,Loh


In [13]:
affiliations.describe(include="all")

Unnamed: 0,aff_city,aff_city_id,aff_country,aff_country_code,aff_id,aff_name,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
count,6413,6407.0,6413,6413,6413,7187,2202,2202,7187,7187.0,7187,7187
unique,428,,50,50,718,998,51,51,985,4298.0,3450,1985
top,Beijing,,United States,US,grid.59025.3b,Nanyang Technological University,California,US-CA,pub.1019661721,,Yi,Wang
freq,631,,1949,1949,229,229,548,548,105,140.0,63,300
mean,,3039684.0,,,,,,,,,,
std,,1553733.0,,,,,,,,,,
min,,66093.0,,,,,,,,,,
25%,,1816670.0,,,,,,,,,,
50%,,2174003.0,,,,,,,,,,
75%,,4722625.0,,,,,,,,,,


Let's get the top ten values for `aff_id`. 

In [14]:
affiliations['aff_id'].value_counts()[:10]

grid.59025.3b    229
grid.168010.e    202
grid.12527.33    111
grid.21729.3f    106
grid.19006.3e     98
grid.5379.8       90
grid.9227.e       84
grid.116068.8     83
grid.59053.3a     82
grid.5333.6       82
Name: aff_id, dtype: int64

> Explanation: the most frequent value is still [grid.59025.3b](https://www.grid.ac/institutes/grid.59025.3b), meaning that most authors' current organization is the same organization of when they published these articles. 

Another example: we can now easily analyze the data by country too. 

In [18]:
affiliations['aff_country'].value_counts()[:10]

United States     1949
China             1914
Singapore          328
South Korea        313
United Kingdom     261
Germany            201
Japan              168
Australia          157
Switzerland        122
Spain              101
Name: aff_country, dtype: int64

> Explanation: the vast majority of authors in this dataset are from the United States, closely followed by China. 

## Conclusions 

Moving Dimensions API results to pandas dataframes **makes it easier** to **analyze the data** and **answer research questions**. 

Note: the examples above only scratch the surface of what can be done with pandas! 

---
# Want to learn more?

Check out the [Dimensions API Lab](https://digital-science.github.io/dimensions-api-lab/) website, which contains many tutorials and reusable Jupyter notebooks for scholarly data analytics. 