# The Dimcli Python library: Working with Pandas Dataframes

[Dimcli](https://github.com/lambdamusic/dimcli) includes a few utilities that make it easier to transform Dimensions JSON data into Pandas [dataframe objects](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe). 

Dataframes are then easy to sort, analyse, export as CSV and use within visualisation softwares.

>  [pandas](https://pandas.pydata.org/pandas-docs/stable/) is a popular software library written for the Python programming language for data manipulation and analysis.


In [1]:
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai"

!pip install dimcli -U --quiet

# import all libraries and login
import pandas
import dimcli
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

Dimcli - Dimensions API Client (v0.6.9)
Connected to endpoint: https://app.dimensions.ai - DSL version: 1.24
Method: dsl.ini file


## 1. General method to transform JSON query results into a dataframe 

The `DslDataset.as_dataframe` method allows to quickly turn any query results into a dataframe. 

In [2]:
# we'll reuse this query later on 
query = """search publications for "graphene" 
            where year in [2013:2019] 
            return publications sort by times_cited limit 1000"""
res = dsl.query(query)

Returned Publications: 1000 (total = 425229)


In [3]:
df = res.as_dataframe()
df.head(10)

Unnamed: 0,title,author_affiliations,volume,issue,pages,type,year,id,journal.id,journal.title
0,The chemistry of two-dimensional layered trans...,"[[{'first_name': 'Manish', 'last_name': 'Chhow...",5.0,4.0,263-275,article,2013,pub.1050119463,jour.1041224,Nature Chemistry
1,Van der Waals heterostructures,"[[{'first_name': 'A. K.', 'last_name': 'Geim',...",499.0,7459.0,419-425,article,2013,pub.1024857999,jour.1018957,Nature
2,Interface engineering of highly efficient pero...,"[[{'first_name': 'Huanping', 'last_name': 'Zho...",345.0,6196.0,542-546,article,2014,pub.1004394295,jour.1346339,Science
3,Black phosphorus field-effect transistors,"[[{'first_name': 'Likai', 'last_name': 'Li', '...",9.0,5.0,372-377,article,2014,pub.1032956475,jour.1037429,Nature Nanotechnology
4,The Li-ion rechargeable battery: a perspective.,"[[{'first_name': 'John B.', 'last_name': 'Good...",135.0,4.0,1167-76,article,2013,pub.1019126274,jour.1081898,Journal of the American Chemical Society
5,"Nanoenergy, Nanotechnology Applied for Energy ...",,,,,book,2013,pub.1031762191,,
6,Phosphorene: an unexplored 2D semiconductor wi...,"[[{'first_name': 'Han', 'last_name': 'Liu', 'i...",8.0,4.0,4033-41,article,2014,pub.1009826879,jour.1038917,ACS Nano
7,Raman spectroscopy as a versatile tool for stu...,"[[{'first_name': 'Andrea C.', 'last_name': 'Fe...",8.0,4.0,235-246,article,2013,pub.1015305822,jour.1037429,Nature Nanotechnology
8,Carbon Nanotubes: Present and Future Commercia...,"[[{'first_name': 'Michael F. L.', 'last_name':...",339.0,6119.0,535-539,article,2013,pub.1007405937,jour.1346339,Science
9,The emergence of perovskite solar cells,"[[{'first_name': 'Martin A.', 'last_name': 'Gr...",8.0,7.0,506-514,article,2014,pub.1045181228,jour.1037430,Nature Photonics


Pandas dataframes offer a myriad of utilities for inspecting data. Check out the [official docs](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) or google a [pandas tutorial](https://www.google.com/search?q=pandas+tutorial) to lean more about it. 

In [4]:
# the table shape
df.shape

(1000, 10)

In [5]:
# the 'value_counts' method returns the distribution of a specific field eg publication [years]
df['year'].value_counts()

2013    324
2014    299
2015    210
2016    105
2017     50
2018     11
2019      1
Name: year, dtype: int64

In [6]:
# eg distribution of publication [type]
df['type'].value_counts()

article    997
book         2
chapter      1
Name: type, dtype: int64

## 2. Dataframe Methods for 'Publications' queries

What follows are specialized versions of the `as_dataframe` method for results sets composed of publication records. 

###  Extracting authors: `as_dataframe_authors`

Publications authors are usually returned by the Dimensions API inside a nested JSON object in the `authors_affiliations` sub-key. 

> Note: the order of authors in the JSON is consistent with the ordering of authors in the original publication

This methods allows to quickly extract that data and return a dataframe with **one row per author**.

In [7]:
authors = res.as_dataframe_authors()
authors.head()

Unnamed: 0,first_name,last_name,initials,corresponding,orcid,current_organization_id,researcher_id,affiliations,pub_id
0,Manish,Chhowalla,,True,,grid.5335.0,ur.0633062306.03,"[{'id': 'grid.430387.b', 'name': 'Rutgers, The...",pub.1050119463
1,Hyeon Suk,Shin,,,,grid.42687.3f,ur.07617630407.83,"[{'id': 'grid.42687.3f', 'name': 'Ulsan Nation...",pub.1050119463
2,Goki,Eda,,,,grid.4280.e,ur.01150450507.27,"[{'id': 'grid.4280.e', 'name': 'National Unive...",pub.1050119463
3,Lain-Jong,Li,,,['0000-0002-4059-7783'],grid.45672.32,ur.01313340113.13,"[{'id': 'grid.28665.3f', 'name': 'Academia Sin...",pub.1050119463
4,Kian Ping,Loh,,,['0000-0002-1491-743X'],grid.4280.e,ur.0752174033.73,"[{'id': 'grid.4280.e', 'name': 'National Unive...",pub.1050119463


Using the authors dataframe, we can easily get the top ten values for `current_organization_id`. 

In [8]:
authors['current_organization_id'].value_counts()[:10]

                 200
grid.168010.e    157
grid.59025.3b    136
grid.5379.8       88
grid.12527.33     83
grid.116068.8     83
grid.59053.3a     80
grid.19006.3e     75
grid.13402.34     72
grid.5333.6       71
Name: current_organization_id, dtype: int64

> Explanation: the most frequent value turns to be grid.59025.3b ie [Nanyang Technological University in Singapore](https://www.grid.ac/institutes/grid.59025.3b). The first result is empty, meaning that for those authors Dimensions has no info about `current_organization_id`. 

### Extracting Affiliations: `as_dataframe_authors_affiliations`

As you can see from the results of the previous section, the `affiliations` of each author is yet another nested JSON object. 

> Note: the order of affiliations in the JSON is consistent with the affiliations order in the original publication

The `as_dataframe_authors_affiliations` method allows to quickly extract that affiliations data and return a dataframe with **one row per affiliation**.

This can be useful e.g. if one wants to count research organizations at *the time of writing* (as opposed to `current_organization_id`, which is the *most recent organization* of a researcher). 

In [9]:
affiliations = res.as_dataframe_authors_affiliations()
affiliations.head()

Unnamed: 0,aff_id,aff_name,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
0,grid.430387.b,"Rutgers, The State University of New Jersey",New Brunswick,5101720.0,United States,US,New Jersey,US-NJ,pub.1050119463,ur.0633062306.03,Manish,Chhowalla
1,grid.42687.3f,Ulsan National Institute of Science and Techno...,Ulsan,1833750.0,South Korea,KR,,,pub.1050119463,ur.07617630407.83,Hyeon Suk,Shin
2,grid.4280.e,National University of Singapore,Singapore,1880250.0,Singapore,SG,,,pub.1050119463,ur.01150450507.27,Goki,Eda
3,grid.4280.e,National University of Singapore,Singapore,1880250.0,Singapore,SG,,,pub.1050119463,ur.01150450507.27,Goki,Eda
4,grid.4280.e,National University of Singapore,Singapore,1880250.0,Singapore,SG,,,pub.1050119463,ur.01150450507.27,Goki,Eda


In [10]:
affiliations.describe(include="all")

Unnamed: 0,aff_id,aff_name,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
count,7810.0,7810,7810.0,7810.0,7810,7810,7810.0,7810.0,7810,7810.0,7810,7810
unique,727.0,1078,446.0,448.0,51,51,53.0,53.0,985,4363.0,3500,2097
top,,Nanyang Technological University,,,United States,US,,,pub.1019661721,,Yi,Wang
freq,989.0,242,989.0,995.0,2162,2162,5350.0,5350.0,108,185.0,72,325


Let's get the top ten values for `aff_id`. 

In [11]:
affiliations['aff_id'].value_counts()[:10]

                 989
grid.59025.3b    242
grid.168010.e    221
grid.19006.3e    135
grid.4280.e      116
grid.21729.3f    116
grid.8217.c      113
grid.5379.8      106
grid.116068.8    100
grid.12527.33     97
Name: aff_id, dtype: int64

> Explanation: the most frequent value is still [grid.59025.3b](https://www.grid.ac/institutes/grid.59025.3b), meaning that most authors' current organization is the same organization of when they published these articles. 

Another example: we can now easily analyze the data by country too. 

In [12]:
affiliations['aff_country'].value_counts()[:10]

United States     2162
China             1833
                   989
Singapore          382
South Korea        328
United Kingdom     317
Germany            228
Australia          175
Japan              174
Canada             123
Name: aff_country, dtype: int64

> Explanation: the vast majority of authors in this dataset are from China, closely followed by the USA. 

## 3. Dataframe Methods for 'Grants' queries

###  Extracting Funders: `as_dataframe_funders`

Grant funders authors are usually returned by the Dimensions API inside a nested JSON object in the `funders` sub-key. 

This methods allows to quickly extract that data and return a dataframe with **one row per funder**.

In [13]:
# get a sample list of grants
query = """search grants for "malaria" return grants limit 1000"""
res = dsl.query(query)

Returned Grants: 1000 (total = 11076)


In [14]:
res.as_dataframe_funders().head(10)

Unnamed: 0,id,linkout,name,country_name,types,longitude,city_name,latitude,acronym,state_name,grant_id,grant_title,grant_start_date,grant_end_date
0,grid.270680.b,[http://ec.europa.eu/index_en.htm],European Commission,Belgium,[Government],4.36367,Brussels,50.85165,EC,,grant.8964341,mAlaRIa Sex dEtermination,2020-10-01,2022-09-30
1,grid.419681.3,[http://www.niaid.nih.gov/Pages/default.aspx],National Institute of Allergy and Infectious D...,United States,[Facility],-77.11183,Bethesda,39.066647,NIAID,Maryland,grant.9018944,Establishment of the New York University Vacci...,2020-04-10,2026-11-30
2,grid.419681.3,[http://www.niaid.nih.gov/Pages/default.aspx],National Institute of Allergy and Infectious D...,United States,[Facility],-77.11183,Bethesda,39.066647,NIAID,Maryland,grant.9020234,Antigen discovery for transmission-blocking va...,2020-04-06,2025-03-31
3,grid.419681.3,[http://www.niaid.nih.gov/Pages/default.aspx],National Institute of Allergy and Infectious D...,United States,[Facility],-77.11183,Bethesda,39.066647,NIAID,Maryland,grant.9020279,Repurposing kinase inhibitor chemotypes as ant...,2020-04-03,2025-03-31
4,grid.270680.b,[http://ec.europa.eu/index_en.htm],European Commission,Belgium,[Government],4.36367,Brussels,50.85165,EC,,grant.8585457,Estimating the Prevalence of AntiMicrobial Res...,2020-04-01,2022-03-31
5,grid.420089.7,[http://www.nichd.nih.gov/Pages/index.aspx],National Institute of Child Health and Human D...,United States,[Facility],-77.10042,Bethesda,39.001095,NICHD,Maryland,grant.9018783,Physiological Functions of Female Reproductive...,2020-04-01,2025-03-31
6,grid.419681.3,[http://www.niaid.nih.gov/Pages/default.aspx],National Institute of Allergy and Infectious D...,United States,[Facility],-77.11183,Bethesda,39.066647,NIAID,Maryland,grant.9019363,Determining the mechanism of antibody-mediated...,2020-04-01,2022-03-31
7,grid.419681.3,[http://www.niaid.nih.gov/Pages/default.aspx],National Institute of Allergy and Infectious D...,United States,[Facility],-77.11183,Bethesda,39.066647,NIAID,Maryland,grant.9019371,Computational models of naturally acquired imm...,2020-04-01,2025-03-31
8,grid.280785.0,[http://www.nigms.nih.gov/Pages/default.aspx],National Institute of General Medical Sciences,United States,[Facility],-77.09938,Bethesda,38.997833,NIGMS,Maryland,grant.9018600,A Biomedical Mass Spectrometry Resource: Ongoi...,2020-04-01,2023-03-31
9,grid.425888.b,[http://www.snf.ch/en],Swiss National Science Foundation,Switzerland,[Government],7.432395,Bern,46.94923,SNF,,grant.8968483,Adressing concerns over gene drive based malar...,2020-04-01,2020-06-30


### Extracting investigators: `as_dataframe_investigators`

Grant investigators are usually returned by the Dimensions API inside a nested JSON object in the `investigator_details` sub-key. 

This methods allows to quickly extract that data and return a dataframe with **one row per investigator**.

> NOTE: `investigator_details` are not returned by default in a grants query hence one must specify this in the query results

In [15]:
# get a sample list of grants
query = """search grants for "malaria" return grants[basics+investigator_details] limit 1000"""
res = dsl.query(query)

Returned Grants: 1000 (total = 11076)
Field 'title_language' is deprecated in favor of language_title. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
Field 'project_num' is deprecated in favor of grant_number. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details


In [16]:
res.as_dataframe_investigators().head(10)

Unnamed: 0,affiliations,first_name,last_name,role,id,middle_name,grant_id,grant_title,grant_start_date,grant_end_date
0,"[{'state': 'NY', 'state_code': 'US-NY', 'city'...",MARK JOSEPH,MULLIGAN,PI,,,grant.9018944,Establishment of the New York University Vacci...,2020-04-10,2026-11-30
1,"[{'state': '', 'state_code': None, 'city': 'SH...",YAMING,CAO,PI,,,grant.9020234,Antigen discovery for transmission-blocking va...,2020-04-06,2025-03-31
2,"[{'state': '', 'state_code': None, 'city': 'RO...",KELLY,CHIBALE,PI,,,grant.9020279,Repurposing kinase inhibitor chemotypes as ant...,2020-04-03,2025-03-31
3,,WILLIAM,ZUERCHER,Co-PI,,,grant.9020279,Repurposing kinase inhibitor chemotypes as ant...,2020-04-03,2025-03-31
4,,LORI,FERRINS,Co-PI,,,grant.9020279,Repurposing kinase inhibitor chemotypes as ant...,2020-04-03,2025-03-31
5,"[{'state': 'CT', 'state_code': 'US-CT', 'city'...",JIANJUN,SUN,PI,,,grant.9018783,Physiological Functions of Female Reproductive...,2020-04-01,2025-03-31
6,"[{'state': 'MN', 'state_code': 'US-MN', 'city'...",GEOFFREY T,HART,PI,,,grant.9019363,Determining the mechanism of antibody-mediated...,2020-04-01,2022-03-31
7,"[{'state': 'CA', 'state_code': 'US-CA', 'city'...",BRYAN R,GREENHOUSE,PI,,,grant.9019371,Computational models of naturally acquired imm...,2020-04-01,2025-03-31
8,,ATUL J.,BUTTE,Co-PI,,,grant.9019371,Computational models of naturally acquired imm...,2020-04-01,2025-03-31
9,,PRASANNA,JAGANNATHAN,Co-PI,,,grant.9019371,Computational models of naturally acquired imm...,2020-04-01,2025-03-31


## 4. Dataframe Methods for 'Concepts' queries

These methods can be used with all content types that support the extraction of concepts, i.e., `publications` or `grants`. See the [official documentation](https://docs.dimensions.ai/dsl/data-sources.html) for more details.

### Extracting Concepts: `as_dataframe_concepts`

The `as_dataframe_concepts` method allows to quickly extract all concepts attached to a record, **one row per concept**, so to make it easier to do operations like counting or plotting the results.

NOTE: concepts are normalized keywords describing the main topics of a document, which are automatically derived from the full text  using machine learning. In the JSON data, concepts are returned with an ordered list (=first items are the most relevant), like this one: 

```
{'concepts': ['electrochemical conversion',
  'conversion',
  'CO2',
  'formate',
  'formic acid',
  'acid'],
 'id': 'pub.1122072646'}
```

The `as_dataframe_concepts` extracts all concepts data from JSON to a dataframe (ps this is functionally equivalent to pandas's [explode method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)). Moreover, it automatically creates a number of metrics that can be used to carry out further analyses:

1. `concepts_count`: an integer representing the total number of concepts per single document ID.
2. `score`: a float representing the weighted importance of the concept within a document. This is obtained by  normalizing its ranking against the total number of concepts for a single document. E.g., if a document has 10 concepts in total, the first concept gets a score=1, the second score=0.9, etc..
3. `frequency`: an integer representing how often that concept occurs within the full results-set returned by a query, i.e. how many documents have that concept name. E.g., if a concept appears in 5 documents, frequency=5.
4. `score_avg`: the average (mean) value of all scores for a single concept, across the full set of documents returned by a query. 


By sorting and segmenting concepts using these parameters, it is possible to fine-tune the concept extraction algorithm, so to make more suitable for the application at hand.

In [17]:
q = """search publications for "graphene" 
            where year=2019 
       return publications[id+title+year+concepts] limit 100"""
concepts = dsl.query(q).as_dataframe_concepts()
concepts.head()

Returned Publications: 100 (total = 101412)


Unnamed: 0,id,title,year,concepts_count,concept,score,frequency,score_avg
0,pub.1120769344,Reliability issues of lead-free solder joints ...,2019,37,electronic products,1.0,1,1.0
1,pub.1120769344,Reliability issues of lead-free solder joints ...,2019,37,products,0.973,3,0.755
2,pub.1120769344,Reliability issues of lead-free solder joints ...,2019,37,miniaturization,0.946,1,0.946
3,pub.1120769344,Reliability issues of lead-free solder joints ...,2019,37,high integration,0.919,1,0.919
4,pub.1120769344,Reliability issues of lead-free solder joints ...,2019,37,integration,0.892,4,0.539


**E.g. Sorting by score_avg** permits to highlight concepts that are important throughout the dataset.


In [18]:
concepts_unique = concepts.drop_duplicates("concept")

In [19]:
concepts_unique.sort_values("score_avg", ascending=False)

Unnamed: 0,id,title,year,concepts_count,concept,score,frequency,score_avg
0,pub.1120769344,Reliability issues of lead-free solder joints ...,2019,37,electronic products,1.000,1,1.000
939,pub.1126259617,Intravenous delivery of enzalutamide based on ...,2019,43,background,1.000,1,1.000
526,pub.1122086607,Metal-organic framework-coated magnetite nanop...,2019,52,tumor treatment,1.000,1,1.000
3056,pub.1121964614,Electronic and optoelectronic applications of ...,2019,48,isolation of graphene,1.000,1,1.000
2962,pub.1123924921,Phenolic Resin Foam Composites Reinforced by A...,2019,57,phenolic foam composites,1.000,1,1.000
...,...,...,...,...,...,...,...,...
1186,pub.1123764790,Enhancement of HDO Activity of MoP/SiO2 Cataly...,2019,61,alumina,0.016,1,0.016
155,pub.1123764898,All-Waste Hybrid Composites with Waste Silicon...,2019,66,module,0.015,1,0.015
1958,pub.1123764888,Nitrogen-Doped Porous Carbon Derived from Biom...,2019,66,evolution,0.015,1,0.015
2554,pub.1122803664,Tetrahedral amorphous carbon prepared filter c...,2019,71,LEDs,0.014,1,0.014


**E.g. Sorting by frequency** highlights concepts that are shared by many documents, but are not necessarily having a high score.

In [20]:
concepts_unique.sort_values("frequency", ascending=False)

Unnamed: 0,id,title,year,concepts_count,concept,score,frequency,score_avg
165,pub.1110196431,Tuning phonon transport spectrum for better th...,2019,53,properties,0.830,37,0.678
85,pub.1123764889,Smart Non-Woven Fiber Mats with Light-Induced ...,2019,51,applications,0.059,36,0.404
83,pub.1123764889,Smart Non-Woven Fiber Mats with Light-Induced ...,2019,51,materials,0.098,35,0.570
215,pub.1117611535,Spectrum adapted the expectation-maximization ...,2019,28,method,0.786,24,0.533
74,pub.1123764889,Smart Non-Woven Fiber Mats with Light-Induced ...,2019,51,performance,0.275,21,0.393
...,...,...,...,...,...,...,...,...
1248,pub.1123754647,Nanolayer Film on Poly(Styrene/Ethylene Glycol...,2019,66,PSS,0.076,1,0.076
1246,pub.1123754647,Nanolayer Film on Poly(Styrene/Ethylene Glycol...,2019,66,polyelectrolyte solutions,0.106,1,0.106
1243,pub.1123754647,Nanolayer Film on Poly(Styrene/Ethylene Glycol...,2019,66,multilayers,0.152,1,0.152
1242,pub.1123754647,Nanolayer Film on Poly(Styrene/Ethylene Glycol...,2019,66,polyelectrolyte multilayers,0.167,1,0.167


#### Which concepts metrics should you use? 

That depends on the data you have (eg how homogeneous it is) and your projects goals too. 

The various indicators available are meant to help you construct filtered lists of concepts more easily. But it's down to you to determine the right combination of `score` and `frequency` so that the 'right' concepts become more apparent! 

Tip: see also the [Topic Modeling Analysis](https://api-lab.dimensions.ai/cookbooks/2-publications/Simple-topic-analysis.html) notebook for more examples about this topic.



## Conclusions 

Moving Dimensions API results to pandas dataframes **makes it easier** to **analyze the data** and **answer research questions**. 

Note: the examples above only scratch the surface of what can be done with pandas! 

> Tip: see also the *Dimcli: Magic Commands* notebook to find out what shortcuts are available for these dataframe methods. 