# Working with lists in the Dimensions API

In this notebook we are going to show:

* How to use lists in order to write more efficient DSL queries
* How lists can be used to concatenate the results of one query with another query 
* How these methods can be used for real-word applications e.g., getting publications/patents/grants that cite my publications

## Prerequisites: Installing the Dimensions Library and Logging in

In [1]:

# @markdown # Get the API library and login
# @markdown Click the 'play' button on the left (or shift+enter) after entering your API credentials

username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}


!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

import json
import pandas as pd
import numpy as np

You should consider upgrading via the '/Users/michele.pasin/Envs/jupyterlab/bin/python3.7 -m pip install --upgrade pip' command.[0m
DimCli v0.6.8.1 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


## 1. How do we use lists in the Dimensions API?

We use lists in the API because they are easier to read, and easier to work with.

Here is a query without lists. 


How many publications were produced from either Monash or Melbourne University ( grid.1002.3, grid.1008.9 ) in either (2019 OR 2020).  Be really careful with your brakets!

In [2]:
%%dsldf

search publications 
where 
      (
          research_orgs.id = "grid.1008.9"
       or research_orgs.id = "grid.1002.3"
       )
  and (
          year = 2019 
       or year = 2020
       )
return publications 
limit 1


Returned Publications: 1 (total = 32329)


Unnamed: 0,year,type,title,issue,volume,id,author_affiliations,pages,journal.id,journal.title
0,2020,article,Structural brain changes with lifetime trauma ...,1,11,pub.1125399511,"[[{'first_name': 'Marie-Laure', 'last_name': '...",1733247,jour.1045059,European Journal of Psychotraumatology


The query above could get really messy. What if I wanted 20 institutions.  What if I wanted the last ten years: (or,or,or,or,or....) and (or,or,or,or,or)

By using lists we can quickly add a large number of conditions by means of an easy to read square-brakets notation:

In [4]:
%%dsldf
search publications 
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id] 
limit 100

Returned Publications: 100 (total = 32329)


Unnamed: 0,id
0,pub.1125399511
1,pub.1125025149
2,pub.1121881108
3,pub.1125408894
4,pub.1125679504
...,...
95,pub.1127453754
96,pub.1125754478
97,pub.1127424640
98,pub.1126671331


## 2. What are all the things that we can make lists of in the Dimensions API?

### What are the internal Entities that we might put in a list?

In [5]:
%dsldocs
dsl_last_results[dsl_last_results['is_entity']==True]

Unnamed: 0,sources,field,type,description,is_filter,is_entity,is_facet
6,publications,category_bra,categories,`Broad Research Areas <https://app.dimensions....,True,True,True
7,publications,category_for,categories,`ANZSRC Fields of Research classification <htt...,True,True,True
8,publications,category_hra,categories,`Health Research Areas <https://app.dimensions...,True,True,True
9,publications,category_hrcs_hc,categories,`HRCS - Health Categories <https://app.dimensi...,True,True,True
10,publications,category_hrcs_rac,categories,`HRCS – Research Activity Codes <https://app.d...,True,True,True
...,...,...,...,...,...,...,...
260,datasets,research_org_cities,cities,City of the organisations the publication auth...,True,True,True
261,datasets,research_org_countries,countries,Country of the organisations the publication a...,True,True,True
262,datasets,research_org_states,states,State of the organisations the publication aut...,True,True,True
263,datasets,research_orgs,organizations,GRID organisations linked to the publication a...,True,True,True


### What about lists of ids?

In [6]:
%dsldocs
dsl_last_results[dsl_last_results['field'].str.contains('id')==True]

Unnamed: 0,sources,field,type,description,is_filter,is_entity,is_facet
1,publications,altmetric_id,integer,AltMetric Publication ID,True,False,False
23,publications,id,string,Dimensions publication ID.,True,False,False
32,publications,pmcid,string,PubMed Central ID.,True,False,False
33,publications,pmid,string,PubMed ID.,True,False,False
37,publications,reference_ids,string,Dimensions publication ID for publications in ...,True,False,False
48,publications,supporting_grant_ids,string,"Grants supporting a publication, returned as a...",True,False,False
84,grants,id,string,Dimensions grant ID.,True,False,False
105,patents,associated_grant_ids,string,Dimensions IDs of the grants associated to the...,True,False,False
114,patents,cited_by_ids,string,Dimensions IDs of the patents that cite this p...,True,False,False
125,patents,id,string,Dimensions patent ID,True,False,False


### What are the external entities that we can put in a list?

* a list of ISSN's
* a list of External Grant IDs
* a list of DOIs
* a list of categories

## 3. Making a list from the results of a query

The list syntax for the Dimensions API is the same as the list syntax for json, so we can use python's json-to-string functions to make a list of ids for us from the previous query.

Let's run our example query again.

In [7]:
%%dsldf
search publications 
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id] 
limit 100

Returned Publications: 100 (total = 32329)


Unnamed: 0,id
0,pub.1125399511
1,pub.1125025149
2,pub.1121881108
3,pub.1125408894
4,pub.1125679504
...,...
95,pub.1127453754
96,pub.1125754478
97,pub.1127424640
98,pub.1126671331


In [8]:
json.dumps(list(dsl_last_results.id))



'["pub.1125399511", "pub.1125025149", "pub.1121881108", "pub.1125408894", "pub.1125679504", "pub.1116652110", "pub.1125508088", "pub.1125663277", "pub.1125617654", "pub.1124349366", "pub.1123936545", "pub.1124817049", "pub.1124280046", "pub.1124106056", "pub.1124223401", "pub.1124083721", "pub.1124553291", "pub.1124482579", "pub.1125097280", "pub.1125313688", "pub.1125398558", "pub.1125488822", "pub.1125765172", "pub.1125827595", "pub.1125859842", "pub.1126267225", "pub.1123949259", "pub.1125893190", "pub.1124659985", "pub.1126706935", "pub.1126729120", "pub.1126671333", "pub.1125326520", "pub.1123948338", "pub.1124666940", "pub.1127143155", "pub.1127377346", "pub.1127361753", "pub.1127124910", "pub.1127124921", "pub.1127149809", "pub.1123746077", "pub.1123789526", "pub.1124227184", "pub.1124597518", "pub.1123836600", "pub.1126830206", "pub.1124776361", "pub.1127393844", "pub.1125685156", "pub.1127507905", "pub.1127544663", "pub.1124427106", "pub.1126729265", "pub.1124841398", "pub.112

Let's try to use this list of IDs. 

Unfortunately, you can't just put your results directly into the query

In [11]:
%%dsldf
  search publications
  where id in [json.dumps(list(dsl_last_results.id))]

  return publications 


Returned Errors: 1
1 QuerySyntaxError found
1 ParserError found
  * [Line 2:15] ('json') no viable alternative at input '[json'


..so let's get our results back again

In [14]:
%%dsldf
search publications 
where research_orgs.id in ["grid.1008.9","grid.1002.3"]
  and year in [2019:2020]
return publications[id] 
limit 100

Returned Publications: 100 (total = 32329)


Unnamed: 0,id
0,pub.1125399511
1,pub.1125025149
2,pub.1121881108
3,pub.1125408894
4,pub.1125679504
...,...
95,pub.1127453754
96,pub.1125754478
97,pub.1127424640
98,pub.1126671331


... and use the python way of calling the Dimensions API instead

In [15]:
dsl.query(f"""

 search publications
  where id in {json.dumps(list(dsl_last_results.id))}

  return publications


""").as_dataframe()

f"""

 search publications
  where id in {json.dumps(list(dsl_last_results.id))}

  return publications


"""

Returned Publications: 20 (total = 100)


'\n\n search publications\n  where id in ["pub.1125399511", "pub.1125025149", "pub.1121881108", "pub.1125408894", "pub.1125679504", "pub.1116652110", "pub.1125508088", "pub.1125663277", "pub.1125617654", "pub.1124349366", "pub.1123936545", "pub.1124817049", "pub.1124280046", "pub.1124106056", "pub.1124223401", "pub.1124083721", "pub.1124553291", "pub.1124482579", "pub.1125097280", "pub.1125313688", "pub.1125398558", "pub.1125488822", "pub.1125765172", "pub.1125827595", "pub.1125859842", "pub.1126267225", "pub.1123949259", "pub.1125893190", "pub.1124659985", "pub.1126706935", "pub.1126729120", "pub.1126671333", "pub.1125326520", "pub.1123948338", "pub.1124666940", "pub.1127143155", "pub.1127377346", "pub.1127361753", "pub.1127124910", "pub.1127124921", "pub.1127149809", "pub.1123746077", "pub.1123789526", "pub.1124227184", "pub.1124597518", "pub.1123836600", "pub.1126830206", "pub.1124776361", "pub.1127393844", "pub.1125685156", "pub.1127507905", "pub.1127544663", "pub.1124427106", "pub

### Putting both parts of this example together

In [16]:
# Step 1. Get the list of publications..

pubs = dsl.query("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id] 
                    limit 100
                """).as_dataframe()

# Step 2. Put the list into the next query...

dsl.query_iterative(f"""
                 search publications
                    where id in {json.dumps(list(pubs.id))}
                    return publications
""").as_dataframe().head(5)

Returned Publications: 100 (total = 32329)
1000 / ...
100 / 100
===
Records extracted: 100


Unnamed: 0,volume,year,title,pages,type,author_affiliations,issue,id,journal.id,journal.title
0,11,2020,Structural brain changes with lifetime trauma ...,1733247,article,"[[{'first_name': 'Marie-Laure', 'last_name': '...",1,pub.1125399511,jour.1045059,European Journal of Psychotraumatology
1,11,2020,Exploring cultural differences in the use of e...,1729033,article,"[[{'first_name': 'Amanda', 'last_name': 'Nagul...",1,pub.1125025149,jour.1045059,European Journal of Psychotraumatology
2,25,2020,The large-scale implementation and evaluation ...,1-11,article,"[[{'first_name': 'Bengianni', 'last_name': 'Pi...",1,pub.1121881108,jour.1097842,International Journal of Adolescence and Youth
3,11,2020,Posttraumatic anger: a confirmatory factor ana...,1731127,article,"[[{'first_name': 'Grazia', 'last_name': 'Cesch...",1,pub.1125408894,jour.1045059,European Journal of Psychotraumatology
4,13,2020,Direct assessment of mental health and metabol...,1732665,article,"[[{'first_name': 'Peter S', 'last_name': 'Azzo...",1,pub.1125679504,jour.1041075,Global Health Action


### Doing something useful: Get all the publications that cite my publications

In [17]:
pubs = dsl.query("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id] 
                    limit 100
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query_iterative(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe().head()

Returned Publications: 100 (total = 32329)
1000 / ...
26 / 26
===
Records extracted: 26


Unnamed: 0,volume,year,title,pages,type,author_affiliations,issue,id,journal.id,journal.title
0,1.0,2020,"Non-photochemical quenching, a non-invasive pr...",32-43,article,"[[{'first_name': 'Pranali', 'last_name': 'Deor...",1.0,pub.1125663277,,
1,11.0,2020,Reconstructing evolutionary trajectories of mu...,731,article,"[[{'first_name': 'Yulia', 'last_name': 'Rubano...",1.0,pub.1124588823,jour.1043282,Nature Communications
2,18.0,2020,Reducing ignorance about who dies of what: res...,58,article,"[[{'first_name': 'Alan D.', 'last_name': 'Lope...",1.0,pub.1125488824,jour.1032885,BMC Medicine
3,11.0,2020,From GWAS to Function: Using Functional Genomi...,424,article,"[[{'first_name': 'Eddie', 'last_name': 'Cano-G...",,pub.1127545487,jour.1045144,Frontiers in Genetics
4,,2020,Understanding mental health and its determinan...,102148,article,"[[{'first_name': 'Lisa', 'last_name': 'Willenb...",,pub.1127416286,jour.1042319,Asian Journal of Psychiatry


## 5. How Long can lists get?  
It is a bit dependent on string length, plus a fixed length of 512 items

### This won't work

In [18]:
pubs = dsl.query("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id] 
                    limit 1000
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe()

Returned Publications: 1000 (total = 32329)
Returned Errors: 1
Semantic Error
Semantic errors found:
	Filter operator 'in' requires 0 < items < 512. '1000 is out of this range'.


### This will

In [19]:
pubs = dsl.query("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id] 
                    limit 250
                """)

mypubslist = json.dumps(list(pubs.as_dataframe().id))

dsl.query(f"""
                 search publications
                    where reference_ids in {mypubslist}
                    return publications
""").as_dataframe().head(2)

Returned Publications: 250 (total = 32329)
Returned Publications: 20 (total = 73)


Unnamed: 0,year,type,title,issue,volume,id,author_affiliations,pages,journal.id,journal.title
0,2020,article,"Non-photochemical quenching, a non-invasive pr...",1,1,pub.1125663277,"[[{'first_name': 'Pranali', 'last_name': 'Deor...",32-43,,
1,2020,article,Application of a risk-management framework for...,1,6,pub.1127516507,"[[{'first_name': 'Jan', 'last_name': 'Hudeček'...",15,jour.1052988,npj Breast Cancer


### What if I need a very long list?

The [Dimcli](https://github.com/lambdamusic/dimcli) library can break up your query into chunks. 

We then loop through each chunk - get the result, and stick them back together again at the end.

In [20]:
# Step 1 - same as before - except now we want the query in chunks

pubs_chunks = dsl.query("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id] 
                    limit 1000
                """).chunks(250)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results

query_results = []

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(
          
                  dsl.query_iterative(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return publications
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

pd.concat(query_results).\
   drop_duplicates(subset='id').\
   head(2)
   


Returned Publications: 1000 (total = 32329)
1000 / ...
73 / 73
===
Records extracted: 73
1000 / ...
52 / 52
===
Records extracted: 52
1000 / ...
56 / 56
===
Records extracted: 56
1000 / ...
90 / 90
===
Records extracted: 90


Unnamed: 0,year,type,title,issue,volume,id,author_affiliations,pages,journal.id,journal.title
0,2020,article,"Non-photochemical quenching, a non-invasive pr...",1,1,pub.1125663277,"[[{'first_name': 'Pranali', 'last_name': 'Deor...",32-43,,
1,2020,article,Application of a risk-management framework for...,1,6,pub.1127516507,"[[{'first_name': 'Jan', 'last_name': 'Hudeček'...",15,jour.1052988,npj Breast Cancer


## 6. What if I want to get the researchers associated with the publications the cite my institution?

In [21]:
# Step 1 - same as before

pubs_chunks = dsl.query("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id] 
                    limit 1000
                """).chunks(250)

query_results = []

# Step 2 same as before, but now I returning researchers instead of publications

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(
          
                  dsl.query(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return researchers limit 1000
                        """).as_dataframe() 
      # Warning 1, If there are more than 1000 researchers involved in this query, you will miss some
      )

# Step 3 join the queries back together, this time using a groupby statement to join the counts back together again

my_researchers = pd.concat(query_results).\
                 groupby(['id','first_name','last_name']).\
                  agg({'count':'sum'}).\
                  sort_values(by='count', ascending=False).\
                  head(10)


Returned Publications: 1000 (total = 32329)
Returned Researchers: 576
Returned Researchers: 318
Returned Researchers: 155
Returned Researchers: 293


## 7. What if I want to get *all* the researchers associated with the publications that cite my institution?

In [22]:
# Step 1 - same as before

pubs_chunks = dsl.query("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9","grid.1002.3"]
                      and year in [2019:2020]
                    return publications[id] 
                    limit 1000
                """).chunks(250)

query_results = []

# Step 2 - almost the same as before - 
#          except now we are asking for the as_dataframe_authors data frame

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(
          
                  dsl.query_iterative(f"""
                        search publications
                            where reference_ids in {mypubslist}
                            return publications
                        """).as_dataframe_authors() # I have changed this line from as_dataframe to as_datframe_authors
      )

# Step 3 - join the publications back together

researcher_pubs = pd.concat(query_results).\
                drop_duplicates(subset=['researcher_id','pub_id'])


# Step 4 - count up the publications using a groupby statement

my_researchers = researcher_pubs[researcher_pubs['researcher_id'] != ''].\
    groupby(['researcher_id']).\
    agg({'first_name':'max','last_name':'max','pub_id':'count'}).\
    sort_values(by='pub_id', ascending=False).\
    reset_index()
    
my_researchers.\
    head(10)


Returned Publications: 1000 (total = 32329)
1000 / ...
73 / 73
===
Records extracted: 73
1000 / ...
52 / 52
===
Records extracted: 52
1000 / ...
56 / 56
===
Records extracted: 56
1000 / ...
90 / 90
===
Records extracted: 90


Unnamed: 0,researcher_id,first_name,last_name,pub_id
0,ur.014233544433.66,Shirui,Pan,6
1,ur.01167177047.84,Antonis C.,Antoniou,3
2,ur.01125623170.44,Alan D.,Lopez,3
3,ur.010771505735.21,Xingquan,Zhu,3
4,ur.012041436200.28,Alexander,Tsai,2
5,ur.0761250433.09,Eóin,Killackey,2
6,ur.01332326377.02,Anna,Jakubowska,2
7,ur.01357125251.51,Quaid D.,Morris,2
8,ur.01273277721.53,David E.,Goldgar,2
9,ur.01313240102.02,Trinidad,Caldés,2


## 8. ..and it we want details about our researchers, we can put our list of researchers into the researcher API

See [the researcher source docs](https://docs.dimensions.ai/dsl/datasource-researchers.html) for more details.

In [23]:
## First, we need to chunk up our researcher list

query_results = []

for g, rschr in my_researchers.groupby(np.arange(len(my_researchers)) // 250):
          # This does *almost* the same thing as the chunks command used above
     
     myreslist = json.dumps(list(rschr.researcher_id))

     query_results.append(
          
                  dsl.query_iterative(f"""
                        search researchers
                            where id in {myreslist}
                            return researchers
                        """).as_dataframe() # 
      )    


pd.concat(query_results).head()

1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
87 / 87
===
Records extracted: 87


Unnamed: 0,last_name,research_orgs,orcid_id,id,first_name
0,Vilutienė,"[{'id': 'grid.9424.b', 'longitude': 25.335735,...",[0000-0002-1617-8685],ur.015251107357.98,Tatjana
1,Mumtaz,"[{'id': 'grid.440588.5', 'longitude': 108.9137...",,ur.013773205633.31,Syed Muhammad Taskheer
2,Shittu,"[{'id': 'grid.9582.6', 'longitude': 3.9, 'coun...",,ur.015206376207.45,Funmilayo
3,Richards,"[{'id': 'grid.1003.2', 'longitude': 153.00963,...",[0000-0003-4507-5498],ur.01075020113.17,Nicola C
4,Barnes,"[{'id': 'grid.5335.0', 'longitude': 0.114908, ...",[0000-0002-3781-7570],ur.0771733111.06,Daniel R


## 9. Patents example (patents -> publications)

Using the same method, we can retrieve all patents citing publications from my institution. 

In [24]:
%dsldocs patents
dsl_last_results[dsl_last_results['field']=='publication_ids']

Unnamed: 0,sources,field,type,description,is_filter,is_entity,is_facet
37,patents,publication_ids,string,Dimensions IDs of the publications related to ...,True,False,False


In [25]:
# Step 1 - same as before - except now we want the query in chunks

pubs_chunks = dsl.query_iterative("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2015
                    return publications[id] 
                """).chunks(250)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below.  publications was replaced with patents, and refernce_ids was replaced by publication_ids

query_results = []

for c in pubs_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(
          
                  dsl.query_iterative(f"""
                        search patents
                            where publication_ids in {mypubslist}
                            return patents
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

cited_patents = pd.concat(query_results).\
   drop_duplicates(subset='id')

cited_patents.head(2)

1000 / ...
1000 / 8409
2000 / 8409
3000 / 8409
4000 / 8409
5000 / 8409
6000 / 8409
7000 / 8409
8000 / 8409
8409 / 8409
===
Records extracted: 8409
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
1 / 1
===
Records extracted: 1
1000 / ...
===
Records extracted: 0
1000 / ...
1 / 1
===
Records extracted: 1
1000 / ...
6 / 6
===
Records extracted: 6
1000 / ...
4 / 4
===
Records extracted: 4
1000 / ...
4 / 4
===
Records extracted: 4
1000 / ...
7 / 7
===
Records extracted: 7
1000 / ...
5 / 5
===
Records extracted: 5
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
2 / 2
===
Records extracted: 2
1000 / ...
5 / 5
===
Records extracted: 5
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
8 / 8
===
Records extracted: 8
1000 / ...
3 / 3
===
Records extracted: 3
1000 / ...
2 / 2
==

Unnamed: 0,year,title,times_cited,granted_year,assignee_names,assignees,inventor_names,publication_date,id,filing_status
0,2013,Modified human rotaviruses and uses therefor,0.0,2018.0,"[Murdoch Childrens Research Institute, MURDOCH...","[{'id': 'grid.1058.c', 'country_name': 'Austra...","[Carl Kirkwood, Ruth Frances Bishop, Graeme La...",2018-05-15,US-9969985-B2,Grant
1,2017,METHODS AND APPARATUS FOR IDENTIFYING ONE OR M...,0.0,,[GENOMICS PLC],,"[DING, Zhihao, FROT, Benjamin, JOSTINS, Luke, ...",2018-03-22,WO-2018051072-A1,Application


In [26]:
%dsldocs patents
dsl_last_results[dsl_last_results['type']=='organizations']

Unnamed: 0,sources,field,type,description,is_filter,is_entity,is_facet
6,patents,assignees,organizations,Disambiguated GRID organisations who own or ha...,True,True,True
19,patents,current_assignees,organizations,Disambiguated GRID organisations currenlty own...,True,True,True
24,patents,funders,organizations,GRID organisations funding the patent.,True,True,True
33,patents,original_assignees,organizations,Disambiguated GRID organisations that first ow...,True,True,True


In [27]:
import json
cited_patents_assignees = cited_patents.explode('assignees')

cited_patents_assignees['assignee_grid_id'] = cited_patents_assignees['assignees'].\
    apply(lambda g: g['id'] if type(g) == dict else 0 )

cited_patents_assignees['assignee_name'] = cited_patents_assignees['assignees'].\
    apply(lambda g: g['name'] if type(g) == dict else 0 )

cited_patents_assignees.\
    groupby(['assignee_grid_id','assignee_name']).\
    agg({'id':'count'}).\
    sort_values(by='id', ascending=False).\
    head(20) 

Unnamed: 0_level_0,Unnamed: 1_level_0,id
assignee_grid_id,assignee_name,Unnamed: 2_level_1
0,0,29
grid.428999.7,Pasteur Institute,5
grid.1058.c,Murdoch Children's Research Institute,4
grid.453773.1,Wisconsin Alumni Research Foundation,3
grid.419905.0,Nestlé (Switzerland),3
grid.420918.6,Imperial Innovations (United Kingdom),3
grid.1055.1,Peter MacCallum Cancer Centre,3
grid.25879.31,University of Pennsylvania,2
grid.1042.7,Walter and Eliza Hall Institute of Medical Research,2
grid.4444.0,French National Centre for Scientific Research,2


## 10. Clinical Trials (clinical trials -> publications)


Using the same method, we can retrieve all clinical trials citing publications from my institution. 

In [28]:
%dsldocs clinical_trials
dsl_last_results[dsl_last_results['field']=='research_orgs']

Unnamed: 0,sources,field,type,description,is_filter,is_entity,is_facet
26,clinical_trials,research_orgs,organizations,"GRID organizations involved, e.g. as sponsors ...",True,True,True


In [29]:
# Step 1 - same as before - except now we want the query in chunks

clinical_trials_chunks = dsl.query_iterative("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2015
                    return publications[id] 
                """).chunks(400)

# Step 2 - almost the same as before - except now we use a for loop to loop through our results
#. We changed 2 things below.  publications was replaced with clinical_trials, and reference_ids was replaced by publication_ids

query_results = []

for c in clinical_trials_chunks:

      mypubslist = json.dumps(list(pd.DataFrame(c).id))

      query_results.append(
          
                  dsl.query_iterative(f"""
                        search clinical_trials
                            where publication_ids in {mypubslist}
                            return clinical_trials[all]
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

cited_clinical_trials = pd.concat(query_results).\
   drop_duplicates(subset='id')

cited_clinical_trials.head(2)

1000 / ...
1000 / 8409
2000 / 8409
3000 / 8409
4000 / 8409
5000 / 8409
6000 / 8409
7000 / 8409
8000 / 8409
8409 / 8409
===
Records extracted: 8409
1000 / ...
13 / 13
===
Records extracted: 13
1000 / ...
18 / 18
===
Records extracted: 18
1000 / ...
21 / 21
===
Records extracted: 21
1000 / ...
8 / 8
===
Records extracted: 8
1000 / ...
12 / 12
===
Records extracted: 12
1000 / ...
11 / 11
===
Records extracted: 11
1000 / ...
6 / 6
===
Records extracted: 6
1000 / ...
6 / 6
===
Records extracted: 6
1000 / ...
11 / 11
===
Records extracted: 11
1000 / ...
12 / 12
===
Records extracted: 12
1000 / ...
13 / 13
===
Records extracted: 13
1000 / ...
9 / 9
===
Records extracted: 9
1000 / ...
9 / 9
===
Records extracted: 9
1000 / ...
7 / 7
===
Records extracted: 7
1000 / ...
12 / 12
===
Records extracted: 12
1000 / ...
14 / 14
===
Records extracted: 14
1000 / ...
10 / 10
===
Records extracted: 10
1000 / ...
9 / 9
===
Records extracted: 9
1000 / ...
4 / 4
===
Records extracted: 4
1000 / ...
1 / 1
===
R

Unnamed: 0,gender,FOR_first,registry,date,funders,date_inserted,category_bra,id,HRCS_RAC,category_hrcs_rac,...,category_hrcs_hc,title,category_for,conditions,acronym,researchers,funder_groups,category_icrp_cso,category_icrp_ct,associated_grant_ids
0,All,"[{'id': '2211', 'name': '11 Medical and Health...",ClinicalTrials.gov,2008-11-01,"[{'id': 'grid.453005.7', 'country_name': 'Aust...",2019-05-18,"[{'id': '4001', 'name': 'Clinical Medicine and...",NCT00408850,"[{'id': '10601', 'name': '6.1 Pharmaceuticals'...","[{'id': '10601', 'name': '6.1 Pharmaceuticals'...",...,"[{'id': '906', 'name': 'Metabolic and Endocrin...",Mechanisms of Sympathetic Overactivity in the ...,"[{'id': '3048', 'name': '1102 Cardiorespirator...",[Metabolic Syndrome],,,,,,
1,All,"[{'id': '2211', 'name': '11 Medical and Health...",ClinicalTrials.gov,2014-12-01,,2019-07-10,"[{'id': '4003', 'name': 'Public Health'}]",NCT02112734,"[{'id': '10301', 'name': '3.1 Primary preventi...","[{'id': '10301', 'name': '3.1 Primary preventi...",...,"[{'id': '903', 'name': 'Inflammatory and Immun...",Can Vitamin D Supplementation in Infants Preve...,"[{'id': '3177', 'name': '1117 Public Health an...",[Food Allergy],VITALITY,"[{'id': 'ur.0670003010.07', 'first_name': 'Kir...",,,,


In [30]:
%dsldocs clinical_trials
dsl_last_results[dsl_last_results['type']=='organizations']

Unnamed: 0,sources,field,type,description,is_filter,is_entity,is_facet
17,clinical_trials,funders,organizations,GRID funding organisations that are involved w...,True,True,True
26,clinical_trials,research_orgs,organizations,"GRID organizations involved, e.g. as sponsors ...",True,True,True


In [31]:
cited_clinical_trials_orgs = cited_clinical_trials.explode('research_orgs')

cited_clinical_trials_orgs['research_orgs_grid_id'] = cited_clinical_trials_orgs['research_orgs'].\
    apply(lambda g: g['id'] if type(g) == dict else 0 )

cited_clinical_trials_orgs['research_orgs_name'] = cited_clinical_trials_orgs['research_orgs'].\
    apply(lambda g: g['name'] if type(g) == dict else 0 )

cited_clinical_trials_orgs.\
    groupby(['research_orgs_grid_id','research_orgs_name']).\
    agg({'id':'count'}).\
    sort_values(by='id', ascending=False).\
    head(20) 

Unnamed: 0_level_0,Unnamed: 1_level_0,id
research_orgs_grid_id,research_orgs_name,Unnamed: 2_level_1
grid.1008.9,University of Melbourne,11
grid.431143.0,National Health and Medical Research Council,11
grid.416153.4,Royal Melbourne Hospital,6
grid.21107.35,Johns Hopkins University,6
grid.1058.c,Murdoch Children's Research Institute,5
grid.1002.3,Monash University,5
grid.1055.1,Peter MacCallum Cancer Centre,5
grid.1623.6,The Alfred Hospital,4
grid.413249.9,Royal Prince Alfred Hospital,4
grid.411109.c,Virgen del Rocío University Hospital,4


## 11. Grants (publications -> grants)

Using the same method, we can retrieve all grants funding publications from my institution. 

In [32]:
%dsldocs publications
dsl_last_results[dsl_last_results['field'].str.contains('ids')]

Unnamed: 0,sources,field,type,description,is_filter,is_entity,is_facet
37,publications,reference_ids,string,Dimensions publication ID for publications in ...,True,False,False
48,publications,supporting_grant_ids,string,"Grants supporting a publication, returned as a...",True,False,False


In [33]:
# Step 1 - same as before - except now we want the query in chunks

publications = dsl.query_iterative("""
                  search publications 
                    where research_orgs.id in ["grid.1008.9"]
                      and year = 2020
                    return publications[id+supporting_grant_ids] 
                """).as_dataframe()

# Step 2 - we can get the grants IDs directly from publications this time. 
# So as a second step, we want to pull grants metadata using these identifiers. 

pubs_grants = publications.explode('supporting_grant_ids')

grants_from_pubs = pd.DataFrame(pubs_grants.supporting_grant_ids.unique()).\
                   dropna().\
                   rename(columns={0:'id'})

query_results = []

for g, gnts in grants_from_pubs.groupby(np.arange(len(grants_from_pubs)) // 250):
          # This does *almost* the same thing as the chunks command used above

      myglist = json.dumps(list(gnts.id))

      query_results.append(
          
                  dsl.query_iterative(f"""
                        search grants
                            where id in {myglist}
                          return grants[all]
                        """).as_dataframe()
      )

# Step 3 - join our results back together again, and get rid of duplicates

grant_details = pd.concat(query_results).\
   drop_duplicates(subset='id')

grant_details.head(5)

1000 / ...
1000 / 6005
2000 / 6005
3000 / 6005
4000 / 6005
5000 / 6005
6000 / 6005
6005 / 6005
===
Records extracted: 6005
1000 / ...
248 / 248
===
Records extracted: 248
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
248 / 248
===
Records extracted: 248
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
249 / 249
===
Records extracted: 249
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
250 / 250
===
Records extracted: 250
1000 / ...
198 / 198
===
Records extracted: 198


Unnamed: 0,funding_jpy,category_uoa,funding_org_name,terms,concepts,noun_phrases,FOR_first,start_date,research_org_cities,funders,...,research_org_name,funding_gbp,category_for,foa_number,category_icrp_ct,HRCS_HC,category_hrcs_hc,category_icrp_cso,funding_org_city,funding_org_acronym
0,57190260.0,"[{'id': '30001', 'name': 'A01 Clinical Medicin...",National Human Genome Research Institute,"[non-coding sequence perturbations, different ...","[non-coding sequence perturbations, different ...","[ENCODE, PROJECT SUMMARY, Nucleic Acid Regulat...","[{'id': '2206', 'name': '06 Biological Science...",2019-09-01,"[{'id': 4930956, 'name': 'Boston'}]","[{'id': 'grid.280128.1', 'city_name': 'Bethesd...",...,Massachusetts General Hospital,411118.0,"[{'id': '2620', 'name': '0604 Genetics'}, {'id...",RFA-HG-18-006,,,,,,
1,1030613000.0,"[{'id': '30004', 'name': 'A04 Psychology, Psyc...",National Institute on Aging,"[biological underpinnings, collaboration, U.S....","[biological underpinnings, collaboration, U.S....","[disability, biological insight, chronic disea...","[{'id': '2211', 'name': '11 Medical and Health...",2019-07-15,"[{'id': 5037649, 'name': 'Minneapolis'}]","[{'id': 'grid.419475.a', 'city_name': 'Baltimo...",...,HENNEPIN HEALTHCARE RESEARCH INSTITUTE,7408687.0,"[{'id': '3053', 'name': '1103 Clinical Science...",PAR-18-296,"[{'id': '3793', 'name': 'Colon and Rectal Canc...","[{'id': '911', 'name': 'Cancer'}]","[{'id': '911', 'name': 'Cancer'}]",,,
2,83159520.0,"[{'id': '30011', 'name': 'B11 Computer Science...",National Cancer Institute,"[immunotherapy, experts, various tissue level ...","[immunotherapy, experts, various tissue level ...","[challenge, technology, collection, non-cancer...","[{'id': '2211', 'name': '11 Medical and Health...",2019-03-20,"[{'id': 5150529, 'name': 'Cleveland'}]","[{'id': 'grid.48336.3a', 'city_name': 'Rockvil...",...,Case Western Reserve University,591476.0,"[{'id': '3142', 'name': '1112 Oncology and Car...",PAR-15-332,"[{'id': '3790', 'name': 'Breast Cancer'}]","[{'id': '911', 'name': 'Cancer'}]","[{'id': '911', 'name': 'Cancer'}]","[{'id': '3761', 'name': '4.1 Technology Develo...",,
3,67306340.0,"[{'id': '30005', 'name': 'A05 Biological Scien...",National Health and Medical Research Council,"[cell surface, protective barrier, numerous re...","[cell surface, protective barrier, numerous re...","[muscular dystrophy, cell, membrane microdomai...","[{'id': '2206', 'name': '06 Biological Science...",2019-01-01,"[{'id': 2174003, 'name': 'Brisbane'}]","[{'id': 'grid.431143.0', 'city_name': 'Canberr...",...,University of Queensland,483840.0,"[{'id': '2206', 'name': '06 Biological Science...",Not available,"[{'id': '3816', 'name': 'Not Site-Specific Can...","[{'id': '890', 'name': 'Generic Health Relevan...","[{'id': '890', 'name': 'Generic Health Relevan...",,,
4,67306340.0,"[{'id': '30004', 'name': 'A04 Psychology, Psyc...",National Health and Medical Research Council,"[people, edge novel therapy development, full ...","[people, edge novel therapy development, full ...","[fellowship program, functional recovery, ther...","[{'id': '2211', 'name': '11 Medical and Health...",2019-01-01,"[{'id': 2165798, 'name': 'Geelong'}]","[{'id': 'grid.431143.0', 'city_name': 'Canberr...",...,Deakin University,483840.0,"[{'id': '2211', 'name': '11 Medical and Health...",Not available,,"[{'id': '905', 'name': 'Mental Health'}]","[{'id': '905', 'name': 'Mental Health'}]",,,


In [34]:
pubs_grants.groupby('supporting_grant_ids').\
    agg({'id':'count'}).\
    reset_index().\
    rename(columns={'id':'pubs','supporting_grant_ids':'id'}).\
    merge(grant_details[['id','original_title','funding_usd']],
          on='id').\
    sort_values(by='pubs', ascending=False)


Unnamed: 0,id,pubs,original_title,funding_usd
1398,grant.6711717,24,ARC Centre of Excellence in Exciton Science,22669994.0
744,grant.3931418,15,ARC Centre of Excellence in Convergent Bio-Nan...,19674412.0
2219,grant.7874297,15,Advancing Nanomedicine through Particle Techno...,617151.0
2690,grant.7878111,12,"Novel therapies, risk pathways and prevention ...",643236.0
659,grant.3801883,12,Long Term Multidisciplinary Study of Cancer in...,16255245.0
...,...,...,...,...
1214,grant.5476065,1,Pattern Analysis of fMRI via machine learning/...,2113150.0
1216,grant.5476170,1,Mechanisms of Monocytosis in Obesity: Implicat...,740084.0
1217,grant.5476234,1,Using markers to improve pancreatic cancer scr...,4068088.0
1218,grant.5476670,1,CNS TAU KINETICS IN ALZHEIMER'S DISEASE,2925348.0


### Why didn't I use resulting_publication_ids ?

In [35]:
%%dsldf

search grants 
where resulting_publication_ids in ["pub.1005269097"]

Returned Grants: 3 (total = 3)
Field 'resulting_publication_ids' is deprecated. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details


Unnamed: 0,title_language,start_year,active_year,title,original_title,start_date,language,funding_org_name,id,project_num,funders,end_date
0,en,2018,"[2018, 2019, 2020, 2021]",Nanoscale X-Ray Imaging and Dynamics of Electr...,Nanoscale X-Ray Imaging and Dynamics of Electr...,2018-12-15,en,Office of Basic Energy Sciences,grant.4320525,DE-SC0001805,"[{'id': 'grid.452988.a', 'longitude': -77.0259...",2021-12-14
1,en,2014,"[2014, 2015, 2016, 2017, 2018]",Strain-induced modification of nanoscale mater...,Strain-induced modification of nanoscale mater...,2014-08-15,en,Directorate for Mathematical & Physical Sciences,grant.3660654,1411335,"[{'id': 'grid.457875.c', 'longitude': -77.1109...",2018-07-31
2,en,2009,"[2009, 2010, 2011, 2012, 2013]",Magnetic Transition Metal Nanowires,Magnetic Transition Metal Nanowires,2009-08-15,en,Directorate for Mathematical & Physical Sciences,grant.3100327,0906957,"[{'id': 'grid.457875.c', 'longitude': -77.1109...",2013-09-30


## Conclusions

Lists are a simple data structure that can have a great number of applications. 

When used in conjuction with the [DSL](https://docs.dimensions.ai/dsl/tour.html) language, they make it easy to concatenate the results of one query with another query e.g. in order to navigate through links available in Dimensions (from publications to grants, patents etc...). 

See also this [patents tutorial](https://api-lab.dimensions.ai/cookbooks/5-patents/1-Patents-referencing-a-Research-Organization.html) or this [clinical trials tutorial](https://api-lab.dimensions.ai/cookbooks/4-clinical-trials/Clinical_Trials_by_Volume_of_Pubs.html) for more in-depth applications of the queries discussed above.