# Researchers Queries 
The purpose of this notebook is to provide some examples of how to query researchers-related data.


#### Prerequisites

This notebook assumes you have installed the [`dimcli`](https://pypi.org/project/dimcli/) library and are familiar with the [Getting Started](https://github.com/digital-science/dimensions-api-examples/tree/master/1%20Getting%20Started) tutorials.

In [None]:
import dimcli
%dsl_login

### How to retrieve all researchers? 

If you have in mind a specific publications subset (eg all publications from a years-range and organization), a common use case it to extract all researchers info Dimensions knows about that subset. 

For example let's assume we can identify out publications set as follows:

In [None]:
%%dsl_query
search publications
    where year in [2013:2018] and research_orgs="grid.258806.1"
return publications

There are different ways to get to the researchers data,  depending on what are the search criteria. 

#### Approach 1

If one is interested in only the researchers from a certain institution, for a specific publications years range - this can be achieved via a single query to the `researchers` database, eg: 

In [None]:
%%dsl_query_loop 
search researchers 
    where research_orgs="grid.258806.1" 
    and first_publication_year>=2013 
    and last_publication_year<=2018 
return researchers

In [None]:
res = _
res['researchers'][10]

> NOTE the query above return only researchers who **first** published in 2013, and **last** published in 2018 - hence most likely a subset of all researchers who published during that years range!

#### Approach 2

If instead one wants to get all researchers related to publications (for a specific year range) where at least one author is related to a specific GRID affiliation - then this must be done in multiple steps. 

The basic idea is to get all the relevant publications, extract all researchers, remove duplicates and return the results. 

So, first we want to get all publications using a `loop` query. This is fundamentally the same of a normal query, but the in background it uses the `limit` and `skip` operators to gather all possible results for a query that returns more than 1000 records. 

> NOTE: loop queries must be used with caution as they can result in a large number of API calls

In [None]:
%%dsl_query_loop
search publications
    where year in [2013:2018] and research_orgs="grid.258806.1"
return publications

In [None]:
res = _

The second step is to pull out all researchers info from these results. 
Byt looking at one single record we can find out how the data is organized internally: authors are stored in an inner list/dictionary with the key `author_affiliations`. So to extract for example the second author of the tenth publication we can do the following:

In [None]:
res['publications'][10]['author_affiliations'][0][2]

Now we can extract all authors and put them into a single dictionary. 

In order to remove duplicates we can use the (unique) `researcher_id` values as dictionary keys as follows:

In [None]:
out = {}
for p in res['publications']:
    for a in p['author_affiliations'][0]:
        try:
            out[a['researcher_id']] = a
        except:
            pass

In [None]:
len(out)

That's it! We have all researchers. Let's get one of them 

In [None]:
next(iter(out.values()))

Now let's count how many are currenlty linked to "grid.258806.1"

In [None]:
counter = 0
for k in out.keys():
    if out[k]['current_organization_id'] == "grid.258806.1":
        counter +=1

In [None]:
counter