<a href="https://colab.research.google.com/github/digital-science/dimensions-api-lab" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Dimensions API Lab In Google Colab"/></a>

# Part 3: Funding

## Install Dimensions Library and login

In [None]:
try:
  from google.colab import files
  %load_ext google.colab.data_table
  COLAB_ENV = True
  !pip install dimcli plotly_express  -U
  !mkdir data # to save temp data 
except:
  COLAB_ENV = False


# common libraries
import pandas as pd
from pandas.io.json import json_normalize
import time
import json
from tqdm import tqdm_notebook as tqdm
import plotly_express as px
from getpass import getpass
# FINALLY..
import dimcli
from dimcli.shortcuts import *

# set up for exports
if not COLAB_ENV:
  from plotly.offline import init_notebook_mode # needed for exports 
  init_notebook_mode(connected=True)


##
# LOG IN 
##

USERNAME = "m.pasin@digital-science.com"  #@param {type: "string"}

if not USERNAME:
  print("====\nERROR: Please enter a valid Dimensions API username")
else:
  password = getpass('====\nEnter password here')
  print('=> username is', USERNAME)
  print('=> password is', "*" * len(password))
  dimcli.login(USERNAME, password)
  dsl = dimcli.Dsl()


### Load previously saved researchers data 

In [None]:
researchers = pd.read_csv("data/2.researchers_impact_metrics.csv")

In [4]:
# note the extra column will be dropped after re-running
researchers.head(5)

Unnamed: 0,researcher_id,pubs,full_name,citations_mean,altmetric_mean,last_pub_year,url
0,ur.0723426172.10,49,Kari Stefansson,84.571429,261.346939,2019,https://app.dimensions.ai/discover/publication...
1,ur.01277776417.51,35,Unnur Thorsteinsdottir,70.342857,168.485714,2019,https://app.dimensions.ai/discover/publication...
2,ur.0641525362.39,30,Gonçalo R Abecasis,106.9,146.6,2019,https://app.dimensions.ai/discover/publication...
3,ur.0637651205.48,27,Daniel F Gudbjartsson,55.37037,153.518519,2019,https://app.dimensions.ai/discover/publication...
4,ur.01313145634.66,27,Andres Metspalu,159.592593,356.518519,2019,https://app.dimensions.ai/discover/publication...


# Adding another impact measure: funding 

We want to enhance the reseachers dataset by adding some funding information:

* total amount of funding for each researcher 
* total number of grants 
* funding end date (useful to understand who is going to publish soon)

### We'll have to do it in two steps


**1 Total grants and last grants year can be easy to extract using the 'researchers' core.**

In [5]:
%dsldf search researchers where id in ["ur.0723426172.10", "ur.01277776417.51"] return researchers[id+last_grant_year+total_grants]

Returned Researchers: 2 (total = 2)


Unnamed: 0,id,last_grant_year,total_grants
0,ur.01277776417.51,,
1,ur.0723426172.10,2018.0,8.0


**2 Aggregated funding needs to be extracted from the `grants` database.**

> NOTE this kind of aggregate query will not return any data if a researcher that has no grants!


In [6]:
%dsldf search grants where researchers.id in ["ur.0723426172.10", "ur.01277776417.51"] return researchers[id] aggregate funding

Returned Researchers: 13


Unnamed: 0,count,current_research_org,first_name,funding,id,last_name,orcid_id,research_orgs
0,8,grid.14013.37,Kári,17760368.0,ur.0723426172.10,Stefansson,,"[grid.421812.c, grid.4777.3, grid.38142.3c, gr..."
1,2,grid.412578.d,Raymond Philip,0.0,ur.012662217132.90,Roos,[0000-0002-0613-4048],"[grid.266093.8, grid.419261.9, grid.185006.a, ..."
2,1,,Raymond P,0.0,ur.010520247252.54,Roos,,
3,1,grid.16753.36,Teepu,0.0,ur.01121653260.31,Siddique,[0000-0001-7293-9146],"[grid.26009.3d, grid.414179.e, grid.240684.c, ..."
4,1,grid.438717.e,Mark E,0.0,ur.01127672147.84,Gurney,,"[grid.411451.4, grid.410513.2, grid.421812.c, ..."
5,1,,Richard J,0.0,ur.011316554452.18,Miller,,
6,1,grid.170205.1,Deborah J,0.0,ur.012167132327.75,Nelson,,"[grid.265892.2, grid.170205.1, grid.5379.8]"
7,1,,Mark,0.0,ur.012237141052.77,Gurneyh,,
8,1,,Sara,0.0,ur.012455520474.57,Szuchet,,[grid.170205.1]
9,1,,Jeffrey Robert,2592940.0,ur.01274135317.46,Gulcher,,"[grid.38142.3c, grid.410540.4, grid.170205.1, ..."


## Next: full data for step 1

What we're gonna do 

1. loop over all researchers (400 at a time) 
2. extract the **tot grants** and **last grants year** information
3. collect all data into one sigle dataframe 
4. finally, add the data to our 'researchers' spreadsheet 



In [9]:
llist = list(researchers['researcher_id'])
#
#
query = """search researchers where id in {} return researchers[id+last_grant_year+total_grants] limit 1000"""
#
#
out = []
for chunk in tqdm(list(chunks_of(list(llist), 400))):
    q = dsl.query(query.format(json.dumps(chunk)))
    out += q.researchers
    time.sleep(1)
# save to a df
df1 = pd.DataFrame().from_dict(out)
print("======\nResearchers used to query: ", len(llist))
print("======\nResearchers returned: ", len(df1))
df1.head(5)

HBox(children=(IntProgress(value=0, max=44), HTML(value='')))


Researchers used to query:  17440
Researchers returned:  17438


Unnamed: 0,id,last_grant_year,total_grants
0,ur.01065203525.67,2023.0,5.0
1,ur.01145654113.82,2021.0,9.0
2,ur.0767751267.74,2021.0,9.0
3,ur.0721636067.97,2022.0,5.0
4,ur.0650743215.20,,


Save the data so that we can use it later

In [None]:
df1.to_csv("data/3.funding-part-1.csv")

## Next: full data for step 2

For this we can do the following 

1. loop over all researchers, chunked in groups of 50
2. query for grants, faceting on researchers and **aggregating funding information**
3. then extract the from results only the researchers we are interested in

> NOTE since we are querying for grants, each query can return many more researchers than the ones we are asking for, as the co-authors of a grant are also matched

Example query:

In [None]:
%dsldf search grants where researchers.id in ["ur.0723426172.10", "ur.01277776417.51"] return researchers[id] aggregate funding

Returned Researchers: 13


Unnamed: 0,count,funding,id
0,8,17760368.0,ur.0723426172.10
1,2,0.0,ur.012662217132.90
2,1,0.0,ur.010520247252.54
3,1,0.0,ur.01121653260.31
4,1,0.0,ur.01127672147.84
5,1,0.0,ur.011316554452.18
6,1,0.0,ur.012167132327.75
7,1,0.0,ur.012237141052.77
8,1,0.0,ur.012455520474.57
9,1,2592940.0,ur.01274135317.46


Here we chunk using a lower number because each query will return more researchers than the ones we ask for (eg the query is 'grant' based)

In [None]:
llist = list(researchers['researcher_id'])

#
#
query = """search grants where researchers.id in {} return researchers[id] aggregate funding limit 1000"""
#
#
out = []
for chunk in tqdm(list(chunks_of(list(llist), 50))):
    q = dslquery(query.format(json.dumps(chunk)))
    out += q.researchers
    time.sleep(1)
# save to a df
df2 = pd.DataFrame().from_dict(out)
print("======\nResearchers used to query: ", len(llist))
print("======\nResearchers returned: ", len(df2))

# save to csv just in case
df2.to_csv("data/3.funding-part-2.csv")
df2.head(5)

## Finally: let's merge all the data into the original researchers table

In [15]:
#
# first let's replace all empty values with zeros
#

df1 = df1.fillna(0)
df2 = df2.fillna(0)

#
# helper functions 
#

def grants_and_year_from_id(researcher_id):
  "try/except to prevent some parsing errors - TODO investigate further.."
  try:
    x = int(df1[df1['id'] == researcher_id]['total_grants'])
  except:
    x = 0
  try:
    y = int(df1[df1['id'] == researcher_id]['last_grant_year'])
  except:
    y = 0
  return (x, y)

def total_funding_from_id(researcher_id):
    """Since the bulk querying returned several rows for same researcher (due to various random combinations
    of researcher IDs lists in the query filters), we take the max value."""
    return df2[df2['id'] == researcher_id]['funding'].max()
    
#
# merge the results found into original researchers dataset
#

total_grants, last_grant_year, total_funding  = [], [], []

for i, row in tqdm(researchers.iterrows(), total=researchers.shape[0]):
    res_id = row['researcher_id']
    data = grants_and_year_from_id(res_id)
    total_grants.append(data[0])
    last_grant_year.append(data[1])
    total_funding.append(total_funding_from_id(res_id))

researchers['total_grants'] = total_grants
researchers['last_grant_year'] = last_grant_year
researchers['total_funding'] = total_funding
#
# finally..
#
print("=======\nResearchers total:",  len(researchers))
researchers.head(10)

HBox(children=(IntProgress(value=0, max=17440), HTML(value='')))

Researchers total: 17440


Unnamed: 0,researcher_id,pubs,full_name,citations_mean,altmetric_mean,last_pub_year,url,total_grants,last_grant_year,total_funding
0,ur.0723426172.10,49,Kari Stefansson,84.571429,261.346939,2019,https://app.dimensions.ai/discover/publication...,8,2018,17760368.0
1,ur.01277776417.51,35,Unnur Thorsteinsdottir,70.342857,168.485714,2019,https://app.dimensions.ai/discover/publication...,0,0,
2,ur.0641525362.39,30,Gonçalo R Abecasis,106.9,146.6,2019,https://app.dimensions.ai/discover/publication...,12,2023,56939889.0
3,ur.0637651205.48,27,Daniel F Gudbjartsson,55.37037,153.518519,2019,https://app.dimensions.ai/discover/publication...,0,0,
4,ur.01313145634.66,27,Andres Metspalu,159.592593,356.518519,2019,https://app.dimensions.ai/discover/publication...,29,2021,13874871.0
5,ur.01344404521.43,26,Lude Franke,122.769231,208.5,2019,https://app.dimensions.ai/discover/publication...,1,2020,
6,ur.01174076626.46,25,André G. Uitterlinden,112.36,289.52,2019,https://app.dimensions.ai/discover/publication...,1,2014,0.0
7,ur.01264737414.70,24,Tõnu Esko,123.541667,401.916667,2019,https://app.dimensions.ai/discover/publication...,8,2020,1325160.0
8,ur.01220453202.22,24,Eleftheria Zeggini,73.958333,168.166667,2019,https://app.dimensions.ai/discover/publication...,9,2021,15612829.0
9,ur.016704245502.43,23,Mark I McCarthy,100.826087,169.478261,2019,https://app.dimensions.ai/discover/publication...,41,2024,73876004.0


Save the data / download it

In [None]:
researchers.to_csv("data/3.researchers_impact_metrics_and_funding.csv", index=False)

In [None]:
if COLAB_ENV:
  files.download("data/3.researchers_impact_metrics_and_funding.csv")

# Couple of Dataviz

In [None]:
temp1 = researchers.sort_values(by=["total_funding"], ascending=False)[:100]
temp2 = researchers.sort_values(by=["last_grant_year"], ascending=False)[:200]

In [20]:
px.scatter(temp1, x="full_name", y="total_funding", hover_name="full_name", size="total_grants", color="total_grants",
           hover_data=['total_funding', 'total_grants', 'last_grant_year', 'citations_mean', 'altmetric_mean', 'last_pub_year'], 
           marginal_y="histogram", title="Researchers By Total Funding")

In [21]:
px.scatter(temp2, x="full_name", y="last_grant_year", hover_name="full_name", size="total_grants",color="total_grants",
           hover_data=['total_funding', 'total_grants', 'last_grant_year', 'citations_mean', 'altmetric_mean', 'last_pub_year'], 
           marginal_y="histogram",  title="Researchers By Grant End Year")

In [22]:
px.scatter(temp2, x="full_name", y="last_grant_year", hover_name="full_name",  size="total_grants",color="total_grants",
           hover_data=['total_funding', 'total_grants', 'last_grant_year', 'citations_mean', 'altmetric_mean', 'last_pub_year'], 
           facet_col="last_pub_year", title="Researchers By Grant End Year & Last Publications Year")

In [23]:
px.density_heatmap(temp2, x="last_grant_year", y="last_pub_year", 
                   marginal_x="histogram", marginal_y="histogram", title="Distribution of Grant End Year VS Last Publications Year")

In [24]:
px.scatter_3d(temp2, x="last_grant_year", y="last_pub_year",  z="citations_mean", 
              color="total_grants", size="total_grants",
              hover_name="full_name", hover_data=['total_funding', 'total_grants', 'last_grant_year', 'citations_mean', 'altmetric_mean', 'last_pub_year'], 
                 title="Citations Mean VS Grant End Year VS Last Publications Year")