# Enriching Grants part 1: Matching your grants records to Dimensions

In this tutorial we are going to **match** a sample local grants datasets to Dimensions. 

By *matching* we mean discovering the unique **Dimensions identifier** for these grants, so that we can then use the IDs to extract from Dimensions more related objects (eg researchers, publications, patents, clinical trials etc.. related to the grants).

## A sample grants list

Our starting point is a [sample list of completed grants on the topic of *vaccines*](http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-0.csv), which contains common fields such as title, funder, grant/project ID, funding amount etc.. 

We will show below how to enrich this dataset with Dimensions IDs.

In [1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Apr 19, 2023
==


## Prerequisites

This notebook assumes you have installed the [Dimcli](https://pypi.org/project/dimcli/) library and are familiar with the ['Getting Started' tutorial](https://api-lab.dimensions.ai/cookbooks/1-getting-started/1-Using-the-Dimcli-library-to-query-the-API.html).

In [2]:
!pip install dimcli tqdm -U --quiet 

import dimcli
from dimcli.utils import *

import sys, time
import pandas as pd


print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')  
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

[2mSearching config file credentials for 'https://app.dimensions.ai' endpoint..[0m


==
Logging in..
[2mDimcli - Dimensions API Client (v1.0.2)[0m
[2mConnected to: <https://app.dimensions.ai/api/dsl> - DSL v2.6[0m
[2mMethod: dsl.ini file[0m


## Loading the sample grants data

First we are going to load the sample dataset ["vaccines-grants-sample-part-0.csv"](http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-0.csv). 


In [3]:
grants_source = pd.read_csv("http://api-sample-data.dimensions.ai/data/vaccines-grants-sample-part-0.csv")

Now we can preview the contents of the file.

In [4]:
grants_source.head(5)

Unnamed: 0,Grant Number,Title,Funding Amount in USD,Start Year,End Year,Funder,Funder Country
0,30410203277,疫苗－整体方案,1208.0,2004,2004,National Natural Science Foundation of China,China
1,620792,Engineering Inhalable Vaccines,26956.0,2017,2018,Natural Sciences and Engineering Research Council,Canada
2,599115,Engineering Inhalable Vaccines,26403.0,2016,2017,Natural Sciences and Engineering Research Council,Canada
3,251564,HIV Vaccine research,442366.0,2003,2007,National Health and Medical Research Council,Australia
4,334174,HIV Vaccine Development,236067.0,2005,2009,National Health and Medical Research Council,Australia


## Matching grants data

The are two possible situations to consider.



### A) Matching grants when we have a grant number

The Dimensions API includes a [function 'extract_grants'](https://docs.dimensions.ai/dsl/functions.html#function-extract-grants) that makes it easier to find a grant in Dimensions, by using information such as funder name and the funder grant identifier. 

So one approach is the following:

In [5]:
dsl.query("""extract_grants(grant_number="30410203277", 
    funder_name="National Natural Science Foundation of China")""").json

{'grant_id': 'grant.8172033'}

In [6]:
dsl.query("""extract_grants(grant_number="334174", 
    funder_name="National Health and Medical Research Council")""").json

{'grant_id': 'grant.6722306'}

Note: this won't work without funder name

In [7]:
dsl.query("""extract_grants(grant_number="30410203277", 
    funder_name="")""").json

{'grant_id': None}

But we can pass a [fundref ID](https://www.crossref.org/services/funder-registry/) if we have it (also available via [GRID](https://grid.ac/institutes/grid.419696.5))

In [8]:
dsl.query("""extract_grants(grant_number="30410203277", 
    fundref="501100001809")""").json

{'grant_id': 'grant.8172033'}

### B) What if we don't have a grant number?

Then the only way is to 

* query Dimensions using the best grants metadata we have available
* if we get only one result, we just take it 
* if we get more than one result, we need to manually review them at a later point, or develop some sort of algorithm  to sort them by relevancy so that we can take the first result

Let's take some of the grants without a number:

In [9]:
grants_without_number = grants_source[grants_source['Grant Number'].isnull()]

In [10]:
grants_without_number.head(5)

Unnamed: 0,Grant Number,Title,Funding Amount in USD,Start Year,End Year,Funder,Funder Country
38,,DENGUE VACCINE DEVELOPMENT,50000.0,1985,1986,United States Department of the Army,United States
79,,Sterilization Vaccine for Cattle,0.0,2006,2006,Council for International Exchange of Scholars,United States
80,,Vaccine Production in Plants,0.0,2011,2011,Council for International Exchange of Scholars,United States
81,,Novel vaccine formulations against tuberculosis,295003.0,2002,2005,Canadian Institutes of Health Research,Canada
446,,Development of recombinant TB vaccine,101459.0,1999,2003,Canadian Institutes of Health Research,Canada


Now let's try to find the second grant in the list above, using only its title and the funder country

In [11]:
%%dsldf 

search grants 
    in title_only for "Vaccine Production in Plants" 
    where funder_countries.name="United States" 
return grants

Returned Grants: 1 (total = 1)
[2mTime: 0.65s[0m
Field 'funder_countries' is deprecated in favor of funder_org_countries. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details


Unnamed: 0,id,title,active_year,end_date,funder_org_name,funder_orgs,grant_number,language,original_title,start_date,start_year
0,grant.9923948,Vaccine Production in Plants,[2011],2011-09-01,Council for International Exchange of Scholars,"[{'acronym': 'CIES', 'city_name': 'Washington ...",,en,Vaccine Production in Plants,2011-01-01,2011


In [12]:
# we're in luck! only one record found
print("The Dimensions ID is: ", dsl_last_results.iloc[0]['id'])

The Dimensions ID is:  grant.9923948


## Back to our grants list

We can set up a loop to go through all the grants we want to get a Dimensions ID for, so that we can enrich our original dataset with those IDs. A simple approach is the following:

* we try to use the `extract_grants` function first
* second we try the `search` operation as a fall back plan
  * if that returns more than one record, we simply take the first one (even though in real life we'd want a more sophisticated approach)
* note: we pause a second after each query to ensure we don't hit the max queries quota (~30 per minute)

NOTE For the purpose of this exercise, you can select less that the ~1200 grants in the original list, so to speed things up. 

In [13]:
grants = grants_source[:100].copy()

In [14]:
grants.head(10)

Unnamed: 0,Grant Number,Title,Funding Amount in USD,Start Year,End Year,Funder,Funder Country
0,30410203277,疫苗－整体方案,1208.0,2004,2004,National Natural Science Foundation of China,China
1,620792,Engineering Inhalable Vaccines,26956.0,2017,2018,Natural Sciences and Engineering Research Council,Canada
2,599115,Engineering Inhalable Vaccines,26403.0,2016,2017,Natural Sciences and Engineering Research Council,Canada
3,251564,HIV Vaccine research,442366.0,2003,2007,National Health and Medical Research Council,Australia
4,334174,HIV Vaccine Development,236067.0,2005,2009,National Health and Medical Research Council,Australia
5,910292,Dengue virus vaccine.,130890.0,1991,1993,National Health and Medical Research Council,Australia
6,578221,Engineering Inhalable Vaccines,27386.0,2015,2016,Natural Sciences and Engineering Research Council,Canada
7,IC18980360,Schistosomiasis Vaccine Network.,0.0,1998,2000,European Commission,Belgium
8,7621798,Pneumococcal Ribosomal Vaccines,46000.0,1977,1980,Directorate for Biological Sciences,United States
9,255890,Rational vaccine design,7138.0,2003,2004,Natural Sciences and Engineering Research Council,Canada


Setting up the loop 

In [25]:
# load the progress bar widget for jupyter
from tqdm.notebook import tqdm as progressbar

output = []

def find_grant_first_method(grantno, funder):
  match = dsl.query(f'''extract_grants(grant_number="{grantno}", funder_name="{funder}")''').json
  grant_id = match.get("grant_id")
  if grant_id:
    print("Found a match with method 1: ", grant_id)
    return grant_id

def find_grant_second_method(title, country):
  # match titles exactly - see also https://docs.dimensions.ai/dsl/language.html#using-triple-quotes
  res = dsl.query(f'''search grants in title_only for """ "{title}" """ where funder_org_countries.name="{country}" return grants''')
  if not res.errors:
      if res.grants and res.grants[0].get('id'):
        grant_id = res.grants[0].get('id')
        print("=== Found a match with method 2: ", grant_id)
        return grant_id


for index, row in progressbar(grants.iterrows(), total=grants.shape[0]):
  # get data from table
  grantno, funder = row['Grant Number'], row['Funder']
  # try first method
  grant_id = find_grant_first_method(grantno, funder)
  if not grant_id:
    # try second method
    title, country = row['Title'], row['Funder Country']
    grant_id = find_grant_second_method(title, country)
    if not grant_id:
      print("Failed - skipping")
  output.append(grant_id)
  time.sleep(1)


  0%|          | 0/100 [00:00<?, ?it/s]

Found a match with method 1:  grant.8172033
Returned Grants: 3 (total = 3)
[2mTime: 0.56s[0m
=== Found a match with method 2:  grant.7715379
Returned Grants: 3 (total = 3)
[2mTime: 3.35s[0m
=== Found a match with method 2:  grant.7715379
Found a match with method 1:  grant.6723913
Found a match with method 1:  grant.6722306
Found a match with method 1:  grant.6716312
Returned Grants: 3 (total = 3)
[2mTime: 7.82s[0m
=== Found a match with method 2:  grant.7715379
Found a match with method 1:  grant.3733803
Found a match with method 1:  grant.3274273
Returned Grants: 2 (total = 2)
[2mTime: 1.62s[0m
=== Found a match with method 2:  grant.2863615
Returned Grants: 2 (total = 2)
[2mTime: 0.55s[0m
=== Found a match with method 2:  grant.2863615
Returned Grants: 2 (total = 2)
[2mTime: 6.07s[0m
=== Found a match with method 2:  grant.2807651
Returned Grants: 2 (total = 2)
[2mTime: 3.28s[0m
=== Found a match with method 2:  grant.2807651
Returned Grants: 2 (total = 2)
[2mTime: 6.

### Enriching the original list

Finally, we can take the Dimensions ID data we extracted and add it to the original grants table as an extra column.

In [26]:
grants["Dimensions ID"] = output

In [27]:
grants.head(10)

Unnamed: 0,Grant Number,Title,Funding Amount in USD,Start Year,End Year,Funder,Funder Country,Dimensions ID
0,30410203277,疫苗－整体方案,1208.0,2004,2004,National Natural Science Foundation of China,China,grant.8172033
1,620792,Engineering Inhalable Vaccines,26956.0,2017,2018,Natural Sciences and Engineering Research Council,Canada,grant.7715379
2,599115,Engineering Inhalable Vaccines,26403.0,2016,2017,Natural Sciences and Engineering Research Council,Canada,grant.7715379
3,251564,HIV Vaccine research,442366.0,2003,2007,National Health and Medical Research Council,Australia,grant.6723913
4,334174,HIV Vaccine Development,236067.0,2005,2009,National Health and Medical Research Council,Australia,grant.6722306
5,910292,Dengue virus vaccine.,130890.0,1991,1993,National Health and Medical Research Council,Australia,grant.6716312
6,578221,Engineering Inhalable Vaccines,27386.0,2015,2016,Natural Sciences and Engineering Research Council,Canada,grant.7715379
7,IC18980360,Schistosomiasis Vaccine Network.,0.0,1998,2000,European Commission,Belgium,grant.3733803
8,7621798,Pneumococcal Ribosomal Vaccines,46000.0,1977,1980,Directorate for Biological Sciences,United States,grant.3274273
9,255890,Rational vaccine design,7138.0,2003,2004,Natural Sciences and Engineering Research Council,Canada,grant.2863615
