# Enriching Grants part 3: adding related Patents and Clinical Trials data

In this third and final final part of the grants enrichment tutorial we are going to extract from Dimensions all Patents and Clinical Trials information linked to our *vaccines* grants datasets.  

This tutorial builds on the previous one, *Enriching Grants with Publications Information from Dimensions*, and it assumes that our grants list already includes Dimensions IDs as well as publications counts for each grant. 

The enriched grants list we are starting from can be [downloaded here](http://static.michelepasin.org/dsl/grants_enriched_pubs.csv). 

## Load libraries and login
Click the 'play' button on the left (or shift+enter) to run this cell.

In [0]:
username = ""  #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}

#
!pip install dimcli plotly_express -U --quiet 
# !dimcli --init
import dimcli
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()

#
# load common libraries
import pandas as pd
import time
import json
from tqdm import tqdm_notebook as progressbar

[?25l[K     |██▉                             | 10kB 19.4MB/s eta 0:00:01[K     |█████▊                          | 20kB 3.5MB/s eta 0:00:01[K     |████████▌                       | 30kB 5.0MB/s eta 0:00:01[K     |███████████▍                    | 40kB 3.1MB/s eta 0:00:01[K     |██████████████▎                 | 51kB 3.8MB/s eta 0:00:01[K     |█████████████████               | 61kB 4.5MB/s eta 0:00:01[K     |████████████████████            | 71kB 5.2MB/s eta 0:00:01[K     |██████████████████████▉         | 81kB 5.8MB/s eta 0:00:01[K     |█████████████████████████▋      | 92kB 6.5MB/s eta 0:00:01[K     |████████████████████████████▌   | 102kB 5.1MB/s eta 0:00:01[K     |███████████████████████████████▍| 112kB 5.1MB/s eta 0:00:01[K     |████████████████████████████████| 122kB 5.1MB/s 
[?25hDimCli v0.6.1 - Succesfully connected to <https://app.dimensions.ai> (method: manual login)


## Uploading the enriched grants data from part-2

First make sure you have downloaded the file ["grants_enriched_pubs.csv"](http://static.michelepasin.org/dsl/grants_enriched_pubs.csv). 

> If you are using Google Colab, you also want to open up the left sidebar / Files menu (Apple/Ctrl + B) and drag and drop the file in there so that it can be seen by the notebook. 

In [0]:
grants = pd.read_csv("grants_enriched_pubs.csv")
grantsids = list(grants['Grant ID']) 

This file contains 5k recent grants records on the topic of vaccines. Now we can preview the contents of the file.

In [0]:
grants.head(10)

Unnamed: 0,Grant ID,Grant Number,Title,Funding Amount in USD,Start Date,Start Year,End Date,End Year,Research Organization - standardized,GRID ID,Country of Research organization,Funder,Funder Country,pubs
0,grant.8172033,30410203277,疫苗－整体方案,1208,2004-11-30,2004.0,2004-12-31,2004,Sichuan University,grid.13291.38,China,National Natural Science Foundation of China,China,0
1,grant.7715379,620792,Engineering Inhalable Vaccines,26956,2017-04-01,2017.0,2018-03-31,2018,University of Alberta,grid.17089.37,Canada,Natural Sciences and Engineering Research Council,Canada,0
2,grant.6962629,599115,Engineering Inhalable Vaccines,26403,2016-04-01,2016.0,2017-03-31,2017,University of Alberta,grid.17089.37,Canada,Natural Sciences and Engineering Research Council,Canada,0
3,grant.6723913,251564,HIV Vaccine research,442366,2003-01-01,2003.0,2007-12-31,2007,University of Melbourne,grid.1008.9,Australia,National Health and Medical Research Council,Australia,0
4,grant.6722306,334174,HIV Vaccine Development,236067,2005-01-01,2005.0,2009-12-31,2009,Monash University,grid.1002.3,Australia,National Health and Medical Research Council,Australia,1
5,grant.6716312,910292,Dengue virus vaccine.,130890,1991-01-01,1991.0,1993-12-31,1993,Royal Children's Hospital,grid.416107.5,Australia,National Health and Medical Research Council,Australia,0
6,grant.5526688,578221,Engineering Inhalable Vaccines,27386,2015-04-01,2015.0,2016-03-31,2016,University of Alberta,grid.17089.37,Canada,Natural Sciences and Engineering Research Council,Canada,0
7,grant.3733803,IC18980360,Schistosomiasis Vaccine Network.,0,1998-11-01,1998.0,2000-10-31,2000,Pasteur Institute of Lille; University of Edin...,grid.8970.6; grid.4305.2; grid.33058.3d; grid....,France; United Kingdom; Kenya; Belgium; United...,European Commission,Belgium,0
8,grant.3274273,7621798,Pneumococcal Ribosomal Vaccines,46000,1977-08-01,1977.0,1980-01-31,1980,University of Iowa,grid.214572.7,United States,Directorate for Biological Sciences,United States,0
9,grant.2936015,255890,Rational vaccine design,7138,2003-04-01,2003.0,2004-03-31,2004,,,,Natural Sciences and Engineering Research Council,Canada,0


## Extracting linked Patents data

Using a similar methodology as with publications, we can easily extract all patents linked to each grant in two steps. 

* retrieve all the relevant patents records using the `associated_grant_ids` field (see also the [data model](https://docs.dimensions.ai/dsl/data-model.html) and the  [patents fields](https://docs.dimensions.ai/dsl/data-sources.html#patents))
* group patents by grant ID in so that we can have a single count per record 

Note: in this case we can iterate 400 grants at a time cause in general there are much less associated patents per grant (compared to publications).

In [0]:
#
# the main query
#
q = """search patents where associated_grant_ids in {} 
  return patents[basics+associated_grant_ids]"""

#
# useful libraries for looping
#
from dimcli.shortcuts import chunks_of
from tqdm import tqdm_notebook as progressbar

#
# let's loop through all grants IDs in chunks and query Dimensions 
#
print("===\nExtracting patents data ...")
results = []

for chunk in progressbar(list(chunks_of(list(grantsids), 400))):
    data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=False)
    results += data.patents
    time.sleep(1)

#
# put the patents data into a dataframe, remove duplicates and save
#
patents = pd.DataFrame().from_dict(results)
print("Patents found: ", len(patents))
patents.drop_duplicates(subset='id', inplace=True)
print("Unique Patents found: ", len(patents))
patents.to_csv("patents.csv", index=False)
# turning lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
patents['associated_grant_ids'] = patents['associated_grant_ids'].apply(lambda x: ','.join(map(str, x))) 
print("===\nDone - data saved as 'patents.csv'")


#
# count patents per grant and enrich the original dataset
#
def patents_for_grantid(grantid):
  global patents
  return patents[patents['associated_grant_ids'].str.contains(grantid)]

print("===\nCounting patents per grant...")

l = []
for x in progressbar(grantsids):
  l.append(len(patents_for_grantid(x)))

#
# save the data
#
grants['patents'] = l
grants.to_csv("grants_enriched_pubs_patents.csv", index=False)
print("===\nDone - data saved as 'grants_enriched_pubs_patents.csv'")

===
Extracting patents data ...


HBox(children=(IntProgress(value=0, max=13), HTML(value='')))


Patents found:  415
Unique Patents found:  400
===
Done - data saved as 'patents.csv'
===
Counting patents per grant...


HBox(children=(IntProgress(value=0, max=5000), HTML(value='')))


===
Done - data saved as 'grants_enriched_pubs_patents.csv'


Let's quickly preview the patents dataset, and the grants one, which now has an extra column counting patents. 

In [0]:
patents.head(5)

Unnamed: 0,inventor_names,times_cited,granted_year,year,filing_status,associated_grant_ids,assignees,assignee_names,publication_date,id,title
0,"[Daniel J. Smith, Martin A. Taubman]",5.0,2006.0,2004,Grant,"grant.2489571,grant.2489689","[{'id': 'grid.38142.3c', 'name': 'Harvard Univ...","[FORSYTH INSTITUTE, Forsyth Dental Infirmary f...",2006-06-06,US-7056517-B2,Glucosyltransferase immunogens
1,[MARIO PHILIPP],1.0,,2003,Application,grant.2452868,"[{'id': 'grid.265219.b', 'name': 'Tulane Unive...",[Administrators of the Tulane Educational Fund...,2004-04-08,US-20040067517-A1,Surface antigens and proteins useful in compos...
2,"[James Tam, Qitao Yu, Yi-An Lu, Jin-Long Yang]",2.0,,2004,Application,grant.2454716,"[{'id': 'grid.152326.1', 'name': 'Vanderbilt U...","[Vanderbilt University, UNIV VANDERBILT]",2005-05-26,US-20050113292-A1,Compositions of protein mimetics and methods o...
3,"[Clifford J. Beall, Kelly R. Clark, Philip R. ...",,2011.0,2009,Grant,grant.5246582,"[{'id': 'grid.240344.5', 'name': 'Nationwide C...","[Nationwide Children's Hospital Inc, NATIONWID...",2011-05-17,US-7943379-B2,Production of rAAV in vero cells using particu...
4,"[Jon O. Rayner, Jonathan F. Smith, Bolyn Hubby...",,,2014,Application,"grant.2687621,grant.2687606","[{'id': 'grid.422340.3', 'name': 'AlphaVax (Un...","[AlphaVax Inc, ALPHAVAX INC]",2014-07-24,US-20140205629-A1,"TC-83-DERIVED ALPHAVIRUS VECTORS, PARTICLES AN..."


In [0]:
grants.sort_values("patents", ascending=False).head(5)

Unnamed: 0,Grant ID,Grant Number,Title,Funding Amount in USD,Start Date,Start Year,End Date,End Year,Research Organization - standardized,GRID ID,Country of Research organization,Funder,Funder Country,pubs,patents
4250,grant.2452138,R01AI030904,Protective CMI mechanisms of a dual-subtype FI...,4213791,1991-08-01,1991.0,2015-03-31,2015,University of Florida,grid.15276.37,United States,National Institute of Allergy and Infectious D...,United States,31,28
4800,grant.2643965,R43CA081752,A NEW CANDIDATE VACCINE FOR LEUKEMIA AND CANCER,0,1999-09-08,1999.0,2000-06-30,2000,Corixa Corporation,grid.284594.6,United States,National Cancer Institute,United States,0,20
737,grant.2424402,N01AI005396,HIV Vaccine Design and Development Teams-26605396,7545418,2000-06-05,2000.0,2005-06-04,2005,Novartis (United States),grid.418424.f,United States,National Institute of Allergy and Infectious D...,United States,29,13
2684,grant.2687918,U01AI070443,Preclinical development of a chimeric tetraval...,2873420,2006-09-26,2006.0,2010-08-31,2010,Takeda (United States),grid.419849.9,United States,National Institute of Allergy and Infectious D...,United States,3,11
2722,grant.2474639,R01CA084232,Cell-based tumor vaccines targeting CD4+ T lym...,2595114,2000-04-01,2000.0,2014-01-31,2014,"University of Maryland, Baltimore County",grid.266673.0,United States,National Cancer Institute,United States,65,10


## Extracting linked Clinical Trials data

Now we can repeat the same process once more, for Clinical Trials. The field we need is called  `associated_grant_ids`  (see also the [clinical trials](https://docs.dimensions.ai/dsl/data-sources.html#clinical-trials) docs). 

As with patents, we can iterate 400 grants at a time cause in general there is much less associated content per grant (compared to publications).

In [0]:
#
# the main query
#
q = """search clinical_trials where associated_grant_ids in {} 
  return clinical_trials[basics+associated_grant_ids]"""


#
# useful libraries for looping
#
from dimcli.shortcuts import chunks_of
from tqdm import tqdm_notebook as progressbar

#
# let's loop through all grants IDs in chunks and query Dimensions 
#
print("===\nExtracting clinical trials data ...")
results = []

for chunk in progressbar(list(chunks_of(list(grantsids), 400))):
    data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=False)
    results += data.clinical_trials
    time.sleep(1)

#
# put the patents data into a dataframe, remove duplicates and save
#
clinical_trials = pd.DataFrame().from_dict(results)
print("Clinical Trials found: ", len(clinical_trials))
clinical_trials.drop_duplicates(subset='id', inplace=True)
print("Unique Clinical Trials found: ", len(clinical_trials))
clinical_trials.to_csv("clinical_trials.csv", index=False)
# NOTE turning lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
clinical_trials['associated_grant_ids'] = clinical_trials['associated_grant_ids'].apply(lambda x: ','.join(map(str, x))) 
print("===\nDone - data saved as 'clinical_trials.csv'.")


#
# count patents per grant and enrich the original dataset
#
def cltrials_for_grantid(grantid):
  global clinical_trials
  return clinical_trials[clinical_trials['associated_grant_ids'].str.contains(grantid)]

print("===\nCounting clinical trials per grant...")
l = []
for x in progressbar(grantsids):
  l.append(len(cltrials_for_grantid(x)))



#
# save the data
#
grants['clinical_trials'] = l
grants.to_csv("grants_enriched_pubs_patents_clinical_trials.csv", index=False)
print("===\nDone - data saved as 'grants_enriched_pubs_patents_clinical_trials.csv'")

===
Extracting clinical trials data ...


HBox(children=(IntProgress(value=0, max=13), HTML(value='')))

Clinical Trials found:  11
Unique Clinical Trials found:  11
===
Done - data saved as 'clinical_trials.csv'.
===
Counting clinical trials per grant...


HBox(children=(IntProgress(value=0, max=5000), HTML(value='')))

===
Done - data saved as 'grants_enriched_pubs_patents_clinical_trials.csv'


In [0]:
clinical_trials.head(5)

Unnamed: 0,investigator_details,associated_grant_ids,title,id,active_years
0,"[[Beryl A Koblin, PhD, Principal Investigator,...",grant.2485747,Project UNITY - A Randomized Trial of Enhanced...,NCT00150098,"[2005, 2006, 2007, 2008, 2009, 2010]"
1,"[[Emmanuel B Walter, MPH, MD, Principal Invest...",grant.2692202,Prevention of Influenza in Infants by Immuniza...,NCT00570037,"[2007, 2008]"
2,"[[Paul Reiter, PhD, Principal Investigator, Oh...","grant.2438819,grant.4103659",Increasing HPV Vaccine Coverage Among Young Ad...,NCT02835755,"[2016, 2017]"
3,"[[Giuseppe Pantaleo, , Study Chair, CTU Lausan...",grant.2687787,A Phase 1b Clinical Trial to Evaluate the Safe...,NCT00961883,
4,"[[Connie Celum, MD, Study Chair, University of...",grant.2687787,"A Multi-Site Evaluation of Virologic, Immunolo...",NCT00029913,"[2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]"


In [0]:
grants.sort_values("clinical_trials", ascending=False).head(5)

Unnamed: 0,Grant ID,Grant Number,Title,Funding Amount in USD,Start Date,Start Year,End Date,End Year,Research Organization - standardized,GRID ID,Country of Research organization,Funder,Funder Country,pubs,patents,clinical_trials
3877,grant.2687787,U01AI068614,Leadership Group for a Global HIV Vaccine Clin...,181539904,2006-06-29,2006.0,2013-05-31,2013,Fred Hutchinson Cancer Research Center,grid.270240.3,United States,National Institute of Allergy and Infectious D...,United States,64,0,2
4498,grant.2435034,P01AI045142,DEVELOPMENT OF A NOVEL MULTIENVELOPE AIDS VACCINE,7044139,1999-05-15,1999.0,2006-07-31,2006,St. Jude Children's Research Hospital,grid.240871.8,United States,National Institute of Allergy and Infectious D...,United States,34,0,1
4791,grant.2695778,U19AI065683,Malaria Vaccine Trials in Mali,4748457,2005-09-01,2005.0,2010-06-30,2010,"University of Maryland, Baltimore",grid.411024.2,United States,National Institute of Allergy and Infectious D...,United States,32,0,1
4229,grant.2687567,U01AI053719,INDO-US Collaboration to Develop Rotavirus Vac...,1130283,2003-03-01,2003.0,2008-02-29,2008,All India Institute of Medical Sciences,grid.413618.9,India,National Institute of Allergy and Infectious D...,United States,2,0,1
3814,grant.2692202,U01IP000074,PiiiTCH Study,0,2006-09-15,2006.0,2008-09-14,2008,Duke University,grid.26009.3d,United States,Centers for Disease Control and Prevention,United States,2,0,1


## Data Exploration

Now we can explore a bit the grants+publications+patents+clinical_trials dataset using the [plotly express](https://plot.ly/python/plotly-express/) library. 

> If by any chance you couldn't complete the steps above, the final enriched data can be found here: [grants_enriched_pubs_patents_clinical_trials.csv](http://static.michelepasin.org/dsl/grants_enriched_pubs_patents_clinical_trials.csv). Load it with `grants = pd.read_csv()` as above.

In [0]:
import plotly_express as px

### How many linked objects overall? 


In [0]:
df = pd.DataFrame({
    'measure' : ['Grants', 'Grants with pubs', 'Grants with Patents', 'Grants with Clinical Trials'],
    'count' : [len(grants), len(grants[grants['pubs'] > 0]), len(grants[grants['patents'] > 0]), len(grants[grants['clinical_trials'] > 0])],
})

px.bar(df, x="measure", y="count", title=f"Grants: overview of associated objects found")

### Patents and Clinical Trials by Year

In [0]:
px.bar(grants, x="End Year", y="patents", color="Funding Amount in USD", 
       hover_name="Title", 
       hover_data=['Grant ID', 'Start Year', 'End Year', 'Funder', 'Funder Country'],
       title=f"Patents per grant")

In [0]:
px.bar(grants, x="End Year", y="clinical_trials", color="Funding Amount in USD", 
       hover_name="Title", 
       hover_data=['Grant ID', 'Start Year', 'End Year', 'Funder', 'Funder Country'],
       title=f"Clinical Trials per grant")

### Patents and Clinical Trials by Grant Funder

In [0]:
funders_patents = grants.query('patents > 0').groupby(['Funder', 'Funder Country'], as_index=False).sum().sort_values(by=["patents"], ascending=False)
funders_trials = grants.query('clinical_trials > 0').groupby(['Funder', 'Funder Country'], as_index=False).sum().sort_values(by=["clinical_trials"], ascending=False)

In [0]:
px.bar(funders_patents,  y="patents", x="Funder", color="Funder Country",
       hover_name="Funder", 
       hover_data=['Funder', 'Funder Country'], 
       title=f"Patents by Funders")

In [0]:
px.bar(funders_trials,  y="clinical_trials", x="Funder", color="Funder Country",
       hover_name="Funder", 
       hover_data=['Funder', 'Funder Country'], 
       title=f"Clinical Trials by Funders")

### Exploring Correlations between dimensions

Tip: a straight diagonal indicates a strong correlation, while a 90 degree angle indicates no correlation. 

In [0]:
px.scatter_matrix(grants, dimensions=["patents", "clinical_trials", "pubs"], color="Funder Country")

--- 

## Conclusion

In this tutorial we have enriched a grants dataset on the topic of 'vaccines' by adding information about patent and clinical trials.
