# Patent publication references, for an entire patent family

This notebook shows how to use the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) to identify all the publications referenced by patents, for all the patents that belong to the same [patent family](https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/patent-families.html).

There are the steps: 

1. We start from a specific patent Dimensions ID and obtain its family ID
2. Using the family ID, we query the [patents API](https://docs.dimensions.ai/dsl/datasource-patents.html) to search for all related patents and return the publications IDs they reference
3. Finally, we query the [publications API](https://docs.dimensions.ai/dsl/datasource-publications.html) to obtain other useful publication metadata e.g. title, publisher, journals etc.. 

These sample results can be explored in [Google Sheets](https://docs.google.com/spreadsheets/d/17aCE36hsKapt9nOzvP1KtbmmsMuj5O-F7ckcPEU_bUE). 

In [1]:
import datetime
print("==\nCHANGELOG\nThis notebook was last run on %s\n==" % datetime.date.today().strftime('%b %d, %Y'))

==
CHANGELOG
This notebook was last run on Jan 25, 2022
==


## Prerequisites

This notebook assumes you have installed the [Dimcli](https://pypi.org/project/dimcli/) library and are familiar with the ['Getting Started' tutorial](https://api-lab.dimensions.ai/cookbooks/1-getting-started/1-Using-the-Dimcli-library-to-query-the-API.html).

In [2]:
!pip install dimcli tqdm -U --quiet 

import dimcli
from dimcli.utils import *
import sys, json, time, os
from tqdm.notebook import tqdm
import pandas as pd
#

print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = getpass.getpass(prompt='API Key: ')  
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()

[2mSearching config file credentials for 'https://app.dimensions.ai' endpoint..[0m


==
Logging in..
[2mDimcli - Dimensions API Client (v0.9.6)[0m
[2mConnected to: <https://app.dimensions.ai/api/dsl> - DSL v2.0[0m
[2mMethod: dsl.ini file[0m


## 1. Search for the patent ID and return the family ID.

As a starting point, let's take patent ID `US-20210108231-A1`. 

> View this patent record in Dimensions: [Methods and compositions for rna-directed target dna modification and for rna-directed modulation of transcription](https://app.dimensions.ai/details/patent/US-20210108231-A1)

In [3]:
patent_id = "US-20210108231-A1" #@param {type:"string"}

q_family_id = dsl.query(f"""
    search patents
    where id = "{patent_id}"
    return family_id
""")

try:
    family_id = q_family_id['family_id'][0]['id']
    print("Found family_id:",  family_id)
except:
    print("No family ID found. \nFull API results:\n", str(q_family_id.json))

Returned Family_id: 1
[2mTime: 0.60s[0m
Found family_id: 49624232


## 2. Use the family ID to search for all related patents and return the publications IDs they reference

A few things to note about the query below: 

* The `unnest` operator in `return patents[unnest(publication_ids)]` is used to 'explode' lists of its into a single value per column - [more info here](https://docs.dimensions.ai/dsl/language.html#unnesting-multi-value-entity-fields)
* The filter `publication_ids is not empty` means that only patents that have at least one publication reference get returned
* The return statement could be changed to `..return publications` i.e. a facet (aggregation). However remember that all facet queries allow a maximum of 1000 results.. so to ensure we get all results for any family ID we simply return one line per patent and aggregate data manually. 
* Finally, we should keep in mind that results contain duplicate rows cause the more than one patent (in the same family) may be referencing the same publication. We'll dedup the data later but keep this infos, as it'll tell us *which publications are cited most frequently*. 


In [4]:

#
# get all patents from same family
all_patents = []

q_all_patents = dsl.query_iterative(f"""
    search patents
    where family_id = {family_id}
    and publication_ids is not empty
    return patents[unnest(publication_ids)]
""")

df = q_all_patents.as_dataframe()

#
# pivot on IDs and count frequency 
references_list = df.groupby(df.columns.tolist(),as_index=False).size().sort_values("size", ascending=False)


Starting iteration with limit=1000 skip=0 ...[0m
0-55 / 55 (0.81s)[0m
55-55 / 55 (0.56s)[0m
===
Records extracted: 11401[0m


In [5]:
# preview the data 

references_list.head(10)

Unnamed: 0,publication_ids,size
206,pub.1030591890,50
323,pub.1052438070,47
276,pub.1041850060,47
132,pub.1019873131,44
151,pub.1022072971,44
152,pub.1022097335,40
129,pub.1019168198,39
260,pub.1039119530,39
285,pub.1043148894,39
264,pub.1040038815,39


## 3. Enriching the publication IDs with additional metadata 

In this step we query the [publications API](https://docs.dimensions.ai/dsl/datasource-publications.html), using the referenced Dimensions IDs extracted previously in order to retrieve further metadata about publications. 

Since we can have lots of publications to go through, the IDs list is *chunked* into smaller groups so to ensure the resulting API query is never too long ([more info here](https://api-lab.dimensions.ai/cookbooks/1-getting-started/6-Working-with-lists.html#5.-How-Long-can-lists-get?)). 

PS Change the query template `return` statement to customise the metadata returned. 

In [6]:
pubids = list(references_list['publication_ids'])


query_template = """search publications 
                    where id in {}
                    return publications[id+doi+pmid+title+journal+year+publisher+type+dimensions_url]
                    limit 1000"""


#
# loop through all references-publications IDs in chunks and query Dimensions 

print("===\nExtracting publications data ...")
results = []
BATCHSIZE = 300
VERBOSE = False # set to True to see extraction logs

for chunk in tqdm(list(chunks_of(pubids, BATCHSIZE))):
    query = query_template.format(json.dumps(chunk))
    data = dsl.query(query, verbose=VERBOSE)
    results += data.publications
    time.sleep(0.5)

#
# put the cited publication data into a dataframe 

pubs_cited = pd.DataFrame().from_dict(results)
print("===\nCited Publications found: ", len(pubs_cited))



#
# transform the 'journal' column cause it contains nested data 

temp = pubs_cited['journal'].apply(pd.Series).rename(columns={"id": "journal.id", 
                                                            "title": "journal.title"}).drop([0], axis=0)
pubs_cited = pd.concat([pubs_cited.drop(['journal'], axis=1), temp], axis=1)

pubs_cited.head()


===
Extracting publications data ...


  0%|          | 0/2 [00:00<?, ?it/s]

===
Cited Publications found:  363


Unnamed: 0,dimensions_url,doi,id,pmid,publisher,title,type,year,journal.id,journal.title
0,https://app.dimensions.ai/details/publication/...,10.1021/acs.chemrev.7b00499,pub.1100683468,29377672.0,American Chemical Society (ACS),Molecular Mechanism and Evolution of Nuclear P...,article,2018,,
1,https://app.dimensions.ai/details/publication/...,10.1101/168443,pub.1091918134,,Cold Spring Harbor Laboratory,P53 toxicity is a hurdle to CRISPR/CAS9 screen...,preprint,2017,jour.1293558,bioRxiv
2,https://app.dimensions.ai/details/publication/...,10.1016/j.cell.2016.08.056,pub.1035697575,27662091.0,Elsevier,Editing DNA Methylation in the Mammalian Genome,article,2016,jour.1019114,Cell
3,https://app.dimensions.ai/details/publication/...,10.18632/oncotarget.10234,pub.1017128844,27356740.0,"Impact Journals, LLC",CRISPR-dCas9 mediated TET1 targeting for selec...,article,2016,jour.1043645,Oncotarget
4,https://app.dimensions.ai/details/publication/...,10.1038/nature17946,pub.1009172001,27096365.0,Springer Nature,Programmable editing of a target base in genom...,article,2016,jour.1018957,Nature


## 4. Combine the publication metadata with the patent citations information  

In this step we take the results of the patents query from step 2 and merge them with the publication query from step 3. 

The goal is simply to retain the total count of patent citations per publication in the final dataset containing detailed publications metadata. 

In [7]:
# merge two datasets using 'publication id' as key
final_data = pubs_cited.merge(references_list, left_on='id', right_on='publication_ids')

# rename 'size' column
final_data.rename(columns = {"size" : "patents_citations"}, inplace = True)

# show top 5 cited publications
final_data.sort_values("patents_citations", ascending=False, inplace=True)
final_data.head(5)

Unnamed: 0,dimensions_url,doi,id,pmid,publisher,title,type,year,journal.id,journal.title,publication_ids,patents_citations
134,https://app.dimensions.ai/details/publication/...,10.1038/nature09886,pub.1030591890,21455174,Springer Nature,CRISPR RNA maturation by trans-encoded small R...,article,2011,jour.1018957,Nature,pub.1030591890,50
92,https://app.dimensions.ai/details/publication/...,10.1126/science.1225829,pub.1041850060,22745249,American Association for the Advancement of Sc...,A Programmable Dual-RNA–Guided DNA Endonucleas...,article,2012,jour.1346339,Science,pub.1041850060,47
126,https://app.dimensions.ai/details/publication/...,10.1093/nar/gkr606,pub.1052438070,21813460,Oxford University Press (OUP),The Streptococcus thermophilus CRISPR/Cas syst...,article,2011,jour.1018982,Nucleic Acids Research,pub.1052438070,47
78,https://app.dimensions.ai/details/publication/...,10.1126/science.1232033,pub.1022072971,23287722,American Association for the Advancement of Sc...,RNA-Guided Human Genome Engineering via Cas9,article,2013,jour.1346339,Science,pub.1022072971,44
79,https://app.dimensions.ai/details/publication/...,10.1126/science.1231143,pub.1019873131,23287718,American Association for the Advancement of Sc...,Multiplex Genome Engineering Using CRISPR/Cas ...,article,2013,jour.1346339,Science,pub.1019873131,44


### 4.1 Optional: exporting the data to google sheets

NOTE: this will work only Google Colab, or in other Jupyter environment if you have previously enabled the required Google credentials ([more info here](https://digital-science.github.io/dimcli/modules.html?highlight=export_as_gsheets#dimcli.utils.misc_utils.export_as_gsheets)).

In [None]:
export_as_gsheets(final_data)

---
## Conclusions

In this notebook we have shown how to use the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) to identify all the publications referenced by patents, for all the patents that belong to the same [patent family](https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/patent-families.html).

This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it'll give you a few basic tools to get started building your own application. For more background, see the [list of fields](https://docs.dimensions.ai/dsl/datasource-patents.html) available via the Patents API. 
