# Python TERMite toolkit - DOCstore

This notebook walks you through how to make calls to the DOCstore API
and some of the post-processing of the JSON output.

Import the required sublibrary

In [None]:
from termite_toolkit import docstore

Point to a docstore server and then fill in the authentication details:

In [None]:
docstore_url = 'https://example.docstore.com:port'
user = 'user'
pw= 'pw'

# Document-level query
We can make a document-level query of docstore. In this example we print the docstore id of the first hit for a query on genes HTT and EGFR

In [None]:
docs = docstore.DocStoreRequestBuilder()
# specify docstore API endpoint and add authentication
docs.set_url(docstore_url)
docs.set_basic_auth(username=user, password=pw)
# make call to DOCStore Document-level query API
docs_json = docs.get_docs(['id:GENE$HTT', 'id:GENE$EGFR'])
print (docs_json)
# print unique id of the first hit
uid = docs_json['hits'][0]['uid']
print (uid)

# Retrieve a specific document
We can also use the document lookup by ID to retrieve data for a specific document based on its ID. For the purposes of this demo we use the ID from the previous query. The output of the script below are the authors of this document

In [None]:
docs = docstore.DocStoreRequestBuilder()
# specify docstore API endpoint and add authentication if necessary
docs.set_url(docstore_url)
docs.set_basic_auth(username=user, password=pw)
# make call to document lookup by ID API (using the uid of the previous query)
docs_jon = docs.get_doc_by_id(uid)
# print the authors of the document hit
print (docs_json['hits'][0]['authors'])

# Document co-occurrence dataframe
The script below looks for the occurence of two entities in the same document. While you can retrieve the output in raw json format, the toolkit also enables to produce a dataframe from it.

In [None]:
docs = docstore.DocStoreRequestBuilder()
# specify docstore API endpoint and add authentication if necessary
docs.set_url(docstore_url)
docs.set_basic_auth(username=user, password=pw)
# make call to DOCStore Document co-occurence API
docs_json = docs.get_dcc_docs(['id:GENE$HTT', 'id:GENE$EGFR'])
# convert json to df
df = docstore.get_docstore_dcc_df(docs_json)
# print titles of hits
df

# Sentence co-occurrence

In this case we're looking to find documents where the two entities are mentioned in the same sentence. The output of the script below is a dataframe with one co-occurence sentence per row.

In [None]:
docs = docstore.DocStoreRequestBuilder()
# specify docstore API endpoint and add authentication if necessary
docs.set_url(docstore_url)
docs.set_basic_auth(username=user, password=pw)
# make call to DOCStore sentence co-occurence API
docs_json = docs.get_scc_docs(['id:GENE$HTT', 'id:GENE$EGFR'])
# convert json to df
df = docstore.get_docstore_scc_df(docs_json)
# print doc_ids of hits
df

# Advanced Sentence co-occurrence

In [None]:
docs = docstore.DocStoreRequestBuilder()
# specify docstore API endpoint and add authentication if necessary
docs.set_url(docstore_url)
docs.set_basic_auth(username=user, password=pw)

# make call to DOCStore sentence co-occurence API
"""
options_dict is where we add options to fine-tune the results

Options used:
limit - limit to how many docs the sentences come from - in this case we get the n most recent
slop - distance allowed between the entity pairs - set to 99 for same sentence.
fields - focus the search to specific fields, e.g we are ignoring keywords here


"""
docs_json = docs.get_scc_docs(['id:GENE$HTT', 'type:INDICATION'], 
                options_dict={"limit" : 1000, 
                              "slop" : 10, 
                              "fields" : "title,body"
                             })

# convert json to df
df = docstore.get_docstore_scc_df(docs_json)

# printing sample result to screen. 
print(df.head())
# write results to an output file
df.to_csv("demo_output.tsv", index=False, sep="\t", header=True)

# Entity lookup
There is also an API call to lookup entity synonyms. The output gives information such as entity id, type and names. Some words may have more than one synonym. In those cases the json output will include all the possible options.

In [None]:
docs = docstore.DocStoreRequestBuilder()
# specify docstore API endpoint and add authentication if necessary
docs.set_url(docstore_url)
docs.set_basic_auth(username=user, password=pw)
# returns json with list of synonyms and their IDs
synonym = 'hedgehog'
entity_type = 'GENE'
print(docs.entity_lookup_id(synonym,entity_type))