# Python TERMite toolkit - TERMite

We provide a Python library for making calls to our NER engine, TERMite, as well as the TExpress module for defining more complex semantic patterns. The library also enables post-processing of the JSON returned from such requests. This notebook gives you the rundown on how to make a call to TERMite and some of the possible post-processing of the JSON output.

## Install or update Python toolkit¶

The Python toolkit can simply be installed by running the following command in the terminal:
```
pip3 install termite_toolkit
```
If you already have the toolkit install make sure you have the latest version:
```
pip3 install termite_toolkit --upgrade
```

## Example call to TERMite

Making a call to TERMite with the toolkit is easy: simply ```import termite``` from the ```termite_toolkit``` and make a call.

A call is made up of:
* the TERMite API endpoint
* the entities you wish to use for annotation
* a TERMite request
* request execution

Save the TERMite call in a python script and simply run ```python ExampleCall.py``` in the terminal.

This is some example text we can make a TERMite call on

In [1]:
input_text = "The data in Table 2, Row 2 suggest that Telmisartan might be useful to prevent colon cancer (note that Clopidogrel is in both the Drug and Control arm, so we did not investigate Clopidogrel further). Recent cell-based studies reported that Telmisartan exerts anti-tumor effects by activating peroxisome proliferator-activated receptor-γ (Li et al., 2014; Pu, Zhu & Kong, 2016; Wu et al., 2016b). The algorithm presented here provides the first evidence from a randomized clinical trial indicating that Telmisartan may be viable as a repurposed prevention for colon cancer. Phylloquinone (Table 2, Row 4) is a vitamin (vitamin K1) supplement rather than a prescription drug. K vitamins + sorafenib induce apoptosis in human pancreatic cancer cell lines (Wei, Wang & Carr, 2010). A prospective cohort analysis found that individuals who increased their intake of dietary phylloquinone might have a lower risk of cancer than those who did not (Juanola-Falgarona et al., 2014). The data from the randomized trial in Table 2 suggest that vitamin K1 might actually help prevent cancer (OR = 0.27, 95% CI [0.07–0.98]). The potential cancer prevention by vitamin K1 is especially intriguing because one can get more than 1,000% daily value of vitamin K1 by simply eating one cup of cooked kale or spinach (https://www.healthaliciousness.com/articles/food-sources-of-vitamin-k.php)."

Below is an example TERMite call. The API endpoint specified is TERMite's default endpoint. Here we just print the TERMite result to the screen. 

In [2]:
from pprint import pprint
from termite_toolkit import termite

# specify termite API endpoint
termite_home = "http://localhost:9090/termite"

# specify entities to annotate
entities = "DRUG,INDICATION"

# initialise a request builder
t = termite.TermiteRequestBuilder()

# add items to your TERMite request
t.set_url(termite_home)
t.set_text(input_text)  # this is where we send the text to be annotated
t.set_entities(entities)  # you must specify the vocab neams you would like to use for annotation
t.set_subsume(True)
t.set_input_format("txt")
t.set_output_format("json")  # you can try different output formats here e.g. "tsv"
t.set_reject_ambiguous(False)


# once the query object has been built, execute the TERMite request
termite_response = t.execute(display_request=False)

pprint(termite_response)

{'RESP_META': {'CONID': '0:0:0:0:0:0:0:1/94',
               'ENTITIES_LIMIT': '[DRUG, INDICATION]',
               'HTTP_CODE': '200',
               'INPUT_SIZE': 1373,
               'REQID': 'c7e140a4-c606-4cce-b117-3048565ce9e9-2364',
               'RUNTIME_OPTIONS': {'_termitesys.exetermite': 'true',
                                   '_termitesys.exetexpress': 'false',
                                   'rejectAmbig': 'false',
                                   'subsume': 'true'},
               'TERMITE_RUNTIME': 'default',
               'TERMITE_VERS': '6.4.9',
               'Timing_msec_TOTAL': '1',
               '_READY_FORMATTED_WITH': 'com.scibite.termitej.formatter.streamers.JsonStreamFormatter'},
 'RESP_MULTIDOC_PAYLOAD': {'_document': {'DRUG': [{'dependencyMet': True,
                                                   'dictSynList': ['telmisartan',
                                                                   'telmisartan',
                                     

To understand the JSON output of TERMite results [click here](https://help.scibite.com/a/solutions/articles/179705-anatomy-of-a-termite-hit).

Use ```help(termite.TermiteRequestBuilder)``` to view the documentation to see the available functions of ```TermiteRequestBuilder()``` and how they can be used to set the runtime options.

Once familiar with making a call in Python you'll be able to make calls on files and using a python dict object of TERMite options (these can be viewed on your TERMite server homepage), like the example below:


In [3]:
from pprint import pprint
from termite_toolkit import termite
import sys
import os

# specify termite API endpoint
termite_home = "http://localhost:9090/termite"

# input file

parentDir = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))  # this line relatively locates the parent directory
input_file = os.path.join(parentDir, 'sample_scripts/medline_sample.zip')  

# TERMite options
options = {"format": "medline.xml", "output": "json", "entities": "DRUG,GENE,INDICATION"}

# TERMite call as JSON result
termite_json_response = termite.annotate_files(termite_home, input_file, options)

In [4]:
termite.payload_records(termite_json_response)

[{'invalidPositions': False,
  'ft': False,
  'synState': [1, 1, 1, 1, 1, 1],
  'hitCount': 6,
  'sourceTitle': '',
  'sourceID': '',
  'docTitle': '',
  'docID': '26377028',
  'hitID': 'D007249',
  'name': 'Inflammation',
  'frag_vector_array': ["1#s mapping of small bowel {!inflammation!} in Crohn's disease (CD).",
   '2#value of the severity of {!inflammatory!} lesions, quantified by t',
   '6#with moderate or severe {!inflammatory!} activity (LS ≥790) and t',
   '6#790) and those with mild {!inflammatory!} activity (135 ≤ LS <',
   '10#with moderate to severe {!inflammatory!} activity there were high',
   '11#he degree of small bowel {!inflammatory!} activity with SBCE and L'],
  'totnosyns': 2,
  'goodSynCount': 6,
  'nonambigsyns': 2,
  'score': 4,
  'hit_loc_vector': [1, 2, 6, 6, 10, 11],
  'word_pos_array': [10, 11, 18, 26, 6, 7],
  'exact_string': '1#67-79,2#164-176,6#817-829,6#869-881,10#1640-1652,11#1807-1819',
  'exact_array': [{'byteStart': 0,
    'byteEnd': 0,
    'subsum

## TERMite toolkit library

The standard JSON output is gives the most rich output, but this isn't the most human friendly.

The TERMite toolkit has many built-in functions for parsing outputs. For example, ```get_entitiy_hits_from_json()``` takes a JSON TERMite response and returns a summary of the hits with additional filtering rules applied. The returned object is a python dict object indexed by entity ID, with associated frequency counts.

Below is an example of post-processing of the results from our first TERMite example call; we've filtered the TERMite hits so that we're only looking at DRUG hits.

In [5]:
filtered_hits = termite.get_entity_hits_from_json(termite_response, 'DRUG', reject_ambig=False)

pprint(filtered_hits)

{'DRUG$CHEMBL1017': {'doc_count': 1,
                     'doc_id': ['_document'],
                     'hit_count': 3,
                     'id': 'CHEMBL1017',
                     'max_relevance_score': 4,
                     'name': 'Telmisartan',
                     'type': 'DRUG'},
 'DRUG$CHEMBL1336': {'doc_count': 1,
                     'doc_id': ['_document'],
                     'hit_count': 1,
                     'id': 'CHEMBL1336',
                     'max_relevance_score': 1,
                     'name': 'Sorafenib',
                     'type': 'DRUG'},
 'DRUG$CHEMBL1550': {'doc_count': 1,
                     'doc_id': ['_document'],
                     'hit_count': 6,
                     'id': 'CHEMBL1550',
                     'max_relevance_score': 4,
                     'name': 'Phytonadione',
                     'type': 'DRUG'},
 'DRUG$CHEMBL1771': {'doc_count': 1,
                     'doc_id': ['_document'],
                     'hit_count': 2,
           

We've also added functionality to convert the json and doc.JSONx outputs into a pandas dataframe, either by individual hits or grouped by TERMite ID.

In [6]:
termite.get_termite_dataframe(termite_response, reject_ambig = False).head()

Unnamed: 0,docID,entityType,hitID,name,score,realSynList,totnosyns,nonambigsyns,frag_vector_array,hitCount
0,_document,DRUG,CHEMBL1017,Telmisartan,4,"[Telmisartan, Telmisartan, Telmisartan]",1,1,"[1#le 2, Row 2 suggest that {!Telmisartan!} mi...",3
1,_document,DRUG,CHEMBL1336,Sorafenib,1,[sorafenib],1,1,[5#K vitamins + {!sorafenib!} induce apoptosis...,1
2,_document,DRUG,CHEMBL1550,Phytonadione,4,"[Phylloquinone, vitamin K1, phylloquinone, vit...",2,2,"[4#{!Phylloquinone!} (Table 2, Row 4) is a vi,...",6
3,_document,DRUG,CHEMBL1771,Clopidogrel Bisulfate,1,"[Clopidogrel, Clopidogrel]",1,1,[1#colon cancer (note that {!Clopidogrel!} is ...,2
4,_document,INDICATION,D010190,Pancreatic Neoplasms,1,[pancreatic cancer],1,1,[5#nduce apoptosis in human {!pancreatic cance...,1


In [7]:
termite.all_entities_df(termite_json_response).head()

Unnamed: 0,id,type,name,hit_count,max_relevance_score,doc_id,doc_count
INDICATION$D007249,D007249,INDICATION,Inflammation,468,5,"[26377028, 26351389, 26351387, 26254470, 26209...",244
INDICATION$D003424,D003424,INDICATION,Crohn Disease,2084,5,"[26377028, 26374663, 26374662, 26351391, 26351...",476
INDICATION$D009164,D009164,INDICATION,Mycobacterium Infections,28,5,"[26374663, 26417047, 24768214, 25398152, 24360...",12
INDICATION$D014376,D014376,INDICATION,Tuberculosis,31,5,"[26374663, 26417047, 25398152, 24360259, 24295...",11
INDICATION$D005402,D005402,INDICATION,Fistula,76,5,"[26374663, 26589956, 26223842, 26512136, 26351...",26


We've made it easier to identify which VOCabs have hits within the TERMite input, their frequencies, and the most frequent hits:

In [8]:
termite.all_entities(termite_json_response)

['INDICATION', 'DRUG', 'GENE']

In [9]:
termite.entity_freq(termite_json_response)

Unnamed: 0,entityType
INDICATION,3183
DRUG,683
GENE,538


In [10]:
termite.top_hits_df(termite_json_response, selection=5)

Unnamed: 0,name,realSynList,totnosyns,hitID
694,Inflammatory Bowel Diseases,"[Inflammatory Bowel Disease, inflammatory bowe...",2,D015212
3443,Inflammatory Bowel Diseases,"[inflammatory bowel diseases, inflammatory bow...",3,D015212
3856,Obesity,"[overweight, obesity, Obesity, overweight, Bod...",5,D009765
920,nucleotide binding oligomerization domain cont...,"[NOD2, CARD15, NOD2, CARD15, NOD2, CARD15, NOD...",2,NOD2
3561,Infliximab,"[Infliximab, infliximab, infliximab, Inflixima...",1,CHEMBL1201581


We've also made it possible to get the hits from a specific set of vocabs for each document

In [15]:
filter_entity_types = ['DRUG', 'INDICATION','GENE']
termite.termite_entity_hits_df(termite_json_response, filter_entity_types)['INDICATION']

0                      Inflammation
1                     Crohn Disease
2          Mycobacterium Infections
3                      Tuberculosis
4                           Fistula
                   ...             
4399    Inflammatory Bowel Diseases
4400         Venous Thromboembolism
4401                   Inflammation
4402                  Crohn Disease
4403              Venous Thrombosis
Name: INDICATION, Length: 4404, dtype: object