# Python TERMite toolkit - TERMite

We provide a Python library for making calls to our NER engine, TERMite, as well as the TExpress module for defining more complex semantic patterns. The library also enables post-processing of the JSON returned from such requests. This notebook gives you the rundown on how to make a call to TERMite and some of the possible post-processing of the JSON output.

## Install or update Python toolkit¶

The Python toolkit can simply be installed by running the following command in the terminal:
```
pip3 install termite_toolkit
```
If you already have the toolkit install make sure you have the latest version:
```
pip3 install termite_toolkit --upgrade
```

## Example call to TERMite

Making a call to TERMite with the toolkit is easy: simply ```import termite``` from the ```termite_toolkit``` and make a call.

A call is made up of:
* the TERMite API endpoint
* the entities you wish to use for annotation
* a TERMite request
* request execution

Save the TERMite call in a python script and simply run ```python ExampleCall.py``` in the terminal.

This is some example text we can make a TERMite call on

In [None]:
input_text = "The data in Table 2, Row 2 suggest that Telmisartan might be useful to prevent colon cancer (note that Clopidogrel is in both the Drug and Control arm, so we did not investigate Clopidogrel further). Recent cell-based studies reported that Telmisartan exerts anti-tumor effects by activating peroxisome proliferator-activated receptor-γ (Li et al., 2014; Pu, Zhu & Kong, 2016; Wu et al., 2016b). The algorithm presented here provides the first evidence from a randomized clinical trial indicating that Telmisartan may be viable as a repurposed prevention for colon cancer. Phylloquinone (Table 2, Row 4) is a vitamin (vitamin K1) supplement rather than a prescription drug. K vitamins + sorafenib induce apoptosis in human pancreatic cancer cell lines (Wei, Wang & Carr, 2010). A prospective cohort analysis found that individuals who increased their intake of dietary phylloquinone might have a lower risk of cancer than those who did not (Juanola-Falgarona et al., 2014). The data from the randomized trial in Table 2 suggest that vitamin K1 might actually help prevent cancer (OR = 0.27, 95% CI [0.07–0.98]). The potential cancer prevention by vitamin K1 is especially intriguing because one can get more than 1,000% daily value of vitamin K1 by simply eating one cup of cooked kale or spinach (https://www.healthaliciousness.com/articles/food-sources-of-vitamin-k.php)."

Below is an example TERMite call. The API endpoint specified is TERMite's default endpoint. Here we just print the TERMite result to the screen. 

In [None]:
from pprint import pprint
from termite_toolkit import termite

# specify termite API endpoint
termite_home = "http://localhost:9090/termite"

# specify entities to annotate
entities = "DRUG,INDICATION"

# initialise a request builder
t = termite.TermiteRequestBuilder()

# add items to your TERMite request
t.set_url(termite_home)
t.set_text(input_text)  # this is where we send the text to be annotated
t.set_entities(entities)  # you must specify the vocab neams you would like to use for annotation
t.set_subsume(True)
t.set_input_format("txt")
t.set_output_format("json")  # you can try different output formats here e.g. "tsv"
t.set_reject_ambiguous(False)


# once the query object has been built, execute the TERMite request
termite_response = t.execute(display_request=False)

pprint(termite_response)

To understand the JSON output of TERMite results [click here](https://help.scibite.com/a/solutions/articles/179705-anatomy-of-a-termite-hit).

Use ```help(termite.TermiteRequestBuilder)``` to view the documentation to see the available functions of ```TermiteRequestBuilder()``` and how they can be used to set the runtime options.

Once familiar with making a call in Python you'll be able to make calls on files and using a python dict object of TERMite options (these can be viewed on your TERMite server homepage), like the example below:


In [None]:
from pprint import pprint
from termite_toolkit import termite
import sys
import os

# specify termite API endpoint
termite_home = "http://localhost:9090/termite"

# input file

parentDir = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))  # this line relatively locates the parent directory
input_file = os.path.join(parentDir, 'sample_scripts/medline_sample.zip')  

# TERMite options
options = {"format": "medline.xml", "output": "json", "entities": "DRUG,GENE,INDICATION"}

# TERMite call as JSON result
termite_json_response = termite.annotate_files(termite_home, input_file, options)

In [None]:
termite.payload_records(termite_json_response)

## TERMite toolkit library

The standard JSON output is gives the most rich output, but this isn't the most human friendly.

The TERMite toolkit has many built-in functions for parsing outputs. For example, ```get_entitiy_hits_from_json()``` takes a JSON TERMite response and returns a summary of the hits with additional filtering rules applied. The returned object is a python dict object indexed by entity ID, with associated frequency counts.

Below is an example of post-processing of the results from our first TERMite example call; we've filtered the TERMite hits so that we're only looking at DRUG hits.

In [None]:
filtered_hits = termite.get_entity_hits_from_json(termite_response, 'DRUG', reject_ambig=False)

pprint(filtered_hits)

We've also added functionality to convert the json and doc.JSONx outputs into a pandas dataframe, either by individual hits or grouped by TERMite ID.

In [None]:
termite.get_termite_dataframe(termite_response, reject_ambig = False).head()

In [None]:
termite.all_entities_df(termite_json_response)


We've made it easier to identify which VOCabs have hits within the TERMite input, their frequencies, and the most frequent hits:

In [None]:
termite.all_entities(termite_json_response)

In [None]:
termite.entity_freq(termite_json_response)

In [None]:
termite.top_hits_df(termite_json_response, selection=5)

We've also made it possible to get the hits from a specific set of vocabs for each document

In [None]:
filter_entity_types = ['DRUG', 'INDICATION','GENE']
termite.termite_entity_hits_df(termite_json_response, filter_entity_types)['INDICATION']