This notebook outlines some basic analysis of the terms used in the [DEDuCT](https://cb.imsc.res.in/deduct/) database. Some terms have been used that represent directed relationships with biomarkers, clinical endpoints, abundances of biological entities, activities of biological entities, ratios of biological entities, and others.

[Gilda](https://github.com/indralab/gilda) was used to look up normalizations on a first pass, but it lacks entries from Disease Ontology, Experimental Factor Ontology, and Human Phenotype Ontology that would likely be most appropriate. This notebook will later be appended with other results from the [EBI Ontology Lookup Service](https://www.ebi.ac.uk/ols/index) which could provide complementary results.

In [1]:
import json
from collections import Counter

import requests
import pandas as pd
from tqdm import tqdm_notebook as tqdm

In [2]:
GILDA_URL = 'http://34.201.164.108:8001'

def post_gilda(text: str, url: str = GILDA_URL) -> requests.Response:
    """Send text to GILDA."""
    return requests.post(f'{url}/ground', json={'text': text})

## Data

In [3]:
chemicals_url = 'https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ChemicalBasicInformation.csv'
chemicals_df = pd.read_csv(chemicals_url, index_col=0)

In [4]:
evidences_url = 'https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ExperimentalEvidence.csv'
evidences_df = pd.read_csv(evidences_url, index_col=0)
del evidences_df['Name']

In [5]:
df = evidences_df.join(chemicals_df)
df = df[df['PubChem identifier'].notna()]
df.head()

Unnamed: 0_level_0,Literature identifier,Study type,Dosage unit,Tested concentration – Lower,Tested concentration – Upper,Effective concentration – Lower,Effective concentration – Upper,Endocrine-mediated endpoints,Endocrine mediated systems,CAS Number,PubChem identifier,Name,IUPAC_Name,SMILES (Canonical),INCHI,INCHI_Key
Primary identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
CAS:319-85-7,PMID:2423406,IVR,mg/kg,250.0,250.0,250.0,,Affects spermatogenesis,Reproductive endocrine-mediated perturbations,319-85-7,727.0,beta-Hexachlorocyclohexane,"1,2,3,4,5,6-hexachlorocyclohexane",C1(C(C(C(C(C1Cl)Cl)Cl)Cl)Cl)Cl,InChI=1S/C6H6Cl6/c7-1-2(8)4(10)6(12)5(11)3(1)9...,JLYXXMFPNIAWKQ-CDRYSYESSA-N
CAS:319-85-7,PMID:2423406,IVR,mg/kg,250.0,250.0,250.0,,Decreased ovarian weights,Reproductive endocrine-mediated perturbations,319-85-7,727.0,beta-Hexachlorocyclohexane,"1,2,3,4,5,6-hexachlorocyclohexane",C1(C(C(C(C(C1Cl)Cl)Cl)Cl)Cl)Cl,InChI=1S/C6H6Cl6/c7-1-2(8)4(10)6(12)5(11)3(1)9...,JLYXXMFPNIAWKQ-CDRYSYESSA-N
CAS:319-85-7,PMID:2423406,IVR,mg/kg,250.0,250.0,250.0,,Decreased thymus gland weights,Immunological endocrine-mediated perturbations,319-85-7,727.0,beta-Hexachlorocyclohexane,"1,2,3,4,5,6-hexachlorocyclohexane",C1(C(C(C(C(C1Cl)Cl)Cl)Cl)Cl)Cl,InChI=1S/C6H6Cl6/c7-1-2(8)4(10)6(12)5(11)3(1)9...,JLYXXMFPNIAWKQ-CDRYSYESSA-N
CAS:319-85-7,PMID:2423406,IVR,mg/kg,50.0,50.0,50.0,,Increased liver weights,Hepatic endocrine-mediated perturbations,319-85-7,727.0,beta-Hexachlorocyclohexane,"1,2,3,4,5,6-hexachlorocyclohexane",C1(C(C(C(C(C1Cl)Cl)Cl)Cl)Cl)Cl,InChI=1S/C6H6Cl6/c7-1-2(8)4(10)6(12)5(11)3(1)9...,JLYXXMFPNIAWKQ-CDRYSYESSA-N
CAS:319-85-7,PMID:2423406,IVR,mg/kg,10.0,10.0,10.0,,Increased ovarian weights,Reproductive endocrine-mediated perturbations,319-85-7,727.0,beta-Hexachlorocyclohexane,"1,2,3,4,5,6-hexachlorocyclohexane",C1(C(C(C(C(C1Cl)Cl)Cl)Cl)Cl)Cl,InChI=1S/C6H6Cl6/c7-1-2(8)4(10)6(12)5(11)3(1)9...,JLYXXMFPNIAWKQ-CDRYSYESSA-N


## Normalizing terms containing relationships

The set of endpoints contains several common prefixes and suffixes. The prefixes usually correspond to relationship types and the suffixes usually denote that the abundance of the target of the relationship is being measured.

In [6]:
# This dictionary maps different forms of the same relationship
relationships = {
    'Affects': 'affect',
    'Decreased': 'decrease',
    'Increased': 'increase',
    'Changes': 'change',
    'Decrease': 'decrease',
    'Induced': 'increase',
    'Increase': 'increase',
    'Elevated': 'increase',
    'Reduced': 'decrease',
    'Causes': 'affect',
    'Induce': 'increase',
    'Impaired': 'decrease',
    'Altered': 'change',
    'Accumulation': 'increase',
    'Alteration': 'change',
}

In [7]:
def get_prefix(endpoint):
    for relationship in relationships:
        for prefix in ('in', 'of'):
            if endpoint.startswith(f'{relationship} {prefix}'):
                return relationship, prefix
        if endpoint.startswith(relationship):
            return relationship, ''
    return None, None


def get_suffix(endpoint):
    if endpoint.endswith('levels'):
        return 'levels'
    return ''
    
    
unnormalized_df_rows = []
normalized_df_rows = []
for endpoint in df['Endocrine-mediated endpoints'].unique():
    relationship, prefix = get_prefix(endpoint)
    suffix = get_suffix(endpoint)
    if relationship:
        prefix_length = len(f'{relationship} {prefix}')
        if suffix:
            term = endpoint[prefix_length:-len(suffix)]
        else:
            term = endpoint[prefix_length:]
            
        normalized_df_rows.append((
            relationships[relationship], 
            term.lstrip(),
            endpoint,
        ))
    else:
        unnormalized_df_rows.append(endpoint)

normalized_df = pd.DataFrame(
    normalized_df_rows, 
    columns=['relationship', 'term', 'endpoint'],
)
normalized_df

Unnamed: 0,relationship,term,endpoint
0,affect,spermatogenesis,Affects spermatogenesis
1,decrease,ovarian weights,Decreased ovarian weights
2,decrease,thymus gland weights,Decreased thymus gland weights
3,increase,liver weights,Increased liver weights
4,increase,ovarian weights,Increased ovarian weights
5,increase,uterine weights,Increased uterine weights
6,increase,weights of adrenal gland,Increased weights of adrenal gland
7,increase,weights of pituitary gland,Increased weights of pituitary gland
8,affect,calcium signaling,Affects calcium signaling
9,affect,folliculogenesis,Affects folliculogenesis


In [8]:
gilda_normalized_df_rows = []

it = tqdm(set(normalized_df.term), desc='Querying Gilda')
for term in it:
    res_json = post_gilda(term).json()
    if not res_json:
        continue
    for result in res_json:
        gilda_normalized_df_row = (
            term, round(result['score'], 3), result['term']['db'].lower(),
            result['term']['id'], result['term']['entry_name'] 
        )
        gilda_normalized_df_rows.append(gilda_normalized_df_row)

HBox(children=(IntProgress(value=0, description='Querting Gilda', max=324, style=ProgressStyle(description_wid…




In [9]:
gilda_normalized_df = pd.DataFrame(
    gilda_normalized_df_rows,
    columns=['query', 'gilda_score', 'db', 'db_id', 'db_name']
)
gilda_normalized_df

Unnamed: 0,query,gilda_score,db,db_id,db_name
0,developmental process,0.778,go,GO:0032502,developmental process
1,social behavior,0.778,go,GO:0035176,social behavior
2,social behavior,0.762,mesh,D012919,Social Behavior
3,fertility,0.762,mesh,D005298,Fertility
4,glycogenolysis,0.762,mesh,D050261,Glycogenolysis
5,ovulation,0.778,go,GO:0030728,ovulation
6,ovulation,0.762,mesh,D010060,Ovulation
7,lordosis,0.762,mesh,D008141,Lordosis
8,sperm motility,0.778,go,GO:0097722,sperm motility
9,sperm motility,0.762,mesh,D013081,Sperm Motility


## Normalizing terms without relationships

The same technique was used for entities without relationships.

In [10]:
gilda_unnormalized_df_rows = []
for term in tqdm(unnormalized_df_rows, desc='Querying Gilda'):
    res_json = post_gilda(term).json()
    if not res_json:
        continue
    for result in res_json:
        gilda_unnormalized_df_row = (
            term, round(result['score'], 3), result['term']['db'].lower(),
            result['term']['id'], result['term']['entry_name'] 
        )
        gilda_unnormalized_df_rows.append(gilda_unnormalized_df_row)

HBox(children=(IntProgress(value=0, max=111), HTML(value='')))




In [11]:
gilda_normalized_df = pd.DataFrame(
    gilda_unnormalized_df_rows,
    columns=['query', 'gilda_score', 'db', 'db_id', 'db_name']
)
gilda_normalized_df

Unnamed: 0,query,gilda_score,db,db_id,db_name
0,Pregnancy complications,0.772,mesh,D011248,Pregnancy Complications
1,Delayed puberty,0.549,mesh,D011628,"Puberty, Delayed"
2,Adenocarcinoma,0.778,mesh,D000230,Adenocarcinoma
3,Paralysis,0.778,mesh,D010243,Paralysis
4,Erectile dysfunction,0.772,mesh,D007172,Erectile Dysfunction
5,Follicular atresia,0.772,mesh,D005496,Follicular Atresia
6,Endometrial hyperplasia,0.772,mesh,D004714,Endometrial Hyperplasia
7,Cognitive impairment,0.549,mesh,D060825,Cognitive Dysfunction
8,Hypospadias,0.778,mesh,D007021,Hypospadias
9,Hepatocellular carcinoma,0.549,mesh,D006528,"Carcinoma, Hepatocellular"
