# eTransafe Concordance analysis

This is the use scenario that has been described by Thomas Steger-Hartmann in a publication with Matthew Clark.
The idea is to compare animal observations with clinical observations for the various drugs
1. determine the drugs that have been used in the preclinical and the clinical domain
2. compare the individual PTs for preclinical and clinical
3. compute the concordance matrix
    1. aggregate the PT terms per PT, HLT, or SOC level
    2. for each drug, count how many terms are in the TP, FP, TN, FN bins
4. Visualize the matrix

(C) 2022 Erasmus University Medical Center, Rotterdam, The Netherlands
Author: Erik M. van Mulligen, e.vanmulligen@erasmusmc.nl

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from knowledgehub.api import KnowledgeHubAPI
from Concordance.condordance_utils import getClinicalDatabases, getPreclinicalDatabases
from Concordance.mapper import Mapper

import ipywidgets as w
from IPython.display import display, Javascript
from ipypublish import nb_setup
import numpy as np
import mysql.connector

import warnings
warnings.filterwarnings('ignore')

In [2]:
api = KnowledgeHubAPI(server='DEV', client_secret='3db5a6d7-4694-48a4-8a2e-e9c30d78f9ab')
mapper = Mapper(api)

## Authenticate for KnowledgeHub

In [4]:
username = w.Text(value='tester',placeholder='Knowledge Hub account', description='username:', disabled=False)
password = w.Password(value='', placeholder='Knowledge Hub password', description='password:', disabled=False)
loginBtn = w.Button(description='Login')
status = w.Output()

def on_button_clicked(_):
    if not api.login(username.value, password.value):
        print("Failed to login")
    else:
        print("successfully logged in")


loginBtn.on_click(on_button_clicked)
w.VBox([username, password, loginBtn])

VBox(children=(Text(value='tester', description='username:', placeholder='Knowledge Hub account'), Password(de…

## Authenticate for the data stored in the database

In [6]:
global db

dbhost = w.Text(value='localhost',placeholder='database host', description='host:', disabled=False)
dbdatabase = w.Text(value='concordance-20220524',placeholder='database name', description='database:', disabled=False)
dbusername = w.Text(value='root',placeholder='database username', description='username:', disabled=False)
dbpassword = w.Password(value='', placeholder='database password', description='password:', disabled=False)
dbLoginBtn = w.Button(description='Login')
status = w.Output()

def dbLoginBtn_click(_):
    global db
    try:
        db = mysql.connector.connect(host=dbhost.value, database=dbdatabase.value, user=dbusername.value, password=dbpassword.value)
        print("successfully logged in database")
    except Exception as e:
        print("failed to log in database")
        sys.exit(0)
    
dbLoginBtn.on_click(dbLoginBtn_click)
w.VBox([dbhost, dbdatabase, dbusername, dbpassword, dbLoginBtn])

VBox(children=(Text(value='localhost', description='host:', placeholder='database host'), Text(value='concorda…

## The database
In order to be able to compute the concordance tables we have constructed a database with all preclinical and clinical findings found for drugs (i.e. inchikeys) that are both found in the preclinical and clinical data. For the preclinical data we restricted to findings that are treatment related and not in the control group. For each of the preclinical findings we checked with the semantic service whether the equivalent clinical finding was found in the clinical findings and vice versa. This is stored in the database as 'mapped' per finding. For each of the preclinical and clinical findings we derived the MedDRA PT it is related to. For preclinical terms this is done through the semantic service. For clinical terms the findings are already expressed as MedDRA PT terms. These mappings are stored in the database as 'PT' per finding.



## Drug mapping
We maintain a list of drugs that can be found in the preclinical and clinical data with its inchi-group (inchi-group being the first 14 characters of an inchi-key). Per drug we have stored the finding ids that are associated per database.

In [12]:
cursor = db.cursor()
cursor.execute('SELECT inchi_group, inchi_keys, names FROM drugs')
drugs = [{'inchi_group': r[0], 'inchi_keys': [i for i in r[1].split(',')], 'names':[i for i in r[2].split(',')]} for r in cursor.fetchall()]
print(f'{len(drugs)} drugs found')

236 drugs found


## Overview of the drugs

In [13]:
pd = nb_setup.setup_pandas(escape_latex=False)
df = pd.DataFrame(np.random.rand(len(drugs),3),columns=['inchi_group','inchi_key','name'])
df.inchi_group = [drug['inchi_group'] for drug in drugs]
df.inchi_key = [drug['inchi_keys'][0] for drug in drugs]
df.name = [drug['names'][0] for drug in drugs]
df.round(3)

Unnamed: 0,inchi_group,inchi_key,name
0,GDLIGKIOYRNHDA,GDLIGKIOYRNHDA-UHFFFAOYSA-N,Clomipramine HCl
1,OZVBMTJYIDMWIL,OZVBMTJYIDMWIL-AYFBDAFISA-N,Bromocriptine
2,KPYSYYIEGFHWSV,KPYSYYIEGFHWSV-UHFFFAOYSA-N,Baclofen
3,QZUDBNBUXVUHMW,QZUDBNBUXVUHMW-UHFFFAOYSA-N,Clozapin
4,ZNRGQMMCGHDTEI,ZNRGQMMCGHDTEI-ITGUQSILSA-N,Tropisetron
...,...,...,...
231,BMPDWHIDQYTSHX,BMPDWHIDQYTSHX-UHFFFAOYSA-N,Licarbazepine
232,VMZMNAABQBOLAK,VMZMNAABQBOLAK-DBILLSOUSA-N,Pasireotide
233,HTIQEAQVCYTUBX,HTIQEAQVCYTUBX-QGZVFWFLSA-N,Amlodipine
234,YMTINGFKWWXKFG,YMTINGFKWWXKFG-UHFFFAOYSA-N,Fenofibrate


## Concordance table
Per drug retrieve the preclinical and clinical PT terms. 
- true positives are the PT terms that can be found present in the preclinical and clinical data per drug. 
- false positives are the PT terms that can be found in the preclinical data but not in the clinical data per drug
- false negatives are the PT terms that can not be found in the preclinical data but in the clinical data per drug
- true negatives are all PT terms that can not be found in the preclinical data and in the clinical data per drug

In [14]:
from Concordance.meddra import MedDRA

level = 'soc'
pt_to_group = {}

def getPTDrugFindings(db, drug, clinical):
    cursor = db.cursor()
    cursor.execute(f'SELECT DISTINCT finding_code FROM findings WHERE inchi_group = "{drug}" AND clinical = {clinical} AND distance IS NOT NULL')
    return [r[0] for r in cursor.fetchall()]

def getAllPreclinicalClinicalDistances(db):
    cursor = db.cursor()
    cursor.execute('SELECT finding_code, min(distance) FROM findings WHERE distance is not NULL GROUP BY finding_code')
    return {finding[0]: finding[1] for finding in cursor.fetchall()}

def getAllPreClinicalClinicalPTs(db):
    cursor = db.cursor()
    cursor.execute(f'SELECT distinct finding_code FROM findings WHERE distance is not NULL')
    return [finding[0] for finding in cursor.fetchall()]

def getGroup(meddra, pt, level):
    if not pt in pt_to_group:
        if level == 'pt':
            group = meddra.getPt(pt)
        elif level == 'hlt':
            group = meddra.getHLT(pt)
        elif level == 'soc':
            group = meddra.getSoc(pt)
        pt_to_group[pt] = list(group.keys())[0] if len(group) > 0 else None
    return pt_to_group[pt]

print(f'username={dbusername.value}, password={dbpassword.value}')
meddra = MedDRA(username=dbusername.value, password=dbpassword.value)
ClinicalDatabases = getClinicalDatabases(api);
PreclinicalDatabases = getPreclinicalDatabases(api);

groups = {}
preclinical_pts = {}
clinical_pts = {}
for drug in drugs:
    inchi_group = drug['inchi_group']
    preclinical_pts[inchi_group] = set([getGroup(meddra, pt, level) for pt in getPTDrugFindings(db=db, drug=inchi_group, clinical=0)])
    clinical_pts[inchi_group] = set([getGroup(meddra, pt, level) for pt in getPTDrugFindings(db=db, drug=inchi_group, clinical=1)])

all_preclinical_clinical_pts = set([getGroup(meddra, pt, level) for pt in getAllPreClinicalClinicalPTs(db=db)])
all_preclinical_clinical_distances = {getGroup(meddra, pt, level): distance for (pt, distance) in getAllPreclinicalClinicalDistances(db=db).items()}

for code in all_preclinical_clinical_pts:
    #group = getGroup(meddra, code, level)
    group = code

    if group is not None:
        if not group in groups:
            groups[group] = {'tp': 0, 'fp': 0, 'fn': 0, 'tn': 0, 'drugs': [], 'distance': all_preclinical_clinical_distances[code]}
        elif abs(groups[group]['distance']) > abs(all_preclinical_clinical_distances[code]):
            groups[group]['distance'] = all_preclinical_clinical_distances[code]

        for drug in drugs:
            inchi_group = drug['inchi_group']
            if inchi_group not in groups[group]['drugs']:
                groups[group]['drugs'].append(inchi_group)
                if code in preclinical_pts[inchi_group]:
                    if code in clinical_pts[inchi_group]:
                        groups[group]['tp'] += 1
                    else:
                        groups[group]['fp'] += 1
                else:
                    if code in clinical_pts[inchi_group]:
                        groups[group]['fn'] += 1
                    else:
                        groups[group]['tn'] += 1

username=root, password=crosby9


## Concordance tables

In [15]:
from Concordance.condordance_utils import getName

def compute_lrp(group):
    sensitivity = compute_sensitivity(group)
    specificity = compute_specificity(group)
    if specificity is not None and sensitivity is not None:
        return sensitivity / (1 - specificity) if specificity != 1 else None
    else:
        return None

def compute_lrn(group):
    sensitivity = compute_sensitivity(group)
    specificity = compute_specificity(group)
    if specificity is not None and sensitivity is not None:
        return (1 - sensitivity) / specificity if specificity != 0 else None
    else:
        return None
    
def compute_chisquare(group):
    tp = group['tp']
    fp = group['fp']
    fn = group['fn']
    tn = group['tn']
    total = tp + fp + fn + tn
    e11 = ((tp + fp) * (tp + fn)) / total
    e12 = ((tp + fp) * (fp + tn)) / total
    e21 = ((fn + tn) * (tp + fn)) / total
    e22 = ((fn + tn) * (fp + tn)) / total
    try:
        return (((tp - e11)**2)/e11) + (((fp - e12)**2)/e12) + (((fn - e21)**2)/e21) + (((tn - e22)**2)/e22)
    except Exception as e:
        return None

def compute_sensitivity(group):
    tp = group['tp']
    fn = group['fn']
    return tp / (tp + fn) if (tp + fn) > 0 else None

def compute_specificity(group):
    fp = group['fp']
    tn = group['tn']
    return tn / (fp + tn) if (fp + tn) > 0 else None

group_title = 'MedDRA ' + level.upper()
pd.set_option('display.max_rows', None)
pd.set_option('display.colheader_justify', 'left')
pd.options.display.float_format = '{:.2f}'.format
df = pd.DataFrame(np.random.rand(len(groups),11),columns=[group_title,'min.distance', 'TP','FP', 'FN', 'TN', 'Sensitivity', 'Specificity', 'LR+', 'LR-', 'chi-square'])
df[group_title] = [getName(meddra, code, level) for code in groups]
df['min.distance'] = [groups[code]['distance'] for code in groups]
df.TP = [groups[code]['tp'] for code in groups]
df.FP = [groups[code]['fp'] for code in groups]
df.FN = [groups[code]['fn'] for code in groups]
df.TN = [groups[code]['tn'] for code in groups]
df['Sensitivity'] = [compute_sensitivity(groups[code]) for code in groups]
df['Specificity'] = [compute_specificity(groups[code]) for code in groups]
df['LR+'] = [compute_lrp(groups[code]) for code in groups]
df['LR-'] = [compute_lrn(groups[code]) for code in groups]
df['chi-square'] = [compute_chisquare(groups[code]) for code in groups]
df.round(3)
df = df.sort_values(by=['LR+'], ascending=False)
dfStyler = df.style.set_properties(**{'text-align': 'right'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
dfStyler.set_properties(subset=[group_title], **{'text-align': 'left'}).hide_index()

MedDRA SOC,min.distance,TP,FP,FN,TN,Sensitivity,Specificity,LR+,LR-,chi-square
Metabolism and nutrition disorders,4,51,23,94,68,0.351724,0.747253,1.391604,0.867546,2.544733
Ear and labyrinth disorders,1,59,28,84,65,0.412587,0.698925,1.37038,0.840452,3.010716
Cardiac disorders,0,137,30,47,22,0.744565,0.423077,1.29058,0.603755,5.50724
Investigations,0,30,14,119,73,0.201342,0.83908,1.251198,0.951825,0.591719
Eye disorders,2,109,41,54,32,0.668712,0.438356,1.190633,0.755752,2.49546
Vascular disorders,5,100,28,77,31,0.564972,0.525424,1.190476,0.827957,1.45679
Nervous system disorders,4,173,15,41,7,0.808411,0.318182,1.18567,0.602136,1.97319
Infections and infestations,6,185,31,12,8,0.939086,0.205128,1.181431,0.296954,8.72925
General disorders and administration site conditions,1,151,36,35,14,0.811828,0.28,1.127539,0.672043,2.019794
Psychiatric disorders,0,12,69,20,135,0.375,0.661765,1.108696,0.944444,0.165858


In [16]:
html = df.to_html(index=False, justify='right', border=1)

#write html to file
text_file = open("../data/concordance_" + level.upper() + ".html", "w")
text_file.write(html)
text_file.close()