# Drug Name Normalization
In this notebook, we evaluate our normalizer with three drug name gold standards and present two use cases.

In [1]:
import sys
sys.path.append('../../')

In [2]:
from preon.normalization import PrecisionOncologyNormalizer
from preon.drug import load_ebi_drugs, load_charite_drug_goldstandard, load_database_drug_goldstandard, load_ctg_drug_goldstandard
from preon.tests.utils import f1_score

In [3]:
import pandas as pd
import daproli as dp

Let's first load the reference drug names from EBI and fit the normalizer.

In [4]:
drug_names, chembl_ids = load_ebi_drugs()
normalizer = PrecisionOncologyNormalizer().fit(drug_names, chembl_ids)

Now, we can evaluate it using the provided provided gold standards.

In [5]:
goldstandards = [
    ("charite", load_charite_drug_goldstandard),
    ("database", load_database_drug_goldstandard),
    ("ctg", load_ctg_drug_goldstandard)
]

In [6]:
for dataset_name, load_dataset in goldstandards:
    drug_names, chembl_ids, _ = load_dataset()
    df_eval = normalizer.evaluate(drug_names, chembl_ids)
    print(f"{dataset_name}: f1_score={f1_score(df_eval)}")

charite: f1_score=0.918918918918919
database: f1_score=0.935251798561151
ctg: f1_score=0.9341317365269461


# User Search
Going on, we demonstrate how to use preon to search for drug names.

In [7]:
normalizer.query("Avastin")

(['avastin'], [['CHEMBL1201583']], {'match_type': 'exact'})

We can simply search for drug names and retrieve their CHEMBL ids by quering the normalizer. As a result for our query, we get list of matching normalized drug names (in this case ['avastin']), a list of associated CHEMBL ids for every returned drug name [['CHEMBL1201583']] and some meta information about the matching {'match_type': 'exact'}. We can also search for multi-token drug names like "Ixabepilone Epothilone B analog" and find CHEMBL ids for the relevant tokens.

In [8]:
normalizer.query("Ixabepilone Epothilone B analog")

(['ixabepilone'], [['CHEMBL1201752']], {'match_type': 'substring'})

We find the relevant drug name ['ixabepilone'] and preon provides the meta information that the matching is based on a substring. On default, preon only looks for 1 matching token. It can also look for n-grams by setting the n_grams parameter in the query method. Let's take a harder example, say "Isavuconazonium", but misspell it as "Isavuconaconium".

In [9]:
normalizer.query("Isavuconaconium")

(['isavuconazonium'],
 [['CHEMBL1183349']],
 {'match_type': 'partial', 'edit_distance': 0.06666666666666667})

preon finds the correct drug "Isavuconazonium" and provides the meta information that it is a partial match with 7% distance. It returns drug names with a distance smaller than 20% on default. In order to change this parameter, set the threshold argument in the query method.

# Data Integration
We use preon in the PREDICT project to integrate drug names from different sources. Going on, we provide an overview how to do so.

In [10]:
db_names, _, _ = load_database_drug_goldstandard()
ch_names, _, _ = load_ctg_drug_goldstandard()

Let's say we wanted to integrate the drug names from the database and charite gold standards. We would normalize both lists of names and join on the returned CHEMBL ids.

In [11]:
normalizer.transform(db_names)

Unnamed: 0,Name,Found Names,Found Name IDs,Match Type,Edit Distance,Query Time
0,LOXO-292,[loxo292],[[CHEMBL4559134]],exact,,0.000140
1,H3B-8800,[h3b8800],[[CHEMBL4802174]],exact,,0.000083
2,Tubulins,[],[[None]],none,,0.113431
3,Androgen Deprivation,[],[[None]],none,,0.131318
4,Mitomycin,[mitomycin],[[CHEMBL105]],exact,,0.000087
...,...,...,...,...,...,...
71,Thioguanine,[thioguanine],[[CHEMBL727]],exact,,0.000021
72,Lenalidomide,[lenalidomide],[[CHEMBL848]],exact,,0.000021
73,Gemcitabine,[gemcitabine],[[CHEMBL888]],exact,,0.000022
74,Gefitinib,[gefitinib],[[CHEMBL939]],exact,,0.000020


Using the transform method, preon returns a comprehensive pandas dataframe that contains the corresponding annotations. Let's normalize the drug names from both gold standards.

In [12]:
db_df = normalizer.transform(db_names)
db_df["Found Name IDs"] = db_df["Found Name IDs"].apply(dp.flatten).apply(lambda ids: ids[0])

ch_df = normalizer.transform(ch_names)
ch_df["Found Name IDs"] = ch_df["Found Name IDs"].apply(dp.flatten).apply(lambda ids: ids[0])

We can now extract and relate the query drug names with the found CHEMBL ids from both sources. 

In [13]:
db_names, db_ids = db_df.Name.tolist(), db_df["Found Name IDs"].tolist()
ch_names, ch_ids = ch_df.Name.tolist(), ch_df["Found Name IDs"].tolist()