# Drug Name Normalization
In this notebook, we evaluate our normalizer with three drug name gold standards and present two use cases.

In [96]:
import sys
sys.path.append('../../')

In [97]:
import logging
from preon.normalization import PrecisionOncologyNormalizer
from preon.drug import load_ebi_drugs, load_charite_drug_goldstandard, load_database_drug_goldstandard, load_ctg_drug_goldstandard
from preon.tests.utils import f1_score

In [98]:
import pandas as pd
import daproli as dp

Let's first load the reference drug names from EBI and fit the normalizer.

In [99]:
drug_names, chembl_ids = load_ebi_drugs()
normalizer = PrecisionOncologyNormalizer(enable_warnings=False).fit(drug_names, chembl_ids)

Now, we can evaluate it using the provided provided gold standards.

In [100]:
goldstandards = [
    ("charite", load_charite_drug_goldstandard),
    ("database", load_database_drug_goldstandard),
    ("ctg", load_ctg_drug_goldstandard)
]

In [101]:
for dataset_name, load_dataset in goldstandards:
    drug_names, chembl_ids, _ = load_dataset()
    df_eval = normalizer.evaluate(drug_names, chembl_ids)
    print(f"{dataset_name}: f1_score={f1_score(df_eval)}")

charite: f1_score=0.9144050104384134
database: f1_score=0.9275362318840579
ctg: f1_score=0.9285714285714286


# User Search
Going on, we demonstrate how to use preon to search for drug names.

In [102]:
# print warnings (default)
logging.captureWarnings(False)

In [103]:
drug_names, chembl_ids = load_ebi_drugs()
normalizer = PrecisionOncologyNormalizer().fit(drug_names, chembl_ids)

In [104]:
normalizer.query("Avastin")

(['avastin'], [['CHEMBL1201583']], {'match_type': 'exact'})

We can simply search for drug names and retrieve their CHEMBL ids by quering the normalizer. As a result for our query, we get list of matching normalized drug names (in this case ['avastin']), a list of associated CHEMBL ids for every returned drug name [['CHEMBL1201583']] and some meta information about the matching {'match_type': 'exact'}. We can also search for multi-token drug names like "Ixabepilone Epothilone B analog" and find CHEMBL ids for the relevant tokens.

In [105]:
normalizer.query("Ixabepilone Epothilone B analog")

(['ixabepilone'], [['CHEMBL1201752']], {'match_type': 'substring'})

We find the relevant drug name ['ixabepilone'] and preon provides the meta information that the matching is based on a substring. On default, preon only looks for 1 matching token. It can also look for n-grams by setting the n_grams parameter in the query method. Let's take a harder example, say "Isavuconazonium", but misspell it as "Isavuconaconium".

In [106]:
normalizer.query("Isavuconaconium")

(['isavuconazonium'],
 [['CHEMBL1183349']],
 {'match_type': 'partial', 'edit_distance': 0.067})

preon finds the correct drug "Isavuconazonium" and provides the meta information that it is a partial match with 7% distance. It returns drug names with a distance smaller than 20% on default. In order to change this parameter, set the threshold argument in the query method.

In [107]:
normalizer.query("risolipase en.")



If preon cannot find a match, it warns the user and suggests hyper-parameter changes. In this case, we increase the partial matching threshold to 30% and get a valid result.

In [108]:
normalizer.query("risolipase en.", threshold=.3)

(['rizolipase'],
 [['CHEMBL2108124']],
 {'match_type': 'partial', 'edit_distance': 0.25})

# Data Integration
We use preon in the PREDICT project to integrate drug names from different sources. Going on, we provide an overview how to do so. For this example, we write issued warnings to a log file for inspection.

In [109]:
# store warnings in file
logging.basicConfig(filename='warnings.log', level=logging.WARNING)
logging.captureWarnings(True)

In [110]:
drug_names, chembl_ids = load_ebi_drugs()
normalizer = PrecisionOncologyNormalizer().fit(drug_names, chembl_ids)

In [111]:
db_names, _, _ = load_database_drug_goldstandard()
ch_names, _, _ = load_ctg_drug_goldstandard()

Let's say we wanted to integrate the drug names from the database and charite gold standards. We would normalize both lists of names and join on the returned CHEMBL ids.

In [112]:
normalizer.transform(db_names)

Unnamed: 0,Name,Found Names,Found Name IDs,Match Type,Edit Distance,Query Time
0,LOXO-292,[loxo292],[[CHEMBL4559134]],exact,,0.000044
1,H3B-8800,[h3b8800],[[CHEMBL4802174]],exact,,0.000026
2,Tubulins,[],[[None]],none,,0.109484
3,Androgen Deprivation,[],[[None]],none,,0.199025
4,Mitomycin,[mitomycin],[[CHEMBL105]],exact,,0.000047
...,...,...,...,...,...,...
71,Thioguanine,[thioguanine],[[CHEMBL727]],exact,,0.000007
72,Lenalidomide,[lenalidomide],[[CHEMBL848]],exact,,0.000008
73,Gemcitabine,[gemcitabine],[[CHEMBL888]],exact,,0.000008
74,Gefitinib,[gefitinib],[[CHEMBL939]],exact,,0.000008


Using the transform method, preon returns a comprehensive pandas dataframe that contains the corresponding annotations. Let's normalize the drug names from both gold standards.

In [113]:
db_df = normalizer.transform(db_names)
db_df["Found Name IDs"] = db_df["Found Name IDs"].apply(dp.flatten).apply(lambda ids: ids[0])

ch_df = normalizer.transform(ch_names)
ch_df["Found Name IDs"] = ch_df["Found Name IDs"].apply(dp.flatten).apply(lambda ids: ids[0])

We can now extract and relate the query drug names with the found CHEMBL ids from both sources. 

In [114]:
db_names, db_ids = db_df.Name.tolist(), db_df["Found Name IDs"].tolist()
ch_names, ch_ids = ch_df.Name.tolist(), ch_df["Found Name IDs"].tolist()