# Cancer Type Normalization
In this notebook, we evaluate our normalizer with two cancer type gold standards and present two usecases.

In [1]:
import sys
sys.path.append('../../')

In [2]:
from preon.normalization import PrecisionOncologyNormalizer
from preon.cancer import download_or_load_do_cancers, load_do_flat_mapping, apply_do_flat_mapping_to_ontology, apply_do_flat_mapping_to_goldstandard, \
    load_database_cancer_goldstandard, load_ncbi_cancer_goldstandard
from preon.tests.utils import f1_score

In [3]:
import pandas as pd
import daproli as dp

Let's first load the reference cancer types from Disease Ontology and fit the normalizer.

In [4]:
cancer_types, doids = download_or_load_do_cancers()

# reduce the cancer type hierachy to just two levels
do_flat_mapping = load_do_flat_mapping()
cancer_types, doids = apply_do_flat_mapping_to_ontology(cancer_types, doids, do_flat_mapping)

normalizer = PrecisionOncologyNormalizer().fit(cancer_types, doids)

Now, we can evaluate it using the provided provided gold standards.

In [5]:
goldstandards = [
    ("database", load_database_cancer_goldstandard),
    ("ncbi", load_ncbi_cancer_goldstandard)
]

In [6]:
for dataset_name, load_dataset in goldstandards:
    cancer_types, doids = load_dataset()
    
    # reduce cancer type hierachy in gold standard as well
    cancer_types, doids = apply_do_flat_mapping_to_goldstandard(cancer_types, doids, do_flat_mapping)
    
    df_eval = normalizer.evaluate(cancer_types, doids, n_grams=3)
    print(f"{dataset_name}: f1_score={f1_score(df_eval)}")

database: f1_score=0.934131736526946
ncbi: f1_score=0.8208955223880596


# User Search
Going on, we demonstrate how to use preon to search for cancer types.

In [7]:
normalizer.query("Carcinoma")

(['carcinoma'], [['DOID:0050687', 'DOID:305']], {'match_type': 'exact'})

We can simply search for cancer types and retrieve their DOIDs by quering the normalizer. As a result for our query, we get list of matching normalized cancer types (in this case ['carcinoma']), a list of associated DOIDS for every returned cancer type [['DOID:0050687', 'DOID:305']] and some meta information about the matching {'match_type': 'exact'}. We can also search for multi-token cancer types like "adrenal carcinoma treatment and causes" and find DOIDs for the relevant tokens.

In [8]:
normalizer.query("adrenal carcinoma or bladder sarcoma treatment and causes")

(['carcinoma', 'sarcoma'],
 [['DOID:0050687', 'DOID:305'], ['DOID:0050687', 'DOID:1115']],
 {'match_type': 'substring'})

We find the relevant cancer types ['carcinoma', 'sarcoma'] and preon provides the meta information that the matching is based on substrings. On default, preon only looks for 1 matching token. It can also look for n-grams by setting the n_grams parameter in the query method. Let's take a harder example, say "mesenchymal cell neoplasm", but misspell it as "mesenchimal zell neoplasm".

In [9]:
normalizer.query("mesenchimal zell neoplasm", n_grams=2)

(['mesenchymalcellneoplasm'],
 [['DOID:0050687', 'DOID:3350']],
 {'match_type': 'partial', 'edit_distance': 0.08695652173913043})

preon finds the correct cancer type "mesenchymal cell neoplasm" and provides the meta information that it is a partial match with 8.7% distance. It returns drug names with a distance smaller than 20% on default. In order to change this parameter, set the threshold argument in the query method.

# Data Integration
We use preon in the PREDICT project to integrate cancer types from different sources. Going on, we provide an overview how to do so.

In [10]:
db_names, _ = load_database_cancer_goldstandard()
ncbi_names, _ = load_ncbi_cancer_goldstandard()

Let's say we wanted to integrate the drug names from the database and ncbi gold standards. We would normalize both lists of names and join on the returned CHEMBL ids.

In [11]:
normalizer.transform(db_names)

Unnamed: 0,Name,Found Names,Found Name IDs,Match Type,Edit Distance,Query Time
0,Multiple myeloma,[multiplemyeloma],"[[DOID:0050686, DOID:2531]]",exact,,0.000303
1,Medulloblastoma,[medulloblastoma],"[[DOID:0050686, DOID:3093]]",exact,,0.000190
2,Mantle cell lymphoma,[mantlecelllymphoma],"[[DOID:0050686, DOID:0060083, DOID:2531]]",exact,,0.000511
3,multiple myeloma,[multiplemyeloma],"[[DOID:0050686, DOID:2531]]",exact,,0.000796
4,Colorectal Cancer,[colorectalcancer],"[[DOID:0050686, DOID:3119]]",exact,,0.000097
...,...,...,...,...,...,...
128,Thrombotic Thrombocytopenic Purpura (TTP),[],[[None]],none,,0.012495
129,Vasodilatory Shock,[],[[None]],none,,0.007559
130,All Tumors,[all],"[[DOID:0050686, DOID:2531]]",substring,,0.000078
131,Anemia,[],[[None]],none,,0.005533


Using the transform method, preon returns a comprehensive pandas dataframe that contains the corresponding annotations. Let's normalize the cancer types from both gold standards.

In [12]:
db_df = normalizer.transform(db_names)
db_df["Found Name IDs"] = db_df["Found Name IDs"].apply(dp.flatten).apply(lambda ids: ids[0])

ncbi_df = normalizer.transform(ncbi_names)
ncbi_df["Found Name IDs"] = ncbi_df["Found Name IDs"].apply(dp.flatten).apply(lambda ids: ids[0])

We can now extract and relate the query cancer types with the found CHEMBL ids from both sources. 

In [13]:
db_names, db_ids = db_df.Name.tolist(), db_df["Found Name IDs"].tolist()
ncbi_names, ncbi_ids = ncbi_df.Name.tolist(), ncbi_df["Found Name IDs"].tolist()