In [1]:
import bdikit as bdi
import pandas as pd

import flair, torch
flair.device = torch.device('cpu') 

In [2]:
dataset = pd.read_csv('./datasets/dou.csv')

columns = [
    "Country",
    "Histologic_type",
    "FIGO_stage",
    "BMI",
    "Age",
    "Race",
    "Ethnicity",
    "Gender",
    "Tumor_Focality",
    "Tumor_Size_cm",
]

dataset[columns].head(10)

Unnamed: 0,Country,Histologic_type,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Focality,Tumor_Size_cm
0,United States,Endometrioid,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Unifocal,2.9
1,United States,Endometrioid,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Unifocal,3.5
2,United States,Endometrioid,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,Unifocal,4.5
3,,Carcinosarcoma,,,,,,,,
4,United States,Endometrioid,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,Unifocal,3.5
5,United States,Serous,IA,20.28,63.0,White,Not-Hispanic or Latino,Female,Unifocal,6.0
6,United States,Endometrioid,IA,55.67,50.0,White,Not-Hispanic or Latino,Female,Unifocal,4.5
7,Other_specify,Endometrioid,IA,25.68,60.0,White,Not-Hispanic or Latino,Female,Unifocal,5.0
8,United States,Serous,IIIA,21.57,83.0,White,Not-Hispanic or Latino,Female,Unifocal,4.0
9,United States,Endometrioid,IA,34.26,69.0,White,Not-Hispanic or Latino,Female,Unifocal,5.2


### Matching the table schema to GDC standard vocabulary

`bdi-kit` offers a suite of functions to help with data harmonization task.
For instance, it can help with automatic discovery of one-to-one mappings between the columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or standard data vocabulary such as the GDC (Genomic Data Commons).

To achive this using `bdi-kit`, we can use the `match_columns()` function to the GDC vocabulary as follows.

In [3]:
column_mappings = bdi.match_columns(dataset[columns], target='gdc', method='two_phase')
column_mappings

  0%|          | 0/10 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|██████████| 10/10 [00:01<00:00,  9.84it/s]


Table features extracted from 10 columns


100%|██████████| 734/734 [01:03<00:00, 11.51it/s]


Table features extracted from 734 columns


Unnamed: 0,source,target
0,Country,country_of_birth
1,Histologic_type,dysplasia_type
2,FIGO_stage,figo_stage
3,BMI,hpv_positive_type
4,Age,weight
5,Race,race
6,Ethnicity,ethnicity
7,Gender,gender
8,Tumor_Focality,tumor_focality
9,Tumor_Size_cm,tumor_depth


#### Generating a harmonized table

After discovering a schema mapping, we can generate a new table (DataFrame) using the new column names from the GDC starndard vocabulary.

To do so using `bdi-kit`, we can use the function `materialize_mapping()` as follows. Note that the column headers have been renamed to the target schema.

In [4]:
bdi.materialize_mapping(dataset, column_mappings)

Unnamed: 0,country_of_birth,dysplasia_type,figo_stage,hpv_positive_type,weight,race,ethnicity,gender,tumor_focality,tumor_depth
0,United States,Endometrioid,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Unifocal,2.9
1,United States,Endometrioid,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Unifocal,3.5
2,United States,Endometrioid,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,Unifocal,4.5
3,,Carcinosarcoma,,,,,,,,
4,United States,Endometrioid,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,Unifocal,3.5
...,...,...,...,...,...,...,...,...,...,...
99,Ukraine,Endometrioid,IA,29.40,75.0,,,Female,Unifocal,4.2
100,Ukraine,Endometrioid,II,35.42,74.0,,,Female,Unifocal,1.5
101,United States,Serous,II,24.32,85.0,Black or African American,Not-Hispanic or Latino,Female,Unifocal,3.8
102,Ukraine,Serous,IA,34.06,70.0,,,Female,Unifocal,5.0


#### Generating a harmonized table with value mappings

`bdi-kit` can also help with translation of the values from source table to the target starndard format.

To this end, `bdi-kit` provides the function `match_values()` that automatically creates value mappings for each string column.
The output of `match_values()` can be fed to `materialize_mapping()` which materialized the final target using both schema and value mappings.

In [5]:
value_mappings = bdi.match_values(dataset, column_mapping=column_mappings, target='gdc', method='tfidf')
bdi.materialize_mapping(dataset, value_mappings)

Unnamed: 0,country_of_birth,dysplasia_type,figo_stage,race,ethnicity,gender,tumor_focality
0,United States,,Stage IA,white,not hispanic or latino,female,Unifocal
1,United States,,Stage IA,white,not hispanic or latino,female,Unifocal
2,United States,,Stage IA,white,not hispanic or latino,female,Unifocal
3,,Esophageal Mucosa Columnar Dysplasia,,,,,
4,United States,,Stage IA,white,not hispanic or latino,female,Unifocal
...,...,...,...,...,...,...,...
99,Ukraine,,Stage IA,,,female,Unifocal
100,Ukraine,,Stage III,,,female,Unifocal
101,United States,,Stage III,black or african american,not hispanic or latino,female,Unifocal
102,Ukraine,,Stage IA,,,female,Unifocal


#### Verifying the schema mappings

Sometimes the mappings generated automatically may be incorrect or you may want verify them individualy.
To verify the suggested column mappings, `bdi-kit` offers additional APIs to visualizing the data and make any modification when necessary. 

For this example, we will use the column `Histologic_type`. We can start by exploring are the columns most similar to `Histologic_type`. 

For this we can use the `top_matches()` function. Here, we notice that `primary_diagnosis` could be potential target column.


In [6]:
hist_type_matches = bdi.top_matches(dataset, columns=['Histologic_type'], target='gdc')
hist_type_matches

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 1/1 [00:00<00:00, 14.70it/s]


Table features extracted from 1 columns


100%|██████████| 734/734 [00:58<00:00, 12.48it/s]

Table features extracted from 734 columns





Unnamed: 0,source,target,similarity
0,Histologic_type,described_cases,0.589956
1,Histologic_type,slide_images,0.587552
2,Histologic_type,history_of_tumor_type,0.57464
3,Histologic_type,primary_diagnosis,0.573583
4,Histologic_type,additional_pathology_findings,0.562278
5,Histologic_type,pathology_details,0.562007
6,Histologic_type,pathology_reports,0.547307
7,Histologic_type,relationship_primary_diagnosis,0.524285
8,Histologic_type,diagnoses,0.519854
9,Histologic_type,family_histories,0.516649


##### Viewing the column domains

To verify that `primary_diagnosis` is a good target column, we view and compare the domains of each column using the `preview_domain()` function. For the source table, it returns the list of unique values in the source column. For the GDC target, it returns the list of unique valid values that a column can have.

Here we see that the values seem to be related.

In [7]:
bdi.preview_domains(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc"
)

Unnamed: 0,source_domain,target_domain
0,Endometrioid,Abdominal desmoid
1,Carcinosarcoma,Abdominal fibromatosis
2,Serous,Achromic nevus
3,Clear cell,Acidophil adenocarcinoma
4,,Acidophil adenoma
...,...,...
2620,,Wolffian duct tumor
2621,,Xanthofibroma
2622,,Yolk sac tumor
2623,,Unknown


Since `primary_diagnosis` looks like a correct match for `Histologic_type`, we can modify the `column_mappings` variable directly.

In [8]:
column_mappings.loc[column_mappings['source'] == 'Histologic_type', 'target'] = 'primary_diagnosis'
column_mappings


Unnamed: 0,source,target
0,Country,country_of_birth
1,Histologic_type,primary_diagnosis
2,FIGO_stage,figo_stage
3,BMI,hpv_positive_type
4,Age,weight
5,Race,race
6,Ethnicity,ethnicity
7,Gender,gender
8,Tumor_Focality,tumor_focality
9,Tumor_Size_cm,tumor_depth


##### Finding correct value mappings

After finding the correct column, we need to find appropriate value mappings. 
Using `preview_value_mappings()`, we can inspect what are the possible value mappings for this would look like after the harmonization.

`bdi-kit` implements multiple methods for value mapping discovery, inncluding:
- `edit_distance` - Computes value similarities using Levenstein's edit distance measure.
- `tfidf` - A method based on tf-idf importance weighting computed over charcter n-grams.
- `embeddings` - Uses BERT word embeddings to compute "semantic similarity" between the value.

To specify a value mapping approach, we can pass the `method` parameter.

In [9]:
bdi.preview_value_mappings(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="edit_distance"
)

Unnamed: 0,source,target,similarity
0,Carcinosarcoma,"Carcinosarcoma, NOS",0.848485
1,Clear cell,Clear cell adenoma,0.714286
2,Endometrioid,Stromal endometriosis,0.666667
3,Serous,Neuronevus,0.625


In [10]:
bdi.preview_value_mappings(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="tfidf"
)

Unnamed: 0,source,target,similarity
0,Carcinosarcoma,"Carcinosarcoma, NOS",0.969
1,Endometrioid,"Endometrioid adenoma, NOS",0.897
2,Clear cell,Clear cell adenoma,0.853
3,Serous,"Serous carcinoma, NOS",0.755


In [11]:
bdi.preview_value_mappings(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="embedding"
)

Unnamed: 0,source,target,similarity
0,Carcinosarcoma,Carcinofibroma,0.919
1,Endometrioid,Endometrioid cystadenocarcinoma,0.81
2,Clear cell,Clear cell carcinoma,0.76
3,Serous,Serous cystoma,0.661


In [12]:
hist_type_vmap = pd.DataFrame(
    columns=["source", "target"],
    data=[
        ("Carcinosarcoma", "Carcinosarcoma, NOS"),
        ("Clear cell", "Clear cell adenocarcinoma, NOS"),
        ("Endometrioid", "Endometrioid carcinoma"),
        ("Serous", "Serous cystadenocarcinoma"),
    ],
)
hist_type_vmap

Unnamed: 0,source,target
0,Carcinosarcoma,"Carcinosarcoma, NOS"
1,Clear cell,"Clear cell adenocarcinoma, NOS"
2,Endometrioid,Endometrioid carcinoma
3,Serous,Serous cystadenocarcinoma


### Verifying multiple value mappings at once

Besides verifying value mappings individually, you can also to it for all column mappings at once.

In [13]:
mappings = bdi.preview_value_mappings(
    dataset,
    column_mapping=column_mappings,
    target="gdc",
    method="tfidf",
)

for mapping in mappings:
    print(f"{mapping['source']} => {mapping['target']}")
    display(mapping["mapping"])
    print("")

Country => country_of_birth


Unnamed: 0,source,target,similarity
0,United States,United States,1.0
1,Ukraine,Ukraine,1.0
2,Poland,Poland,1.0
3,,,
4,Other_specify,,



Histologic_type => primary_diagnosis


Unnamed: 0,source,target,similarity
0,Carcinosarcoma,"Carcinosarcoma, NOS",0.969
1,Endometrioid,"Endometrioid adenoma, NOS",0.897
2,Clear cell,Clear cell adenoma,0.853
3,Serous,"Serous carcinoma, NOS",0.755



FIGO_stage => figo_stage


Unnamed: 0,source,target,similarity
0,IIIC2,Stage IIIC2,0.889
1,IIIC1,Stage IIIC1,0.889
2,IVB,Stage IVB,0.854
3,IIIB,Stage IIIB,0.849
4,IIIA,Stage IIIA,0.822
5,II,Stage III,0.687
6,IB,Stage IB,0.649
7,IA,Stage IA,0.586
8,,Unknown,0.35



Race => race


Unnamed: 0,source,target,similarity
0,White,white,1.0
1,Asian,asian,1.0
2,Not Reported,not reported,1.0
3,Black or African American,black or african american,1.0
4,,american indian or alaska native,0.359



Ethnicity => ethnicity


Unnamed: 0,source,target,similarity
0,Hispanic or Latino,hispanic or latino,1.0
1,Not-Hispanic or Latino,not hispanic or latino,0.935
2,Not reported,not hispanic or latino,0.268
3,,,



Gender => gender


Unnamed: 0,source,target,similarity
0,Female,female,1.0
1,,unknown,0.29



Tumor_Focality => tumor_focality


Unnamed: 0,source,target,similarity
0,Unifocal,Unifocal,1.0
1,Multifocal,Multifocal,1.0
2,,,





## Fixing remaining value mappings

We need fix a few value mappings:
- Race
- Ethnicity
- Tumor_Site

For race, we need to fix `nan` -> `merican indian or alaska native`.

In [14]:
race_vmap = bdi.preview_value_mappings(
    dataset,
    column_mapping=("Race", "race"),
    target="gdc",
    method="tfidf",
)
race_vmap

Unnamed: 0,source,target,similarity
0,White,white,1.0
1,Asian,asian,1.0
2,Not Reported,not reported,1.0
3,Black or African American,black or african american,1.0
4,,american indian or alaska native,0.359


In [15]:
race_vmap = race_vmap[race_vmap['similarity'] >= 1.0]
race_vmap

Unnamed: 0,source,target,similarity
0,White,white,1.0
1,Asian,asian,1.0
2,Not Reported,not reported,1.0
3,Black or African American,black or african american,1.0


For `Ethnicity` we nee dot fix `Not reported` -> `not hispanic or latino`.

In [16]:
ethinicity_vmap = bdi.preview_value_mappings(
    dataset,
    column_mapping=("Ethnicity", "ethnicity"),
    target="gdc",
    method="tfidf",
)
ethinicity_vmap


Unnamed: 0,source,target,similarity
0,Hispanic or Latino,hispanic or latino,1.0
1,Not-Hispanic or Latino,not hispanic or latino,0.935
2,Not reported,not hispanic or latino,0.268
3,,,


In [17]:
ethinicity_vmap = ethinicity_vmap[ethinicity_vmap["similarity"] > 0.9]
ethinicity_vmap

Unnamed: 0,source,target,similarity
0,Hispanic or Latino,hispanic or latino,1.0
1,Not-Hispanic or Latino,not hispanic or latino,0.935


For `Tumor_Site`, given that this dataset is about endometrial cancer, all values must be mapped to "Endometrium".

In [18]:
bdi.preview_value_mappings(
    dataset, column_mapping=("Tumor_Site", "tissue_or_organ_of_origin"), target="gdc", method="tfidf"
)

Unnamed: 0,source,target,similarity
0,Anterior endometrium,Endometrium,0.852
1,Posterior endometrium,Endometrium,0.823
2,"Other, specify",Other specified parts of pancreas,0.543
3,,Anal canal,0.301


In [19]:
# Custom mapping function that will be used to map the values of the 'Tumor_Site' column
def map_tumor_site(source_value):
    return "Endometrium"

#### Combining custom user mappings with suggested mappings

In [20]:
from math import ceil

user_mappings = [
    {
        "source": "Tumor_Site",
        "target": "tissue_or_organ_of_origin",
        "mapper": map_tumor_site,
    },
    {
        "source": "BMI",
        "target": "bmi",
    },
    {
        "source": "Age",
        "target": "days_to_birth",
        "mapper": lambda age: -age * 365.25,
    },
    {
        "source": "Age",
        "target": "age_at_diagnosis",
        "mapper": lambda age: float("nan") if pd.isnull(age) else ceil(age*365.25),
    },
    {
        "source": "Tumor_Size_cm",
        "target": "tumor_largest_dimension_diameter",
    }
]

value_mappings = bdi.match_values(
    dataset, target="gdc", column_mapping=column_mappings, method="tfidf"
)

harmonization_spec = bdi.update_mappings(value_mappings, user_mappings)


In [21]:
harmonized_dataset = bdi.materialize_mapping(dataset, harmonization_spec)
harmonized_dataset

Unnamed: 0,tissue_or_organ_of_origin,bmi,days_to_birth,age_at_diagnosis,tumor_largest_dimension_diameter,country_of_birth,primary_diagnosis,figo_stage,race,ethnicity,gender,tumor_focality
0,Endometrium,38.88,-23376.00,23376.0,2.9,United States,"Endometrioid adenoma, NOS",Stage IA,white,not hispanic or latino,female,Unifocal
1,Endometrium,39.76,-21184.50,21185.0,3.5,United States,"Endometrioid adenoma, NOS",Stage IA,white,not hispanic or latino,female,Unifocal
2,Endometrium,51.19,-18262.50,18263.0,4.5,United States,"Endometrioid adenoma, NOS",Stage IA,white,not hispanic or latino,female,Unifocal
3,Endometrium,,,,,,"Carcinosarcoma, NOS",,,,,
4,Endometrium,32.69,-27393.75,27394.0,3.5,United States,"Endometrioid adenoma, NOS",Stage IA,white,not hispanic or latino,female,Unifocal
...,...,...,...,...,...,...,...,...,...,...,...,...
99,Endometrium,29.40,-27393.75,27394.0,4.2,Ukraine,"Endometrioid adenoma, NOS",Stage IA,,,female,Unifocal
100,Endometrium,35.42,-27028.50,27029.0,1.5,Ukraine,"Endometrioid adenoma, NOS",Stage III,,,female,Unifocal
101,Endometrium,24.32,-31046.25,31047.0,3.8,United States,"Serous carcinoma, NOS",Stage III,black or african american,not hispanic or latino,female,Unifocal
102,Endometrium,34.06,-25567.50,25568.0,5.0,Ukraine,"Serous carcinoma, NOS",Stage IA,,,female,Unifocal


For comparisson, here is how our original data looked like:

In [22]:
original_columns = map(lambda m: m["source"], harmonization_spec)
dataset[original_columns]

Unnamed: 0,Tumor_Site,BMI,Age,Age.1,Tumor_Size_cm,Country,Histologic_type,FIGO_stage,Race,Ethnicity,Gender,Tumor_Focality
0,Anterior endometrium,38.88,64.0,64.0,2.9,United States,Endometrioid,IA,White,Not-Hispanic or Latino,Female,Unifocal
1,Posterior endometrium,39.76,58.0,58.0,3.5,United States,Endometrioid,IA,White,Not-Hispanic or Latino,Female,Unifocal
2,"Other, specify",51.19,50.0,50.0,4.5,United States,Endometrioid,IA,White,Not-Hispanic or Latino,Female,Unifocal
3,,,,,,,Carcinosarcoma,,,,,
4,"Other, specify",32.69,75.0,75.0,3.5,United States,Endometrioid,IA,White,Not-Hispanic or Latino,Female,Unifocal
...,...,...,...,...,...,...,...,...,...,...,...,...
99,"Other, specify",29.40,75.0,75.0,4.2,Ukraine,Endometrioid,IA,,,Female,Unifocal
100,"Other, specify",35.42,74.0,74.0,1.5,Ukraine,Endometrioid,II,,,Female,Unifocal
101,"Other, specify",24.32,85.0,85.0,3.8,United States,Serous,II,Black or African American,Not-Hispanic or Latino,Female,Unifocal
102,"Other, specify",34.06,70.0,70.0,5.0,Ukraine,Serous,IA,,,Female,Unifocal
