# Getting Started

First, import the `bdikit` library.

In [2]:
import bdikit as bdi
import pandas as pd

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/) to the GDC format.

In [3]:
dataset = pd.read_csv("./datasets/dou.csv")

columns = [
    "Country",
    "Histologic_type",
    "FIGO_stage",
    "BMI",
    "Age",
    "Race",
    "Ethnicity",
    "Gender",
    "Tumor_Focality",
    "Tumor_Size_cm",
]

dataset.head(10)

Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5
5,United States,,Serous,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,20.28,63.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,6.0
6,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,55.67,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
7,Other_specify,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,25.68,60.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,5.0
8,United States,,Serous,pT3a (FIGO IIIA),pNX,cM0,Staging Incomplete,Stage III,IIIA,21.57,83.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.0
9,United States,FIGO grade 1,Endometrioid,pT1 (FIGO I),pN0,cM0,Staging Incomplete,Stage I,IA,34.26,69.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,5.2


### Matching the table schema to GDC standard vocabulary

`bdi-kit` offers a suite of functions to help with data harmonization tasks.
For instance, it can help with automatic discovery of one-to-one mappings between the columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or a standard data vocabulary such as the GDC (Genomic Data Commons).

To achieve this using `bdi-kit`, we can use the `match_schema()` function to match columns to the GDC vocabulary schema as follows.

The GDC schema contains XXX attributes and current SOTA is not able to handle ...  explain why we need this model. point to a description of the model.

Note: This step requires the model to be downloaded and it may take a few minutes

In [5]:
bdi.match_schema(dataset[columns], target="gdc", method="ct_learning")

Extracting features from 10 columns...


  0%|          | 0/10 [00:00<?, ?it/s]

Table features loaded for 734 columns


Unnamed: 0,source,target
0,Country,country_of_birth
1,Histologic_type,history_of_tumor_type
2,FIGO_stage,figo_stage
3,BMI,average_base_quality
4,Age,age_at_diagnosis
5,Race,race
6,Ethnicity,ethnicity
7,Gender,gender
8,Tumor_Focality,tumor_focality
9,Tumor_Size_cm,tumor_width_measurement


In [4]:
column_mappings = bdi.match_schema(dataset, target="gdc", method="ct_learning")
column_mappings

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 17/17 [00:00<00:00, 25.70it/s]


Table features extracted from 17 columns
Table features loaded for 734 columns
Distances (cosine): [1.10394392 0.98914557 0.92537964 0.94752893 1.05914744 1.08939152
 1.04157806 1.04617294 0.97803455 1.02611567 0.92811078 0.98646065
 0.92494871 0.82554179 1.08864448 1.11371916 0.9982571  0.88276508
 1.03046031 1.14162502 0.98299454 0.89779826 0.9763337  1.07523165
 0.89635589 1.06016441 1.0512307  1.02413479 0.91700193 1.08814005
 0.98184121 1.12561218 1.1200962  1.01722983 1.0703122  1.12627641
 1.07342407 1.05760935 1.1052211  1.20270863 1.05279379 1.08772562
 1.07630705 1.03543143 1.10607188 0.86132086 0.92440717 1.01018515
 0.93720262 1.00066075 0.9507775  0.8202722  0.93760846 1.00147017
 1.09045534 1.07474188 0.97043665 1.06990152 1.08130116 1.03859922
 1.09030101 0.94105161 1.01664633 1.01230161 0.99208111 0.83868857
 1.03597787 0.91045015 1.04946121 1.06219909 0.99042535 1.17921298
 1.06863905 1.11142212 1.1048234  1.04655936 1.07739838 1.02920307
 0.99854448 1.12527391 1.06252

Unnamed: 0,source,target
0,Country,country_of_birth
1,Histologic_Grade_FIGO,tumor_grade
2,Histologic_type,history_of_tumor_type
3,Path_Stage_Primary_Tumor-pT,uicc_pathologic_t
4,Path_Stage_Reg_Lymph_Nodes-pN,ajcc_pathologic_n
5,Clin_Stage_Dist_Mets-cM,uicc_pathologic_m
6,Path_Stage_Dist_Mets-pM,uicc_pathologic_m
7,tumor_Stage-Pathological,ajcc_pathologic_stage
8,FIGO_stage,figo_stage
9,BMI,bmi


In [5]:
bdi.top_matches(dataset, columns=["BMI"], target="gdc")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 1/1 [00:00<00:00, 15.27it/s]


Table features extracted from 1 columns
Table features loaded for 734 columns
Distances (cosine): [0.62180854 0.95828624 0.97060146 1.06355793 1.05457518 0.95031876
 0.87183141 0.98037726 1.08884842 0.59818588 0.75152265 0.83261097
 0.80213099 0.77332142 0.87859477 1.16856598 0.83399758 1.10548815
 0.95137044 1.08006319 0.84199771 1.13298833 1.02036063 1.12404512
 1.19349639 0.82890662 0.8390033  0.89459161 0.89116058 0.87583417
 0.85418167 0.88809974 0.93308411 1.1451897  0.82541944 1.21806511
 1.22505213 1.12013957 0.84643531 0.97380871 0.9138511  0.8798859
 0.96106713 1.1585152  0.95307048 0.80630147 0.83896481 0.96793791
 0.9969389  0.95977895 1.07777176 1.03366965 1.11084185 1.0039425
 1.02620412 0.9678424  0.97017597 0.99921509 0.2632313  0.87646072
 0.88057175 1.06924909 0.95787864 0.78496797 0.63093806 0.96951553
 1.06129685 1.04936767 0.86532366 0.98907561 0.86559347 0.85613342
 0.88012961 1.02773534 0.90308327 0.87087699 0.28419954 0.38251533
 0.80052387 0.87840672 0.93036599

Unnamed: 0,source,target,similarity
0,BMI,average_base_quality,0.736769
1,BMI,bmi,0.7158
2,BMI,body_surface_area,0.617485
3,BMI,intermediate_dimension,0.602427
4,BMI,recist_targeted_regions_sum,0.576925
5,BMI,longest_dimension,0.494575
6,BMI,percent_stromal_cells,0.424426
7,BMI,sequencing_date,0.42442
8,BMI,spindle_cell_percent,0.418992
9,BMI,pmid,0.412501


In [None]:
bdi.top_matches(dataset, columns=["BMI"], target="gdc")

In [None]:
bdi.match_schema(dataset[["BMI"]], target="gdc", method="ct_learning")

In [None]:
bdi.top_matches(dataset, columns=["Histologic_type"], target="gdc")

In [None]:
bdi.top_matches(dataset, columns=["Histologic_type"], target="gdc")

In [None]:
bdi.match_schema(dataset[["FIGO_stage", "BMI"]], target="gdc", method="ct_learning")

In [None]:
column_mappings = bdi.match_schema(dataset[columns], target="gdc", method="two_phase")
column_mappings

### Generating a harmonized table

After discovering a schema mapping, we can generate a new table (DataFrame) using the new column names from the GDC standard vocabulary.

To do so using `bdi-kit`, we can use the function `materialize_mapping()` as follows. Note that the column headers have been renamed to the target schema.

In [None]:
bdi.materialize_mapping(dataset, column_mappings)

### Generating a harmonized table with value mappings

`bdi-kit` can also help with translation of the values from the source table to the target standard format.

To this end, `bdi-kit` provides the function `match_values()` that automatically creates value mappings for each string column.
The output of `match_values()` can be fed to `materialize_mapping()` which materialized the final target using both schema and value mappings.

In [None]:
# JF: why do we have so many "None" in dysplasia_type?
# and why do we have fewer columns -- the previous dataframe has 10 attributes
# and this has 7
value_mappings = bdi.match_values(dataset, column_mapping=column_mappings, target="gdc", method="tfidf")
bdi.materialize_mapping(dataset, value_mappings)

### Verifying the schema mappings

Sometimes the mappings generated automatically may be incorrect or you may to want verify them individually.
To verify the suggested column mappings, `bdi-kit` offers additional APIs to visualize the data and make any modifications when necessary. 

For this example, we will use the column `Histologic_type`. We can start by exploring the columns most similar to `Histologic_type`. 

For this, we can use the `top_matches()` function. Here, we notice that `primary_diagnosis` could be a potential target column.


In [None]:
hist_type_matches = bdi.top_matches(dataset, columns=["Histologic_type"], sample=true, attrib_desc=true, target="gdc")
hist_type_matches

### Viewing the column domains

To verify that `primary_diagnosis` is a good target column, we view and compare the domains of each column using the `preview_domain()` function. For the source table, it returns the list of unique values in the source column. For the GDC target, it returns the list of unique valid values that a column can have.

Here we see that the values seem to be related.

In [None]:
bdi.preview_domain(dataset, "Histologic_type")

In [None]:
bdi.preview_domain("gdc", "primary_diagnosis")

#### JF: actually, I don't see why this is a correct match, maybe we should have another function that looks for probable matches -- given the values for the first column, return a sample of the second column with similar values. 
Since `primary_diagnosis` looks like a correct match for `Histologic_type`, we can modify the `column_mappings` variable directly.

In [None]:
column_mappings.loc[column_mappings["source"] == "Histologic_type", "target"] = "primary_diagnosis"
column_mappings

### Finding correct value mappings

After finding the correct column, we need to find appropriate value mappings. 
Using `preview_value_mappings()`, we can inspect what the possible value mappings for this would look like after the harmonization.

`bdi-kit` implements multiple methods for value mapping discovery, including:

 - `edit_distance` - Computes value similarities using Levenstein's edit distance measure.
 - `tfidf` - A method based on tf-idf importance weighting computed over charcter n-grams.
 - `embeddings` - Uses BERT word embeddings to compute "semantic similarity" between the values.

To specify a value mapping approach, we can pass the `method` parameter.

In [None]:
bdi.preview_value_mappings(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="edit_distance"
)

In [None]:
bdi.preview_value_mappings(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="tfidf"
)

In [None]:
bdi.preview_value_mappings(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="embedding"
)

In [None]:
# JF add context  - -manual map
hist_type_vmap = pd.DataFrame(
    columns=["source", "target"],
    data=[
        ("Carcinosarcoma", "Carcinosarcoma, NOS"),
        ("Clear cell", "Clear cell adenocarcinoma, NOS"),
        ("Endometrioid", "Endometrioid carcinoma"),
        ("Serous", "Serous cystadenocarcinoma"),
    ],
)
hist_type_vmap

### Verifying multiple value mappings at once

Besides verifying value mappings individually, you can also do it for all column mappings at once.

In [None]:
mappings = bdi.preview_value_mappings(
    dataset,
    column_mapping=column_mappings,
    target="gdc",
    method="tfidf",
)

for mapping in mappings:
    print(f"{mapping['source']} => {mapping['target']}")
    display(mapping["mapping"])
    print("")

### Fixing remaining value mappings

We need fix a few value mappings:
- Race
- Ethnicity
- Tumor_Site

For race, we need to fix: `nan` -> `merican indian or alaska native`.

In [None]:
race_vmap = bdi.preview_value_mappings(
    dataset,
    column_mapping=("Race", "race"),
    target="gdc",
    method="tfidf",
)
race_vmap

In [None]:
race_vmap = race_vmap[race_vmap["similarity"] >= 1.0]
race_vmap

For `Ethnicity`, we need to fix: `Not reported` -> `not hispanic or latino`.

In [None]:
ethinicity_vmap = bdi.preview_value_mappings(
    dataset,
    column_mapping=("Ethnicity", "ethnicity"),
    target="gdc",
    method="tfidf",
)
ethinicity_vmap


In [None]:
ethinicity_vmap = ethinicity_vmap[ethinicity_vmap["similarity"] > 0.9]
ethinicity_vmap

For `Tumor_Site`, given that this dataset is about endometrial cancer, all values must be mapped to "Endometrium". So instead of fixing each mapping individually, we will write a custom function that returns "Endometrium" regardless of the input value. Later, we will show how to use this function to transform the dataset.

In [None]:
bdi.preview_value_mappings(
    dataset, column_mapping=("Tumor_Site", "tissue_or_organ_of_origin"), target="gdc", method="tfidf"
)

In [None]:
# Custom mapping function that will be used to map the values of the 'Tumor_Site' column
# JF: why does this always return Endometrium?
def map_tumor_site(source_value):
    return "Endometrium"

### Combining custom user mappings with suggested mappings

Before generating a final harmonized dataset, we can combine the automatically generated value mappings with the fixed mappings provided by the user. To do so, we use `bdi.update_mappings()` functions, which take a list of mappings (e.g., generated automatically) and a list of "user-defined mapping overrides" that will be combined with the first list of mappings and will take precedence whenever they conflict.

In our example below, all mappings specified in the variable `user_mappings` will override the mappings in `value_mappings` generated by the `bdi.match_values()` function.

In [None]:
from math import ceil

user_mappings = [
    {
        "source": "Tumor_Site",
        "target": "tissue_or_organ_of_origin",
        "mapper": map_tumor_site,
    },
    {
        "source": "BMI",
        "target": "bmi",
    },
    {
        "source": "Age",
        "target": "days_to_birth",
        "mapper": lambda age: -age * 365.25,
    },
    {
        "source": "Age",
        "target": "age_at_diagnosis",
        "mapper": lambda age: float("nan") if pd.isnull(age) else ceil(age*365.25),
    },
    {
        "source": "Tumor_Size_cm",
        "target": "tumor_largest_dimension_diameter",
    }
]

value_mappings = bdi.match_values(
    dataset, target="gdc", column_mapping=column_mappings, method="tfidf"
)

harmonization_spec = bdi.update_mappings(value_mappings, user_mappings)


Finally, we generate the harmonized dataset, with the user-defined value mappings.

In [None]:
# JF: are there still incorrect matches and mappings?
harmonized_dataset = bdi.materialize_mapping(dataset, harmonization_spec)
harmonized_dataset

For comparison, here is how our original data looked like:

In [None]:
original_columns = map(lambda m: m["source"], harmonization_spec)
dataset[original_columns]