# Getting Started

## Data Harmonization with `bdikit`

Data harmonization is the process of integrating and aligning data from different sources into a consistent format to ensure compatibility and interoperability across data analyses and systems. `bdikit` is a library the helps with key data harmonization steps:
- *Schema Mapping*: In this step, data from various sources are mapped to a unified schema or model. This involves identifying equivalent table columns and establishing relationships between disparate datasets.
- *Value Mapping (Data Standardization)*: This step involves converting data into a common format or structure, using consistent naming conventions, units, and coding systems to ensure uniformity.

In this example, we describe how `bdikit` can be used to map a dataset from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/) to the [GDC (Genomic Data Commons)](https://portal.gdc.cancer.gov/) standard data format.

First, import the `bdikit` library.

In [1]:
import bdikit as bdi
import pandas as pd

Next, we load the data using Pandas.

In [2]:
dataset = pd.read_csv("./datasets/dou.csv")

# columns = [
#     "Country",
#     "Path_Stage_Primary_Tumor-pT",
#     "FIGO_stage",
#     "Race",
#     "Ethnicity",
#     "Gender",
#     "Tumor_Focality",
#     "Tumor_Site",
# ]
columns = [
    "Country",
    "Path_Stage_Primary_Tumor-pT",
    "Histologic_type",
    "FIGO_stage",
    "BMI",
    "Age",
    "Race",
    "Ethnicity",
    "Gender",
    "Tumor_Focality",
    "Tumor_Size_cm",
]
# dataset = dataset[columns]
dataset.head(10)

Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5
5,United States,,Serous,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,20.28,63.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,6.0
6,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,55.67,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
7,Other_specify,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,25.68,60.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,5.0
8,United States,,Serous,pT3a (FIGO IIIA),pNX,cM0,Staging Incomplete,Stage III,IIIA,21.57,83.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.0
9,United States,FIGO grade 1,Endometrioid,pT1 (FIGO I),pN0,cM0,Staging Incomplete,Stage I,IA,34.26,69.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,5.2


In [3]:
dataset

Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,Ukraine,FIGO grade 3,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,29.40,75.0,,,Female,"Other, specify",Unifocal,4.2
100,Ukraine,FIGO grade 2,Endometrioid,pT2 (FIGO II),pN0,cM0,Staging Incomplete,Stage II,II,35.42,74.0,,,Female,"Other, specify",Unifocal,1.5
101,United States,,Serous,pT2 (FIGO II),pN0,Staging Incomplete,Staging Incomplete,Stage II,II,24.32,85.0,Black or African American,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.8
102,Ukraine,,Serous,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,34.06,70.0,,,Female,"Other, specify",Unifocal,5.0


### Matching the table schema to GDC standard vocabulary

`bdi-kit` offers a suite of functions to help with data harmonization tasks.
For instance, it can help with automatic discovery of one-to-one mappings between the columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or a standard data vocabulary such as the GDC (Genomic Data Commons).

To achieve this using `bdi-kit`, we can use the `match_schema()` function to match columns to the GDC vocabulary schema as follows.

In [4]:
column_mappings = bdi.match_schema(dataset[columns], target="gdc", method="two_phase")
column_mappings

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/11 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|██████████| 11/11 [00:01<00:00,  9.78it/s]


Table features extracted from 11 columns


100%|██████████| 734/734 [00:55<00:00, 13.32it/s]


Table features extracted from 734 columns

Source column:  Country
value_matches:  [ValueMatch(current_value='United States', target_value='Federated States of Micronesia', similarity=0.541)]
source: Country target: country_of_birth score: 0.541
value_matches:  [ValueMatch(current_value='Other_specify', target_value='other', similarity=0.508), ValueMatch(current_value='nan', target_value='american indian or alaska native', similarity=0.332), ValueMatch(current_value='United States', target_value='white', similarity=0.313), ValueMatch(current_value='Poland', target_value='native hawaiian or other pacific islander', similarity=0.27)]
source: Country target: race score: 1.423
value_matches:  [ValueMatch(current_value='Poland', target_value='Andorra', similarity=0.317), ValueMatch(current_value='United States', target_value='Guatemala', similarity=0.261), ValueMatch(current_value='Other_specify', target_value='Jersey', similarity=0.253)]
source: Country target: country_of_residence_at_enro

Unnamed: 0,source,target
0,Country,race
1,Path_Stage_Primary_Tumor-pT,ajcc_pathologic_t
2,Histologic_type,primary_diagnosis
3,FIGO_stage,figo_stage
4,BMI,average_base_quality
5,Age,age_at_last_exposure
6,Race,race
7,Ethnicity,ethnicity
8,Gender,primary_site
9,Tumor_Focality,tumor_focality


### Generating a harmonized table

After discovering a schema mapping, we can generate a new table (DataFrame) using the new column names from the GDC standard vocabulary.

To do so using `bdi-kit`, we can use the function `materialize_mapping()` as follows. Note that the column headers have been renamed to the target schema.

In [4]:
bdi.materialize_mapping(dataset, column_mappings)

Unnamed: 0,country_of_birth,history_of_tumor_type,figo_stage,average_base_quality,age_at_diagnosis,race,ethnicity,gender,tumor_focality,tumor_width_measurement,tumor_level_prostate
0,United States,Endometrioid,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Unifocal,2.9,Anterior endometrium
1,United States,Endometrioid,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Unifocal,3.5,Posterior endometrium
2,United States,Endometrioid,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,Unifocal,4.5,"Other, specify"
3,,Carcinosarcoma,,,,,,,,,
4,United States,Endometrioid,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,Unifocal,3.5,"Other, specify"
...,...,...,...,...,...,...,...,...,...,...,...
99,Ukraine,Endometrioid,IA,29.40,75.0,,,Female,Unifocal,4.2,"Other, specify"
100,Ukraine,Endometrioid,II,35.42,74.0,,,Female,Unifocal,1.5,"Other, specify"
101,United States,Serous,II,24.32,85.0,Black or African American,Not-Hispanic or Latino,Female,Unifocal,3.8,"Other, specify"
102,Ukraine,Serous,IA,34.06,70.0,,,Female,Unifocal,5.0,"Other, specify"


### Generating a harmonized table with value mappings

`bdi-kit` can also help with translation of the values from the source table to the target standard format.

To this end, `bdi-kit` provides the function `match_values()` that automatically creates value mappings for each string column.
The output of `match_values()` can be fed to `materialize_mapping()` which materialized the final target using both schema and value mappings.

In [5]:
value_mappings = bdi.match_values(dataset, column_mapping=column_mappings, target="gdc", method="tfidf")
bdi.materialize_mapping(dataset, value_mappings)

## Verifying and Correcting Automatic Mappings

### Verifying the schema mappings

Sometimes the mappings generated automatically may be incorrect or you may to want verify them individually.
To verify the suggested column mappings, `bdi-kit` offers additional APIs to visualize the data and make any modifications when necessary. 

For this example, we will use the column `Histologic_type`. We can start by exploring the columns most similar to `Histologic_type`. 

For this, we can use the `top_matches()` function. Here, we notice that `primary_diagnosis` could be a potential target column.


In [6]:
hist_type_matches = bdi.top_matches(dataset, columns=["Histologic_type"], target="gdc", top_k=40)
hist_type_matches

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 1/1 [00:00<00:00, 17.62it/s]


Table features extracted from 1 columns


100%|██████████| 734/734 [00:55<00:00, 13.25it/s]

Table features extracted from 734 columns





Unnamed: 0,source,target,similarity
0,Histologic_type,disease_type,0.539168
1,Histologic_type,sample_type,0.530217
2,Histologic_type,roots,0.525866
3,Histologic_type,history_of_tumor_type,0.524959
4,Histologic_type,additional_pathology_findings,0.517125
5,Histologic_type,specimen_type,0.511611
6,Histologic_type,morphologic_architectural_pattern,0.482257
7,Histologic_type,histone_variant,0.478656
8,Histologic_type,viral_hepatitis_serologies,0.471346
9,Histologic_type,chromosome,0.471309


### Viewing the column domains

To verify that `primary_diagnosis` is a good target column, we view and compare the domains of each column using the `preview_domain()` function. For the source table, it returns the list of unique values in the source column. For the GDC target, it returns the list of unique valid values that a column can have.

Here we see that the values seem to be related.

In [None]:
bdi.preview_domain(dataset, "Histologic_type")

Unnamed: 0,value_name
0,Endometrioid
1,Carcinosarcoma
2,Serous
3,Clear cell


In [None]:
bdi.preview_domain("gdc", "primary_diagnosis")

Unnamed: 0,value_name,value_description,column_description
0,Abdominal desmoid,An insidious poorly circumscribed neoplasm ari...,Text term used to describe the patient's histo...
1,Abdominal fibromatosis,An insidious poorly circumscribed neoplasm ari...,
2,Achromic nevus,A benign nevus characterized by the absence of...,
3,Acidophil adenocarcinoma,A malignant epithelial neoplasm of the anterio...,
4,Acidophil adenoma,An epithelial neoplasm of the anterior pituita...,
...,...,...,...
2620,Wolffian duct tumor,An epithelial neoplasm of the female reproduct...,
2621,Xanthofibroma,A benign neoplasm composed of fibroblastic spi...,
2622,Yolk sac tumor,A non-seminomatous malignant germ cell tumor c...,
2623,Unknown,"Not known, not observed, not recorded, or refu...",


Since `primary_diagnosis` looks like a correct match for `Histologic_type`, we can modify the `column_mappings` variable directly.

In [None]:
column_mappings.loc[column_mappings["source"] == "Histologic_type", "target"] = "primary_diagnosis"
column_mappings

Unnamed: 0,source,target
0,Country,country_of_birth
1,Histologic_type,primary_diagnosis
2,FIGO_stage,irs_stage
3,BMI,age_at_diagnosis
4,Age,weight
5,Race,race
6,Ethnicity,ethnicity
7,Gender,gender
8,Tumor_Focality,tumor_focality
9,Tumor_Size_cm,tumor_depth


### Finding correct value mappings

After finding the correct column, we need to find appropriate value mappings. 
Using `match_values()`, we can inspect what the possible value mappings for this would look like after the harmonization.

`bdi-kit` implements multiple methods for value mapping discovery, including:

 - `edit_distance` - Computes value similarities using Levenstein's edit distance measure.
 - `tfidf` - A method based on tf-idf importance weighting computed over charcter n-grams.
 - `embeddings` - Uses BERT word embeddings to compute "semantic similarity" between the values.

To specify a value mapping approach, we can pass the `method` parameter.

In [None]:
bdi.match_values(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="edit_distance"
)

Unnamed: 0,source,target,similarity
0,Carcinosarcoma,"Carcinosarcoma, NOS",0.848485
1,Clear cell,Clear cell adenoma,0.714286
2,Endometrioid,Stromal endometriosis,0.666667
3,Serous,Neuronevus,0.625


In [None]:
bdi.match_values(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="tfidf"
)

Unnamed: 0,source,target,similarity
0,Carcinosarcoma,"Carcinosarcoma, NOS",0.969
1,Endometrioid,"Endometrioid adenoma, NOS",0.897
2,Clear cell,Clear cell adenoma,0.853
3,Serous,"Serous carcinoma, NOS",0.755


In [None]:
bdi.match_values(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="embedding"
)

Unnamed: 0,source,target,similarity
0,Carcinosarcoma,Carcinofibroma,0.919
1,Endometrioid,Endometrioid cystadenocarcinoma,0.81
2,Clear cell,Clear cell carcinoma,0.76
3,Serous,Serous cystoma,0.661


In [None]:
hist_type_vmap = pd.DataFrame(
    columns=["source", "target"],
    data=[
        ("Carcinosarcoma", "Carcinosarcoma, NOS"),
        ("Clear cell", "Clear cell adenocarcinoma, NOS"),
        ("Endometrioid", "Endometrioid carcinoma"),
        ("Serous", "Serous cystadenocarcinoma"),
    ],
)
hist_type_vmap

Unnamed: 0,source,target
0,Carcinosarcoma,"Carcinosarcoma, NOS"
1,Clear cell,"Clear cell adenocarcinoma, NOS"
2,Endometrioid,Endometrioid carcinoma
3,Serous,Serous cystadenocarcinoma


### Verifying multiple value mappings at once

Besides verifying value mappings individually, you can also do it for all column mappings at once.

In [4]:
hist_type_matches

Unnamed: 0,source,target,similarity
0,Histologic_type,disease_type,0.539168
1,Histologic_type,sample_type,0.530217
2,Histologic_type,roots,0.525866
3,Histologic_type,history_of_tumor_type,0.524959
4,Histologic_type,additional_pathology_findings,0.517125
5,Histologic_type,specimen_type,0.511611
6,Histologic_type,morphologic_architectural_pattern,0.482257
7,Histologic_type,histone_variant,0.478656
8,Histologic_type,viral_hepatitis_serologies,0.471346
9,Histologic_type,chromosome,0.471309


In [8]:
mappings = bdi.match_values(
    dataset,
    column_mapping=hist_type_matches.head(34),
    target="gdc",
    method="tfidf",
    # default_missing=None
)

for mapping in mappings:
    mapping.attrs['default_missing'] = "___"
    print(f"{mapping.attrs['source']} => {mapping.attrs['target']} (coverage: {mapping.attrs['coverage']:.2%})")
    display(mapping)

Histologic_type => disease_type (coverage: 75.00%)


Unnamed: 0,source,target,similarity
0,Serous,"Cystic, Mucinous and Serous Neoplasms",0.563
1,Carcinosarcoma,"Soft Tissue Tumors and Sarcomas, NOS",0.45
2,Clear cell,Acinar Cell Neoplasms,0.319
3,Endometrioid,,


Histologic_type => sample_type (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Mononuclear Cells from Bone Marrow Normal,0.502
1,Carcinosarcoma,,
2,Endometrioid,,
3,Serous,,


Histologic_type => history_of_tumor_type (coverage: 50.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Colorectal Cancer,0.313
1,Carcinosarcoma,Phenochromocytoma or Paraganglioma,0.309
2,Endometrioid,,
3,Serous,,


Histologic_type => additional_pathology_findings (coverage: 75.00%)


Unnamed: 0,source,target,similarity
0,Endometrioid,Endometriosis,0.746
1,Carcinosarcoma,Carcinoma in situ,0.643
2,Clear cell,Diffuse and early nodular diabetic glomerulosc...,0.273
3,Serous,,


Histologic_type => specimen_type (coverage: 75.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Cell,0.615
1,Serous,Serum,0.369
2,Carcinosarcoma,Bone Marrow Components NOS,0.272
3,Endometrioid,,


Histologic_type => morphologic_architectural_pattern (coverage: 50.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Papillary Renal Cell,0.516
1,Carcinosarcoma,"Papillary, NOS",0.259
2,Endometrioid,,
3,Serous,,


Histologic_type => histone_variant (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Endometrioid,Not Reported,0.269
1,Carcinosarcoma,,
2,Clear cell,,
3,Serous,,


Histologic_type => viral_hepatitis_serologies (coverage: 0.00%)


Unnamed: 0,source,target,similarity
0,Serous,,
1,Carcinosarcoma,,
2,Clear cell,,
3,Endometrioid,,


Histologic_type => chromosome (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Endometrioid,Not Reported,0.28
1,Carcinosarcoma,,
2,Clear cell,,
3,Serous,,


Histologic_type => analyte_type_id (coverage: 75.00%)


Unnamed: 0,source,target,similarity
0,Serous,S,0.414
1,Endometrioid,D,0.298
2,Clear cell,E,0.269
3,Carcinosarcoma,,


Histologic_type => composition (coverage: 75.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Cell,0.621
1,Serous,Serum,0.374
2,Carcinosarcoma,Bone Marrow Components NOS,0.279
3,Endometrioid,,


Histologic_type => antigen (coverage: 75.00%)


Unnamed: 0,source,target,similarity
0,Carcinosarcoma,Squamous Cell Carcinoma Antigen (SCCA),0.514
1,Clear cell,CEA,0.342
2,Serous,NSE,0.28
3,Endometrioid,,


Histologic_type => cog_rhabdomyosarcoma_risk_group (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Endometrioid,Intermediate Risk,0.277
1,Carcinosarcoma,,
2,Clear cell,,
3,Serous,,


Histologic_type => biospecimen_type (coverage: 75.00%)


Unnamed: 0,source,target,similarity
0,Serous,Serum,0.365
1,Carcinosarcoma,Buccal Mucosa,0.295
2,Clear cell,Muscle Tissue,0.271
3,Endometrioid,,


Histologic_type => analyte_type (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Nuclei RNA,0.313
1,Carcinosarcoma,,
2,Endometrioid,,
3,Serous,,


Histologic_type => single_cell_library (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Carcinosarcoma,Chromium scATAC v1 Library,0.27
1,Endometrioid,,
2,Clear cell,,
3,Serous,,


Histologic_type => laboratory_test (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Cellularity,0.392
1,Carcinosarcoma,,
2,Endometrioid,,
3,Serous,,


Histologic_type => tumor_descriptor (coverage: 50.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Not Allowed To Collect,0.382
1,Carcinosarcoma,NOS,0.302
2,Endometrioid,,
3,Serous,,


Histologic_type => sample_type_id (coverage: 0.00%)


Unnamed: 0,source,target,similarity
0,Serous,,
1,Carcinosarcoma,,
2,Clear cell,,
3,Endometrioid,,


Histologic_type => relationship_primary_diagnosis (coverage: 75.00%)


Unnamed: 0,source,target,similarity
0,Carcinosarcoma,Sarcoma,0.697
1,Clear cell,Basal Cell Cancer,0.461
2,Endometrioid,Thyroid Cancer,0.253
3,Serous,,


Histologic_type => stain_type (coverage: 0.00%)


Unnamed: 0,source,target,similarity
0,Serous,,
1,Carcinosarcoma,,
2,Clear cell,,
3,Endometrioid,,


Histologic_type => histone_family (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Endometrioid,Not Reported,0.258
1,Carcinosarcoma,,
2,Clear cell,,
3,Serous,,


Histologic_type => pathogenicity (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Likely Pathogenic,0.252
1,Carcinosarcoma,,
2,Endometrioid,,
3,Serous,,


Histologic_type => contiguous_organ_invaded (coverage: 25.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Small Bowel,0.262
1,Carcinosarcoma,,
2,Endometrioid,,
3,Serous,,


Histologic_type => aids_risk_factors (coverage: 75.00%)


Unnamed: 0,source,target,similarity
0,Endometrioid,Coccidioidomycosis,0.377
1,Clear cell,Salmonella Septicemia,0.305
2,Carcinosarcoma,Nocardiosis,0.283
3,Serous,,


Histologic_type => biospecimen_anatomic_site (coverage: 100.00%)


Unnamed: 0,source,target,similarity
0,Clear cell,Cell-Line,0.474
1,Endometrioid,Abdomen,0.376
2,Serous,Venous,0.318
3,Carcinosarcoma,Stomach - Mucosa Only,0.27


Histologic_type => primary_diagnosis (coverage: 100.00%)


Unnamed: 0,source,target,similarity
0,Carcinosarcoma,"Carcinosarcoma, NOS",0.969
1,Endometrioid,"Endometrioid adenoma, NOS",0.897
2,Clear cell,Clear cell adenoma,0.853
3,Serous,"Serous carcinoma, NOS",0.755


In [13]:
import numpy as np
scores = [
    np.sum(m["similarity"])
    for m in mappings
]
scores
sorted_mappings = [m for _, m in sorted(zip(scores, mappings), key=lambda it: it[0], reverse=True)]
sorted_mappings[0]

Unnamed: 0,source,target,similarity
0,Carcinosarcoma,"Carcinosarcoma, NOS",0.969
1,Endometrioid,"Endometrioid adenoma, NOS",0.897
2,Clear cell,Clear cell adenoma,0.853
3,Serous,"Serous carcinoma, NOS",0.755


In [15]:
path_stage_matches = bdi.top_matches(dataset, columns=["Path_Stage_Primary_Tumor-pT"], target="gdc", top_k=40)
path_stage_matches

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 1/1 [00:00<00:00, 11.98it/s]


Table features extracted from 1 columns


100%|██████████| 734/734 [00:52<00:00, 13.92it/s]

Table features extracted from 734 columns





Unnamed: 0,source,target,similarity
0,Path_Stage_Primary_Tumor-pT,uicc_pathologic_t,0.67762
1,Path_Stage_Primary_Tumor-pT,ajcc_pathologic_t,0.660834
2,Path_Stage_Primary_Tumor-pT,ensat_pathologic_t,0.660739
3,Path_Stage_Primary_Tumor-pT,uicc_clinical_t,0.645708
4,Path_Stage_Primary_Tumor-pT,ajcc_clinical_t,0.592968
5,Path_Stage_Primary_Tumor-pT,extrathyroid_extension,0.589977
6,Path_Stage_Primary_Tumor-pT,uicc_pathologic_stage,0.584754
7,Path_Stage_Primary_Tumor-pT,somatic_mutation_indexes,0.571665
8,Path_Stage_Primary_Tumor-pT,ajcc_pathologic_stage,0.563254
9,Path_Stage_Primary_Tumor-pT,annotated_somatic_mutations,0.563235


In [17]:
mappings = bdi.match_values(
    dataset,
    column_mapping=path_stage_matches.head(10),
    target="gdc",
    method="tfidf",
    # default_missing=None
)

for mapping in mappings:
    mapping.attrs['default_missing'] = "___"
    print(f"{mapping.attrs['source']} => {mapping.attrs['target']} (coverage: {mapping.attrs['coverage']:.2%})")
    display(mapping)

Path_Stage_Primary_Tumor-pT => uicc_pathologic_t (coverage: 100.00%)


Unnamed: 0,source,target,similarity
0,pT1b (FIGO IB),T1b,0.534
1,pT1a (FIGO IA),T1a,0.509
2,pT3b (FIGO IIIB),T3b,0.436
3,pT3a (FIGO IIIA),T3a,0.418
4,,Unknown,0.373
5,pT1 (FIGO I),T1,0.289
6,pT2 (FIGO II),T2,0.281


Path_Stage_Primary_Tumor-pT => ajcc_pathologic_t (coverage: 100.00%)


Unnamed: 0,source,target,similarity
0,pT1b (FIGO IB),T1b,0.535
1,pT1a (FIGO IA),T1a,0.507
2,pT3b (FIGO IIIB),T3b,0.437
3,pT3a (FIGO IIIA),T3a,0.416
4,,Unknown,0.348
5,pT1 (FIGO I),T1,0.29
6,pT2 (FIGO II),T2,0.281


Path_Stage_Primary_Tumor-pT => ensat_pathologic_t (coverage: 85.71%)


Unnamed: 0,source,target,similarity
0,pT1 (FIGO I),T1,0.424
1,pT2 (FIGO II),T2,0.424
2,pT1a (FIGO IA),T1,0.311
3,pT1b (FIGO IB),T1,0.302
4,pT3a (FIGO IIIA),T3,0.26
5,pT3b (FIGO IIIB),T3,0.256
6,,,


Path_Stage_Primary_Tumor-pT => uicc_clinical_t (coverage: 100.00%)


Unnamed: 0,source,target,similarity
0,pT1b (FIGO IB),T1b,0.536
1,pT1a (FIGO IA),T1a,0.512
2,pT3b (FIGO IIIB),T3b,0.437
3,pT3a (FIGO IIIA),T3a,0.418
4,,Unknown,0.373
5,pT1 (FIGO I),T1,0.297
6,pT2 (FIGO II),T2,0.285


Path_Stage_Primary_Tumor-pT => ajcc_clinical_t (coverage: 100.00%)


Unnamed: 0,source,target,similarity
0,pT1b (FIGO IB),T1b,0.537
1,pT1a (FIGO IA),T1a,0.509
2,pT3b (FIGO IIIB),T3b,0.437
3,pT3a (FIGO IIIA),T3a,0.416
4,,Unknown,0.348
5,pT1 (FIGO I),T1,0.298
6,pT2 (FIGO II),T2,0.285


Path_Stage_Primary_Tumor-pT => extrathyroid_extension (coverage: 0.00%)


Unnamed: 0,source,target,similarity
0,pT1b (FIGO IB),,
1,pT3b (FIGO IIIB),,
2,,,
3,pT1 (FIGO I),,
4,pT3a (FIGO IIIA),,
5,pT1a (FIGO IA),,
6,pT2 (FIGO II),,


Path_Stage_Primary_Tumor-pT => uicc_pathologic_stage (coverage: 71.43%)


Unnamed: 0,source,target,similarity
0,pT3b (FIGO IIIB),Stage IIIB,0.532
1,pT3a (FIGO IIIA),Stage IIIA,0.478
2,,Unknown,0.379
3,pT1b (FIGO IB),Stage IB,0.339
4,pT2 (FIGO II),Stage III,0.265
5,pT1a (FIGO IA),,
6,pT1 (FIGO I),,


Path_Stage_Primary_Tumor-pT => ajcc_pathologic_stage (coverage: 71.43%)


Unnamed: 0,source,target,similarity
0,pT3b (FIGO IIIB),Stage IIIB,0.532
1,pT3a (FIGO IIIA),Stage IIIA,0.478
2,,Unknown,0.379
3,pT1b (FIGO IB),Stage IB,0.339
4,pT2 (FIGO II),Stage III,0.265
5,pT1a (FIGO IA),,
6,pT1 (FIGO I),,


In [8]:
import numpy as np
scores = [
    np.sum(m["similarity"])
    for m in mappings
]
scores
sorted_mappings = [m for _, m in sorted(zip(scores, mappings), key=lambda it: it[0], reverse=True)]
print(scores[:2])
print(sorted_mappings[1].attrs["target"])
sorted_mappings[1]

NameError: name 'mappings' is not defined

In [5]:
country_matches = bdi.top_matches(dataset, columns=["Country"], target="gdc", top_k=20)
country_matches

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 1/1 [00:00<00:00, 17.54it/s]


Table features extracted from 1 columns


100%|██████████| 734/734 [00:48<00:00, 15.13it/s]

Table features extracted from 734 columns





Unnamed: 0,source,target,similarity
0,Country,country_of_birth,0.491808
1,Country,country_of_residence_at_enrollment,0.419452
2,Country,oct_embedded,0.371549
3,Country,race,0.35219
4,Country,submission_enabled,0.325194
5,Country,zone_of_origin_prostate,0.281882
6,Country,is_legacy,0.279952
7,Country,released,0.271991
8,Country,non_nodal_regional_disease,0.268099
9,Country,perineural_invasion_present,0.225525


In [6]:
mappings = bdi.match_values(
    dataset,
    column_mapping=country_matches,
    target="gdc",
    method="tfidf",
    # default_missing=None
)

for mapping in mappings:
    print(f"{mapping.attrs['source']} => {mapping.attrs['target']} (coverage: {mapping.attrs['coverage']:.2%})")
    display(mapping)

Country => country_of_birth (coverage: 60.00%)


Unnamed: 0,source,target,similarity
0,United States,United States,1.0
1,Poland,Poland,1.0
2,Ukraine,Ukraine,1.0
3,,,
4,Other_specify,,


Country => country_of_residence_at_enrollment (coverage: 60.00%)


Unnamed: 0,source,target,similarity
0,United States,United States,1.0
1,Poland,Poland,1.0
2,Ukraine,Ukraine,1.0
3,,,
4,Other_specify,,


Country => race (coverage: 80.00%)


Unnamed: 0,source,target,similarity
0,Other_specify,other,0.502
1,,american indian or alaska native,0.344
2,United States,white,0.297
3,Poland,native hawaiian or other pacific islander,0.274
4,Ukraine,,


Country => zone_of_origin_prostate (coverage: 20.00%)


Unnamed: 0,source,target,similarity
0,Other_specify,Peripheral zone,0.288
1,,,
2,Ukraine,,
3,Poland,,
4,United States,,


Country => non_nodal_regional_disease (coverage: 20.00%)


Unnamed: 0,source,target,similarity
0,United States,Indeterminate,0.325
1,,,
2,Ukraine,,
3,Poland,,
4,Other_specify,,


Country => perineural_invasion_present (coverage: 20.00%)


Unnamed: 0,source,target,similarity
0,United States,Not Reported,0.325
1,,,
2,Ukraine,,
3,Poland,,
4,Other_specify,,


Country => ethnicity (coverage: 0.00%)


Unnamed: 0,source,target,similarity
0,,,
1,United States,,
2,Ukraine,,
3,Poland,,
4,Other_specify,,


Country => consent_type (coverage: 0.00%)


Unnamed: 0,source,target,similarity
0,,,
1,United States,,
2,Ukraine,,
3,Poland,,
4,Other_specify,,


Country => overrepresented_sequences (coverage: 20.00%)


Unnamed: 0,source,target,similarity
0,United States,Not Reported,0.338
1,,,
2,Ukraine,,
3,Poland,,
4,Other_specify,,


Country => status (coverage: 0.00%)


Unnamed: 0,source,target,similarity
0,,,
1,United States,,
2,Ukraine,,
3,Poland,,
4,Other_specify,,


Country => ulceration_indicator (coverage: 20.00%)


Unnamed: 0,source,target,similarity
0,United States,Not Reported,0.315
1,,,
2,Ukraine,,
3,Poland,,
4,Other_specify,,


Country => vascular_invasion_present (coverage: 20.00%)


Unnamed: 0,source,target,similarity
0,United States,Not Reported,0.325
1,,,
2,Ukraine,,
3,Poland,,
4,Other_specify,,


### Fixing remaining value mappings

We need fix a few value mappings:
- Race
- Ethnicity
- Tumor_Site

For race, we need to fix: `nan` -> `american indian or alaska native`.

In [None]:
race_vmap = bdi.match_values(
    dataset,
    column_mapping=("Race", "race"),
    target="gdc",
    method="tfidf",
)
race_vmap

Unnamed: 0,source,target,similarity
0,White,white,1.0
1,Asian,asian,1.0
2,Not Reported,not reported,1.0
3,Black or African American,black or african american,1.0
4,,american indian or alaska native,0.359


In [None]:
race_vmap = race_vmap[race_vmap["similarity"] >= 1.0]
race_vmap

Unnamed: 0,source,target,similarity
0,White,white,1.0
1,Asian,asian,1.0
2,Not Reported,not reported,1.0
3,Black or African American,black or african american,1.0


For `Ethnicity`, we need to fix: `Not reported` -> `not hispanic or latino`.

In [None]:
ethinicity_vmap = bdi.match_values(
    dataset,
    column_mapping=("Ethnicity", "ethnicity"),
    target="gdc",
    method="tfidf",
)
ethinicity_vmap


Unnamed: 0,source,target,similarity
0,Hispanic or Latino,hispanic or latino,1.0
1,Not-Hispanic or Latino,not hispanic or latino,0.935
2,Not reported,not hispanic or latino,0.268
3,,,


In [None]:
ethinicity_vmap = ethinicity_vmap[ethinicity_vmap["similarity"] > 0.9]
ethinicity_vmap

Unnamed: 0,source,target,similarity
0,Hispanic or Latino,hispanic or latino,1.0
1,Not-Hispanic or Latino,not hispanic or latino,0.935


For `Tumor_Site`, given that this dataset is about endometrial cancer, all values must be mapped to "Endometrium". So instead of fixing each mapping individually, we will write a custom function that returns "Endometrium" regardless of the input value. Later, we will show how to use this function to transform the dataset.

In [None]:
bdi.match_values(
    dataset, column_mapping=("Tumor_Site", "tissue_or_organ_of_origin"), target="gdc", method="tfidf"
)

Unnamed: 0,source,target,similarity
0,Anterior endometrium,Endometrium,0.852
1,Posterior endometrium,Endometrium,0.823
2,"Other, specify",Other specified parts of pancreas,0.543
3,,Anal canal,0.301


In [None]:
# Custom mapping function that will be used to map the values of the 'Tumor_Site' column
def map_tumor_site(source_value):
    return "Endometrium"

### Combining custom user mappings with suggested mappings

Before generating a final harmonized dataset, we can combine the automatically generated value mappings with the fixed mappings provided by the user. To do so, we use `bdi.merge_mappings()` functions, which take a list of mappings (e.g., generated automatically) and a list of "user-defined mapping overrides" that will be combined with the first list of mappings and will take precedence whenever they conflict.

In our example below, all mappings specified in the variable `user_mappings` will override the mappings in `value_mappings` generated by the `bdi.match_values()` function.

In [None]:
from math import ceil

user_mappings = [
    {
        # When no mapping is need, specifying the source and target is enough
        "source": "BMI",
        "target": "bmi",
    },
    {
        "source": "Tumor_Size_cm",
        "target": "tumor_largest_dimension_diameter",
    },
    {
        # mapper can be a custom Python function
        "source": "Tumor_Site",
        "target": "tissue_or_organ_of_origin",
        "mapper": map_tumor_site,
    },
    {
        # Lambda functions can also be used as mappers
        "source": "Age",
        "target": "days_to_birth",
        "mapper": lambda age: -age * 365.25,
    },
    {
        "source": "Age",
        "target": "age_at_diagnosis",
        "mapper": lambda age: float("nan") if pd.isnull(age) else ceil(age*365.25),
    },
    {
        # We can also use a data frame to specify value mappings using the `matches` attribute
        "source": "Histologic_type",
        "target": "primary_diagnosis",
        "matches": hist_type_vmap
    },
    # For dataframes that contain the 'source' and 'target' columns as attributes,
    # such as the ones returned by the match_values() function, we can directly
    # use them as mappings
    ethinicity_vmap,
    race_vmap,
]


harmonization_spec = bdi.merge_mappings(value_mappings, user_mappings)


Finally, we generate the harmonized dataset, with the user-defined value mappings.

In [None]:
harmonized_dataset = bdi.materialize_mapping(dataset, harmonization_spec)
harmonized_dataset

Unnamed: 0,bmi,tumor_largest_dimension_diameter,tissue_or_organ_of_origin,days_to_birth,age_at_diagnosis,primary_diagnosis,ethnicity,race,country_of_birth,history_of_tumor_type,irs_stage,gender,tumor_focality,tumor_shape
0,38.88,2.9,Endometrium,-23376.00,23376.0,Endometrioid carcinoma,not hispanic or latino,white,United States,,,female,Unifocal,Dome
1,39.76,3.5,Endometrium,-21184.50,21185.0,Endometrioid carcinoma,not hispanic or latino,white,United States,,,female,Unifocal,Dome
2,51.19,4.5,Endometrium,-18262.50,18263.0,Endometrioid carcinoma,not hispanic or latino,white,United States,,,female,Unifocal,
3,,,Endometrium,,,"Carcinosarcoma, NOS",,,,Phenochromocytoma or Paraganglioma,,,,
4,32.69,3.5,Endometrium,-27393.75,27394.0,Endometrioid carcinoma,not hispanic or latino,white,United States,,,female,Unifocal,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,29.40,4.2,Endometrium,-27393.75,27394.0,Endometrioid carcinoma,,,Ukraine,,,female,Unifocal,
100,35.42,1.5,Endometrium,-27028.50,27029.0,Endometrioid carcinoma,,,Ukraine,,,female,Unifocal,
101,24.32,3.8,Endometrium,-31046.25,31047.0,Serous cystadenocarcinoma,not hispanic or latino,black or african american,United States,,,female,Unifocal,
102,34.06,5.0,Endometrium,-25567.50,25568.0,Serous cystadenocarcinoma,,,Ukraine,,,female,Unifocal,


For comparison, here is how our original data looked like:

In [None]:
original_columns = map(lambda m: m["source"], harmonization_spec)
dataset[original_columns]

Unnamed: 0,BMI,Tumor_Size_cm,Tumor_Site,Age,Age.1,Histologic_type,Ethnicity,Race,Country,Histologic_type.1,FIGO_stage,Gender,Tumor_Focality,Tumor_Site.1
0,38.88,2.9,Anterior endometrium,64.0,64.0,Endometrioid,Not-Hispanic or Latino,White,United States,Endometrioid,IA,Female,Unifocal,Anterior endometrium
1,39.76,3.5,Posterior endometrium,58.0,58.0,Endometrioid,Not-Hispanic or Latino,White,United States,Endometrioid,IA,Female,Unifocal,Posterior endometrium
2,51.19,4.5,"Other, specify",50.0,50.0,Endometrioid,Not-Hispanic or Latino,White,United States,Endometrioid,IA,Female,Unifocal,"Other, specify"
3,,,,,,Carcinosarcoma,,,,Carcinosarcoma,,,,
4,32.69,3.5,"Other, specify",75.0,75.0,Endometrioid,Not-Hispanic or Latino,White,United States,Endometrioid,IA,Female,Unifocal,"Other, specify"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,29.40,4.2,"Other, specify",75.0,75.0,Endometrioid,,,Ukraine,Endometrioid,IA,Female,Unifocal,"Other, specify"
100,35.42,1.5,"Other, specify",74.0,74.0,Endometrioid,,,Ukraine,Endometrioid,II,Female,Unifocal,"Other, specify"
101,24.32,3.8,"Other, specify",85.0,85.0,Serous,Not-Hispanic or Latino,Black or African American,United States,Serous,II,Female,Unifocal,"Other, specify"
102,34.06,5.0,"Other, specify",70.0,70.0,Serous,,,Ukraine,Serous,IA,Female,Unifocal,"Other, specify"
