# Getting Started

## Data Harmonization with `bdikit`

Data harmonization is the process of integrating and aligning data from different sources into a consistent format to ensure compatibility and interoperability across data analyses and systems. `bdikit` is a library the helps with key data harmonization steps:
- *Schema Mapping*: In this step, data from various sources are mapped to a unified schema or model. This involves identifying equivalent table columns and establishing relationships between disparate datasets.
- *Value Mapping (Data Standardization)*: This step involves converting data into a common format or structure, using consistent naming conventions, units, and coding systems to ensure uniformity.

In this example, we describe how `bdikit` can be used to harmonize datasets from two papers:
- Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/)
- Cao et al (https://www.cell.com/cell/fulltext/S0092-8674(21)00997-1).

#### Loading the data

First, import the `bdikit` library and other libraries.

In [1]:
import bdikit as bdi
import pandas as pd
from IPython.display import display, Markdown

Next, we load our source data using Pandas and select some columns we are interested in.

In [2]:
df_source = pd.read_csv("./datasets/Dou-ucec-discovery.csv")
# column_names = [
#     "Country",
#     "Gender",
#     "FIGO_stage",
#     "Path_Stage_Reg_Lymph_Nodes-pN",
#     "tumor_Stage-Pathological",
#     "Tumor_Focality",
# ]
# df_source = df_source[column_names]
df_source.head(10)

Unnamed: 0,idx,Proteomics_Participant_ID,Case_excluded,Proteomics_TMT_batch,Proteomics_TMT_plex,Proteomics_TMT_channel,Proteomics_Parent_Sample_IDs,Proteomics_Aliquot_ID,Proteomics_Tumor_Normal,Proteomics_OCT,...,RNAseq_R1_sample_type,RNAseq_R1_filename,RNAseq_R1_UUID,RNAseq_R2_sample_type,RNAseq_R2_filename,RNAseq_R2_UUID,miRNAseq_sample_type,miRNAseq_UUID,Methylation_available,Methylation_quality
0,S001,C3L-00006,No,2,5,128N,C3L-00006-01,CPT0001460012,Tumor,No,...,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_TAGCTT_S17...,8a1efc47-1c29-417f-a425-cdbd09565dcb,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_TAGCTT_S17...,8c3fe9b7-7acd-4867-8d9c-a8e5d1516eda,Tumor,37bcba98-1094-459e-83ae-c23a602416fb,YES,PASS
1,S002,C3L-00008,No,4,16,130N,C3L-00008-01,CPT0001300009,Tumor,No,...,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_GGCTAC_S22...,555725e8-cba5-4676-9b0a-80100cbf9f47,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_GGCTAC_S22...,15235b12-b67a-4678-acc4-ed03d642bd5e,Tumor,492b50d8-ec35-46e7-a65d-06512aaee394,YES,PASS
2,S003,C3L-00032,No,1,2,131,C3L-00032-01,CPT0001420009,Tumor,No,...,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_GTCCGC_S18...,9ae968f3-691d-4db3-9977-1ab3e5af9085,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_GTCCGC_S18...,423b6b09-02aa-4f47-9241-f75c1dad1161,Tumor,1794ff56-db2d-4d1a-8758-cab7fe3d98c1,YES,PASS
3,S004,C3L-00084,Yes,3,11,129N,C3L-00084-01,CPT0000820012,Tumor,No,...,Tumor,170818_UNC32-K00270_0050_AHL2FHBBXX_ATCACG_S5_...,b0a7cdf2-2ad8-4442-91b0-548ea4975554,Tumor,170818_UNC32-K00270_0050_AHL2FHBBXX_ATCACG_S5_...,c83987a5-1c13-4af4-b46c-218fe5f60c34,,,YES,PASS
4,S005,C3L-00090,No,3,12,129C,C3L-00090-01,CPT0001140003,Tumor,No,...,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_GAGTGG_S10...,8ce5618d-9ff6-40f9-aeea-8d8e1633ae38,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_GAGTGG_S10...,06d3fd4a-a623-4146-8500-4f1f17235253,Tumor,a6524c2d-d7dd-4629-980e-b45dbdc92c49,YES,PASS
5,S006,C3L-00098,No,4,14,129N,C3L-00098-02,CPT0000980012,Tumor,No,...,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_TTAGGC_S8_...,31252ba9-e052-4b77-809a-f936379ae00c,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_TTAGGC_S8_...,23be22ae-de50-4d74-a7c0-c890adbc662a,,,YES,PASS
6,S007,C3L-00136,No,4,16,129C,C3L-00136-03,CPT0000730011,Tumor,No,...,Tumor,170818_UNC32-K00270_0050_AHL2FHBBXX_GTCCGC_S10...,df0e2942-c702-4135-81a0-fbec4439d753,Tumor,170818_UNC32-K00270_0050_AHL2FHBBXX_GTCCGC_S10...,4e1ad404-4646-4828-91b9-e3c35a4ce505,,,YES,PASS
7,S008,C3L-00137,No,4,15,130N,C3L-00137-02,CPT0002010011,Tumor,No,...,Tumor,170818_UNC32-K00270_0050_AHL2FHBBXX_GTGAAA_S12...,8fcdd6a1-a7c7-41b5-8b44-e41f2237b236,Tumor,170818_UNC32-K00270_0050_AHL2FHBBXX_GTGAAA_S12...,2bea607d-6eb2-4583-90d7-7823a3d8a572,,,YES,PASS
8,S009,C3L-00139,No,3,11,130N,C3L-00139-01,CPT0001850012,Tumor,No,...,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_CAGATC_S1_...,7785d5a1-a60d-41f9-86f3-e4ebc100704c,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_CAGATC_S1_...,90ced367-0342-4739-93b2-4b1a4af800c4,Tumor,a02b2784-9e7f-41b1-8e53-707ae4371c45,YES,PASS
9,S010,C3L-00143,No,4,14,130C,C3L-00143-01,CPT0001910016,Tumor,No,...,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_ACTTGA_S4_...,6412838b-2f70-4b14-a6ee-3c7baca09fb0,Tumor,170802_UNC31-K00269_0072_AHK3GVBBXX_ACTTGA_S4_...,5d0a26e0-2739-4f38-9350-c685b44911d3,Tumor,872be4b7-1735-48a6-a3a2-7541ec65ea87,YES,PASS


Our goal is to harmonize the data from our source table (`dou.csv`) with the data from our target table `cao.csv`

In [3]:
df_target = pd.read_csv("./datasets/Dou-ucec-confirmatory.csv")
df_target.head(5)

Unnamed: 0,Idx,Case_id,Case_excluded,Batch,Plex,ReporterName,Aliquot_ID,Group,Discovery_study,Age,...,Follow-up_additional_surgery_for_new_tumor,Follow-up_additional_treatment_radiation_therapy_for_new_tumor,Follow-up_additional_treatment_pharmaceutical_therapy_for_new_tumor,Follow-up_additional_treatment_immuno_for_new_tumor,Follow-up_days_from_date_of_collection_to_date_of_last_contact,Follow-up_cause_of_death,Follow-up_days_from_date_of_initial_pathologic_diagnosis_to_date_of_death,Follow-up_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor,Follow-up_procedure_type_of_new_tumor,Follow-up_residual_tumor_after_surgery_for_new_tumor
0,C3L-00086,C3L-00086,No,b4,16.0,128N,CPT0092460003,Tumor,No,56,...,n/a|No|No|No|No,n/a|Yes|Yes|Yes|Yes,n/a|Yes|Yes|Yes|Yes,n/a|No|No|No|No,330.0|701.0|1046.0|1436.0|n/a,n/a|n/a|n/a|n/a|Breast Carcinoma,n/a|n/a|n/a|n/a|1578.0,n/a|n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a|n/a
1,C3L-00898,C3L-00898,No,b4,14.0,128C,CPT0172200008,Tumor,No,54,...,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a,396.0|746.0|982.0|1600.0,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a
2,C3L-00943,C3L-00943,No,b4,15.0,130C,CPT0086090003,Tumor,No,63,...,n/a|n/a|n/a,n/a|n/a|n/a,n/a|n/a|n/a,n/a|n/a|n/a,237.0|693.0|1039.0,n/a|n/a|n/a,n/a|n/a|n/a,n/a|n/a|n/a,n/a|n/a|n/a,n/a|n/a|n/a
3,C3L-01064,C3L-01064,No,b3,9.0,129N,CPT0113430004,Tumor,No,54,...,No|No|No|No,No|Yes|No|No,Yes|Yes|Yes|Yes,No|No|No|No,453.0|726.0|1062.0|1447.0,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a,n/a|n/a|n/a|n/a
4,C3L-01277,C3L-01277,No,b4,13.0,130N,CPT0093170003,Tumor,No,61,...,n/a|No|No,n/a|No|Yes,n/a|Yes|No,n/a|No|No,351.0|713.0|967.0,n/a|n/a|n/a,n/a|n/a|n/a,n/a|n/a|n/a,n/a|n/a|n/a,n/a|n/a|n/a


#### Finding column matches between two tables

`bdi-kit` offers a suite of functions to help with data harmonization tasks.

For instance, it can help automatically discover one-to-one mappings between the source and target dataset columns.

To do so using `bdi-kit`, we can use the `match_schema()` function to match columns of the two schemas as follows.

In [4]:
bdi.match_schema(df_source, df_target, method="ct_learning")

Extracting features from 179 columns...


  0%|          | 0/179 [00:00<?, ?it/s]

Extracting features from 213 columns...


  0%|          | 0/213 [00:00<?, ?it/s]

Unnamed: 0,source,target
0,idx,xCell_T_cell_CD4+_Th1
1,Proteomics_Participant_ID,Idx
2,Case_excluded,Case_excluded
3,Proteomics_TMT_batch,ABSOLUTE_tumor_purity
4,Proteomics_TMT_plex,Number_of_para-aortic_lymph_nodes_examined
...,...,...
174,RNAseq_R2_UUID,Case_id
175,miRNAseq_sample_type,Mutation_signature_SBS7a
176,miRNAseq_UUID,Case_id
177,Methylation_available,Mutation_signature_SBS42


In [9]:
pd.set_option('display.max_rows', None)
schema_mapping = bdi.match_schema(df_source, df_target, method="ct_learning")
schema_mapping

Extracting features from 179 columns...


  0%|          | 0/179 [00:00<?, ?it/s]

Extracting features from 213 columns...


  0%|          | 0/213 [00:00<?, ?it/s]

Unnamed: 0,source,target
0,idx,xCell_T_cell_CD4+_Th1
1,Proteomics_Participant_ID,Idx
2,Case_excluded,Case_excluded
3,Proteomics_TMT_batch,ABSOLUTE_tumor_purity
4,Proteomics_TMT_plex,Number_of_para-aortic_lymph_nodes_examined
5,Proteomics_TMT_channel,ReporterName
6,Proteomics_Parent_Sample_IDs,Idx
7,Proteomics_Aliquot_ID,Aliquot_ID
8,Proteomics_Tumor_Normal,Group
9,Proteomics_OCT,POLE


In [7]:
bdi.top_matches(df_source, columns=['Tumor_purity'], target=df_target, top_k=35)

Extracting features from 1 columns...


  0%|          | 0/1 [00:00<?, ?it/s]

Extracting features from 213 columns...


  0%|          | 0/213 [00:00<?, ?it/s]

Unnamed: 0,source,target,similarity
0,Tumor_purity,Tumor_size_cm,0.108482
1,Tumor_purity,ABSOLUTE_tumor_purity,0.099089
2,Tumor_purity,Mutation_signature_SBS10b,0.094293
3,Tumor_purity,Cibersort_T_cell_regulatory_(Tregs),0.0929
4,Tumor_purity,Cibersort_Monocyte,0.092766
5,Tumor_purity,Cibersort_B_cell_naive,0.092648
6,Tumor_purity,Mutation_signature_SBS1,0.092625
7,Tumor_purity,Cibersort_T_cell_gamma_delta,0.091828
8,Tumor_purity,Progeny_TGFb,0.090625
9,Tumor_purity,Mutation_signature_SBS10a,0.090585


In [21]:
bdi.preview_domain(df_target, column='Treatment_naive')

Unnamed: 0,value_name
0,
1,Surgery|Surgery
2,Surgery
3,Other(Mohs treatment)
4,"Radiation,Surgery"
5,Unknown


In [26]:
# Stemness_score	Progeny_Androgen
bdi.match_values(df_source, df_target, ('MLH2', 'Ancillary_studies_mlh2'), method='tfidf')  
# bdi.match_values(df_source, df_target, ('Tumor_purity', 'ABSOLUTE_tumor_purity'), method='tfidf')

KeyError: 'Ancillary_studies_mlh2'

In [11]:
bdi.match_values(df_source, df_target, ('Proteomics_TMT_batch', 'Batch'), method='tfidf')

Unnamed: 0,source,target,similarity
0,2,b2,0.578
1,4,b4,0.578
2,1,b1,0.578
3,3,b3,0.578
4,5,,


In [25]:
bdi.top_matches(df_source, columns=['MLH2'], target=df_target, top_k=35)

Extracting features from 1 columns...


  0%|          | 0/1 [00:00<?, ?it/s]

Extracting features from 213 columns...


  0%|          | 0/213 [00:00<?, ?it/s]

Unnamed: 0,source,target,similarity
0,MLH2,Ancillary_studies_mlh1,0.103021
1,MLH2,BMI,0.101541
2,MLH2,CNV_status,0.100899
3,MLH2,Mutation_signature_SBS21,0.097867
4,MLH2,xCell_B_cell,0.097845
5,MLH2,Mutation_load,0.096779
6,MLH2,Mutation_signature_SBS20,0.096626
7,MLH2,xCell_Macrophage,0.096572
8,MLH2,xCell_Myeloid_dendritic_cell,0.095951
9,MLH2,Clinical_staging_distant_metastasis_cm,0.095322


#### Finding value matches between two columns

Once the matching columns are identified, we can standardize data to ensure that no duplicate values represent the same entity/meaning.

To do that, `bdikit` provides the function `match_values()` to find values that should potentially be merged. The library supports multiple methods to perform this task, including syntactic and semantic matching algorithms. In this example, we use the `tfidf` method, which finds values based on the similarity of character n-grams. Please, refer to the [bdikit documentation](https://bdi-kit.readthedocs.io/) to learn more about the methods available.

In [16]:
# Matches values from each pair of source-target columns
value_matches = bdi.match_values(df_source, df_target, schema_mapping.head(250), method="tfidf")

# Print value matches
for match in value_matches:
    display(
        Markdown(
            f"<br>**Source column:** {match.attrs['source']}<br>"
            f"**Target column:** {match.attrs['target']}<br>"
        )
    )
    display(match)

<br>**Source column:** Case_excluded<br>**Target column:** Case_excluded<br>

Unnamed: 0,source,target,similarity
0,No,No,1.0
1,Yes,Yes,1.0


<br>**Source column:** Proteomics_TMT_batch<br>**Target column:** ABSOLUTE_tumor_purity<br>

Unnamed: 0,source,target,similarity
0,3,0.33,0.622
1,4,0.4,0.608
2,5,0.5,0.606
3,2,0.2,0.601
4,1,1.0,0.506


<br>**Source column:** Proteomics_TMT_plex<br>**Target column:** Number_of_para-aortic_lymph_nodes_examined<br>

Unnamed: 0,source,target,similarity
0,5,5.0,1.0
1,16,16.0,1.0
2,2,2.0,1.0
3,11,11.0,1.0
4,12,12.0,1.0
5,8,8.0,1.0
6,7,7.0,1.0
7,6,6.0,1.0
8,3,3.0,1.0
9,1,1.0,1.0


<br>**Source column:** Proteomics_TMT_channel<br>**Target column:** ReporterName<br>

Unnamed: 0,source,target,similarity
0,128N,128N,1.0
1,130N,130N,1.0
2,129N,129N,1.0
3,129C,129C,1.0
4,130C,130C,1.0
5,127N,127N,1.0
6,127C,127C,1.0
7,128C,128C,1.0
8,131,131N,0.714


<br>**Source column:** Proteomics_Tumor_Normal<br>**Target column:** Group<br>

Unnamed: 0,source,target,similarity
0,Tumor,Tumor,1.0
1,Adjacent_normal,Adjacent_normal,1.0
2,Enriched_normal,Enriched_Normal,1.0
3,Myometrium_normal,,


<br>**Source column:** Proteomics_OCT<br>**Target column:** POLE<br>

Unnamed: 0,source,target,similarity
0,No,No,1.0
1,Yes,Yes,1.0


<br>**Source column:** Country<br>**Target column:** Participant_country<br>

Unnamed: 0,source,target,similarity
0,United States,United States,1.0
1,,,1.0
2,Ukraine,Ukraine,1.0
3,Poland,Poland,1.0
4,Other_specify,,


<br>**Source column:** Histologic_Grade_FIGO<br>**Target column:** Histologic_grade<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,FIGO grade 1,Other: High grade,0.426
2,FIGO grade 2,Other: High grade,0.426
3,FIGO grade 3,Other: High grade,0.426


<br>**Source column:** Myometrial_invasion_Specify<br>**Target column:** Myometrial_invasion_present_specify<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,50 % or more,,
2,Not identified,,
3,under 50 %,,


<br>**Source column:** Histologic_type<br>**Target column:** Histologic_Type<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Endometrioid,Endometrioid carcinoma,0.855
2,Clear cell,Clear cell carcinoma,0.835
3,Serous,Serous carcinoma,0.717
4,Carcinosarcoma,Serous carcinoma,0.618


<br>**Source column:** Treatment_naive<br>**Target column:** Follow-up_additional_treatment_radiation_therapy_for_new_tumor<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,NO,No|No,0.806
2,YES,Yes|Yes|Yes|Yes,0.787


<br>**Source column:** Tumor_purity<br>**Target column:** Tumor_necrosis<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Normal,,
2,Low,,


<br>**Source column:** Path_Stage_Primary_Tumor-pT<br>**Target column:** Pathologic_staging_primary_tumor_pt<br>

Unnamed: 0,source,target,similarity
0,pT1a (FIGO IA),pT1a (FIGO IA),1.0
1,,,1.0
2,pT3a (FIGO IIIA),pT3a (FIGO IIIA),1.0
3,pT1 (FIGO I),pT1 (FIGO I),1.0
4,pT1b (FIGO IB),pT1b (FIGO IB),1.0
5,pT2 (FIGO II),pT2 (FIGO II),1.0
6,pT3b (FIGO IIIB),pT3b (FIGO IIIB),1.0


<br>**Source column:** Path_Stage_Reg_Lymph_Nodes-pN<br>**Target column:** Pathologic_staging_regional_lymph_nodes_pn<br>

Unnamed: 0,source,target,similarity
0,pN0,pN0,1.0
1,pNX,pNX,1.0
2,,,1.0
3,pN2 (FIGO IIIC2),pN2 (FIGO IIIC2),1.0
4,pN1 (FIGO IIIC1),pN1 (FIGO IIIC1),1.0


<br>**Source column:** Clin_Stage_Dist_Mets-cM<br>**Target column:** Clinical_staging_distant_metastasis_cm<br>

Unnamed: 0,source,target,similarity
0,cM0,cM0,1.0
1,,,1.0
2,Staging Incomplete,Staging Incomplete,1.0
3,cM1,cM1,1.0


<br>**Source column:** Path_Stage_Dist_Mets-pM<br>**Target column:** Clinical_staging_distant_metastasis_cm<br>

Unnamed: 0,source,target,similarity
0,Staging Incomplete,Staging Incomplete,1.0
1,,,1.0
2,pM1,cM1,0.445
3,No pathologic evidence of distant metastasis,Staging Incomplete,0.423


<br>**Source column:** tumor_Stage-Pathological<br>**Target column:** Tumor_stage_pathological<br>

Unnamed: 0,source,target,similarity
0,Stage I,Stage I,1.0
1,Stage IV,Stage IV,1.0
2,,,1.0
3,Stage III,Stage III,1.0
4,Stage II,Stage II,1.0


<br>**Source column:** FIGO_stage<br>**Target column:** Pathologic_staging_primary_tumor_pt<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,IIIB,pT3b (FIGO IIIB),0.689
2,IIIA,pT3a (FIGO IIIA),0.682
3,IA,pT1a [IA],0.567
4,II,pT3a (FIGO IIIA),0.526
5,IB,pT1b (FIGO IB),0.498
6,IIIC2,pT2 [II],0.403
7,IIIC1,pT3a (FIGO IIIA),0.344
8,IVB,,


<br>**Source column:** Diabetes<br>**Target column:** Diabetes<br>

Unnamed: 0,source,target,similarity
0,Yes,Yes,1.0
1,,,1.0
2,Unknown,,
3,No,,


<br>**Source column:** Race<br>**Target column:** Race<br>

Unnamed: 0,source,target,similarity
0,White,White,1.0
1,,,1.0
2,Asian,Asian,1.0
3,Not Reported,Not Reported,1.0
4,Black or African American,Black or African American,1.0


<br>**Source column:** Ethnicity<br>**Target column:** Ethnicity<br>

Unnamed: 0,source,target,similarity
0,Not-Hispanic or Latino,Not-Hispanic or Latino,1.0
1,,,1.0
2,Hispanic or Latino,Hispanic or Latino,1.0
3,Not reported,Not reported,1.0


<br>**Source column:** Gender<br>**Target column:** Sex<br>

Unnamed: 0,source,target,similarity
0,Female,Female,1.0
1,,,1.0


<br>**Source column:** Tumor_Site<br>**Target column:** Tumor_site<br>

Unnamed: 0,source,target,similarity
0,Anterior endometrium,Anterior endometrium,1.0
1,Posterior endometrium,Posterior endometrium,1.0
2,,,1.0
3,"Other, specify",Other,0.558


<br>**Source column:** Tumor_Focality<br>**Target column:** Tumor_focality<br>

Unnamed: 0,source,target,similarity
0,Unifocal,Unifocal,1.0
1,,,1.0
2,Multifocal,Multifocal,1.0


<br>**Source column:** Estrogen_Receptor<br>**Target column:** Ancillary_studies_estrogen_receptor<br>

Unnamed: 0,source,target,similarity
0,Cannot be determined,Cannot be determined,1.0
1,,,1.0
2,Negative,Negative,1.0
3,Positive,Positive : 5 %,0.941
4,Unknown,,


<br>**Source column:** Progesterone_Receptor<br>**Target column:** Ancillary_studies_progesterone_receptor<br>

Unnamed: 0,source,target,similarity
0,Cannot be determined,Cannot be determined,1.0
1,,,1.0
2,Negative,Negative,1.0
3,Positive,Positive : 5 %,0.941
4,Unknown,,


<br>**Source column:** MLH1<br>**Target column:** Ancillary_studies_mlh1<br>

Unnamed: 0,source,target,similarity
0,Intact nuclear expression,Intact nuclear expression,1.0
1,,,1.0
2,Loss of nuclear expression,Loss of nuclear expression,1.0
3,Cannot be determined,Cannot be determined,1.0
4,Unknown,,


<br>**Source column:** MLH2<br>**Target column:** Ancillary_studies_mlh1<br>

Unnamed: 0,source,target,similarity
0,Intact nuclear expression,Intact nuclear expression,1.0
1,,,1.0
2,Cannot be determined,Cannot be determined,1.0
3,Loss of nuclear expression,Loss of nuclear expression,1.0
4,Unknown,,


<br>**Source column:** MSH6<br>**Target column:** Ancillary_studies_msh2<br>

Unnamed: 0,source,target,similarity
0,Loss of nuclear expression,Loss of nuclear expression,1.0
1,Intact nuclear expression,Intact nuclear expression,1.0
2,,,1.0
3,Cannot be determined,Cannot be determined,1.0
4,Unknown,,


<br>**Source column:** PMS2<br>**Target column:** Ancillary_studies_pms2<br>

Unnamed: 0,source,target,similarity
0,Intact nuclear expression,Intact nuclear expression,1.0
1,Loss of nuclear expression,Loss of nuclear expression,1.0
2,,,1.0
3,Cannot be determined,Cannot be determined,1.0
4,Unknown,,


<br>**Source column:** p53<br>**Target column:** Ancillary_studies_p53<br>

Unnamed: 0,source,target,similarity
0,Cannot be determined,Cannot be determined,1.0
1,,,1.0
2,Normal,Normal,1.0
3,Overexpression,Overexpression,1.0
4,Loss of expression,Loss of expression,1.0
5,Unknown,,


<br>**Source column:** MLH1_Promoter_Hypermethylation<br>**Target column:** Ancillary_studies_mlh1_promoter_hypermethylation<br>

Unnamed: 0,source,target,similarity
0,Cannot be determined,Cannot be determined,1.0
1,,,1.0
2,Absent,Absent,1.0
3,Present,Present,1.0


<br>**Source column:** Num_full_term_pregnancies<br>**Target column:** Donor_information_number_of_full_term_pregnancies<br>

Unnamed: 0,source,target,similarity
0,1,1,1.0
1,4 or more,4 or more,1.0
2,,,1.0
3,2,2,1.0
4,3,3,1.0
5,Unknown,Unknown,1.0


<br>**Source column:** CNV_class<br>**Target column:** CNV_status<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,CNV_LOW,CNV_L,0.631
2,CNV_HIGH,CNV_H,0.623


<br>**Source column:** MSI_status<br>**Target column:** MSI_status<br>

Unnamed: 0,source,target,similarity
0,MSI-H,MSI-H,1.0
1,MSS,MSS,1.0
2,,,1.0


<br>**Source column:** POLE_subtype<br>**Target column:** POLE<br>

Unnamed: 0,source,target,similarity
0,No,No,1.0
1,Yes,Yes,1.0
2,,,1.0


<br>**Source column:** JAK1_MS_INDEL<br>**Target column:** MSI_status<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,MS_indel,MSI-H,0.352
2,WT,,


<br>**Source column:** JAK1_Mutation<br>**Target column:** BMI<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Nonsense_Mutation,,0.31
2,WT,,
3,Missense_Mutation,,
4,Frame_Shift_Ins_Nonsense_Mutation,,
5,Frame_Shift_Del_Nonsense_Mutation,,
6,Frame_Shift_Del,,
7,Frame_Shift_Del_Frame_Shift_Ins,,


<br>**Source column:** Genomics_subtype<br>**Target column:** Genomic_subtype<br>

Unnamed: 0,source,target,similarity
0,MSI-H,MSI-H,1.0
1,,,1.0
2,POLE,POLE,1.0
3,CNV_low,CNV_L,0.663
4,CNV_high,CNV_H,0.632


<br>**Source column:** WXS_normal_sample_type<br>**Target column:** Batch<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Blood_normal,,


<br>**Source column:** WXS_tumor_sample_type<br>**Target column:** Tumor_site<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Tumor,,


<br>**Source column:** WGS_normal_sample_type<br>**Target column:** Mutation_signature_SBS7a<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Blood_normal,,


<br>**Source column:** WGS_tumor_sample_type<br>**Target column:** Tumor_site<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Tumor,,


<br>**Source column:** RNAseq_R1_sample_type<br>**Target column:** ARID1A<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Adjacent_normal,,
2,Tumor,,


<br>**Source column:** RNAseq_R2_sample_type<br>**Target column:** ARID1A<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Adjacent_normal,,
2,Tumor,,


<br>**Source column:** miRNAseq_sample_type<br>**Target column:** Mutation_signature_SBS7a<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,Adjacent_normal,,
2,Tumor,,


<br>**Source column:** Methylation_available<br>**Target column:** Mutation_signature_SBS42<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,YES,,


<br>**Source column:** Methylation_quality<br>**Target column:** MSI_status<br>

Unnamed: 0,source,target,similarity
0,,,1.0
1,PASS,MSS,0.414
2,Failed,,


In [11]:
list(df_target["Batch"].unique())

['b4', 'b3', 'b1', 'b2', nan]

In [9]:
list(df_target.columns)

['Idx',
 'Case_id',
 'Case_excluded',
 'Batch',
 'Plex',
 'ReporterName',
 'Aliquot_ID',
 'Group',
 'Discovery_study',
 'Age',
 'Sex',
 'Histologic_Type',
 'Histologic_grade',
 'Tumor_size_cm',
 'Height_at_time_of_surgery_cm',
 'Weight_at_time_of_surgery_kg',
 'BMI',
 'Myometrial_invasion',
 'Myometrial_invasion_present_specify',
 'AJCC_tnm_cancer_staging_edition_used',
 'Pathologic_staging_primary_tumor_pt',
 'Pathologic_staging_regional_lymph_nodes_pn',
 'Number_of_pelvic_lymph_nodes_examined',
 'Tumor_stage_pathological',
 'Race',
 'CNV_ratio',
 'CNV_status',
 'POLE',
 'MSIsensor_ratio',
 'MSI_status',
 'Genomic_subtype',
 'Mutation_load',
 'TP53',
 'PTEN',
 'CTNNB1',
 'ARID1A',
 'PIK3CA',
 'xCell_Myeloid_dendritic_cell_activated',
 'xCell_B_cell',
 'xCell_T_cell_CD4+_memory',
 'xCell_T_cell_CD4+_naive',
 'xCell_T_cell_CD4+_(non-regulatory)',
 'xCell_T_cell_CD4+_central_memory',
 'xCell_T_cell_CD4+_effector_memory',
 'xCell_T_cell_CD8+_naive',
 'xCell_T_cell_CD8+',
 'xCell_T_cell_CD

#### Generating a harmonized table

In [6]:
df_mapped = bdi.materialize_mapping(df_source, value_matches)
df_mapped

Unnamed: 0,participant_country,sex,tumor_stage_pathological,pathologic_staging_regional_lymph_nodes_pn,tumor_focality
0,United States,Female,Stage IA,pN0,Unifocal
1,United States,Female,Stage IV,pNX,Unifocal
2,United States,Female,Stage IA,pN0,Unifocal
3,,,,,
4,United States,Female,Stage IA,pNX,Unifocal
...,...,...,...,...,...
99,,Female,Stage IA,pNX,Unifocal
100,,Female,Stage III,pN0,Unifocal
101,United States,Female,Stage III,pN0,Unifocal
102,,Female,Stage IA,pN0,Unifocal


In [7]:
source_column_names = list(map(lambda m: m.attrs['source'], value_matches))
target_column_names = list(map(lambda m: m.attrs['target'], value_matches))
df_source[source_column_names]

Unnamed: 0,Country,Gender,FIGO_stage,Path_Stage_Reg_Lymph_Nodes-pN,tumor_Stage-Pathological,Tumor_Focality
0,United States,Female,IA,pN0,Stage I,Unifocal
1,United States,Female,IA,pNX,Stage IV,Unifocal
2,United States,Female,IA,pN0,Stage I,Unifocal
3,,,,,,
4,United States,Female,IA,pNX,Stage I,Unifocal
...,...,...,...,...,...,...
99,Ukraine,Female,IA,pNX,Stage I,Unifocal
100,Ukraine,Female,II,pN0,Stage II,Unifocal
101,United States,Female,II,pN0,Stage II,Unifocal
102,Ukraine,Female,IA,pN0,Stage I,Unifocal


In [8]:
pd.concat([df_mapped[target_column_names], df_target[target_column_names]])

Unnamed: 0,participant_country,sex,tumor_stage_pathological,pathologic_staging_regional_lymph_nodes_pn,tumor_stage_pathological.1,tumor_focality
0,United States,Female,Stage IA,pN0,Stage IA,Unifocal
1,United States,Female,Stage IV,pNX,Stage IV,Unifocal
2,United States,Female,Stage IA,pN0,Stage IA,Unifocal
3,,,,,,
4,United States,Female,Stage IA,pNX,Stage IA,Unifocal
...,...,...,...,...,...,...
135,Poland,Male,Stage III,pN2,Stage III,Unifocal
136,China,Female,Stage III,pN2,Stage III,Unifocal
137,China,Male,Stage III,pN2,Stage III,Unifocal
138,Poland,Female,Stage III,pN2,Stage III,Multifocal
