# BDI-Viz Demo On GDC

## Introduction
Welcome to the BDI-Viz demonstration. This demo is designed to generate potential ground truths based on the GDC schema. Should you encounter any bugs or identify areas for improvement, please contact **Eden Wu**. Alternatively, you can open an issue in the [BDI-Kit](https://github.com/VIDA-NYU/bdi-kit) repository. We appreciate your feedback and collaboration!

In [2]:
import pandas as pd
import numpy as np
import json

import panel as pn
import bdikit as bdi
from bdikit.visualization.schema_matching import BDISchemaMatchingHeatMap
from bdikit import top_matches

  from tqdm.autonotebook import tqdm


### Use Case Exercise 1: Metadata extraction, standardization, and integration 

**Scenario:**
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) program wants to make the data from its multiple studies readily available in an integrated, standardized format in the Cancer Research  Data Commons (CRDC). Much of the patient case and sample data are available in supplemental tables in  the original primary research papers, but with varying variable names and value types. This exercise is  based on Li et al. 2023, Proteogenomic data and resources for pan-cancer analysis: Cancer Cell, which  integrated and harmonized data from ten CPTAC source studies. 

**Task:**
Semi-automatically extract and integrate patient case metadata from the ten source studies in Li et al. Create a harmonized dataset in the Genomics Data Commons (GDC) data format for the 15 variables as shown in Table A below. Include only patients that have tumor samples with proteogenomic data and  were not excluded in the source studies.  

To illustrate, Table A shows data for one patient case from one paper and corresponding information in  the GDC data format. Note, there is not a one-to-one mapping for some variables, and values for some  GDC variables can be inferred from other variables in the paper table data.  


| Dou et al. Table S1                |                                 | GDC-formatted data              |                                |
|------------------------------------|---------------------------------|---------------------------------|--------------------------------|
| **Variable**                       | **Value**                       | **Variable**                    | **Value**                      |
| Proteomics_Participant_ID          | C3L-00006                       | case_submitter_id               | C3L-00006                      |
| Age                                | 64                              | age_at_diagnosis                | 23376                          |
| Gender                             | Female                          | gender                          | female                         |
| Race                               | White                           | race                            | white                          |
| Ethnicity                          | Not-Hispanic or Latino          | ethnicity                       | not hispanic or latino         |
| (none)                             | (none)                          | vital_status<sup>1</sup>        | Alive<sup>1</sup>              |
| Histologic_Grade_FIGO              | FIGO grade 1                    | tumor_grade                     | G1                             |
| tumor_Stage-Pathological           | Stage I                         | ajcc_pathologic_stage           | Stage I                        |
| Path_Stage_Reg_Lymph_Nodes-pN      | pN0                             | ajcc_pathologic_n               | N0                             |
| Path_Stage_Primary_Tumor-pT        | pT1a (FIGO IA)                  | ajcc_pathologic_t               | T1a                            |
| Tumor_Focality                     | Unifocal                        | tumor_focality                  | Unifocal                       |
| Tumor_Size_cm                      | 2.9                             | tumor_largest_dimension_diameter | 2.9                            |
| Tumor_Site                         | Anterior endometrium            | tissue_or_organ_of_origin       | Endometrium                    |
| Histologic_type                    | Endometrioid                    | primary_diagnosis               | Endometrioid carcinoma         |
| Histologic_type; Tumor_Site        | Endometrioid; Anterior endometrium | morphology                    | 8380/3                         |
| Case excluded                      | No                              | (None, but presence in this dataset indicates the sample should be included) | 



## Load Dataset
Here, we load the source dataset that needs to be matched, as well as the target dataset, which is a standardized dataset following the GDC schema.

In [3]:
# Here we load dou.csv for example, please use whatever dataset you like :)
source = pd.read_csv("./datasets/dou.csv")
# target = "gdc"

# None GDC
target = pd.read_csv("./datasets/target.csv")
target

Unnamed: 0,case_submitter_id,age_at_diagnosis,race,ethnicity,gender,vital_status,ajcc_pathologic_t,ajcc_pathologic_n,ajcc_pathologic_stage,tumor_grade,tumor_focality,tumor_largest_dimension_diameter,primary_diagnosis,morphology,tissue_or_organ_of_origin,tumor_code,study
0,01BR001,20089.0,black or african american,not hispanic or latino,female,Alive,T2,N1c,Stage II,GX,Not Reported,Not Reported,Invasive carcinoma of no special type,8500/3,"Breast, NOS",BRCA,Krug
1,01BR008,17532.0,black or african american,not hispanic or latino,female,Not Reported,Not Reported,Not Reported,Not Reported,GX,Not Reported,Not Reported,Not Reported,Not Reported,"Breast, NOS",BRCA,Krug
2,01BR009,23376.0,black or african american,not hispanic or latino,female,Not Reported,Not Reported,Not Reported,Not Reported,GX,Not Reported,Not Reported,Not Reported,Not Reported,"Breast, NOS",BRCA,Krug
3,01BR010,23741.0,black or african american,not hispanic or latino,female,Not Reported,Not Reported,Not Reported,Not Reported,GX,Not Reported,Not Reported,Not Reported,Not Reported,"Breast, NOS",BRCA,Krug
4,01BR015,12784.0,white,not hispanic or latino,female,Alive,T2,N1,Stage II,GX,Not Reported,Not Reported,Invasive carcinoma of no special type,8500/3,"Breast, NOS",BRCA,Krug
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
619,C3N-02582,28124.0,Not Reported,Not Reported,male,Dead,T2b,N1,Stage II,G3,Unifocal,5.8,"Adenocarcinoma, NOS",8140/3,"Lung, NOS",LUAD,Gilette
620,C3N-02586,26663.0,Not Reported,Not Reported,male,Dead,T2a,N1,Stage II,G2,Unifocal,3.1,"Adenocarcinoma, NOS",8140/3,"Lung, NOS",LUAD,Gilette
621,C3N-02587,21550.0,Not Reported,Not Reported,female,Alive,T1a,N0,Stage I,G2,Unifocal,2,"Adenocarcinoma, NOS",8140/3,"Lung, NOS",LUAD,Gilette
622,C3N-02588,25202.0,Not Reported,Not Reported,male,Alive,T2a,N0,Stage II,G3,Unifocal,4.5,"Adenocarcinoma, NOS",8140/3,"Lung, NOS",LUAD,Gilette


## Run BDI-Viz

We utilize a pretrained model to identify the Top-20 candidate columns from the GDC dataset. While the computation may take some time initially, the results will be cached to enable rapid visualizations thereafter.

Once BDI-Viz has finished loading, feel free to explore and either accept or reject any candidates you deem appropriate. After completing your review, please proceed to run the next cell to update the manager with the revised scope.

In [4]:
heatmap_manager = BDISchemaMatchingHeatMap(source, target=target, top_k=20)
heatmap_manager.plot_heatmap()

## Update Column Mapping Scope

In [6]:
from bdikit.mapping_algorithms.column_mapping.algorithms import TwoPhaseSchemaMatcher
from bdikit import GDC_DATA_PATH

two_phase_viz = TwoPhaseSchemaMatcher(top_k_matcher=heatmap_manager)
column_mappings = bdi.match_schema(source, target=target, method=two_phase_viz)
column_mappings

Unnamed: 0,source,target
0,Histologic_Grade_FIGO,tumor_grade
1,Histologic_type,ajcc_pathologic_stage
2,Path_Stage_Primary_Tumor-pT,ajcc_pathologic_t
3,Path_Stage_Reg_Lymph_Nodes-pN,ajcc_pathologic_n
4,tumor_Stage-Pathological,ajcc_pathologic_stage
5,Age,age_at_diagnosis
6,Race,race
7,Ethnicity,ethnicity
8,Gender,gender
9,Tumor_Focality,tumor_focality


## Update Value Mappings

In [7]:
column_mappings = column_mappings[column_mappings['target'].str.strip().astype(bool)]
mappings = bdi.match_values(
    source,
    column_mapping=column_mappings,
    target=target,
    method="tfidf",
)

for mapping in mappings:
    print(f"{mapping.attrs['source']} => {mapping.attrs['target']}")
    display(mapping)
    print("")

Histologic_Grade_FIGO => tumor_grade


Unnamed: 0,source,target,similarity
0,FIGO grade 1,,
1,FIGO grade 3,,
2,FIGO grade 2,,
3,,,



Histologic_type => ajcc_pathologic_stage


Unnamed: 0,source,target,similarity
0,Carcinosarcoma,,
1,Serous,,
2,Endometrioid,,
3,Clear cell,,



Path_Stage_Primary_Tumor-pT => ajcc_pathologic_t


Unnamed: 0,source,target,similarity
0,pT1b (FIGO IB),T1b,0.586
1,pT1a (FIGO IA),T1a,0.569
2,pT3a (FIGO IIIA),T3a,0.434
3,pT1 (FIGO I),T1,0.339
4,pT2 (FIGO II),T2,0.336
5,pT3b (FIGO IIIB),,
6,,,



Path_Stage_Reg_Lymph_Nodes-pN => ajcc_pathologic_n


Unnamed: 0,source,target,similarity
0,pNX,NX,0.671
1,pN0,N0,0.606
2,pN1 (FIGO IIIC1),N0 (i+),0.302
3,pN2 (FIGO IIIC2),N0 (i+),0.3
4,,,



tumor_Stage-Pathological => ajcc_pathologic_stage


Unnamed: 0,source,target,similarity
0,Stage I,Stage I,1.0
1,Stage IV,Stage IV,1.0
2,Stage III,Stage III,1.0
3,Stage II,Stage II,1.0
4,,,



Race => race


Unnamed: 0,source,target,similarity
0,White,white,1.0
1,Asian,asian,1.0
2,Not Reported,Not Reported,1.0
3,Black or African American,black or african american,1.0
4,,american indian or alaska native,0.358



Ethnicity => ethnicity


Unnamed: 0,source,target,similarity
0,Hispanic or Latino,hispanic or latino,1.0
1,Not reported,Not Reported,1.0
2,Not-Hispanic or Latino,not hispanic or latino,0.935
3,,,



Gender => gender


Unnamed: 0,source,target,similarity
0,Female,female,1.0
1,,,



Tumor_Focality => tumor_focality


Unnamed: 0,source,target,similarity
0,Unifocal,Unifocal,1.0
1,Multifocal,Multifocal,1.0
2,,,





## Schema Matching Checkup (GDC | Cao | Dou)

| GDC_format_variable_names                        | cao_variable_names                                      | dou_variable_names            |
|--------------------------------------------------|---------------------------------------------------------|-------------------------------|
| case_submitter_id                                | case_id                                                 | Proteomics_Participant_ID     |
| age_at_diagnosis                                 | age                                                     | Age                           |
| gender                                           | sex                                                     | Gender                        |
| race                                             | race                                                    | Race                          |
| country_of_residence_at_enrollment               | participant_country                                      | Country                       |
| site_of_resection_or_biopsy; tissue_or_organ_of_origin | tumor_site                                              | Tumor_Site                    |
| tumor_focality                                   | tumor_focality                                          | Tumor_Focality                |
| tumor_largest_dimension_diameter                 | tumor_size_cm                                           | Tumor_Size_cm                 |
| vascular_invasion_present; lymphatic_invasion_present | lymph_vascular_invasion                                  |                               |
| perineural_invasion_present                      | perineural_invasion                                      |                               |
| lymph_nodes_positive                             | number_of_lymph_nodes_positive_for_tumor                |                               |
| ajcc_pathologic_n                                | pathologic_staging_regional_lymph_nodes_pn              | Path_Stage_Reg_Lymph_Nodes-pN |
| ajcc_pathologic_t                                | pathologic_staging_primary_tumor_pt                     | Path_Stage_Primary_Tumor-pT   |
| ajcc_pathologic_m                                | pathologic_staging_distant_metastasis_pm                | Path_Stage_Dist_Mets-pM       |
| ajcc_clinical_m                                  | clinical_staging_distant_metastasis_cm                  | Clin_Stage_Dist_Mets-cM       |
| residual_disease                                 | residual_tumor                                          |                               |
| ajcc_pathologic_stage                            | tumor_stage_pathological                                | tumor_Stage-Pathological      |
| bmi                                              | bmi                                                     | BMI                           |
| alcohol_intensity                                | alcohol_consumption                                     |                               |
| tobacco_smoking_status                           | tobacco_smoking_history                                 |                               |
| vital_status                                     | vital_status                                            |                               |
| cause_of_death                                   | cause_of_death                                          |                               |
| figo_stage                                       |                                                         | FIGO_stage                    |
| ethnicity                                        |                                                         | Ethnicity                     |
| primary_diagnosis                                |                                                         | Histologic_type               |
| tumor_grade                                      |                                                         | Histologic_Grade_FIGO         |
