# BDI-Viz Demo On GDC

## Introduction
Welcome to the BDI-Viz demonstration. This demo is designed to generate potential ground truths based on the GDC schema. Should you encounter any bugs or identify areas for improvement, please contact **Eden Wu** or **Vitoria Guardieiro**. Alternatively, you can open an issue in the [BDI-Kit](https://github.com/VIDA-NYU/bdi-kit) repository. We appreciate your feedback and collaboration!

In [1]:
import os, sys

parent_dir = os.path.abspath("..")
# the parent_dir could already be there if the kernel was not restarted,
# and we run this cell again
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

## Load Dataset

In [2]:
import pandas as pd
import numpy as np
import json

import panel as pn
from bdikit import APIManager

pn.extension("mathjax")
pn.extension("vega")


manager = APIManager()

# Here we load dou.csv for example, please use whatever dataset you like :)
manager.load_dataset("./datasets/dou.csv")

  from tqdm.autonotebook import tqdm


Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,Ukraine,FIGO grade 3,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,29.40,75.0,,,Female,"Other, specify",Unifocal,4.2
100,Ukraine,FIGO grade 2,Endometrioid,pT2 (FIGO II),pN0,cM0,Staging Incomplete,Stage II,II,35.42,74.0,,,Female,"Other, specify",Unifocal,1.5
101,United States,,Serous,pT2 (FIGO II),pN0,Staging Incomplete,Staging Incomplete,Stage II,II,24.32,85.0,Black or African American,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.8
102,Ukraine,,Serous,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,34.06,70.0,,,Female,"Other, specify",Unifocal,5.0


## Run BDI-Viz

We utilize a pretrained model to identify the Top-20 candidate columns from the GDC dataset. While the computation may take some time initially, the results will be cached to enable rapid visualizations thereafter.

Once BDI-Viz has finished loading, feel free to explore and either accept or reject any candidates you deem appropriate. After completing your review, please proceed to run the next cell to update the manager with the revised scope.

In [3]:
reduce_scope_plot = manager.reduce_scope()
reduce_scope_plot

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Update Column Mapping Scope

In [4]:
manager.update_scope()

[{'Candidate column': 'Country',
  'Top k columns': [['country_of_birth', '0.5726']]},
 {'Candidate column': 'Histologic_Grade_FIGO',
  'Top k columns': [['histologic_progression_type', '0.6556'],
   ['who_nte_grade', '0.5967'],
   ['tumor_grade', '0.5817'],
   ['tumor_grade_category', '0.5759'],
   ['inpc_grade', '0.5104'],
   ['igcccg_stage', '0.4971'],
   ['who_cns_grade', '0.495'],
   ['risk_factor_method_of_diagnosis', '0.4742'],
   ['enneking_msts_grade', '0.4695'],
   ['adverse_event_grade', '0.4679'],
   ['inss_stage', '0.4639'],
   ['tumor_regression_grade', '0.4593'],
   ['extrathyroid_extension', '0.4561'],
   ['secondary_gleason_grade', '0.4304'],
   ['education_level', '0.4262'],
   ['gleason_grade_tertiary', '0.422'],
   ['uicc_clinical_stage', '0.4165'],
   ['ajcc_clinical_stage', '0.4122'],
   ['data_type', '0.4086'],
   ['ensat_pathologic_stage', '0.401']]},
 {'Candidate column': 'Histologic_type',
  'Top k columns': [['history_of_tumor_type', '0.6757'],
   ['roots', '

## Gen Ground Truth Pairs

In [5]:
column_mappings = manager.map_columns()

Unnamed: 0,Original Column,Target Column
0,Country,country_of_birth
1,Histologic_Grade_FIGO,histologic_progression_type
2,Histologic_type,dysplasia_type
3,Path_Stage_Primary_Tumor-pT,ajcc_clinical_m
4,Path_Stage_Reg_Lymph_Nodes-pN,figo_stage
5,Clin_Stage_Dist_Mets-cM,inrg_stage
6,Path_Stage_Dist_Mets-pM,last_known_disease_status
7,tumor_Stage-Pathological,tumor_grade_category
8,FIGO_stage,figo_stage
9,BMI,age_at_index


In [None]:
# Set your OpenAI API key here
%env OPENAI_API_KEY=

In [13]:
value_mappings = manager.map_values("LLMAlgorithm")


Column Histologic_Grade_FIGO:


Unnamed: 0,Current Value,Target Value,Similarity
0,FIGO grade 1,G1,1.0
1,FIGO grade 2,G2,1.0
2,,Unknown,1.0
3,FIGO grade 3,G3,1.0



Column Histologic_type:


Unnamed: 0,Current Value,Target Value,Similarity
0,Endometrioid,Colorectal Cancer,0.1
1,Carcinosarcoma,Colorectal Cancer,0.1
2,Serous,Lower Grade Glioma,0.1
3,Clear cell,Phenochromocytoma or Paraganglioma,0.5



Column Path_Stage_Primary_Tumor-pT:


Unnamed: 0,Current Value,Target Value,Similarity
0,pT1a (FIGO IA),Stage IA,1.0
1,,Unknown,1.0
2,pT3a (FIGO IIIA),Stage IIIA,1.0
3,pT1 (FIGO I),Stage IA,1.0
4,pT1b (FIGO IB),Stage IB,1.0
5,pT2 (FIGO II),Stage II,1.0
6,pT3b (FIGO IIIB),Stage IIIB,1.0



Column Path_Stage_Reg_Lymph_Nodes-pN:


Unnamed: 0,Current Value,Target Value,Similarity
0,pN0,Unknown,1.0
1,pNX,Stage X,1.0
2,,Unknown,1.0
3,pN2 (FIGO IIIC2),Stage IIIC2,1.0
4,pN1 (FIGO IIIC1),Stage IIIC1,1.0



Column Clin_Stage_Dist_Mets-cM:


Unnamed: 0,Current Value,Target Value,Similarity
0,cM0,M0,1.0
1,,Unknown,1.0
2,Staging Incomplete,Unknown,0.9
3,cM1,M1,1.0



Column Path_Stage_Dist_Mets-pM:


Unnamed: 0,Current Value,Target Value,Similarity
0,Staging Incomplete,Unknown tumor status,0.9
1,,not reported,1.0
2,No pathologic evidence of distant metastasis,Tumor free,0.9
3,pM1,With tumor,0.6



Column tumor_Stage-Pathological:


Unnamed: 0,Current Value,Target Value,Similarity
0,Stage I,Stage I,1.0
1,Stage IV,Stage IV,1.0
2,,Stage I,0.0
3,Stage III,Stage III,1.0
4,Stage II,Stage II,1.0



Column FIGO_stage:


Unnamed: 0,Current Value,Target Value,Similarity
0,IA,Stage IA,1.0
1,,Unknown,1.0
2,IIIA,Stage IIIA,1.0
3,IIIC2,Stage IIIC2,1.0
4,IB,Stage IB,1.0
5,II,Stage II,1.0
6,IIIC1,Stage IIIC1,1.0
7,IVB,Stage IVB,1.0
8,IIIB,Stage IIIB,1.0



Column Race:


Unnamed: 0,Current Value,Target Value,Similarity
0,White,white,1.0
1,,not reported,1.0
2,Asian,asian,1.0
3,Not Reported,not reported,1.0
4,Black or African American,black or african american,1.0



Column Gender:


Unnamed: 0,Current Value,Target Value,Similarity
0,Female,female,1.0
1,,unknown,1.0



Column Tumor_Site:


Unnamed: 0,Current Value,Target Value,Similarity
0,Anterior endometrium,Unknown,1.0
1,Posterior endometrium,Unknown,1.0
2,"Other, specify",Unknown,1.0
3,,Unknown,1.0



Column Tumor_Focality:


Unnamed: 0,Current Value,Target Value,Similarity
0,Unifocal,Unifocal,1.0
1,,Unknown,1.0
2,Multifocal,Multifocal,1.0



Column Country:


Unnamed: 0,Current Value,Target Value,Similarity
0,United States,United States,1
1,Other_specify,Afghanistan,0
2,Ukraine,Ukraine,1.0
3,Poland,Poland,1.0
4,,-,-



Column Ethnicity:


Unnamed: 0,Current Value,Target Value,Similarity
0,Not-Hispanic or Latino,not hispanic or latino,1.0
1,,not hispanic or latino,0.5
2,Hispanic or Latino,hispanic or latino,1.0
3,Not reported,-,-
