# Data Integration with BDI

In [1]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" # disable huggingface messages


First, import the class `APIManager`.

In [2]:
from bdi import APIManager

Add the path to the pre-trained model for mapping recommendations.

In [3]:
# os.environ['BDI_MODEL_PATH'] = '/Users/rlopez/Downloads/model_20_1.pt' #YOUR PATH HERE
os.environ['BDI_MODEL_PATH'] = '../models/arpa/model_20_1.pt'

## Dataset Loading

In this example, we are mapping the Dou dataset to the GDC format.

In [4]:
manager = APIManager()

In [5]:
# dataset_path =  './datasets/dou.csv'
dataset_path =  '../experiments/table-union/data/tables/Dou.csv' #curated Duo path
dataset = manager.load_dataset(dataset_path)
dataset

Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148,,,,,,,,,,,,,,,,,
149,,,,,,,,,,,,,,,,,
150,,,,,,,,,,,,,,,,,
151,,,,,,,,,,,,,,,,,


## Column Mapping

Reduce the scope of GDC selecting top k candidates for each column.

In [6]:
manager.reduce_scope()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/17 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|██████████| 17/17 [00:00<00:00, 61.10it/s]


Table features extracted from 17 columns


100%|██████████| 734/734 [00:20<00:00, 35.71it/s]

Table features extracted from 734 columns





[{'Candidate column': 'Country',
  'Top k columns': [('country_of_birth', '0.5726'),
   ('country_of_residence_at_enrollment', '0.5151'),
   ('variant_origin', '0.3803'),
   ('zone_of_origin_prostate', '0.3563'),
   ('tumor_confined_to_organ_of_origin', '0.3322'),
   ('race', '0.2936'),
   ('vascular_invasion_present', '0.291'),
   ('lymphatic_invasion_present', '0.287'),
   ('ethnicity', '0.2618'),
   ('perineural_invasion_present', '0.2578')]},
 {'Candidate column': 'Histologic_Grade_FIGO',
  'Top k columns': [('histologic_progression_type', '0.6556'),
   ('who_nte_grade', '0.5967'),
   ('tumor_grade', '0.5817'),
   ('tumor_grade_category', '0.5759'),
   ('inpc_grade', '0.5104'),
   ('igcccg_stage', '0.4971'),
   ('who_cns_grade', '0.495'),
   ('risk_factor_method_of_diagnosis', '0.4742'),
   ('enneking_msts_grade', '0.4695'),
   ('adverse_event_grade', '0.4679')]},
 {'Candidate column': 'Histologic_type',
  'Top k columns': [('history_of_tumor_type', '0.6765'),
   ('roots', '0.6562'

Perform column mapping.

In [7]:
column_mappings = manager.map_columns()
column_mappings

running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm
running ComaAlgorithm


{'Country': 'country_of_birth',
 'Histologic_Grade_FIGO': 'histologic_progression_type',
 'Histologic_type': 'history_of_tumor_type',
 'Path_Stage_Primary_Tumor-pT': 'ajcc_pathologic_stage',
 'Path_Stage_Reg_Lymph_Nodes-pN': 'ajcc_pathologic_stage',
 'Clin_Stage_Dist_Mets-cM': 'uicc_clinical_stage',
 'Path_Stage_Dist_Mets-pM': 'masaoka_stage',
 'tumor_Stage-Pathological': 'ensat_pathologic_stage',
 'FIGO_stage': 'figo_stage',
 'BMI': 'bmi',
 'Age': 'age_at_onset',
 'Race': 'race',
 'Ethnicity': 'ethnicity',
 'Gender': 'gender',
 'Tumor_Site': 'tumor_shape',
 'Tumor_Focality': 'tumor_focality',
 'Tumor_Size_cm': 'tumor_thickness'}

## Value Mapping

Perform value mapping.

In [8]:
manager.map_values()

Column tumor_Stage-Pathological:
| Current Value   | Target Value   | Similarity   |
|-----------------+----------------+--------------|
| Stage I         | Stage I        | 1.0          |
| Stage IV        | Stage IV       | 1.0          |
| Stage III       | Stage III      | 1.0          |
| Stage II        | Stage II       | 1.0          |
| nan             | -              | -            | 

Column Race:
| Current Value             | Target Value              | Similarity   |
|---------------------------+---------------------------+--------------|
| White                     | white                     | 1.0          |
| Asian                     | asian                     | 1.0          |
| Not Reported              | not reported              | 1.0          |
| Black or African American | black or african american | 1.0          |
| nan                       | -                         | -            | 

Column Tumor_Focality:
| Current Value   | Target Value   | Similarity   |
