# Data Integration With `bdi-kit`

First, import the class `APIManager`.

In [1]:
from bdikit import APIManager

Add the path to the pre-trained model for mapping recommendations. You can download this model from [here](https://drive.google.com/file/d/1YdCTd-kUMjDJaltQwXN4X9ezTCsfjyft/view).

In [2]:
import os
os.environ['BDIKIT_MODEL_PATH'] = '/Users/rlopez/Downloads/model_20_1.pt' #YOUR PATH HERE

## Dataset Loading

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/) to the GDC format.

In [3]:
manager = APIManager()

In [4]:
dataset_path =  './datasets/dou.csv'
dataset = manager.load_dataset(dataset_path)
dataset

Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,Ukraine,FIGO grade 3,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,29.40,75.0,,,Female,"Other, specify",Unifocal,4.2
100,Ukraine,FIGO grade 2,Endometrioid,pT2 (FIGO II),pN0,cM0,Staging Incomplete,Stage II,II,35.42,74.0,,,Female,"Other, specify",Unifocal,1.5
101,United States,,Serous,pT2 (FIGO II),pN0,Staging Incomplete,Staging Incomplete,Stage II,II,24.32,85.0,Black or African American,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.8
102,Ukraine,,Serous,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,34.06,70.0,,,Female,"Other, specify",Unifocal,5.0


## Reducing the GDC Scope

Since the GDC contains 700+ attributes, a first step we take is to select a subset of those attributes that are likely matches to the attributes in the Dou et al. schema -- the top-k candidates for each column.

By default, it shows the top 5 candidates for 5 columns, you can change it using the parameters `num_columns` and `num_candidates`.

In [5]:
reduced_scope = manager.reduce_scope()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|                                                                                                                                                                                                                  | 0/17 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 18.00it/s]


Table features extracted from 17 columns


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 734/734 [01:17<00:00,  9.45it/s]

Table features extracted from 734 columns

Country:





Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,country_of_birth,0.5726,The name of the country in which the patient is born.,"Afghanistan, Albania, Algeria, Andorra, Angola, Anguilla, Antigua and Barbuda, Argentina, Armenia, Aruba, Australia, Austria, Azerbaijan, Bahamas, Bah..."
1,country_of_residence_at_enrollment,0.5151,The text term used to describe the patient's country of residence at the time they were enrolled in the study.,"Afghanistan, Albania, Algeria, Andorra, Angola, Anguilla, Antigua and Barbuda, Argentina, Armenia, Aruba, Australia, Austria, Azerbaijan, Bahamas, Bah..."
2,variant_origin,0.3803,The text term used to describe the biological origin of a specific genetic variant.,"Germline, Somatic, Unknown"
3,zone_of_origin_prostate,0.3563,The location or position of the tumor by zone of the prostate.,"Central zone, Overlapping/multiple zones, Peripheral zone, Transition zone, Unknown zone"
4,tumor_confined_to_organ_of_origin,0.3322,The yes/no/unknown indicator used to describe whether the tumor is confined to the organ where it originated and did not spread to a proximal or dista...,"Yes, No, Unknown, Not Reported"



Histologic_Grade_FIGO:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,histologic_progression_type,0.6556,Text term to describe the disease progression as determined by microscopic review of cells and their surrounding extracellular environment in tissues.,"Anaplastic, Poorly differentiated, Unknown, Not Reported"
1,who_nte_grade,0.5967,The WHO (World Health Organization) grading classification of Neuroendocrine Tumors.,"G1, G2, G3, GX, Unknown, Not Reported"
2,tumor_grade,0.5817,"Numeric value to express the degree of abnormality of cancer cells, a measure of differentiation and aggressiveness.","G1, G2, G3, G4, GB, GX, High Grade, Intermediate Grade, Low Grade, Unknown, Not Reported"
3,tumor_grade_category,0.5759,Describes the number of levels or 'tiers' in the system used to determine the degree of tumor differentiation.,"Four Tier, Three Tier, Not Reported"
4,inpc_grade,0.5104,"Text term used to describe the classification of neuroblastic differentiation within neuroblastoma tumors, as defined by the International Neuroblasto...","Differentiating, Poorly Differentiated, Undifferentiated, Undifferentiated or Poorly Differentiated, Unknown, Not Reported"



Histologic_type:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,history_of_tumor_type,0.6757,Describes the type of the patient's prior diagnosed tumor.,"Colorectal Cancer, Lower Grade Glioma, Phenochromocytoma or Paraganglioma"
1,roots,0.6592,,
2,percent_sarcomatoid_features,0.5766,Numeric value that represents the percentage of sarcomatoid features found in a specific tissue sample.,
3,additional_pathology_findings,0.5398,A section header that includes additional pathologic findings.,"Adenomyosis, Asbestos bodies, Atrophic endometrium, Atypical hyperplasia/Endometrial intraepithelial neoplasia (EIN), Autoimmune atrophic chronic gast..."
4,relationship_primary_diagnosis,0.5278,The text term used to describe the malignant diagnosis of the patient's relative with a history of cancer.,"Adrenal Gland Cancer, Basal Cell Cancer, Bile Duct Cancer, Bladder Cancer, Blood Cancer, Bone Cancer, Brain Cancer, Breast Cancer, Cancer, Cervical Ca..."



Path_Stage_Primary_Tumor-pT:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,uicc_clinical_stage,0.7404,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
1,ajcc_clinical_stage,0.6784,"Stage group determined from clinical information on the tumor (T), regional node (N) and metastases (M) and by grouping cases with similar prognosis f...","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IB Cervix, Stage IC, Stage II,..."
2,uicc_pathologic_stage,0.6754,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
3,figo_stage,0.673,"The extent of a cervical or endometrial cancer within the body, especially whether the disease has spread from the original site to other parts of the...","Stage 0, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage IC1, Stage IC2, Stage IC3, Stage II, Stage IIA, Stag..."
4,ajcc_pathologic_stage,0.6702,"The extent of a cancer, especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteria.","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."



Path_Stage_Reg_Lymph_Nodes-pN:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,figo_stage,0.6017,"The extent of a cervical or endometrial cancer within the body, especially whether the disease has spread from the original site to other parts of the...","Stage 0, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage IC1, Stage IC2, Stage IC3, Stage II, Stage IIA, Stag..."
1,uicc_clinical_stage,0.5684,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
2,inss_stage,0.5541,"Text term used to describe the staging classification of neuroblastic tumors, as defined by the International Neuroblastoma Staging System (INSS).","Stage 1, Stage 2A, Stage 2B, Stage 3, Stage 4, Stage 4S, Unknown, Not Reported"
3,uicc_pathologic_stage,0.5505,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
4,ensat_pathologic_stage,0.5397,An adrenal cancer stage defined according to the European Network for the Study of Adrenal Tumors (ENSAT) criteria.,"Stage I, Stage II, Stage III, Stage IV"


## Column Mapping

Perform column mapping. By default it uses [similarity flooding algorithm](https://ieeexplore.ieee.org/document/994702).

In [6]:
column_mappings = manager.map_columns()

Unnamed: 0,Original Column,Target Column
0,Country,country_of_birth
1,Histologic_Grade_FIGO,histologic_progression_type
2,Histologic_type,dysplasia_type
3,Path_Stage_Primary_Tumor-pT,uicc_clinical_m
4,Path_Stage_Reg_Lymph_Nodes-pN,inss_stage
5,Clin_Stage_Dist_Mets-cM,inrg_stage
6,Path_Stage_Dist_Mets-pM,last_known_disease_status
7,tumor_Stage-Pathological,tumor_grade_category
8,FIGO_stage,inss_stage
9,BMI,hpv_positive_type


Users can change the algorithm to perform the column mappings. We provide a GPT-based algorithm (`GPTAlgorithm`). To use it, you need to add an environment variable for your OpenAI key (`export OPENAI_API_KEY='your-api-key-here'`)

In [7]:
column_mappings = manager.map_columns(algorithm='GPTAlgorithm')

Unnamed: 0,Original Column,Target Column
0,Country,country_of_residence_at_enrollment
1,Histologic_Grade_FIGO,tumor_grade
2,Histologic_type,sample_type
3,Path_Stage_Primary_Tumor-pT,figo_stage
4,Path_Stage_Reg_Lymph_Nodes-pN,ajcc_pathologic_n
5,Clin_Stage_Dist_Mets-cM,ajcc_clinical_m
6,Path_Stage_Dist_Mets-pM,ajcc_pathologic_m
7,tumor_Stage-Pathological,ajcc_pathologic_stage
8,FIGO_stage,figo_stage
9,BMI,bmi


Users can update column mappings through the `update_column_mappings` method.

In [9]:
manager.update_column_mappings([('Histologic_type', 'primary_diagnosis'), ('Path_Stage_Primary_Tumor-pT', 'ajcc_pathologic_t')])

Column mapping updated!


Unnamed: 0,Original Column,Target Column
0,Country,country_of_residence_at_enrollment
1,Histologic_Grade_FIGO,tumor_grade
2,Histologic_type,primary_diagnosis
3,Path_Stage_Primary_Tumor-pT,ajcc_pathologic_t
4,Path_Stage_Reg_Lymph_Nodes-pN,ajcc_pathologic_n
5,Clin_Stage_Dist_Mets-cM,ajcc_clinical_m
6,Path_Stage_Dist_Mets-pM,ajcc_pathologic_m
7,tumor_Stage-Pathological,ajcc_pathologic_stage
8,FIGO_stage,figo_stage
9,BMI,bmi


## Value Mapping

Perform value mapping. By default it uses the edit distance algorithm. In this example it will use and LLM-based algorithm.

In [10]:
value_mappings = manager.map_values('LLMAlgorithm')


Column Histologic_Grade_FIGO:


Unnamed: 0,Current Value,Target Value,Similarity
0,FIGO grade 1,G1,1.0
1,FIGO grade 2,G2,1.0
2,,Not Reported,1.0
3,FIGO grade 3,G3,1.0



Column Histologic_type:


Unnamed: 0,Current Value,Target Value,Similarity
0,Endometrioid,"Endometrioid adenocarcinoma, NOS",1.0
1,Carcinosarcoma,"Carcinosarcoma, NOS",1.0
2,Serous,"Serous cystadenocarcinoma, NOS",0.6
3,Clear cell,Clear cell carcinoma,1.0



Column Path_Stage_Primary_Tumor-pT:


Unnamed: 0,Current Value,Target Value,Similarity
0,pT1a (FIGO IA),T1a,1.0
1,,Unknown,1.0
2,pT3a (FIGO IIIA),T3a,1.0
3,pT1 (FIGO I),T1,1.0
4,pT1b (FIGO IB),T1b,1.0
5,pT2 (FIGO II),T2,1.0
6,pT3b (FIGO IIIB),T3b,1.0



Column Path_Stage_Reg_Lymph_Nodes-pN:


Unnamed: 0,Current Value,Target Value,Similarity
0,pN0,N0,1.0
1,pNX,NX,1.0
2,,Unknown,1.0
3,pN2 (FIGO IIIC2),N2,1.0
4,pN1 (FIGO IIIC1),N1,1.0



Column Clin_Stage_Dist_Mets-cM:


Unnamed: 0,Current Value,Target Value,Similarity
0,cM0,M0,0.9
1,,Unknown,1.0
2,Staging Incomplete,Unknown,0.9
3,cM1,M1,1.0



Column Path_Stage_Dist_Mets-pM:


Unnamed: 0,Current Value,Target Value,Similarity
0,Staging Incomplete,Unknown,0.9
1,,Not Reported,1.0
2,No pathologic evidence of distant metastasis,M0,1.0
3,pM1,M1,1.0



Column tumor_Stage-Pathological:


Unnamed: 0,Current Value,Target Value,Similarity
0,Stage I,Stage I,1.0
1,Stage IV,Stage IV,1.0
2,,Unknown,1.0
3,Stage III,Stage III,1.0
4,Stage II,Stage II,1.0



Column FIGO_stage:


Unnamed: 0,Current Value,Target Value,Similarity
0,IA,Stage IA,1.0
1,,Not Reported,1.0
2,IIIA,Stage IIIA,1.0
3,IIIC2,Stage IIIC2,1.0
4,IB,Stage IB,1.0
5,II,Stage II,1.0
6,IIIC1,Stage IIIC1,1.0
7,IVB,Stage IVB,1.0
8,IIIB,Stage IIIB,1.0



Column Race:


Unnamed: 0,Current Value,Target Value,Similarity
0,White,white,1.0
1,,not reported,1.0
2,Asian,asian,1.0
3,Not Reported,not reported,1.0
4,Black or African American,black or african american,1.0



Column Tumor_Site:


Unnamed: 0,Current Value,Target Value,Similarity
0,Anterior endometrium,Corpus uteri,0.7
1,Posterior endometrium,Corpus uteri,0.85
2,"Other, specify",Unknown,1.0
3,,Not Applicable,1.0



Column Tumor_Focality:


Unnamed: 0,Current Value,Target Value,Similarity
0,Unifocal,Unifocal,1.0
1,,Unknown,1.0
2,Multifocal,Multifocal,1.0



Column Country:


Unnamed: 0,Current Value,Target Value,Similarity
0,United States,United States,1.0
1,Other_specify,Andorra,0.0
2,Ukraine,Ukraine,1.0
3,Poland,Poland,1.0
4,,-,-



Column Ethnicity:


Unnamed: 0,Current Value,Target Value,Similarity
0,Not-Hispanic or Latino,not hispanic or latino,1.0
1,,hispanic or latino,0
2,Hispanic or Latino,hispanic or latino,1.0
3,Not reported,-,-



Column Gender:


Unnamed: 0,Current Value,Target Value,Similarity
0,,DNA,0.1
1,Female,-,-
