# Data Integration With `bdi-kit`

First, import the class `APIManager`.

In [1]:
from bdikit import APIManager

## Dataset Loading

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/) to the GDC format.

In [2]:
manager = APIManager()

In [3]:
dataset_path =  './datasets/dou.csv'
dataset = manager.load_dataset(dataset_path)
dataset

Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,Ukraine,FIGO grade 3,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage I,IA,29.40,75.0,,,Female,"Other, specify",Unifocal,4.2
100,Ukraine,FIGO grade 2,Endometrioid,pT2 (FIGO II),pN0,cM0,Staging Incomplete,Stage II,II,35.42,74.0,,,Female,"Other, specify",Unifocal,1.5
101,United States,,Serous,pT2 (FIGO II),pN0,Staging Incomplete,Staging Incomplete,Stage II,II,24.32,85.0,Black or African American,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.8
102,Ukraine,,Serous,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,34.06,70.0,,,Female,"Other, specify",Unifocal,5.0


## Reducing the GDC Scope

Since the GDC contains 700+ attributes, a first step we take is to select a subset of those attributes that are likely matches to the attributes in the Dou et al. schema -- the top-k candidates for each column. We can explore the candidates for each column using the ScopeReducerExplorer.

In [4]:
manager.reduce_scope()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|                                                                                                                                                                                                                                               | 0/17 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/1

Table features extracted from 17 columns


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 734/734 [01:21<00:00,  9.05it/s]


Table features extracted from 734 columns


## Column Mapping

Perform column mapping. By default it uses [similarity flooding algorithm](https://ieeexplore.ieee.org/document/994702).

In [5]:
column_mappings = manager.map_columns()

Unnamed: 0,Original Column,Target Column
0,Country,country_of_birth
1,Histologic_Grade_FIGO,histologic_progression_type
2,Histologic_type,dysplasia_type
3,Path_Stage_Primary_Tumor-pT,uicc_clinical_m
4,Path_Stage_Reg_Lymph_Nodes-pN,figo_stage
5,Clin_Stage_Dist_Mets-cM,inrg_stage
6,Path_Stage_Dist_Mets-pM,last_known_disease_status
7,tumor_Stage-Pathological,tumor_grade_category
8,FIGO_stage,figo_stage
9,BMI,age_at_index


Users can change the algorithm to perform the column mappings. We provide a GPT-based algorithm (`GPTAlgorithm`). To use it, you need to add an environment variable for your OpenAI key (`export OPENAI_API_KEY='your-api-key-here'`)

In [7]:
column_mappings = manager.map_columns(algorithm='GPTAlgorithm')

Unnamed: 0,Original Column,Target Column
0,Country,country_of_residence_at_enrollment
1,Histologic_Grade_FIGO,tumor_grade
2,Histologic_type,sample_type
3,Path_Stage_Primary_Tumor-pT,figo_stage
4,Path_Stage_Reg_Lymph_Nodes-pN,ajcc_pathologic_n
5,Clin_Stage_Dist_Mets-cM,ajcc_clinical_m
6,Path_Stage_Dist_Mets-pM,ajcc_pathologic_m
7,tumor_Stage-Pathological,ajcc_pathologic_stage
8,FIGO_stage,figo_stage
9,BMI,bmi


Users can update column mappings through the `update_column_mappings` method.

In [9]:
manager.update_column_mappings([('Histologic_type', 'primary_diagnosis'), ('Path_Stage_Primary_Tumor-pT', 'ajcc_pathologic_t')])

Column mapping updated!


Unnamed: 0,Original Column,Target Column
0,Country,country_of_residence_at_enrollment
1,Histologic_Grade_FIGO,tumor_grade
2,Histologic_type,primary_diagnosis
3,Path_Stage_Primary_Tumor-pT,ajcc_pathologic_t
4,Path_Stage_Reg_Lymph_Nodes-pN,ajcc_pathologic_n
5,Clin_Stage_Dist_Mets-cM,ajcc_clinical_m
6,Path_Stage_Dist_Mets-pM,ajcc_pathologic_m
7,tumor_Stage-Pathological,ajcc_pathologic_stage
8,FIGO_stage,figo_stage
9,BMI,bmi


## Value Mapping

Perform value mapping. By default it uses the edit distance algorithm. In this example it will use and LLM-based algorithm.

In [10]:
value_mappings = manager.map_values('LLMAlgorithm')


Column Histologic_Grade_FIGO:


Unnamed: 0,Current Value,Target Value,Similarity
0,FIGO grade 1,G1,1.0
1,FIGO grade 2,G2,1.0
2,,Not Reported,1.0
3,FIGO grade 3,G3,1.0



Column Histologic_type:


Unnamed: 0,Current Value,Target Value,Similarity
0,Endometrioid,"Endometrioid adenocarcinoma, NOS",1.0
1,Carcinosarcoma,"Carcinosarcoma, NOS",1.0
2,Serous,"Serous cystadenocarcinoma, NOS",0.6
3,Clear cell,Clear cell carcinoma,1.0



Column Path_Stage_Primary_Tumor-pT:


Unnamed: 0,Current Value,Target Value,Similarity
0,pT1a (FIGO IA),T1a,1.0
1,,Unknown,1.0
2,pT3a (FIGO IIIA),T3a,1.0
3,pT1 (FIGO I),T1,1.0
4,pT1b (FIGO IB),T1b,1.0
5,pT2 (FIGO II),T2,1.0
6,pT3b (FIGO IIIB),T3b,1.0



Column Path_Stage_Reg_Lymph_Nodes-pN:


Unnamed: 0,Current Value,Target Value,Similarity
0,pN0,N0,1.0
1,pNX,NX,1.0
2,,Unknown,1.0
3,pN2 (FIGO IIIC2),N2,1.0
4,pN1 (FIGO IIIC1),N1,1.0



Column Clin_Stage_Dist_Mets-cM:


Unnamed: 0,Current Value,Target Value,Similarity
0,cM0,M0,0.9
1,,Unknown,1.0
2,Staging Incomplete,Unknown,0.9
3,cM1,M1,1.0



Column Path_Stage_Dist_Mets-pM:


Unnamed: 0,Current Value,Target Value,Similarity
0,Staging Incomplete,Unknown,0.9
1,,Not Reported,1.0
2,No pathologic evidence of distant metastasis,M0,1.0
3,pM1,M1,1.0



Column tumor_Stage-Pathological:


Unnamed: 0,Current Value,Target Value,Similarity
0,Stage I,Stage I,1.0
1,Stage IV,Stage IV,1.0
2,,Unknown,1.0
3,Stage III,Stage III,1.0
4,Stage II,Stage II,1.0



Column FIGO_stage:


Unnamed: 0,Current Value,Target Value,Similarity
0,IA,Stage IA,1.0
1,,Not Reported,1.0
2,IIIA,Stage IIIA,1.0
3,IIIC2,Stage IIIC2,1.0
4,IB,Stage IB,1.0
5,II,Stage II,1.0
6,IIIC1,Stage IIIC1,1.0
7,IVB,Stage IVB,1.0
8,IIIB,Stage IIIB,1.0



Column Race:


Unnamed: 0,Current Value,Target Value,Similarity
0,White,white,1.0
1,,not reported,1.0
2,Asian,asian,1.0
3,Not Reported,not reported,1.0
4,Black or African American,black or african american,1.0



Column Tumor_Site:


Unnamed: 0,Current Value,Target Value,Similarity
0,Anterior endometrium,Corpus uteri,0.7
1,Posterior endometrium,Corpus uteri,0.85
2,"Other, specify",Unknown,1.0
3,,Not Applicable,1.0



Column Tumor_Focality:


Unnamed: 0,Current Value,Target Value,Similarity
0,Unifocal,Unifocal,1.0
1,,Unknown,1.0
2,Multifocal,Multifocal,1.0



Column Country:


Unnamed: 0,Current Value,Target Value,Similarity
0,United States,United States,1.0
1,Other_specify,Andorra,0.0
2,Ukraine,Ukraine,1.0
3,Poland,Poland,1.0
4,,-,-



Column Ethnicity:


Unnamed: 0,Current Value,Target Value,Similarity
0,Not-Hispanic or Latino,not hispanic or latino,1.0
1,,hispanic or latino,0
2,Hispanic or Latino,hispanic or latino,1.0
3,Not reported,-,-



Column Gender:


Unnamed: 0,Current Value,Target Value,Similarity
0,,DNA,0.1
1,Female,-,-
