## Analyzing One Attribute/Column at a Time

In [1]:
import bdikit as bdi
import pandas as pd

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/) to the GDC format.

In [2]:
dataset = pd.read_csv('./datasets/dou.csv')
dataset.head()

Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5


Let's work with the attribute/column `FIGO_stage`.

## 1. Using Multiple Functions

We can call the method `top_matches` to see the most suitable matches for a given column.

In [3]:
top_matches = bdi.top_matches(dataset, target='gdc', columns=['FIGO_stage'])
top_matches

Extracting features from 1 columns...


Table features loaded for 736 columns


Unnamed: 0,source,target,similarity
0,FIGO_stage,figo_stage,0.695458
1,FIGO_stage,ajcc_pathologic_stage,0.620362
2,FIGO_stage,ajcc_clinical_stage,0.616923
3,FIGO_stage,uicc_pathologic_stage,0.607253
4,FIGO_stage,uicc_clinical_stage,0.599319
5,FIGO_stage,irs_group,0.547192
6,FIGO_stage,inss_stage,0.505369
7,FIGO_stage,cog_liver_stage,0.465713
8,FIGO_stage,iss_stage,0.460099
9,FIGO_stage,masaoka_stage,0.395831


From the above outputs, we can see that the highest similarity is `FIGO_stage` -> `figo_stage`. To have more context about these columns, let's see their unique values (method `preview_domain`).

In [4]:
bdi.preview_domain(dataset, 'FIGO_stage')

Unnamed: 0,value_name
0,IA
1,
2,IIIA
3,IIIC2
4,IB
5,II
6,IIIC1
7,IVB
8,IIIB


In [5]:
bdi.preview_domain('gdc', 'figo_stage')

Unnamed: 0,value_name,value_description,column_description
0,Stage 0,A FIGO stage term that applies to gynecologic ...,The extent of a cervical or endometrial cancer...
1,Stage I,A FIGO stage term that applies to gynecologic ...,
2,Stage IA,Invasive cancer confined to the original anato...,
3,Stage IA1,A FIGO stage term that applies to gynecologic ...,
4,Stage IA2,A FIGO stage term that applies to gynecologic ...,
5,Stage IB,A FIGO stage term that applies to gynecologic ...,
6,Stage IB1,A FIGO stage term that applies to gynecologic ...,
7,Stage IB2,A FIGO stage term that applies to gynecologic ...,
8,Stage IC,A FIGO stage term that applies to gynecologic ...,
9,Stage IC1,A FIGO stage term that applies to ovarian canc...,


We can find the matches of these values using the function `match_values()`.

In [7]:
value_mappings = bdi.match_values(
        dataset,
        target='gdc',
        column_mapping=('FIGO_stage', 'figo_stage'),
    )
value_mappings

Unnamed: 0,source,target,similarity
0,IIIC2,Stage IIIC2,0.889
1,IIIC1,Stage IIIC1,0.889
2,IVB,Stage IVB,0.854
3,IIIB,Stage IIIB,0.849
4,IIIA,Stage IIIA,0.822
5,II,Stage III,0.687
6,IB,Stage IB,0.649
7,IA,Stage IA,0.586
8,,Unknown,0.35


## 2. Using A Single Function

We can group all the previous methods into a single one:

In [10]:
def map_columns_values(dataset, target, columns, match_index=0):
    #top_matches = bdi.top_matches(dataset, target=target, columns=columns) # Call top_matches
    print('Top matches:')
    display(top_matches)
    selected_match = top_matches.iloc[[match_index]]
    column_mapping = selected_match.drop(columns=['similarity']).iloc[0]
    column_mapping = tuple(column_mapping)
    preview_domain_source = bdi.preview_domain(dataset, column_mapping[0])
    preview_domain_target = bdi.preview_domain('gdc', column_mapping[1])
    
    print(f'Preview domain {column_mapping[0]} (source):')
    display(preview_domain_source)
    
    print(f'Preview domain {column_mapping[0]} (target):')
    display(preview_domain_target)
    
    print(f'Value mappings {column_mapping}:')
    value_mappings = bdi.match_values(dataset, target=target, column_mapping=column_mapping)
    display(value_mappings)
    
    return column_mapping, value_mappings

In [11]:
column_mapping, value_mappings = map_columns_values(dataset, 'gdc', ['FIGO_stage'])

Top matches:


Unnamed: 0,source,target,similarity
0,FIGO_stage,figo_stage,0.695458
1,FIGO_stage,ajcc_pathologic_stage,0.620362
2,FIGO_stage,ajcc_clinical_stage,0.616923
3,FIGO_stage,uicc_pathologic_stage,0.607253
4,FIGO_stage,uicc_clinical_stage,0.599319
5,FIGO_stage,irs_group,0.547192
6,FIGO_stage,inss_stage,0.505369
7,FIGO_stage,cog_liver_stage,0.465713
8,FIGO_stage,iss_stage,0.460099
9,FIGO_stage,masaoka_stage,0.395831


Preview domain FIGO_stage (source):


Unnamed: 0,value_name
0,IA
1,
2,IIIA
3,IIIC2
4,IB
5,II
6,IIIC1
7,IVB
8,IIIB


Preview domain FIGO_stage (target):


Unnamed: 0,value_name,value_description,column_description
0,Stage 0,A FIGO stage term that applies to gynecologic ...,The extent of a cervical or endometrial cancer...
1,Stage I,A FIGO stage term that applies to gynecologic ...,
2,Stage IA,Invasive cancer confined to the original anato...,
3,Stage IA1,A FIGO stage term that applies to gynecologic ...,
4,Stage IA2,A FIGO stage term that applies to gynecologic ...,
5,Stage IB,A FIGO stage term that applies to gynecologic ...,
6,Stage IB1,A FIGO stage term that applies to gynecologic ...,
7,Stage IB2,A FIGO stage term that applies to gynecologic ...,
8,Stage IC,A FIGO stage term that applies to gynecologic ...,
9,Stage IC1,A FIGO stage term that applies to ovarian canc...,


Value mappings ('FIGO_stage', 'figo_stage'):


Unnamed: 0,source,target,similarity
0,IIIC2,Stage IIIC2,0.889
1,IIIC1,Stage IIIC1,0.889
2,IVB,Stage IVB,0.854
3,IIIB,Stage IIIB,0.849
4,IIIA,Stage IIIA,0.822
5,II,Stage III,0.687
6,IB,Stage IB,0.649
7,IA,Stage IA,0.586
8,,Unknown,0.35
