## Changing Parameters of the Matching Methods

In [1]:
import bdikit as bdi
import pandas as pd

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/) to the GDC format.

In [2]:
dataset = pd.read_csv('./datasets/dou.csv')
dataset.head()

Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5


Let's see the domain of the `Tumor_Site` column:

In [3]:
bdi.preview_domain(dataset, "Tumor_Site")

Unnamed: 0,value_name
0,Anterior endometrium
1,Posterior endometrium
2,"Other, specify"
3,


The matched attribute in GDC of this column is `tissue_or_organ_of_origin`. It contains 333 unique values:

In [4]:
bdi.preview_domain("gdc", "tissue_or_organ_of_origin")

Unnamed: 0,value_name,value_description,column_description
0,"Abdomen, NOS",The portion of the body that lies between the ...,The text term used to describe the anatomic si...
1,Abdominal esophagus,Clinical esophageal segment composed of smooth...,
2,"Accessory sinus, NOS",Any one of the air-filled spaces within the et...,
3,Acoustic nerve,The cochlear portion of cranial nerve VIII (th...,
4,"Adrenal gland, NOS","A flattened, roughly triangular body resting u...",
...,...,...,...
328,Vestibule of mouth,The area inside the mouth between the cheek or...,
329,"Vulva, NOS","The external, visible part of the female genit...",
330,Waldeyer ring,The ring of lymphoid tissue located in the pha...,
331,Unknown,"Not known, not observed, not recorded, or refu...",


We can find the matches of values using the method `embedding`. By default this method uses BERT language model.

In [5]:
value_mappings = bdi.match_values(
        dataset,
        column_mapping=('Tumor_Site', 'tissue_or_organ_of_origin'),
        target='gdc',
        method='embedding'
    )
value_mappings

Unnamed: 0,source,target,similarity
0,Posterior endometrium,Posterior wall of hypopharynx,0.792
1,Anterior endometrium,Anterior mediastinum,0.775
2,"Other, specify","Base of tongue, NOS",0.612
3,,Cecum,0.579


We can also send additional arguments for the matching algorithm. For instance, we can use BioBert model to improve the results, we just need to set the `model_name` parameter through `method_args`:

In [6]:
value_mappings = bdi.match_values(
        dataset,
        column_mapping=('Tumor_Site', 'tissue_or_organ_of_origin'),
        target='gdc',
        method='embedding',
        method_args= {'model_name': 'pritamdeka/BioBert-PubMed200kRCT'}
    )
value_mappings

Unnamed: 0,source,target,similarity
0,Anterior endometrium,Endometrium,0.923
1,Posterior endometrium,Endometrium,0.915
2,,"Nasopharynx, NOS",0.892
3,"Other, specify","Palate, NOS",0.86


We can also set the parameters for the `match_schema` method.