# Data Integration with BDI

First, import the class `APIManager`.

In [1]:
from bdikit import APIManager

Add the path to the pre-trained model for mapping recommendations.

In [2]:
import os
os.environ['BDIKIT_MODEL_PATH'] = '/Users/rlopez/Downloads/model_20_1.pt' #YOUR PATH HERE

## Dataset Loading

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/) to the GDC format.

In [3]:
manager = APIManager()

In [4]:
dataset_path =  './datasets/dou.csv'
dataset = manager.load_dataset(dataset_path)
dataset

Unnamed: 0,Country,Histologic_Grade_FIGO,Histologic_type,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,Clin_Stage_Dist_Mets-cM,Path_Stage_Dist_Mets-pM,tumor_Stage-Pathological,FIGO_stage,BMI,Age,Race,Ethnicity,Gender,Tumor_Site,Tumor_Focality,Tumor_Size_cm
0,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,38.88,64.0,White,Not-Hispanic or Latino,Female,Anterior endometrium,Unifocal,2.9
1,United States,FIGO grade 1,Endometrioid,pT1a (FIGO IA),pNX,cM0,Staging Incomplete,Stage IV,IA,39.76,58.0,White,Not-Hispanic or Latino,Female,Posterior endometrium,Unifocal,3.5
2,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pN0,cM0,Staging Incomplete,Stage I,IA,51.19,50.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,4.5
3,,,Carcinosarcoma,,,,,,,,,,,,,,
4,United States,FIGO grade 2,Endometrioid,pT1a (FIGO IA),pNX,cM0,No pathologic evidence of distant metastasis,Stage I,IA,32.69,75.0,White,Not-Hispanic or Latino,Female,"Other, specify",Unifocal,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148,,,,,,,,,,,,,,,,,
149,,,,,,,,,,,,,,,,,
150,,,,,,,,,,,,,,,,,
151,,,,,,,,,,,,,,,,,


## Reducing the GDC Scope

Since the GDC contains 700+ attributes, a first step we take is to select a subset of those attributes that are likely matches to the attributes in the Dou et al. schema -- the top-k candidates for each column.

In [5]:
reduced_scope = manager.reduce_scope()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|                                                                                                                                                                                                                                          | 0/17 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<0

Table features extracted from 17 columns


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 734/734 [01:31<00:00,  7.99it/s]

Table features extracted from 734 columns

Country:





Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,country_of_birth,0.5726,The name of the country in which the patient is born.,"Afghanistan, Albania, Algeria, Andorra, Angola, Anguilla, Antigua and Barbuda, Argentina, Armenia, Aruba, Australia, Austria, Azerbaijan, Bahamas, Bah..."
1,country_of_residence_at_enrollment,0.5151,The text term used to describe the patient's country of residence at the time they were enrolled in the study.,"Afghanistan, Albania, Algeria, Andorra, Angola, Anguilla, Antigua and Barbuda, Argentina, Armenia, Aruba, Australia, Austria, Azerbaijan, Bahamas, Bah..."
2,variant_origin,0.3803,The text term used to describe the biological origin of a specific genetic variant.,"Germline, Somatic, Unknown"
3,zone_of_origin_prostate,0.3563,The location or position of the tumor by zone of the prostate.,"Central zone, Overlapping/multiple zones, Peripheral zone, Transition zone, Unknown zone"
4,tumor_confined_to_organ_of_origin,0.3322,The yes/no/unknown indicator used to describe whether the tumor is confined to the organ where it originated and did not spread to a proximal or dista...,"Yes, No, Unknown, Not Reported"
5,race,0.2936,An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within ...,"american indian or alaska native, asian, black or african american, native hawaiian or other pacific islander, white, other, Unknown, unknown, not rep..."
6,vascular_invasion_present,0.291,The yes/no indicator to ask if large vessel or venous invasion was detected by surgery or presence in a tumor specimen.,"Yes, No, Unknown, Not Reported"
7,lymphatic_invasion_present,0.287,"A yes/no indicator to ask if small or thin-walled vessel invasion is present, indicating lymphatic involvement","Yes, No, Unknown, Not Reported"
8,ethnicity,0.2618,"An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provid...","hispanic or latino, not hispanic or latino, Unknown, unknown, not reported, not allowed to collect"
9,perineural_invasion_present,0.2578,a yes/no indicator to ask if perineural invasion or infiltration of tumor or cancer is present.,"Yes, No, Unknown, Not Reported"



Histologic_Grade_FIGO:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,histologic_progression_type,0.6556,Text term to describe the disease progression as determined by microscopic review of cells and their surrounding extracellular environment in tissues.,"Anaplastic, Poorly differentiated, Unknown, Not Reported"
1,who_nte_grade,0.5967,The WHO (World Health Organization) grading classification of Neuroendocrine Tumors.,"G1, G2, G3, GX, Unknown, Not Reported"
2,tumor_grade,0.5817,"Numeric value to express the degree of abnormality of cancer cells, a measure of differentiation and aggressiveness.","G1, G2, G3, G4, GB, GX, High Grade, Intermediate Grade, Low Grade, Unknown, Not Reported"
3,tumor_grade_category,0.5759,Describes the number of levels or 'tiers' in the system used to determine the degree of tumor differentiation.,"Four Tier, Three Tier, Not Reported"
4,inpc_grade,0.5104,"Text term used to describe the classification of neuroblastic differentiation within neuroblastoma tumors, as defined by the International Neuroblasto...","Differentiating, Poorly Differentiated, Undifferentiated, Undifferentiated or Poorly Differentiated, Unknown, Not Reported"
5,igcccg_stage,0.4971,"The text term used to describe the International Germ Cell Cancer Collaborative Group (IGCCCG), a grouping used to further classify metastatic testicu...","Good Prognosis, Intermediate Prognosis, Poor Prognosis, Unknown, Not Reported"
6,who_cns_grade,0.495,"The WHO (World Health Organization) grading classification of CNS tumors, which is based on histological characteristics such as cellularity, mitotic ...","Grade I, Grade II, Grade III, Grade IV, Grade Not Assigned, Unknown, Not Reported"
7,risk_factor_method_of_diagnosis,0.4742,The clinical or laboratory procedure(s) used in the determination of a diagnosis described in this context as a risk factor.,"Biochemical Assessment, Both Clinical and Biochemical Assessments, Clinical Assessment, Not Reported"
8,enneking_msts_grade,0.4695,"The text term used to describe the surgical grade of the musculoskeletal sarcoma, using the Enneking staging system approved by the Musculoskeletal Tu...","High Grade (G2), Low Grade (G1), Unknown, Not Reported"
9,adverse_event_grade,0.4679,"Numeric representation of the intensity/severity of an unfavorable and unintended sign (including an abnormal laboratory finding), symptom, syndrome, ...","Grade 1, Grade 2, Grade 3, Grade 4, Grade 5"



Histologic_type:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,history_of_tumor_type,0.6765,Describes the type of the patient's prior diagnosed tumor.,"Colorectal Cancer, Lower Grade Glioma, Phenochromocytoma or Paraganglioma"
1,roots,0.6562,,
2,percent_sarcomatoid_features,0.5852,Numeric value that represents the percentage of sarcomatoid features found in a specific tissue sample.,
3,additional_pathology_findings,0.5574,A section header that includes additional pathologic findings.,"Adenomyosis, Asbestos bodies, Atrophic endometrium, Atypical hyperplasia/Endometrial intraepithelial neoplasia (EIN), Autoimmune atrophic chronic gast..."
4,relationship_primary_diagnosis,0.5291,The text term used to describe the malignant diagnosis of the patient's relative with a history of cancer.,"Adrenal Gland Cancer, Basal Cell Cancer, Bile Duct Cancer, Bladder Cancer, Blood Cancer, Bone Cancer, Brain Cancer, Breast Cancer, Cancer, Cervical Ca..."
5,primary_diagnosis,0.525,"Text term used to describe the patient's histologic diagnosis, as described by the World Health Organization's (WHO) International Classification of D...","Abdominal desmoid, Abdominal fibromatosis, Achromic nevus, Acidophil adenocarcinoma, Acidophil adenoma, Acidophil carcinoma, Acinar adenocarcinoma, Ac..."
6,described_cases,0.5208,,
7,supratentorial_localization,0.5184,Text term to specify the location of the supratentorial tumor.,"Cerebral Cortex, Deep Gray (e.g. Basal Ganglia, Thalamus), Frontal lobe, Occipital lobe, Parietal lobe, Spinal Cord, Temporal lobe, White Matter, Unkn..."
8,dysplasia_type,0.5049,The type of dysplasia involved.,"Epithelial, Esophageal Columnar Dysplasia, Esophageal Mucosa Columnar Dysplasia, Keratinizing, Nonkeratinizing, Other, Unknown, Not Reported"
9,disease_type,0.4981,"The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) International Classification of ...","Acinar Cell Neoplasms, Adenomas and Adenocarcinomas, Adnexal and Skin Appendage Neoplasms, Basal Cell Neoplasms, Blood Vessel Tumors, Chronic Myelopro..."



Path_Stage_Primary_Tumor-pT:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,uicc_clinical_stage,0.7404,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
1,ajcc_clinical_stage,0.6784,"Stage group determined from clinical information on the tumor (T), regional node (N) and metastases (M) and by grouping cases with similar prognosis f...","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IB Cervix, Stage IC, Stage II,..."
2,uicc_pathologic_stage,0.6754,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
3,figo_stage,0.673,"The extent of a cervical or endometrial cancer within the body, especially whether the disease has spread from the original site to other parts of the...","Stage 0, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage IC1, Stage IC2, Stage IC3, Stage II, Stage IIA, Stag..."
4,ajcc_pathologic_stage,0.6702,"The extent of a cancer, especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteria.","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
5,inss_stage,0.6422,"Text term used to describe the staging classification of neuroblastic tumors, as defined by the International Neuroblastoma Staging System (INSS).","Stage 1, Stage 2A, Stage 2B, Stage 3, Stage 4, Stage 4S, Unknown, Not Reported"
6,ensat_pathologic_stage,0.598,An adrenal cancer stage defined according to the European Network for the Study of Adrenal Tumors (ENSAT) criteria.,"Stage I, Stage II, Stage III, Stage IV"
7,masaoka_stage,0.5898,"The text term used to describe the Masaoka staging system, a classification that defines prognostic indicators for thymic malignancies and predicts tu...","Stage I, Stage IIa, Stage IIb, Stage III, Stage IVa, Stage IVb"
8,iss_stage,0.5569,The multiple myeloma disease stage at diagnosis.,"I, II, III, Unknown, Not Reported"
9,ann_arbor_clinical_stage,0.5407,"The text term used to describe the clinical classification of lymphoma, as defined by the Ann Arbor Lymphoma Staging System.","Stage I, Stage II, Stage III, Stage IV, Unknown, Not Reported"



Path_Stage_Reg_Lymph_Nodes-pN:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,figo_stage,0.6017,"The extent of a cervical or endometrial cancer within the body, especially whether the disease has spread from the original site to other parts of the...","Stage 0, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage IC1, Stage IC2, Stage IC3, Stage II, Stage IIA, Stag..."
1,uicc_clinical_stage,0.5684,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
2,inss_stage,0.5541,"Text term used to describe the staging classification of neuroblastic tumors, as defined by the International Neuroblastoma Staging System (INSS).","Stage 1, Stage 2A, Stage 2B, Stage 3, Stage 4, Stage 4S, Unknown, Not Reported"
3,uicc_pathologic_stage,0.5505,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
4,ensat_pathologic_stage,0.5397,An adrenal cancer stage defined according to the European Network for the Study of Adrenal Tumors (ENSAT) criteria.,"Stage I, Stage II, Stage III, Stage IV"
5,ajcc_pathologic_stage,0.539,"The extent of a cancer, especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteria.","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
6,ajcc_clinical_stage,0.5372,"Stage group determined from clinical information on the tumor (T), regional node (N) and metastases (M) and by grouping cases with similar prognosis f...","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IB Cervix, Stage IC, Stage II,..."
7,uicc_pathologic_n,0.5264,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"N0, N0 (i+), N0 (i-), N0 (mol+), N0 (mol-), N1, N1a, N1b, N1bI, N1bII, N1bIII, N1bIV, N1c, N1mi, N2, N2a, N2b, N2c, N2mi, N3, N3a, N3b, N3c, N4, NX, U..."
8,enneking_msts_stage,0.514,"Text term used to describe the stage of the musculoskeletal sarcoma, using the Enneking staging system approved by the Musculoskeletal Tumor Society (...","Stage IA, Stage IB, Stage IIA, Stage IIB, Stage III, Unknown, Not Reported"
9,uicc_clinical_n,0.5137,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"N0, N0 (i+), N0 (i-), N0 (mol+), N0 (mol-), N1, N1a, N1b, N1bI, N1bII, N1bIII, N1bIV, N1c, N1mi, N2, N2a, N2b, N2c, N3, N3a, N3b, N3c, N4, NX, Unknown..."



Clin_Stage_Dist_Mets-cM:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,uicc_clinical_m,0.7455,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"cM0 (i+), M0, M1, M1a, M1b, M1c, MX, Unknown, Not Reported"
1,ajcc_clinical_m,0.7344,Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatment.,"cM0 (i+), M0, M1, M1a, M1b, M1c, MX, Unknown, Not Reported, Not Allowed To Collect"
2,uicc_pathologic_m,0.6992,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"cM0 (i+), M0, M1, M1a, M1b, M1c, M1d, M2, MX, Unknown, Not Reported"
3,ajcc_pathologic_m,0.6908,Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regi...,"cM0 (i+), M0, M1, M1a, M1b, M1c, M1d, M2, MX, Unknown, Not Reported, Not Allowed To Collect"
4,ensat_clinical_m,0.6473,"A clinical finding about one or more characteristics of adrenal cancer, following the rules of the ENSAT staging v7 classification system as they pert...","M0, M1"
5,inrg_stage,0.6316,"The text term used to describe the staging classification of neuroblastic tumors, as defined by the International Neuroblastoma Risk Group (INRG).","L1, L2, M, Ms, Unknown, Not Reported"
6,enneking_msts_metastasis,0.6272,"Text term and code that represents the metastatic stage of the musculoskeletal sarcoma, using the Enneking staging system approved by the Musculoskele...","No Metastasis (M0), Regional or Distant Metastasis (M1), Unknown, Not Reported"
7,uicc_clinical_stage,0.6271,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
8,masaoka_stage,0.6038,"The text term used to describe the Masaoka staging system, a classification that defines prognostic indicators for thymic malignancies and predicts tu...","Stage I, Stage IIa, Stage IIb, Stage III, Stage IVa, Stage IVb"
9,ajcc_pathologic_stage,0.6011,"The extent of a cancer, especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteria.","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."



Path_Stage_Dist_Mets-pM:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,uicc_clinical_m,0.7371,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"cM0 (i+), M0, M1, M1a, M1b, M1c, MX, Unknown, Not Reported"
1,ajcc_clinical_m,0.7076,Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatment.,"cM0 (i+), M0, M1, M1a, M1b, M1c, MX, Unknown, Not Reported, Not Allowed To Collect"
2,uicc_pathologic_m,0.7069,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"cM0 (i+), M0, M1, M1a, M1b, M1c, M1d, M2, MX, Unknown, Not Reported"
3,ajcc_pathologic_m,0.6833,Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regi...,"cM0 (i+), M0, M1, M1a, M1b, M1c, M1d, M2, MX, Unknown, Not Reported, Not Allowed To Collect"
4,ensat_clinical_m,0.6273,"A clinical finding about one or more characteristics of adrenal cancer, following the rules of the ENSAT staging v7 classification system as they pert...","M0, M1"
5,enneking_msts_metastasis,0.6068,"Text term and code that represents the metastatic stage of the musculoskeletal sarcoma, using the Enneking staging system approved by the Musculoskele...","No Metastasis (M0), Regional or Distant Metastasis (M1), Unknown, Not Reported"
6,metastasis_at_diagnosis,0.5874,The text term used to describe the extent of metastatic disease present at diagnosis.,"Distant Metastasis, Metastasis, NOS, No Metastasis, Regional Metastasis, Unknown, Not Reported"
7,classification_of_tumor,0.5729,Text that describes the kind of disease present in the tumor specimen as related to a specific timepoint.,"metastasis, Premalignant, primary, Prior primary, Progression, recurrence, Synchronous primary, other, Unknown, not reported, Not Allowed To Collect"
8,masaoka_stage,0.5534,"The text term used to describe the Masaoka staging system, a classification that defines prognostic indicators for thymic malignancies and predicts tu...","Stage I, Stage IIa, Stage IIb, Stage III, Stage IVa, Stage IVb"
9,uicc_clinical_stage,0.551,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."



tumor_Stage-Pathological:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,ensat_pathologic_stage,0.8143,An adrenal cancer stage defined according to the European Network for the Study of Adrenal Tumors (ENSAT) criteria.,"Stage I, Stage II, Stage III, Stage IV"
1,ann_arbor_clinical_stage,0.7086,"The text term used to describe the clinical classification of lymphoma, as defined by the Ann Arbor Lymphoma Staging System.","Stage I, Stage II, Stage III, Stage IV, Unknown, Not Reported"
2,cog_liver_stage,0.7066,"The text term used to describe the staging classification of liver tumors, as defined by the Children's Oncology Group (COG). This staging system spec...","Stage I, Stage II, Stage III, Stage IV, Unknown, Not Reported"
3,inss_stage,0.7037,"Text term used to describe the staging classification of neuroblastic tumors, as defined by the International Neuroblastoma Staging System (INSS).","Stage 1, Stage 2A, Stage 2B, Stage 3, Stage 4, Stage 4S, Unknown, Not Reported"
4,iss_stage,0.6993,The multiple myeloma disease stage at diagnosis.,"I, II, III, Unknown, Not Reported"
5,cog_renal_stage,0.6944,"The text term used to describe the staging classification of renal tumors, as defined by the Children's Oncology Group (COG).","Stage I, Stage II, Stage III, Stage IV, Unknown, Not Reported"
6,ann_arbor_pathologic_stage,0.6674,"The text term used to describe the pathologic classification of lymphoma, as defined by the Ann Arbor Lymphoma Staging System.","Stage I, Stage II, Stage III, Stage IV, Unknown, Not Reported"
7,masaoka_stage,0.6129,"The text term used to describe the Masaoka staging system, a classification that defines prognostic indicators for thymic malignancies and predicts tu...","Stage I, Stage IIa, Stage IIb, Stage III, Stage IVa, Stage IVb"
8,uicc_clinical_stage,0.5948,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
9,ajcc_clinical_stage,0.5887,"Stage group determined from clinical information on the tumor (T), regional node (N) and metastases (M) and by grouping cases with similar prognosis f...","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IB Cervix, Stage IC, Stage II,..."



FIGO_stage:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,figo_stage,0.8703,"The extent of a cervical or endometrial cancer within the body, especially whether the disease has spread from the original site to other parts of the...","Stage 0, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage IC1, Stage IC2, Stage IC3, Stage II, Stage IIA, Stag..."
1,uicc_pathologic_stage,0.7944,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
2,uicc_clinical_stage,0.7682,The UICC TNM Classification is an anatomically based system that records the primary and regional nodal extent of the tumor and the absence or presenc...,"Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
3,ajcc_clinical_stage,0.7659,"Stage group determined from clinical information on the tumor (T), regional node (N) and metastases (M) and by grouping cases with similar prognosis f...","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IB Cervix, Stage IC, Stage II,..."
4,ajcc_pathologic_stage,0.7346,"The extent of a cancer, especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteria.","Stage 0, Stage 0a, Stage 0is, Stage I, Stage IA, Stage IA1, Stage IA2, Stage IA3, Stage IB, Stage IB1, Stage IB2, Stage IC, Stage II, Stage IIA, Stage..."
5,enneking_msts_stage,0.6924,"Text term used to describe the stage of the musculoskeletal sarcoma, using the Enneking staging system approved by the Musculoskeletal Tumor Society (...","Stage IA, Stage IB, Stage IIA, Stage IIB, Stage III, Unknown, Not Reported"
6,inss_stage,0.6616,"Text term used to describe the staging classification of neuroblastic tumors, as defined by the International Neuroblastoma Staging System (INSS).","Stage 1, Stage 2A, Stage 2B, Stage 3, Stage 4, Stage 4S, Unknown, Not Reported"
7,irs_group,0.6093,"Text term used to describe the classification of rhabdomyosarcoma tumors, as defined by the Intergroup Rhabdomyosarcoma Study (IRS).","Group I, Group Ia, Group Ib, Group II, Group IIa, Group IIb, Group IIc, Group III, Group IIIa, Group IIIb, Group IV, Unknown, Not Reported"
8,igcccg_stage,0.6079,"The text term used to describe the International Germ Cell Cancer Collaborative Group (IGCCCG), a grouping used to further classify metastatic testicu...","Good Prognosis, Intermediate Prognosis, Poor Prognosis, Unknown, Not Reported"
9,masaoka_stage,0.586,"The text term used to describe the Masaoka staging system, a classification that defines prognostic indicators for thymic malignancies and predicts tu...","Stage I, Stage IIa, Stage IIb, Stage III, Stage IVa, Stage IVb"



BMI:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,percent_stromal_cells,0.818,Numeric value to represent the percentage of reactive cells that are present in a malignant tumor sample or specimen but are not malignant such as fib...,
1,necrosis_percent,0.8076,A quantitative measurement of the percent of cells undergoing necrosis compared to the number of total cells present in a sample.,
2,spindle_cell_percent,0.7874,"The percent of uveal melanoma arising from the choroid, ciliary body, or the iris and characterized by the presence of spindle-shaped melanocytes.",
3,recist_targeted_regions_sum,0.7801,"Numeric value that represents the sum of baseline target lesions, as described by the Response Evaluation Criteria in Solid Tumours (RECIST) criteria.",
4,bmi,0.7546,A calculated numerical quantity that represents an individual's weight to height ratio.,
5,intermediate_dimension,0.7513,"Intermediate dimension of the sample, in millimeters.",
6,longest_dimension,0.721,"Numeric value that represents the longest dimension of the sample, measured in millimeters.",
7,average_base_quality,0.6933,Average base quality collected from samtools.,
8,percent_neutrophil_infiltration,0.6483,Numeric value to represent the percentage of infiltration by neutrophils in a tumor sample or specimen.,
9,fragment_standard_deviation_length,0.612,"Standard deviation of the sequenced fragments length (e.g., as predicted by Agilent Bioanalyzer).",



Age:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,age_at_onset,0.8855,Numeric value used to represent the age of the patient when exposure to a specific environmental factor began.,
1,age_at_last_exposure,0.8768,The study participant's age at the time they were last exposed.,
2,age_at_index,0.8745,The patient's age (in years) on the reference or anchor date used during date obfuscation.,
3,age_at_diagnosis,0.8623,Age at the time of diagnosis expressed in number of days since birth.,
4,relationship_age_at_diagnosis,0.8441,The age (in years) when the patient's relative was first diagnosed.,
5,undescended_testis_corrected_age,0.7382,The patient's age when their undescended testis was corrected.,
6,age_is_obfuscated,0.644,The age or other properties related to the patient's age have been modified for compliance reasons. The actual age may differ from what was reported i...,
7,pack_years_smoked,0.6387,Numeric computed value to represent lifetime tobacco exposure defined as number of cigarettes smoked per day x number of years smoked divided by 20.,
8,menopause_status,0.6157,Text term used to describe the patient's menopause status.,"Perimenopausal, Postmenopausal, Premenopausal, Unknown, Not Reported"
9,days_to_birth,0.596,Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days.,



Race:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,ethnicity,0.7531,"An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provid...","hispanic or latino, not hispanic or latino, Unknown, unknown, not reported, not allowed to collect"
1,race,0.7397,An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within ...,"american indian or alaska native, asian, black or african american, native hawaiian or other pacific islander, white, other, Unknown, unknown, not rep..."
2,eye_color,0.4766,The color of the iris of the eye,"Amber, Blue, Brown, Gray, Green, Hazel, Red & Violet, Other, Not Reported"
3,channel,0.425,The corresponding color channel used to generate this data file.,"Green, Red"
4,variant_origin,0.4006,The text term used to describe the biological origin of a specific genetic variant.,"Germline, Somatic, Unknown"
5,demographics,0.3645,,
6,supratentorial_localization,0.3533,Text term to specify the location of the supratentorial tumor.,"Cerebral Cortex, Deep Gray (e.g. Basal Ganglia, Thalamus), Frontal lobe, Occipital lobe, Parietal lobe, Spinal Cord, Temporal lobe, White Matter, Unkn..."
7,somatic_annotation_workflows,0.348,,
8,stain_type,0.3314,The text term used to describe the type of stain used on a slide.,"Haemotoxylin and Eosin (H&E), Immunohistochemistry (IHC)"
9,fab_morphology_code,0.329,"A classification system for acute myeloid leukemias, acute lymphoblastic leukemias, and myelodysplastic syndromes. It is based on the morphologic and ...","M0, M1, M2, M3, M4, M5, M6, M7, Not Classified"



Ethnicity:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,ethnicity,0.8505,"An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provid...","hispanic or latino, not hispanic or latino, Unknown, unknown, not reported, not allowed to collect"
1,race,0.6468,An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within ...,"american indian or alaska native, asian, black or african american, native hawaiian or other pacific islander, white, other, Unknown, unknown, not rep..."
2,variant_origin,0.4661,The text term used to describe the biological origin of a specific genetic variant.,"Germline, Somatic, Unknown"
3,demographics,0.3661,,
4,eye_color,0.3564,The color of the iris of the eye,"Amber, Blue, Brown, Gray, Green, Hazel, Red & Violet, Other, Not Reported"
5,measurement_type,0.3227,The method used to measure tumor size.,"Echographic, Pathologic, Radiologic"
6,channel,0.3153,The corresponding color channel used to generate this data file.,"Green, Red"
7,country_of_birth,0.3073,The name of the country in which the patient is born.,"Afghanistan, Albania, Algeria, Andorra, Angola, Anguilla, Antigua and Barbuda, Argentina, Armenia, Aruba, Australia, Austria, Azerbaijan, Bahamas, Bah..."
8,methylation_array_harmonization_workflows,0.2998,,
9,country_of_residence_at_enrollment,0.2993,The text term used to describe the patient's country of residence at the time they were enrolled in the study.,"Afghanistan, Albania, Algeria, Andorra, Angola, Anguilla, Antigua and Barbuda, Argentina, Armenia, Aruba, Australia, Austria, Azerbaijan, Bahamas, Bah..."



Gender:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,gender,0.908,Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal rol...,"female, male, unspecified, unknown, not reported"
1,relationship_gender,0.8735,The text term used to describe the gender of the patient's relative with a history of cancer.,"female, male, unspecified, unknown, not reported"
2,pregnant_at_diagnosis,0.5085,The text term used to indicate whether the patient was pregnant at the time they were diagnosed.,"Yes, No, Unknown, Not Reported"
3,menopause_status,0.5046,Text term used to describe the patient's menopause status.,"Perimenopausal, Postmenopausal, Premenopausal, Unknown, Not Reported"
4,hormonal_contraceptive_type,0.4781,The specific type of hormonal contraceptives used by the subject.,"Progestin, Progestin and Estrogen, Unknown, Not Reported"
5,pregnancy_outcome,0.4511,The text term used to describe the type of pregnancy the patient had.,"Ectopic Pregnancy, Induced Abortion, Live Birth, Miscarriage, Spontaneous Abortion, Stillbirth, Unknown, Not Reported"
6,tumor_shape,0.439,Text term to represent the description of the shape of a tumor determined by clinical or pathological techniques.,"Diffuse, Dome, Mushroom, Unknown"
7,marital_status,0.4323,A demographic parameter indicating a person's current conjugal status.,"Divorced, Domestic Partnership, Married, Never Married, Separated, Widowed"
8,variant_origin,0.4315,The text term used to describe the biological origin of a specific genetic variant.,"Germline, Somatic, Unknown"
9,demographics,0.4067,,



Tumor_Site:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,margins_involved_site,0.6137,The text term used to describe the anatomic sites that were involved in the survival margins.,"Gerota Fascia, Parenchyma, Perinephric Fat, Renal, Renal Capsule, Renal Sinus, Renal Vein, Ureter"
1,tumor_depth_descriptor,0.5781,Text term for the degree to which a tumor has penetrated into organ or tissue.,"Deep, Superficial, Not Reported"
2,max_tumor_bulk_site,0.5583,The site of the tumor where the dimension or diameter is larger than any other part of the tumor.,"Adrenal, Appendix, Ascites/peritoneum, Axillary lymph nodes, Bone marrow, Brain, Breast, Cervical lymph nodes, Colon, Iliac, Iliac-external, Inguinal,..."
3,supratentorial_localization,0.5446,Text term to specify the location of the supratentorial tumor.,"Cerebral Cortex, Deep Gray (e.g. Basal Ganglia, Thalamus), Frontal lobe, Occipital lobe, Parietal lobe, Spinal Cord, Temporal lobe, White Matter, Unkn..."
4,enneking_msts_tumor_site,0.5399,"Text term and code that represents the tumor site of the musculoskeletal sarcoma, using the Enneking staging system approved by the Musculoskeletal Tu...","Extracompartmental (T2), Intracompartmental (T1), Unknown, Not Reported"
5,tumor_level_prostate,0.5308,The level(s) of the prostate from which the tumor originated.,
6,primary_site,0.5265,"The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) International Classification of Di...","Accessory sinuses, Adrenal gland, Anus and anal canal, Base of tongue, Bladder, Bones, joints and articular cartilage of limbs, Bones, joints and arti..."
7,morphologic_architectural_pattern,0.52,A specific morphologic or pathologic architectural pattern was discovered within the sample studied.,"Cohesive, Cribiform, Micropapillary, Non-cohesive, Papillary Renal Cell, Papillary, NOS, Solid, Tubular"
8,tumor_shape,0.4893,Text term to represent the description of the shape of a tumor determined by clinical or pathological techniques.,"Diffuse, Dome, Mushroom, Unknown"
9,biospecimen_type,0.4642,"The text term used to describe the biological material used for testing, diagnostic, treatment or research purposes.","Blood, Bone Marrow, Buccal Mucosa, Buffy Coat, Cerebrospinal Fluid, Connective Tissue, Embryonic Fluid, Embryonic Tissue, Feces, Granulocyte, Involved..."



Tumor_Focality:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,tumor_focality,0.808,The text term used to describe whether the patient's disease originated in a single location or multiple locations.,"Multifocal, Unifocal, Unknown, Not Reported"
1,tumor_shape,0.5626,Text term to represent the description of the shape of a tumor determined by clinical or pathological techniques.,"Diffuse, Dome, Mushroom, Unknown"
2,tumor_depth_descriptor,0.5311,Text term for the degree to which a tumor has penetrated into organ or tissue.,"Deep, Superficial, Not Reported"
3,enneking_msts_tumor_site,0.4829,"Text term and code that represents the tumor site of the musculoskeletal sarcoma, using the Enneking staging system approved by the Musculoskeletal Tu...","Extracompartmental (T2), Intracompartmental (T1), Unknown, Not Reported"
4,biospecimen_type,0.4809,"The text term used to describe the biological material used for testing, diagnostic, treatment or research purposes.","Blood, Bone Marrow, Buccal Mucosa, Buffy Coat, Cerebrospinal Fluid, Connective Tissue, Embryonic Fluid, Embryonic Tissue, Feces, Granulocyte, Involved..."
5,tissue_type,0.4783,Text term that represents a description of the kind of tissue collected with respect to disease status or proximity to tumor tissue.,"Abnormal, Normal, Peritumoral, Tumor, Unknown, Not Reported, Not Allowed To Collect"
6,wilms_tumor_histologic_subtype,0.4597,The text term used to describe the classification of Wilms tumors distinguishing between favorable and unfavorable histologic groups.,"Favorable, Unfavorable, Unknown, Not Reported"
7,residual_tumor_measurement,0.4544,A measurement of the tumor cells that remain in the body following cancer treatment.,"1-10 mm, 11-20 mm, >20 mm, No macroscopic disease"
8,non_nodal_tumor_deposits,0.451,The yes/no/unknown indicator used to describe the presence of tumor deposits in the pericolic or perirectal fat or in adjacent mesentery away from the...,"Yes, No, Unknown, Not Reported"
9,tumor_infiltrating_macrophages,0.4465,Non-neoplastic macrophages that infiltrate a tumor.,"Few, Many, Moderate"



Tumor_Size_cm:


Unnamed: 0,Candidate,Similarity,Description,Values (sample)
0,shortest_dimension,0.7575,"Numeric value that represents the shortest dimension of the sample, measured in millimeters.",
1,size_extraocular_nodule,0.7416,The size of the nodule that is outside the eye.,
2,tumor_width_measurement,0.7034,The numerical measurement of tumor width.,
3,tumor_depth_measurement,0.6989,The numerical measurement of tumor depth.,
4,tumor_thickness,0.6506,A measurement of the thickness of a sectioned slice (of tissue or mineral or other substance) in millimeters (mm).,
5,analyte_quantity,0.6418,The quantity in micrograms (ug) of the analyte(s) derived from the analyte(s) shipped for sequencing and characterization.,
6,average_insert_size,0.6362,Average insert size collected from samtools.,
7,tumor_largest_dimension_diameter,0.5736,"Numeric value used to describe the maximum diameter or dimension of the primary tumor, measured in centimeters.",
8,mitotic_total_area,0.5635,The total area reviewed when calculating the mitotic index ratio.,
9,rin,0.5578,A numerical assessment of the integrity of RNA based on the entire electrophoretic trace of the RNA sample including the presence or absence of degrad...,


## Column Mapping

Perform column mapping.

In [6]:
column_mappings = manager.map_columns()

Unnamed: 0,Original Column,Target Column
0,Country,country_of_birth
1,Histologic_Grade_FIGO,histologic_progression_type
2,Histologic_type,history_of_tumor_type
3,Path_Stage_Primary_Tumor-pT,uicc_pathologic_stage
4,Path_Stage_Reg_Lymph_Nodes-pN,uicc_pathologic_stage
5,Clin_Stage_Dist_Mets-cM,uicc_clinical_stage
6,Path_Stage_Dist_Mets-pM,ajcc_pathologic_stage
7,tumor_Stage-Pathological,uicc_pathologic_stage
8,FIGO_stage,figo_stage
9,BMI,bmi


Users can update column mappings.

In [7]:
manager.update_column_mappings('Histologic_Grade_FIGO', 'tumor_grade')

Column mapping updated!


Unnamed: 0,Original Column,Target Column
0,Country,country_of_birth
1,Histologic_Grade_FIGO,tumor_grade
2,Histologic_type,history_of_tumor_type
3,Path_Stage_Primary_Tumor-pT,uicc_pathologic_stage
4,Path_Stage_Reg_Lymph_Nodes-pN,uicc_pathologic_stage
5,Clin_Stage_Dist_Mets-cM,uicc_clinical_stage
6,Path_Stage_Dist_Mets-pM,ajcc_pathologic_stage
7,tumor_Stage-Pathological,uicc_pathologic_stage
8,FIGO_stage,figo_stage
9,BMI,bmi


## Value Mapping

Perform value mapping.

In [8]:
value_mappings = manager.map_values()


Column tumor_Stage-Pathological:


Unnamed: 0,Current Value,Target Value,Similarity
0,Stage I,Stage I,1.0
1,Stage IV,Stage IV,1.0
2,Stage III,Stage III,1.0
3,Stage II,Stage II,1.0
4,,-,-



Column Race:


Unnamed: 0,Current Value,Target Value,Similarity
0,White,white,1.0
1,Asian,asian,1.0
2,Not Reported,not reported,1.0
3,Black or African American,black or african american,1.0
4,,-,-



Column Histologic_Grade_FIGO:


Unnamed: 0,Current Value,Target Value,Similarity
0,FIGO grade 1,High Grade,1.0
1,FIGO grade 2,High Grade,1.0
2,FIGO grade 3,High Grade,1.0
3,,-,-



Column Tumor_Focality:


Unnamed: 0,Current Value,Target Value,Similarity
0,Unifocal,Unifocal,1.0
1,Multifocal,Multifocal,1.0
2,,-,-



Column Country:


Unnamed: 0,Current Value,Target Value,Similarity
0,United States,United States,1.0
1,Ukraine,Ukraine,1.0
2,Poland,Poland,1.0
3,Other_specify,-,-
4,,-,-



Column Ethnicity:


Unnamed: 0,Current Value,Target Value,Similarity
0,Hispanic or Latino,hispanic or latino,1.0
1,Not-Hispanic or Latino,not hispanic or latino,0.936364
2,,-,-
3,Not reported,-,-



Column Gender:


Unnamed: 0,Current Value,Target Value,Similarity
0,Female,female,1.0
1,,-,-



Column Path_Stage_Reg_Lymph_Nodes-pN:


Unnamed: 0,Current Value,Target Value,Similarity
0,pN2 (FIGO IIIC2),Stage IIIC2,1.0
1,pN1 (FIGO IIIC1),Stage IIIC1,1.0
2,pN0,-,-
3,,-,-
4,pNX,-,-



Column Clin_Stage_Dist_Mets-cM:


Unnamed: 0,Current Value,Target Value,Similarity
0,Staging Incomplete,Stage IC,1.0
1,cM0,-,-
2,cM1,-,-
3,,-,-



Column Path_Stage_Dist_Mets-pM:


Unnamed: 0,Current Value,Target Value,Similarity
0,Staging Incomplete,Stage IC,1.0
1,No pathologic evidence of distant metastasis,-,-
2,pM1,-,-
3,,-,-



Column Tumor_Site:


Unnamed: 0,Current Value,Target Value,Similarity
0,,Unknown,1.0
1,Posterior endometrium,-,-
2,Anterior endometrium,-,-
3,"Other, specify",-,-



Column FIGO_stage:


Unnamed: 0,Current Value,Target Value,Similarity
0,IIIC2,Stage IIIC2,1.0
1,IIIC1,Stage IIIC1,1.0
2,IIIB,-,-
3,IVB,-,-
4,II,-,-
5,IIIA,-,-
6,IB,-,-
7,IA,-,-
8,,-,-



Column Histologic_type:


Unnamed: 0,Current Value,Target Value,Similarity
0,Clear cell,Colorectal Cancer,1.0
1,Carcinosarcoma,-,-
2,Serous,-,-
3,Endometrioid,-,-
4,,-,-



Column Path_Stage_Primary_Tumor-pT:


Unnamed: 0,Current Value,Target Value,Similarity
0,pT3a (FIGO IIIA),Stage IIIA,1.0
1,pT3b (FIGO IIIB),-,-
2,,-,-
3,pT1a (FIGO IA),-,-
4,pT1b (FIGO IB),-,-
5,pT1 (FIGO I),-,-
6,pT2 (FIGO II),-,-
