# Quick Start

First, import the `bdikit` library.

In [1]:
import bdikit as bdi
import pandas as pd

In this example, we are mapping a data sample (renal cancer) from Clark et al. (https://pubmed.ncbi.nlm.nih.gov/31675502/) to the GDC format.

In [2]:
dataset = pd.read_csv("./datasets/clark_sample.csv")
dataset.head(5)

Unnamed: 0,Gender,Grade,Path_Stage_Primary_Tumor_pT,Tumor_Focality,Tumor_Stage_Pathological
0,Male,G3,pT3,Unifocal,Stage III
1,Male,G3,pT1b,Unifocal,Stage I
2,Female,G4,pT3a,Unifocal,Stage IV
3,Female,G3,pT1a,Unifocal,Stage I
4,Male,G3,pT3a,Unifocal,Stage III


__Schema Matching__

`bdi-kit` can help with automatic discovery of one-to-one matches between the columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or a standard data vocabulary such as the GDC (Genomic Data Commons).

To achieve this using `bdi-kit`, we can use the `match_schema()` function to match columns to the GDC vocabulary schema as follows.

In [3]:
column_matches = bdi.match_schema(dataset, target="gdc", method="magneto_ft_bp")
column_matches

Unnamed: 0,source,target,similarity
0,Gender,gender,1.0
1,Tumor_Focality,tumor_focality,1.0
2,Grade,tumor_grade,0.83693
3,Path_Stage_Primary_Tumor_pT,ajcc_pathologic_t,0.800312
4,Tumor_Stage_Pathological,ajcc_pathologic_stage,0.784637


__Value Matching__

After finding the correct column matches, we need to find appropriate value matches. 
Using `match_values()`, we can inspect what the possible value mappings for this would look like after the harmonization. `bdi-kit` implements multiple methods for value mapping discovery.

To specify a value mapping approach, we can pass the `method` parameter.

In [4]:
value_mappings = bdi.match_values(dataset, target="gdc", column_mapping=column_matches, method="tfidf")
bdi.view_value_matches(value_mappings)

<br>**Source column:** Gender<br>**Target column:** gender<br>

Unnamed: 0,source_value,target_value,similarity
0,Male,male,1.0
1,Female,female,1.0


<br>**Source column:** Tumor_Focality<br>**Target column:** tumor_focality<br>

Unnamed: 0,source_value,target_value,similarity
0,Unifocal,Unifocal,1.0
1,Multifocal,Multifocal,1.0


<br>**Source column:** Grade<br>**Target column:** tumor_grade<br>

Unnamed: 0,source_value,target_value,similarity
0,G3,G3,1.0
1,G4,G4,1.0
2,G2,G2,1.0
3,G1,G1,1.0


<br>**Source column:** Tumor_Stage_Pathological<br>**Target column:** ajcc_pathologic_stage<br>

Unnamed: 0,source_value,target_value,similarity
0,Stage III,Stage III,1.0
1,Stage I,Stage I,1.0
2,Stage IV,Stage IV,1.0
3,Stage II,Stage II,1.0


<br>**Source column:** Path_Stage_Primary_Tumor_pT<br>**Target column:** ajcc_pathologic_t<br>

Unnamed: 0,source_value,target_value,similarity
0,pT3b,T3b,0.816
1,pT3a,T3a,0.81
2,pT2b,T2b,0.797
3,pT1b,T1b,0.767
4,pT2a,T2a,0.765
5,pT1a,T1a,0.758
6,pT3,T3,0.614
7,pT4,T4,0.592


__Materializing the Harmonized Dataset__

Finally, we generate the harmonized dataset, with the user-defined value mappings.

In [5]:
harmonized_dataset = bdi.materialize_mapping(dataset, value_mappings)
harmonized_dataset

Unnamed: 0,gender,tumor_focality,tumor_grade,ajcc_pathologic_stage,ajcc_pathologic_t
0,male,Unifocal,G3,Stage III,T3
1,male,Unifocal,G3,Stage I,T1b
2,female,Unifocal,G4,Stage IV,T3a
3,female,Unifocal,G3,Stage I,T1a
4,male,Unifocal,G3,Stage III,T3a
...,...,...,...,...,...
105,male,Unifocal,G3,Stage III,T3a
106,male,Multifocal,G2,Stage II,T2a
107,male,Unifocal,G2,Stage III,T3a
108,male,Multifocal,G3,Stage II,T2a
