# Getting Top-K Value Matches

In this example, we show how to search for possible value matches between values in a table column and the standard format specified in the [GDC Data Dictionary](https://docs.gdc.cancer.gov/Data_Dictionary/). We use data from the study published by Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/).

In [1]:
import bdikit as bdi
import pandas as pd

We start by loading the data file. We will focus on the columns shown below.

In [2]:
dataset = pd.read_csv("./datasets/dou.csv")
columns = [
    "Race",
    "Ethnicity",
    "FIGO_stage",
]

dataset[columns].head(10)

Unnamed: 0,Race,Ethnicity,FIGO_stage
0,White,Not-Hispanic or Latino,IA
1,White,Not-Hispanic or Latino,IA
2,White,Not-Hispanic or Latino,IA
3,,,
4,White,Not-Hispanic or Latino,IA
5,White,Not-Hispanic or Latino,IA
6,White,Not-Hispanic or Latino,IA
7,White,Not-Hispanic or Latino,IA
8,White,Not-Hispanic or Latino,IIIA
9,White,Not-Hispanic or Latino,IA


We can send a tuple `(source column, target column)` as a parameter to the function `top_value_matches()`.

In [3]:
column_mapping = ("FIGO_stage", "figo_stage")

value_mappings = bdi.top_value_matches(
    dataset,
    target="gdc",
    column_mapping=column_mapping,
    top_k=5
)

As seen below, the function `view_value_matches()` helps us to visualize the output of the `top_value_matches()` function.

In [4]:
bdi.view_value_matches(value_mappings)

<br>**Source column:** FIGO_stage<br>**Target column:** figo_stage<br>

Unnamed: 0,source,target,similarity
0,IA,Stage IA,0.586
1,IA,Stage IIA,0.563
2,IA,Stage IIIA,0.527
3,IA,Stage IIIAi,0.467
4,IA,Stage IIIA2,0.432


Unnamed: 0,source,target,similarity
0,IB,Stage IB,0.649
1,IB,Stage IIB,0.571
2,IB,Stage IIIB,0.528
3,IB,Stage IB2,0.441
4,IB,Stage IB1,0.441


Unnamed: 0,source,target,similarity
0,II,Stage III,0.687
1,II,Stage IIIAii,0.635
2,II,Stage IIIA,0.598
3,II,Stage IIIC,0.58
4,II,Stage IIIAi,0.566


Unnamed: 0,source,target,similarity
0,IIIA,Stage IIIA,0.822
1,IIIA,Stage IIIAii,0.726
2,IIIA,Stage IIIAi,0.716
3,IIIA,Stage IIIA2,0.674
4,IIIA,Stage IIIA1,0.674


Unnamed: 0,source,target,similarity
0,IIIB,Stage IIIB,0.849
1,IIIB,Stage IIB,0.728
2,IIIB,Stage III,0.545
3,IIIB,Stage IIIA,0.475
4,IIIB,Stage IIIAii,0.471


Unnamed: 0,source,target,similarity
0,IIIC1,Stage IIIC1,0.889
1,IIIC1,Stage IC1,0.651
2,IIIC1,Stage IIIC,0.647
3,IIIC1,Stage IIC,0.538
4,IIIC1,Stage IIIC2,0.536


Unnamed: 0,source,target,similarity
0,IIIC2,Stage IIIC2,0.889
1,IIIC2,Stage IC2,0.651
2,IIIC2,Stage IIIC,0.647
3,IIIC2,Stage IIC,0.538
4,IIIC2,Stage IIIC1,0.536


Unnamed: 0,source,target,similarity
0,IVB,Stage IVB,0.854
1,IVB,Stage IV,0.448
2,IVB,Stage IVA,0.325


Unnamed: 0,source,target,similarity
0,,Unknown,0.35


We can also send a `DataFrame` containing column mappings to `column_mapping` parameter of `top_value_matches()`.
Next we show how the dataframe returned by `match_schema()` can be used as input argument for `top_value_matches()`.
Let's start by creating a dataframe with column mappings.


In [5]:
column_mappings = bdi.match_schema(dataset[columns], target="gdc", method="coma")
column_mappings

Unnamed: 0,source,target
0,FIGO_stage,figo_stage
1,Ethnicity,ethnicity
2,Race,race


Now the `column_mappings` dataframe is used as an argument of `top_value_matches()`.

In [6]:
value_mappings = bdi.top_value_matches(
    dataset,
    target="gdc",
    column_mapping=column_mappings,
    top_k=5
)

In [7]:
bdi.view_value_matches(value_mappings)

<br>**Source column:** FIGO_stage<br>**Target column:** figo_stage<br>

Unnamed: 0,source,target,similarity
0,IA,Stage IA,0.586
1,IA,Stage IIA,0.563
2,IA,Stage IIIA,0.527
3,IA,Stage IIIAi,0.467
4,IA,Stage IIIA2,0.432


Unnamed: 0,source,target,similarity
0,IB,Stage IB,0.649
1,IB,Stage IIB,0.571
2,IB,Stage IIIB,0.528
3,IB,Stage IB2,0.441
4,IB,Stage IB1,0.441


Unnamed: 0,source,target,similarity
0,II,Stage III,0.687
1,II,Stage IIIAii,0.635
2,II,Stage IIIA,0.598
3,II,Stage IIIC,0.58
4,II,Stage IIIAi,0.566


Unnamed: 0,source,target,similarity
0,IIIA,Stage IIIA,0.822
1,IIIA,Stage IIIAii,0.726
2,IIIA,Stage IIIAi,0.716
3,IIIA,Stage IIIA2,0.674
4,IIIA,Stage IIIA1,0.674


Unnamed: 0,source,target,similarity
0,IIIB,Stage IIIB,0.849
1,IIIB,Stage IIB,0.728
2,IIIB,Stage III,0.545
3,IIIB,Stage IIIA,0.475
4,IIIB,Stage IIIAii,0.471


Unnamed: 0,source,target,similarity
0,IIIC1,Stage IIIC1,0.889
1,IIIC1,Stage IC1,0.651
2,IIIC1,Stage IIIC,0.647
3,IIIC1,Stage IIC,0.538
4,IIIC1,Stage IIIC2,0.536


Unnamed: 0,source,target,similarity
0,IIIC2,Stage IIIC2,0.889
1,IIIC2,Stage IC2,0.651
2,IIIC2,Stage IIIC,0.647
3,IIIC2,Stage IIC,0.538
4,IIIC2,Stage IIIC1,0.536


Unnamed: 0,source,target,similarity
0,IVB,Stage IVB,0.854
1,IVB,Stage IV,0.448
2,IVB,Stage IVA,0.325


Unnamed: 0,source,target,similarity
0,,Unknown,0.35


<br>**Source column:** Ethnicity<br>**Target column:** ethnicity<br>

Unnamed: 0,source,target,similarity
0,Hispanic or Latino,hispanic or latino,1.0
1,Hispanic or Latino,not hispanic or latino,0.956


Unnamed: 0,source,target,similarity
0,Not reported,,


Unnamed: 0,source,target,similarity
0,Not-Hispanic or Latino,not hispanic or latino,0.935
1,Not-Hispanic or Latino,hispanic or latino,0.894


Unnamed: 0,source,target,similarity
0,,,


<br>**Source column:** Race<br>**Target column:** race<br>

Unnamed: 0,source,target,similarity
0,White,white,1.0


Unnamed: 0,source,target,similarity
0,Asian,asian,1.0
1,Asian,american indian or alaska native,0.438
2,Asian,native hawaiian or other pacific islander,0.329


Unnamed: 0,source,target,similarity
0,Black or African American,black or african american,1.0
1,Black or African American,american indian or alaska native,0.605
2,Black or African American,native hawaiian or other pacific islander,0.399


Unnamed: 0,source,target,similarity
0,Not Reported,not reported,1.0


Unnamed: 0,source,target,similarity
0,,american indian or alaska native,0.359
