# Getting Top-K Value Matches

In this example, we show how to search for possible value matches between values in a table column and the standard format specified in the [GDC Data Dictionary](https://docs.gdc.cancer.gov/Data_Dictionary/). We use data from the study published by Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/).

In [1]:
import bdikit as bdi
import pandas as pd

We start by loading the data file. We will focus on the columns shown below.

In [2]:
dataset = pd.read_csv("./datasets/dou.csv")
columns = [
    "Race",
    "Ethnicity",
    "FIGO_stage",
]

dataset[columns].head(10)

Unnamed: 0,Race,Ethnicity,FIGO_stage
0,White,Not-Hispanic or Latino,IA
1,White,Not-Hispanic or Latino,IA
2,White,Not-Hispanic or Latino,IA
3,,,
4,White,Not-Hispanic or Latino,IA
5,White,Not-Hispanic or Latino,IA
6,White,Not-Hispanic or Latino,IA
7,White,Not-Hispanic or Latino,IA
8,White,Not-Hispanic or Latino,IIIA
9,White,Not-Hispanic or Latino,IA


We can send a tuple `(source column, target column)` as a parameter to the function `rank_value_matches()`.

In [3]:
column_mapping = ("FIGO_stage", "figo_stage")

value_mappings = bdi.rank_value_matches(
    dataset,
    target="gdc",
    attribute_matches=column_mapping,
    top_k=5
)

As seen below, the function `view_value_matches()` helps us to visualize the output of the `top_value_matches()` function.

In [4]:
bdi.view_value_matches(value_mappings)

<br>**Source attribute:** FIGO_stage<br>**Target attribute:** figo_stage<br>

Unnamed: 0,source_value,target_value,similarity
0,IIIC2,Stage IIIC2,0.89
1,IIIC2,Stage IC2,0.653
2,IIIC2,Stage IIIC,0.646
3,IIIC2,Stage IIC,0.537
4,IIIC2,Stage IIIC1,0.534
5,IIIC1,Stage IIIC1,0.89
6,IIIC1,Stage IC1,0.653
7,IIIC1,Stage IIIC,0.646
8,IIIC1,Stage IIC,0.537
9,IIIC1,Stage IIIC2,0.534


We can also send a `DataFrame` containing column mappings to `column_mapping` parameter of `rank_value_matches()`.
Next we show how the dataframe returned by `match_schema()` can be used as input argument for `rank_value_matches()`.
Let's start by creating a dataframe with column mappings.


In [5]:
column_mappings = bdi.match_schema(dataset[columns], target="gdc", method="coma")
column_mappings

Unnamed: 0,source_attribute,target_attribute,similarity
0,FIGO_stage,figo_stage,0.730554
1,Ethnicity,ethnicity,0.728581
2,Race,race,0.714483


Now the `column_mappings` dataframe is used as an argument of `rank_value_matches()`.

In [6]:
value_mappings = bdi.rank_value_matches(
    dataset,
    target="gdc",
    attribute_matches=column_mappings,
    top_k=5
)

In [7]:
bdi.view_value_matches(value_mappings)

<br>**Source attribute:** Ethnicity<br>**Target attribute:** ethnicity<br>

Unnamed: 0,source_value,target_value,similarity
0,Hispanic or Latino,hispanic or latino,1.0
1,Hispanic or Latino,not hispanic or latino,0.969
2,Not reported,not reported,1.0
3,Not-Hispanic or Latino,not hispanic or latino,0.936
4,Not-Hispanic or Latino,hispanic or latino,0.907
5,,,


<br>**Source attribute:** Race<br>**Target attribute:** race<br>

Unnamed: 0,source_value,target_value,similarity
0,White,white,1.0
1,White,white,1.0
2,Asian,asian,1.0
3,Asian,american indian or alaska native,0.454
4,Asian,native hawaiian or other pacific islander,0.345
5,Not Reported,not reported,1.0
6,Black or African American,black or african american,1.0
7,Black or African American,american indian or alaska native,0.616
8,Black or African American,native hawaiian or other pacific islander,0.409
9,,,


<br>**Source attribute:** FIGO_stage<br>**Target attribute:** figo_stage<br>

Unnamed: 0,source_value,target_value,similarity
0,IIIC2,Stage IIIC2,0.89
1,IIIC2,Stage IC2,0.653
2,IIIC2,Stage IIIC,0.646
3,IIIC2,Stage IIC,0.537
4,IIIC2,Stage IIIC1,0.534
5,IIIC1,Stage IIIC1,0.89
6,IIIC1,Stage IC1,0.653
7,IIIC1,Stage IIIC,0.646
8,IIIC1,Stage IIC,0.537
9,IIIC1,Stage IIIC2,0.534
