# Getting Top-K Value Matches

In [1]:
import bdikit as bdi
import pandas as pd

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/) to the GDC format.

In [2]:
dataset = pd.read_csv("./datasets/dou.csv")
columns = [
    "Race",
    "Ethnicity",
    "FIGO_stage",
]

dataset[columns].head(10)

Unnamed: 0,Race,Ethnicity,FIGO_stage
0,White,Not-Hispanic or Latino,IA
1,White,Not-Hispanic or Latino,IA
2,White,Not-Hispanic or Latino,IA
3,,,
4,White,Not-Hispanic or Latino,IA
5,White,Not-Hispanic or Latino,IA
6,White,Not-Hispanic or Latino,IA
7,White,Not-Hispanic or Latino,IA
8,White,Not-Hispanic or Latino,IIIA
9,White,Not-Hispanic or Latino,IA


We can send a `Tuple (source column, target column)` as a parameter to the function `top_value_matches()`.

In [3]:
column_mapping = ("FIGO_stage", "figo_stage")

value_mappings = bdi.top_value_matches(
        dataset,
        target="gdc",
        column_mapping=column_mapping,
        top_k=5
    )

In [4]:
bdi.view_value_matches(value_mappings)

<br>**Source column:** FIGO_stage<br>**Target column:** figo_stage<br>

We can also send a `DataFrame` of column mappings as a parameter to `top_value_matches()`:

In [5]:
column_mappings = bdi.match_schema(dataset[columns], target="gdc", method="coma")
column_mappings

Unnamed: 0,source,target
0,FIGO_stage,figo_stage
1,Ethnicity,ethnicity
2,Race,race


In [6]:
value_mappings = bdi.top_value_matches(
        dataset,
        target="gdc",
        column_mapping=column_mappings,
        top_k=5
    )

In [7]:
bdi.view_value_matches(value_mappings)

<br>**Source column:** FIGO_stage<br>**Target column:** figo_stage<br>

<br>**Source column:** Ethnicity<br>**Target column:** ethnicity<br>

<br>**Source column:** Race<br>**Target column:** race<br>