## Purpose

This notebook identifies potential duplicate records within your data using vector similarity search.

## Prerequisites

*   A Databricks workspace with access to Databricks Vector Search.
*   A Model Serving endpoint serving an embedding model (e.g., `databricks-gte-large-en`).
*   Bronze tables with the data you want to analyze, with appropriately defined `primaryKey`

## Outputs

*   Creates temporary views named `<entity_name>_duplicate_candidates` for each entity in the provided data model. These views contain potential duplicate records and their similarity scores.

This is your first step in identifying and deduplicating your data!  Run this first to create the datasets that tell you your duplicate candidates

In [0]:
%run ./config/include/multiThreading

In [0]:
dbutils.widgets.text("embedding_model_endpoint_name", "databricks-gte-large-en", "Embedding Model")
dbutils.widgets.text("num_results", "4", "Num Results")
dbutils.widgets.text("catalog_name","")

In [0]:
embedding_model_endpoint_name = dbutils.widgets.get("embedding_model_endpoint_name")
num_results = int(dbutils.widgets.get("num_results"))
catalog_name = dbutils.widgets.get("catalog_name")
if not catalog_name:
    raise Exception("Catalog name is required to run this notebook")

In [0]:
notebookPath = current_notebook_path.replace(current_notebook_path.split("/")[-1],"config/detect_duplicates_main")

In [0]:
# Example usage:  Hardcoded Entities
entities = [
    {"table_name": f"{catalog_name}.bronze.provider", "primary_key": "provider_id", "columns_to_exclude": ["provider_id"]},
    {"table_name": f"{catalog_name}.bronze.speciality", "primary_key": "speciality_id", "columns_to_exclude": ["speciality_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.license_and_credential", "primary_key": "license_credential_id", "columns_to_exclude": ["license_credential_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.affiliations", "primary_key": "affiliation_id", "columns_to_exclude": ["affiliation_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.address_and_location", "primary_key": "address_location_id", "columns_to_exclude": ["address_location_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.contact_information", "primary_key": "contact_id", "columns_to_exclude": ["contact_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.ProviderNetworkParticipation", "primary_key": "network_id", "columns_to_exclude": ["network_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.Employment_and_Contracts", "primary_key": "employment_contract_id", "columns_to_exclude": ["employment_contract_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.Education_and_Training", "primary_key": "education_training_id", "columns_to_exclude": ["education_training_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.Performance_and_QualityMetrics", "primary_key": "performance_metric_id", "columns_to_exclude": ["performance_metric_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.Identifiers", "primary_key": "identifier_id", "columns_to_exclude": ["identifier_id", "provider_id"]},
    {"table_name": f"{catalog_name}.bronze.Digital_Presence", "primary_key": "digital_presence_id", "columns_to_exclude": ["digital_presence_id", "provider_id"]}
]

In [0]:
notebooks = [NotebookData(f"{notebookPath}", 3600, {"table_name" : f'{entity["table_name"]}',"columns_to_exclude" : f'{entity["columns_to_exclude"]}',"primary_key" : f'{entity["primary_key"]}',"catalog_name" : f'{catalog_name}',"num_results" : f'{num_results}',"embedding_model_endpoint_name" : f'{embedding_model_endpoint_name}'}) for entity in entities]   

# #Array of instances of NotebookData Class
parallel_thread = 12

try : 
    res = parallel_notebooks(notebooks, parallel_thread)
    result = [i.result(timeout = 3600) for i in res] # This is a blocking call.
    print(result)  
except NameError as e :
    print(e)