<a href="https://colab.research.google.com/github/daya448/deduplication/blob/main/Deduplication_using_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DeDupify**
### *Identify the duplicate person in the system.*

This notebook outlines the process of setting up a deduplication system using Elasticsearch and LLM. While it doesn‚Äôt cover all possible name-matching use cases‚Äîsince these can vary depending on datasets and scenarios‚Äîit provides a practical overview of how a de duplication system operates.

At a high level, we will look at
- How to connect to Elasticsearch
- How to create Index template with necessary phonetic analysers.
- How to generate context with suitable LLM Prompts.
- How to integrate Elasticsearch backend with Streamlit UI.
- Starting the Streamlit UI

### Setup

Let's start by making sure all required plugins and Elasticsearch clients are installed. We'll also use getpass to ensure we can allow secure user inputs for our IDs and keys to access our Elasticsearch instance.

In [None]:
# Installs the Elasticsearch client library if not already installed
!pip install elasticsearch pandas
!pip install -U langchain-community
!pip install -U streamlit
!npm install -g localtunnel

from elasticsearch import Elasticsearch, helpers

Collecting elasticsearch
  Downloading elasticsearch-8.17.0-py3-none-any.whl.metadata (8.8 kB)
Collecting elastic-transport<9,>=8.15.1 (from elasticsearch)
  Downloading elastic_transport-8.17.0-py3-none-any.whl.metadata (3.6 kB)
Downloading elasticsearch-8.17.0-py3-none-any.whl (571 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m571.2/571.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading elastic_transport-8.17.0-py3-none-any.whl (64 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m64.5/64.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: elastic-transport, elasticsearch
Successfully installed elastic-transport-8.17.0 elasticsearch-8.17.0
Collecting langchain-community
  Downloading langchain_community-0.3.14-py3-none-any.whl.metadata (2.9 kB)

## Connect to Elasticsearch

‚ÑπÔ∏è We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=search&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

You will also need your **API KEY** to access your deployment. You can [create a new API key](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key) from the `Stack Management -> API keys` menu in Kibana. Be sure to copy or write down your key in a safe place once it is created it will be displayed only upon creation.

In [None]:

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
#ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
#ELASTIC_API_KEY = getpass("Elastic API Key: ")

ELASTIC_CLOUD_ID = 'general-purpose:YXAtc291dGhlYXN0LTIuYXdzLmZvdW5kLmlvOjQ0MyRmMDg3MDc0Njk4ZTY0YTQ4OWUwZjAzMDNlNGJkZThhMCQyMmJkMmVkODY2MGQ0NmM2OTI3ZjBmMjY4NGUzNDEzZQ=='
ELASTIC_API_KEY = 'VFlZeWI1SUJCY1BMSUswYjN4MW46dXZwQTVJUlZRdTZEZXFpREFBMy1wZw=='

# Create Elasticsearch client
es = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY
)

### Create an Index template with necessary phonetic analysers

*Important - At this stage it is essential to [follow this documentation](https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html) and install the phonetic analyser. If the elasticsearch cluster is a ESS then [follow this documentation](https://www.elastic.co/guide/en/cloud/current/ec-adding-elastic-plugins.html) to enable the plugin.*

In this demo we use **soundex** algorithm which is more suitable for American names and provides better matches. You may play around with double_metaphone,caverphone1 algorithms depending on the dataset you upload.

Below is the index template which has necessary mappings for "name" and "address" fields. Notice how the analyzers are different for each fields.


In [None]:
# Define the index template
index_template = {
        "index_patterns": [
            "names-*"
        ],
        "settings": {
            "index": {
                "analysis": {
                    "filter": {
                        "my_dmetaphone_filter": {
                            "replace": "false",
                            "type": "phonetic",
                            "encoder": "double_metaphone"
                        },
                        "my_soundex": {
                            "type": "phonetic",
                            "encoder": "soundex"
                        }
                    },
                    "analyzer": {
                        "name_analyzer_soundex": {
                            "filter": [
                                "lowercase",
                                "my_soundex"
                            ],
                            "tokenizer": "standard"
                        },
                        "name_analyzer_dmetaphone": {
                            "filter": [
                                "lowercase",
                                "my_dmetaphone_filter"
                            ],
                            "tokenizer": "standard"
                        }
                    }
                },
                "number_of_shards": "1",
                "number_of_replicas": "1"
            }
        },
        "mappings": {
            "properties": {
                "address": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "ignore_above": 256,
                            "type": "keyword"
                        }
                    }
                },
                "name": {
                    "analyzer": "name_analyzer_dmetaphone",
                    "type": "text",
                    "fields": {
                        "soundex": {
                            "analyzer": "name_analyzer_soundex",
                            "type": "text"
                        }
                    }
                }
            }
        }
    }

In [None]:
# Add the index template to Elasticsearch
try:
    response = es.indices.put_template(name="name-search", body=index_template)
    print("Index template created successfully:", response)
except Exception as e:
    print("Error creating index template:", e)

[1G[0K‚†ô[1G[0K‚†π[1G[0K‚†∏[1G[0K‚†º[1G[0K‚†¥[1G[0K‚†¶[1G[0K‚†ß[1G[0K‚†á[1G[0K‚†è[1G[0K‚†ã[1G[0K‚†ô[1G[0K‚†π[1G[0K‚†∏[1G[0K‚†º[1G[0K‚†¥[1G[0K‚†¶[1G[0K‚†ß[1G[0K‚†á[1G[0K‚†è[1G[0K‚†ã[1G[0K‚†ô[1G[0K‚†π[1G[0K‚†∏[1G[0K‚†º[1G[0K
changed 22 packages in 2s
[1G[0K‚†º[1G[0K
[1G[0K‚†º[1G[0K3 packages are looking for funding
[1G[0K‚†º[1G[0K  run `npm fund` for details
[1G[0K‚†º[1G[0KIndex template created successfully: {'acknowledged': True}


  response = es.indices.put_template(name="name-search", body=index_template)


### Upload Names Dataset and Create Index

Dataset has 101 records and it can be found [here](https://raw.githubusercontent.com/daya448/deduplication/refs/heads/main/similar_names_dataset.csv). If you intend to use your own dataset, make sure it has ‚Äúname‚Äù and ‚Äúaddress‚Äù fields.

To mimic the use case, dataset is generated with similar sounding names, spelling variations, names having same address and with other nuances.


1. **Misspelled Names**: It‚Äôs a classic case of ‚ÄúWho‚Äôs who?‚Äù One day it‚Äôs ‚ÄúJohn Smith,‚Äù the next day it‚Äôs ‚ÄúJon Smith,‚Äù and if he‚Äôs feeling extra fancy, it‚Äôs ‚ÄúJonathon Smith.‚Äù Then there‚Äôs the eternal enigma of Shawn, Shaun, Sean, and Shon‚Äîbecause why settle for one spelling when you can confuse everyone?
2. **Address Variations**: Is it ‚Äú123 Maple Street‚Äù or ‚Äú123 Maple St‚Äù? Or maybe it‚Äôs ‚Äú123 Maple Avenue‚Äù? Either way, the poor delivery guy is crying in the corner trying to figure it out.
3. **Different Application Channels**: Customers are multitasking pros! They‚Äôre out here applying for a vehicle loan, a house loan, and a personal loan‚Äîall at the same time, leaving us with a treasure trove of records for the same person. It‚Äôs like they‚Äôre building a loan portfolio for fun.
4. **Twin Trouble**: Meet Bob and Rob Brown, twin brothers living at the same address, 789 Pine Rd, Brisbane. Sure, their names sound like a sitcom duo, but the system has to crack the case to make sure Bob isn‚Äôt Rob and Rob isn‚Äôt Bob‚Äîbecause mixing them up would be an identity crisis waiting to happen!


In [None]:
# Import required libraries
import pandas as pd
import json

# Load CSV data from a file (Upload the file first in Colab)
from google.colab import files
uploaded = files.upload()  # This will allow you to upload your CSV file

# Specify the filename here (if only one file is uploaded)
csv_filename = list(uploaded.keys())[0]

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(csv_filename)

# Check the CSV content (first few rows)
print("CSV Content:")
print(df.head())

# Define the index name where the data will be ingested
INDEX_NAME = 'names-search'

# Function to create bulk data from the DataFrame
def generate_bulk_data(df):
    for index, row in df.iterrows():
        # Create an action for each document
        yield {
            "_index": INDEX_NAME,
            "_source": row.to_dict()  # Convert the row to dictionary
        }
# Check if the index already exists
if es.indices.exists(index=INDEX_NAME):
    # Delete the index if it exists
    es.indices.delete(index=INDEX_NAME)
    print(f"Index '{INDEX_NAME}' deleted.")

# Use the bulk API to ingest all documents
try:
    response = helpers.bulk(es, generate_bulk_data(df))
    print("Data ingestion complete!")
    print(f"Indexed {response[0]} documents successfully.")
except Exception as e:
    print("Error during bulk ingestion:", e)

KeyboardInterrupt: 

#### Verify If Indexed Data Is Searchable

After indexing the data set, we might be curious to see if we can search that data. To do this, we run a simple **match_all** query in Elasticsearch and check if any records are returned.

In [None]:
# Elasticsearch query to fetch all data
query = {
    "query": {
        "match_all": {}
    }
}
INDEX_NAME = 'names-search'
# Execute the query
try:
    response = es.search(index=INDEX_NAME, body=query)
    # Print the response
    print("Query Response:")
    for hit in response['hits']['hits']:
        print(hit)
except Exception as e:
    print(f"Error querying Elasticsearch: {e}")

Query Response:
{'_index': 'names-search', '_id': '3psNJZQBZ1zH6Rk6mmYX', '_score': 1.0, '_source': {'name': 'John Smith', 'address': '123 Maple Street, Sydney', 'dob': '1980-05-15'}}
{'_index': 'names-search', '_id': '35sNJZQBZ1zH6Rk6mmYX', '_score': 1.0, '_source': {'name': 'Jon Smyth', 'address': '123 Maple St., Sydney', 'dob': '1980-05-15'}}
{'_index': 'names-search', '_id': '4JsNJZQBZ1zH6Rk6mmYX', '_score': 1.0, '_source': {'name': 'Jonathan Smithe', 'address': '123 Maple Street, Syd.', 'dob': '1980-05-15'}}
{'_index': 'names-search', '_id': '4ZsNJZQBZ1zH6Rk6mmYX', '_score': 1.0, '_source': {'name': 'Jane Doe', 'address': '456 Oak Avenue, Melbourne', 'dob': '1990-07-10'}}
{'_index': 'names-search', '_id': '4psNJZQBZ1zH6Rk6mmYX', '_score': 1.0, '_source': {'name': 'Janet Doe', 'address': '456 Oak Ave, Melbourne', 'dob': '1990-07-10'}}
{'_index': 'names-search', '_id': '45sNJZQBZ1zH6Rk6mmYX', '_score': 1.0, '_source': {'name': 'Jayne D', 'address': '456 O. Avenue, Melb.', 'dob': '19

### Integrate Elasticsearch backend with Streamlit UI.

Now that the data is indexed and available for Search,
.  

1.   **Prepare the LLM** - At very basic usage, LLM is required here to help us identify the match percentage across the names.

> Add blockquote


2.   **Build Elasticsearch Query** - Notice how the query is different this time. We try to match the name and address fields with different types of queries.
3. **Context** - Search Elasticsearch and build context for LLM
4. **Display** - Consume Response and present it with Duplicate Tags.



In [None]:
%%writefile app.py

import streamlit as st
import json
import os
from elasticsearch import Elasticsearch
from langchain.prompts import PromptTemplate
from langchain_community.chat_models import AzureChatOpenAI
from langchain.chains import LLMChain

# Set your Azure OpenAI credentials
os.environ["AZURE_OPENAI_API_KEY"] = "8971cf727130428394cb36204e89fe2f"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://carlo-gpt-au.openai.azure.com/"
os.environ["AZURE_OPENAI_API_VERSION"] = "2024-08-01-preview"  # Adjust to your API version

# Connect to your Elasticsearch cluster (adjust the URL and authentication as necessary)
CLOUD_ID = 'general-purpose:YXAtc291dGhlYXN0LTIuYXdzLmZvdW5kLmlvOjQ0MyRmMDg3MDc0Njk4ZTY0YTQ4OWUwZjAzMDNlNGJkZThhMCQyMmJkMmVkODY2MGQ0NmM2OTI3ZjBmMjY4NGUzNDEzZQ=='
API_KEY = 'VFlZeWI1SUJCY1BMSUswYjN4MW46dXZwQTVJUlZRdTZEZXFpREFBMy1wZw=='

# Create Elasticsearch client
es = Elasticsearch(
    cloud_id=CLOUD_ID,
    api_key=API_KEY
)

# Initialize the AzureChatOpenAI model
model = AzureChatOpenAI(
    openai_api_key=os.environ["AZURE_OPENAI_API_KEY"],
    deployment_name="carlo-gpt4o",
    model="gpt-4o",
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"]
)


# Function to query Elasticsearch
def query_elasticsearch(index,input_name,input_address,size=5):
    response = es.search(index=index, body={
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "name": input_name
          }
        },
        {
          "match": {
            "address": input_address
          }
        }
      ]
    }
  },
  "size": size
})
    return response['hits']['hits']

# Function to create the prompt and send to OpenAI API
def check_duplicates(search_name,input_address,response_names):
    # Create prompt template
    prompt_template = PromptTemplate(
        template="""
        Compare the Input Name, Input Address against each response name and address and also generate a match percentage for each.

        Input Name: {search_name}
        Input Address: {input_address}
        Response from Elasticsearch with Name and its address:
        {response_names}

        Please format your response in a table format with columns for Response Name,Response Address, Match Percentage, and Duplicate Status.
        If the Match percentage is above 80% consider it as duplicate.
        Short Explanation of Why it is a match.
        Always use Jaro-Winkler algorithm for name similarity comparison.
        Sort the results in descending order of Match Percentage.
        Turn off preamble and only provide the table.
        """,
        input_variables=["search_name","input_address", "response_names"],
    )

    # Prepare prompt input
    prompt_input = {
        "search_name": search_name,
        "input_address": input_address,
        "response_names": response_names
    }

    # Create the LLM chain
    chain = LLMChain(llm=model, prompt=prompt_template)

    # Get the response from OpenAI
    response = chain.run(prompt_input)
    return response

# Main function for Streamlit app
def main():
    st.set_page_config(page_title="Duplicate Detection", layout="centered")

    # Custom CSS for styling
    st.markdown("""
        <style>
            body { font-family: 'Arial', sans-serif; color: #333; }
            .stTextInput input { background-color: #f0f8ff; padding: 10px; border-radius: 5px; }
            .stButton button { background-color: #4CAF50; color: white; border-radius: 5px; }
            .stButton button:hover { background-color: #45a049; }
            .response-table th, .response-table td { padding: 10px; border: 1px solid #ddd; }
            .response-table th { background-color: #f4f4f4; }
            .response-table td { text-align: center; }
        </style>
    """, unsafe_allow_html=True)

    st.title("üîç Duplicate Detection")
    st.write("Enter the name and address to search for potential duplicates in the database.")
    # Input field for search name
    input_name = st.text_input("Search Name", placeholder="Enter the name you want to search for...")
    input_address = st.text_input("Enter Address", placeholder="Enter the address")

    if st.button("Search üîç"):
        if input_name and input_address:
            index_name = "names-search"  # Replace with your index name


            # Query Elasticsearch for potential duplicates
            es_response = query_elasticsearch(index_name,input_name,input_address)

            # Build context from results
            if es_response:
                # Collect the top results for names
                for name_doc in es_response:
                    names = name_doc['_source']['name']
                #names = [name_doc['_source']['name'] for name_doc in es_response]

                # Prepare a formatted string for response names
                response_names = "\\n".join(names)

                # Send the details to OpenAI for further processing
                #response = check_duplicates(search_name, response_names)
                response = check_duplicates(input_name,input_address,es_response)

                # Display the response on the UI
                st.write("### Results Comparision from Dataset")

                # Attempt to parse the response into a structured format (if applicable)
                try:
                    st.markdown(response)
                    response_data = json.loads(response)
                    st.table(response_data)
                    st.markdown(response_data)
                except json.JSONDecodeError:
                    st.write("Used Jaro-Winkler algorithm for name similarity comparison")
            else:
                st.write("‚ùå No potential duplicates found.")
        else:
            st.write("‚ö†Ô∏è Please enter both the name and address to search.")


if __name__ == "__main__":
    main()

Overwriting app.py


### Start UI
Since the tool is running on colab, in order to access the UI on the local browser, we are making use of localtunnel. As part of this we need to copy paste the below ip address onto the webpage.

In [None]:
# Start the Streamlit app
!streamlit run app.py &>/content/logs.txt & npx localtunnel --port 8501 --subdomain deduplication & curl ipv4.icanhazip.com
#!streamlit run /usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py



34.121.162.7
[1G[0K‚†ô[1G[0K‚†π[1G[0K‚†∏[1G[0K‚†º[1G[0K‚†¥[1G[0K‚†¶[1G[0K‚†ß[1G[0K‚†á[1G[0K‚†è[1G[0K‚†ã[1G[0K‚†ô[1G[0K‚†π[1G[0K‚†∏[1G[0K‚†º[1G[0K‚†¥[1G[0K‚†¶[1G[0Kyour url is: https://deduplication.loca.lt
/tools/node/lib/node_modules/localtunnel/bin/lt.js:81
    throw err;
    ^

Error: connection refused: localtunnel.me:13035 (check your firewall settings)
    at Socket.<anonymous> (/tools/node/lib/node_modules/[4mlocaltunnel[24m/lib/TunnelCluster.js:52:11)
[90m    at Socket.emit (node:events:517:28)[39m
[90m    at emitErrorNT (node:internal/streams/destroy:151:8)[39m
[90m    at emitErrorCloseNT (node:internal/streams/destroy:116:3)[39m
[90m    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)[39m

Node.js v18.20.5
[1G[0K‚†ô[1G[0K