<a href="https://colab.research.google.com/github/Zeaxanthin80/Semantics-Search-with-Embeddings/blob/main/SemanticSearch_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Semantic Search Engine for <br>Customer Complaints with ***OpenAI***

>Original Source:

>https://drlee.io/building-a-semantic-search-engine-for-customer-complaints-with-openai-a61e9f4f2ba7


---



## Step 1: Understanding Semantic Search

>Semantic search goes beyond keyword matching. It uses machine learning models to encode text into numerical vectors, capturing the context and meaning of the text. These embeddings allow us to compare and rank text data based on semantic similarity, even if the wording differs.

#### **Use Case:**
>Imagine you run a business and have a database of customer complaints. A customer contacts you about a delayed delivery. With semantic search, you can quickly find similar complaints and their resolutions to streamline your response.


---




## Step 2: Setting Up the Environment

In [None]:
from openai import OpenAI
from scipy.spatial import distance
import numpy as np

# Initialize OpenAI client with your API key
client = OpenAI(api_key="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")  # Replace "your-api-key" with your actual OpenAI API key

# Define the embedding function using the OpenAI client
def create_embeddings(texts, model="text-embedding-3-small"):
    embeddings = []
    for text in texts:
        response = client.embeddings.create(
            input=text,
            model=model
        )
        embeddings.append(response.data[0].embedding)
    return embeddings



---



## Step 3: Data Preparation

In [None]:
# Example customer complaints
customer_complaints = [
    "The delivery of my order was delayed by 3 days, and I had to constantly check the tracking system for updates. This caused inconvenience as it was a gift that I needed urgently.",
    "I received a damaged product in the package, and the box itself was torn. It seems there was no care taken during the shipping process.",
    "The refund process is incredibly slow. I submitted my request weeks ago and still haven't received any confirmation or updates on the status of my refund.",
    "The customer service representative I spoke to was extremely rude and unhelpful, refusing to listen to my concerns or provide a proper resolution.",
    "I never received the order I placed two weeks ago, even though the system marked it as delivered. I feel like my money has been wasted.",
    "The packaging was torn and damaged when my order arrived, making it look like the contents could have been tampered with or mishandled during shipping.",
    "The product I received doesn’t match the description on the website at all. It feels misleading, and I now have to go through the hassle of returning it.",
    "The website is confusing and difficult to navigate, making it hard to find what I was looking for. The search feature also doesn’t provide accurate results.",
    "I was incorrectly charged an extra amount for my order, and I can't figure out why. The customer support hasn't resolved this issue yet.",
    "The warranty claim process is unclear, and I couldn’t find any detailed instructions on the website. I’ve been stuck without a resolution for weeks.",
    "The size I ordered doesn’t fit, even though I followed the size chart on your website. It seems the chart is inaccurate and misleading.",
    "I tried canceling my order before it was shipped, but the system wouldn't let me. Now, I’m stuck with something I don’t need.",
    "The product quality is far below what I expected based on the reviews and description. It feels like I’ve been scammed.",
    "The live chat feature on the website never connects me to an agent. I’ve tried multiple times, and the automated replies aren’t helpful at all.",
    "I’ve been charged for a subscription that I never signed up for. It’s unfair, and I haven’t received any explanation or solution yet.",
    "The instructions for assembling the product were incomplete and confusing. I had to search online to figure out how to put it together.",
    "My account was locked for no reason, and I wasn’t able to make a purchase. Customer support didn’t resolve this issue quickly.",
    "The item was marked as 'in stock,' but after I placed the order, I was informed that it’s on backorder and won’t arrive for weeks.",
    "I received the wrong item in my order, and now I have to go through the hassle of returning it and waiting for a replacement.",
    "The checkout process on the website is frustratingly slow, and my payment failed multiple times before finally going through.",
    "My promotional discount code didn’t work at checkout, and I ended up paying the full price. I contacted support but haven’t heard back yet.",
    "The delivery person left my package outside in the rain, ruining the contents inside. There should be better handling of deliveries.",
    "I had to pay additional customs fees that weren’t disclosed when I placed the order. This hidden charge is unacceptable.",
    "The automated phone system doesn’t connect me to a real person, and I’ve been stuck waiting for a resolution for over a week.",
    "The color of the product I received is completely different from what was shown on the website. It’s not what I ordered at all.",
    "The mobile app keeps crashing whenever I try to add items to my cart. It’s been impossible to complete my purchase.",
    "The tracking information for my shipment hasn’t been updated in days, and I have no idea where my package is.",
    "The item I purchased doesn’t work as advertised. It’s defective and should have been properly tested before being sold.",
    "I paid for expedited shipping, but my order still arrived late. I feel like I wasted money on a service that wasn’t delivered.",
    "I’ve been trying to return a product for over two weeks, but the return label hasn’t been sent to me yet. This delay is unacceptable."
]

In [None]:
# Generate embeddings for the complaints
complaints = []
embeddings = create_embeddings(customer_complaints, model="text-embedding-3-small")

for complaint, embedding in zip(customer_complaints, embeddings):
    complaints.append({"complaint": complaint, "embedding": embedding})



---



## Step 4: Implementing Semantic Search

In [None]:
# Search query
search_text = "Why is my delivery late?"

# Generate the embedding for the query
search_embedding = create_embeddings([search_text])[0]

# Calculate cosine distances between the query and complaints
distances = []
for complaint in complaints:
    dist = distance.cosine(search_embedding, complaint["embedding"])
    distances.append(dist)

# Find the closest complaint
min_dist_ind = np.argmin(distances)
closest_complaint = complaints[min_dist_ind]

# ——————————————————————————————
# Code to align output to colon.
search_query_label = "Search Query"
closest_complaint_label = "Closest Complaint"

# Calculate the maximum width of the labels
max_width = max(len(search_query_label), len(closest_complaint_label))

# Print with alignment
print(f"{search_query_label:>{max_width}}: {search_text}")
print(f"{closest_complaint_label:>{max_width}}: {closest_complaint['complaint']}")

     Search Query: Why is my delivery late?
Closest Complaint: The delivery of my order was delayed by 3 days, and I had to constantly check the tracking system for updates. This caused inconvenience as it was a gift that I needed urgently.




---



## Step 5: Visualizing with Gradio

In [None]:
!pip install gradio -qqq
import gradio as gr

# Define the search function
def find_similar_complaint(query):
    search_embedding = create_embeddings([query])[0]
    distances = [distance.cosine(search_embedding, c["embedding"]) for c in complaints]
    min_dist_ind = np.argmin(distances)
    closest_complaint = complaints[min_dist_ind]
    return f"Query: {query}\n\nMost Similar Complaint: {closest_complaint['complaint']}"

# Create the Gradio interface
interface = gr.Interface(
    fn=find_similar_complaint,
    inputs="text",
    outputs="text",
    title="Semantic Search for Customer Complaints",
    description="Enter a customer query to find similar complaints in the database."
)

# Launch the app
interface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://e13b37a34e11274b20.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)






---



## Step 6: Testing the Application

>Run "Step 5" above to test ***Gradio*** app locally or deploy it online.

>Enter queries like:

>* “Where is my order?”
* “The product arrived broken.”

>The app will return the most semantically similar complaint from the database.





---



# Addendum: Semantic Search for Legal Departments
A Use Case and Implementation Guide

**Introduction**

>In legal departments, searching through a vast repository of legal cases, summaries, or statutes to find relevant information is a common yet time-consuming task. Traditional keyword-based search often fails to capture the context or nuances of legal texts, leading to inefficient workflows. This is where semantic search comes into play. By leveraging machine learning embeddings, we can create a system that understands the meaning of legal queries and retrieves the most relevant information.

>This guide demonstrates a practical implementation of semantic search tailored for legal departments, enriched with topics and keywords for enhanced contextual understanding. We will also include a Gradio interface for an interactive user experience and provide independent tasks for readers to practice.


**Use Case: Enhancing Legal Research Efficiency**

>Imagine a legal team working on a data privacy compliance case. They need to quickly find previous rulings, statutes, or policies related to user data protection. Instead of manually sifting through hundreds of documents, they can use a semantic search system to input their query and retrieve the most relevant legal summaries instantly.

**Why Enriched Data Matters**

>By tagging each summary with *topics* and *keywords*, the system can:

>* Provide additional context for retrieved results.
* Enable faster human validation by showing related metadata (e.g., topic and keywords).
* Enhance downstream analytics, like clustering or topic-based filtering.



---



# Addendum: Semantic Search with Enriched Legal Data
>Here’s an implementation guide:

## Step 1: Define Legal Summaries
>Each legal summary includes the text of the summary, a topic, keywords, and its embedding.

In [None]:
legal_summaries = [
    {
        "summary": "The Supreme Court ruled in favor of the plaintiff regarding environmental regulations, emphasizing the need for stricter enforcement to protect endangered species in industrial zones.",
        "topic": "Environmental Law",
        "keywords": ["Supreme Court", "plaintiff", "environmental regulations", "endangered species", "industrial zones"]
    },
    {
        "summary": "A new policy on data privacy was enacted to protect user information, requiring companies to implement stricter encryption standards for all stored data.",
        "topic": "Privacy Law",
        "keywords": ["data privacy", "policy", "user information", "encryption standards", "data storage"]
    },
    {
        "summary": "The landmark case addressed intellectual property rights for AI-generated content, establishing that such creations could not yet qualify for traditional copyright protections.",
        "topic": "Intellectual Property",
        "keywords": ["intellectual property", "AI", "landmark case", "copyright", "AI-generated content"]
    },
    {
        "summary": "Antitrust concerns were raised in a merger between two major tech companies, with regulators questioning the potential for monopolistic practices in the digital advertising sector.",
        "topic": "Antitrust Law",
        "keywords": ["antitrust", "merger", "tech companies", "monopoly", "digital advertising"]
    },
    {
        "summary": "The court dismissed charges of negligence against the pharmaceutical company, citing insufficient evidence to prove a breach of safety protocols during the manufacturing process.",
        "topic": "Healthcare Law",
        "keywords": ["negligence", "pharmaceutical company", "court dismissal", "safety protocols", "manufacturing process"]
    },
    {
        "summary": "The court ruled that a major energy company was liable for damages caused by an oil spill that affected coastal communities and wildlife habitats.",
        "topic": "Environmental Law",
        "keywords": ["oil spill", "energy company", "liability", "coastal communities", "wildlife habitats"]
    },
    {
        "summary": "A federal ruling mandated that companies must disclose all data breaches affecting more than 10,000 users within 48 hours of detection.",
        "topic": "Privacy Law",
        "keywords": ["data breaches", "federal ruling", "disclosure", "user protection", "48-hour rule"]
    },
    {
        "summary": "The court upheld the trademark infringement claim, ruling that the defendant's branding created consumer confusion with an established company's product line.",
        "topic": "Intellectual Property",
        "keywords": ["trademark infringement", "branding", "consumer confusion", "product line", "court ruling"]
    },
    {
        "summary": "An antitrust lawsuit was filed against a leading e-commerce platform for allegedly using its market dominance to suppress smaller competitors.",
        "topic": "Antitrust Law",
        "keywords": ["antitrust lawsuit", "e-commerce", "market dominance", "smaller competitors", "suppression"]
    },
    {
        "summary": "A healthcare company was fined for failing to comply with federal health data privacy regulations, resulting in unauthorized access to patient records.",
        "topic": "Healthcare Law",
        "keywords": ["healthcare company", "fines", "data privacy", "patient records", "non-compliance"]
    },
    {
        "summary": "Environmental advocates sued a chemical manufacturer over illegal waste disposal practices that contaminated local water supplies.",
        "topic": "Environmental Law",
        "keywords": ["illegal waste disposal", "chemical manufacturer", "water contamination", "lawsuit", "environmental advocates"]
    },
    {
        "summary": "New legislation on biometric data collection requires businesses to obtain explicit consent before capturing or storing fingerprints or facial recognition data.",
        "topic": "Privacy Law",
        "keywords": ["biometric data", "explicit consent", "facial recognition", "fingerprints", "data collection"]
    },
    {
        "summary": "The court ruled that a software developer retained ownership of the source code created during a freelance project, reinforcing independent contractor rights.",
        "topic": "Intellectual Property",
        "keywords": ["source code", "software developer", "freelance project", "ownership", "contractor rights"]
    },
    {
        "summary": "Regulators imposed fines on a major telecom company for collusion in price-fixing agreements with its regional partners.",
        "topic": "Antitrust Law",
        "keywords": ["collusion", "price-fixing", "telecom company", "regional partners", "regulators"]
    },
    {
        "summary": "The court held a healthcare provider liable for medical malpractice after failing to diagnose a patient’s life-threatening condition in a timely manner.",
        "topic": "Healthcare Law",
        "keywords": ["medical malpractice", "healthcare provider", "liability", "misdiagnosis", "timely care"]
    }
]



---





---



## Step 2: Generate Embeddings

In [None]:
for summary in legal_summaries:
    embedding = create_embeddings([summary["summary"]], model="text-embedding-3-small")[0]
    summary["embedding"] = embedding



---



## Step 3: Search Functionality
>The system calculates the semantic similarity between a query and the legal summaries.

In [None]:
# Search text (legal query)
search_text = "biometric data"
search_embedding = create_embeddings([search_text])[0]

# Calculate cosine distances
distances = []
for summary in legal_summaries:
    dist = distance.cosine(search_embedding, summary["embedding"])
    distances.append(dist)

# Find the legal summary with the smallest distance
min_dist_ind = np.argmin(distances)
closest_summary = legal_summaries[min_dist_ind]

print(f"Closest Legal Summary: {closest_summary['summary']}")
print(f"Topic: {closest_summary['topic']}")
print(f"Keywords: {', '.join(closest_summary['keywords'])}")

Closest Legal Summary: New legislation on biometric data collection requires businesses to obtain explicit consent before capturing or storing fingerprints or facial recognition data.
Topic: Privacy Law
Keywords: biometric data, explicit consent, facial recognition, fingerprints, data collection




---



##Interactive Gradio Interface
>Gradio provides a simple and intuitive way to use the semantic search system.

## Code for Gradio Interface

In [None]:
import gradio as gr

# Define the Gradio search function
def find_similar_summary(query):
    search_embedding = create_embeddings([query])[0]
    distances = [distance.cosine(search_embedding, s["embedding"]) for s in legal_summaries]
    min_dist_ind = np.argmin(distances)
    closest_summary = legal_summaries[min_dist_ind]
    return (
        f"Query: {query}\n\n"
        f"Closest Legal Summary: {closest_summary['summary']}\n\n"
        f"Topic: {closest_summary['topic']}\n"
        f"Keywords: {', '.join(closest_summary['keywords'])}"
    )

# Create the Gradio interface
interface = gr.Interface(
    fn=find_similar_summary,
    inputs="text",
    outputs="text",
    title="Semantic Search for Legal Summaries",
    description="Enter a legal query to find the most relevant legal summary."
)

# Launch the app
interface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://44b753920976f144e1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


