# Typo-Tolerant Search Using Levenshtein Distance (P1.2)

This notebook demonstrates how to handle typos in user queries.

We solve the problem where a search for `"repot"` fails to find entries like `"report"` or `"reports"`, due to strict keyword matching.

**Goal**: use Levenshtein distance to detect the closest valid keyword and search using that instead.

We will:
- Load the keywords from `reports.csv`
- Implement a distance function using `python-Levenshtein`
- Find the closest matching keyword
- Return matching entries from the dataset

In [6]:
import pandas as pd
import Levenshtein
from pathlib import Path

# Load the data
df = pd.read_csv("../api/reports.csv")
df.fillna("", inplace=True)

## Extract all unique keywords for comparison

In [7]:
unique_keywords = set()
for kw_list in df["keywords"]:
    keywords = [kw.strip() for kw in kw_list.split(",") if kw.strip()]
    unique_keywords.update(keywords)

print(f"Total unique keywords: {len(unique_keywords)}")

Total unique keywords: 652


## Function to get the closest match by Levenshtein distance

In [8]:
def closest_keyword(query):
    query = query.strip()
    distances = [(kw, Levenshtein.distance(query, kw)) for kw in unique_keywords]
    closest = sorted(distances, key=lambda x: x[1])[0] if distances else (query, 0)
    return closest

In [9]:
# Example: simulate typo
typo = "repot"
suggested_keyword, distance = closest_keyword(typo)
print(f"User query: {typo}\nClosest match: {suggested_keyword} (distance: {distance})")

User query: repot
Closest match: report (distance: 1)


## Perform search using the corrected keyword

In [10]:
def search_by_corrected_keyword(query):
    corrected, dist = closest_keyword(query)
    matches = df[df["keywords"].str.split(",").apply(lambda kws: corrected in [k.strip() for k in kws])]
    return corrected, dist, matches

# Try a search
corrected, dist, result_df = search_by_corrected_keyword("repot")
print(f"Searching with corrected keyword: '{corrected}' (distance: {dist})")
result_df.head()

Searching with corrected keyword: 'report' (distance: 1)


Unnamed: 0,ID Data Product,Report Name,Report View,Tags,keywords
39,RPPBI0004,eCommerce Report 2024,B2B Digital Report,B2B Digital,"2024, b2b, digital, ecommerce, report"
40,RPPBI0004,eCommerce Report 2024,Database Browser,,"2024, browser, database, ecommerce, report"
41,RPPBI0004,eCommerce Report 2024,Database Browser (Creation Date),,"2024, browser, creation, database, date, ecomm..."
42,RPPBI0004,eCommerce Report 2024,Digital By Creation Date Performance Report,"Performance, Digital","2024, creation, date, digital, ecommerce, perf..."
43,RPPBI0004,eCommerce Report 2024,Digital Performance (Stay Date),"Performance, Digital","2024, date, digital, ecommerce, performance, r..."
