<a href="https://colab.research.google.com/github/akshaya-bharadhwaj/J008-SNLP-Labs/blob/master/J008_Text_Search_Match_Names.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import Libraries**

In [6]:
pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [1]:
import numpy as np
import pandas as pd

In [7]:
from fuzzywuzzy import process, fuzz
import re



# **Load the Dataset**

In [2]:
base_names = pd.read_csv('base_names.csv')
name_variations = pd.read_csv('name_variations.csv')

In [3]:
base_names.head()

Unnamed: 0,Base_Name_ID,Base_Name
0,1,John Smith
1,2,Jennifer Brown
2,3,Michael O'Connor
3,4,Maria Garcia
4,5,Robert Lee


In [4]:
name_variations.head()

Unnamed: 0,Variation,Matches_With_Base_Name
0,Thomas King,Thomas King
1,ThomasKing,Thomas King
2,Maria Garcia,Maria Garcia
3,MaryLewis,Mary Lewis
4,Nancy W.,Nancy Wright


# **Preprocessing**

In [8]:
def preprocess(name):
    name = name.lower()
    name = re.sub(r'\s+', ' ', name)  # Remove extra spaces
    name = re.sub(r'[^\w\s]', '', name)  # Remove punctuation
    return name

In [9]:
# Apply preprocessing
base_names['Base_Name'] = base_names['Base_Name'].apply(preprocess)
name_variations['Variation'] = name_variations['Variation'].apply(preprocess)


# **Fuzzy Matching**

In [32]:
# Fuzzy Matching using fuzzywuzzy
def fuzzy_match(name, choices, scorer=fuzz.token_sort_ratio, threshold=60):
    results = process.extractOne(name, choices, scorer=scorer)
    if results:
        best_match, score = results[0], results[1]
        return best_match if score >= threshold else None
    return None

In [33]:
# Perform fuzzy matching
def match_names(base_names, name_variations):
    matches = []
    for _, row in name_variations.iterrows():
        variation = row['Variation']
        match = fuzzy_match(variation, base_names['Base_Name'])
        matches.append({
            'Variation': variation,
            'Match_With_Base_Name': match
        })
    return pd.DataFrame(matches)

In [34]:
# Find matches
matches_df = match_names(base_names, name_variations)

In [35]:
print("Name Matches:")
print(matches_df.head())

Name Matches:
      Variation Match_With_Base_Name
0   thomas king          thomas king
1    thomasking                 None
2  maria garcia         maria garcia
3     marylewis                 None
4       nancy w         nancy wright


# **Conclusion**



*   Adjusting the similarity threshold and choosing an appropriate scoring function (like fuzz.partial_ratio instead of the default token_sort_ratio) significantly impacted the results.
*   Lowering the threshold from 80 to 60 made the matching more inclusive
*   Despite improvements, the fuzzy matching still missed some matches (e.g., "thomasking" for "Thomas King" and "marylewis" for "Mary Lewis").
*   This indicates that more advanced preprocessing techniques or further fine-tuning may be needed to improve accuracy further.