# **J040 ASGMT 4 TASK 1**

### Goal
The objective of Task 1 is to resolve user queries by matching them to a set of pre-resolved queries using text similarity techniques. This helps in automating the process of query resolution in NLP applications.

### Approach
We use two main approaches:
- **Fuzzy String Matching**: Utilizes thefuzz library to compare queries using different string similarity metrics (ratio, partial ratio, token sort ratio, token set ratio).
- **TF-IDF with Cosine Similarity**: Converts queries into vector representations and measures their similarity using cosine distance.

The results from both methods are combined, and the best match is selected based on defined thresholds.

In [1]:
!python -m pip install thefuzz

Collecting thefuzz
  Using cached thefuzz-0.22.1-py3-none-any.whl.metadata (3.9 kB)
Collecting rapidfuzz<4.0.0,>=3.0.0 (from thefuzz)
  Downloading rapidfuzz-3.14.0-cp312-cp312-win_amd64.whl.metadata (12 kB)
Using cached thefuzz-0.22.1-py3-none-any.whl (8.2 kB)
Downloading rapidfuzz-3.14.0-cp312-cp312-win_amd64.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------------------------- -------------- 1.0/1.7 MB 24.6 MB/s eta 0:00:01
   ------------------------- -------------- 1.0/1.7 MB 24.6 MB/s eta 0:00:01
   ------------------------- -------------- 1.0/1.7 MB 24.6 MB/s eta 0:00:01
   ------------------------- -------------- 1.0/1.7 MB 24.6 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 1.5 MB/s  0:00:00
Installing collected packages: rapidfuzz, thefuzz

   ---------------------------------------- 0/2 [rapidfuzz]
   ---------------------------------------- 0/2 [rapidfuzz]
   ---------------------------------------- 0/2 [rapi

## Importing Libraries

We import essential libraries for data handling, fuzzy string matching, and vector-based similarity calculations. These include pandas for data manipulation, thefuzz for fuzzy matching, and scikit-learn for TF-IDF and cosine similarity.

In [2]:
import pandas as pd
from thefuzz import fuzz, process
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


  from scipy.sparse import csr_matrix, issparse


## Loading the Dataset

We load two datasets:
- `new_queries.csv`: Contains unresolved user queries.
- `resolved_queries.csv`: Contains queries that have already been resolved and can be used as references for matching.

In [None]:
new_queries = pd.read_csv("Data/new_queries.csv")
resolved_queries = pd.read_csv("Data/resolved_queries.csv")

In [4]:
new_queries, resolved_queries

(                              Variation_Query  Matches_With_Query_ID
 0            Unabel to conect to the internet                      1
 1                   Can’t connect to internet                      1
 2                         Intenet not working                      1
 3                Payment failed while chekout                      2
 4   Payment did not go through during chckout                      2
 5                  Payment issue at check out                      2
 6    Application crashes when opening setings                      3
 7            App crash when going to settings                      3
 8            Settings cause the app to chrash                      3
 9               Forgot passwrd and cant reset                      4
 10        Forgotten password, unable to reset                      4
 11                  I can’t reset my password                      4
 12             Unable to uplod file to server                      5
 13        Can't upl

## Text Preprocessing

To improve matching accuracy, we preprocess queries by converting them to lowercase and stripping extra spaces. This standardizes the text and reduces mismatches due to formatting differences.

In [5]:
new_queries["Variation_Query"] = new_queries["Variation_Query"].str.lower().str.strip()
resolved_queries["Pre_Resolved_Query"] = resolved_queries["Pre_Resolved_Query"].str.lower().str.strip()


## Fuzzy Matching Methods

We perform fuzzy matching using multiple methods from thefuzz library:
- **ratio**
- **partial_ratio**
- **token_sort_ratio**
- **token_set_ratio**

For each unresolved query, we select the method that gives the highest similarity score above a set threshold.

In [6]:
def get_all_fuzzy_scores(query, df, threshold=70):
    choices = df["Pre_Resolved_Query"].tolist()
    scores = {
        "ratio": process.extractOne(query, choices, scorer=fuzz.ratio),
        "partial_ratio": process.extractOne(query, choices, scorer=fuzz.partial_ratio),
        "token_sort_ratio": process.extractOne(query, choices, scorer=fuzz.token_sort_ratio),
        "token_set_ratio": process.extractOne(query, choices, scorer=fuzz.token_set_ratio),
    }
    best_method, best_result = max(scores.items(), key=lambda x: x[1][1])
    if best_result and best_result[1] >= threshold:
        # Find the Query_ID for the matched text
        qid = df.loc[df["Pre_Resolved_Query"] == best_result[0], "Query_ID"].values[0]
        return best_method, best_result[0], qid, best_result[1]
    return None, None, None, None

fuzzy_results = []
for uq in new_queries["Variation_Query"]:
    method, match, qid, score = get_all_fuzzy_scores(uq, resolved_queries)
    fuzzy_results.append((uq, method, match, qid, score))

fuzzy_df = pd.DataFrame(fuzzy_results, columns=[
    "Unresolved_Query", "Best_Method", "Fuzzy_Match", "Fuzzy_Query_ID", "Fuzzy_Score"
])

## TF-IDF and Cosine Similarity

We use TF-IDF vectorization to represent queries as numerical vectors. Cosine similarity is then used to measure the closeness between unresolved and resolved queries, helping to identify the best match based on semantic similarity.

In [7]:
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(
    new_queries["Variation_Query"].tolist() + resolved_queries["Pre_Resolved_Query"].tolist()
)

n_unresolved = len(new_queries)
unresolved_vecs = tfidf_matrix[:n_unresolved]
resolved_vecs = tfidf_matrix[n_unresolved:]

cosine_sim = cosine_similarity(unresolved_vecs, resolved_vecs)

tfidf_results = []
for i, uq in enumerate(new_queries["Variation_Query"]):
    best_idx = cosine_sim[i].argmax()
    best_score = cosine_sim[i][best_idx]
    matched_row = resolved_queries.iloc[best_idx]
    tfidf_results.append((uq, matched_row["Pre_Resolved_Query"], matched_row["Query_ID"], best_score))

tfidf_df = pd.DataFrame(tfidf_results, columns=[
    "Unresolved_Query", "TFIDF_Match", "TFIDF_Query_ID", "TFIDF_Score"
])

## Combining Results

We merge the results from fuzzy matching and TF-IDF similarity into a single DataFrame. This allows us to compare and select the best match for each query using both approaches.

In [8]:
combined = pd.merge(fuzzy_df, tfidf_df, on="Unresolved_Query", how="inner")

## Selecting the Best Match

For each unresolved query, we choose the final match based on the following logic:
- If the fuzzy match score exceeds the threshold, use the fuzzy match.
- Otherwise, if the TF-IDF score exceeds its threshold, use the TF-IDF match.
- If neither threshold is met, no match is assigned.

In [9]:
def pick_final(row, fuzzy_thresh=75, tfidf_thresh=0.65):
    if row["Fuzzy_Score"] and row["Fuzzy_Score"] >= fuzzy_thresh:
        return row["Fuzzy_Match"], row["Fuzzy_Query_ID"], f"Fuzzy-{row['Best_Method']}"
    elif row["TFIDF_Score"] >= tfidf_thresh:
        return row["TFIDF_Match"], row["TFIDF_Query_ID"], "TFIDF"
    else:
        return None, None, "No Match"

combined[["Final_Match", "Final_Query_ID", "Method_Used"]] = combined.apply(
    pick_final, axis=1, result_type="expand"
)

## Results

We display the final matches for each unresolved query, showing which method was used and the corresponding matched query. This helps evaluate the effectiveness of the matching approaches.

In [10]:
combined[['Unresolved_Query', 'Fuzzy_Match', 'Method_Used']]

Unnamed: 0,Unresolved_Query,Fuzzy_Match,Method_Used
0,unabel to conect to the internet,unable to connect to the internet,Fuzzy-ratio
1,can’t connect to internet,unable to connect to the internet,Fuzzy-token_set_ratio
2,intenet not working,,No Match
3,payment failed while chekout,payment failed during checkout,Fuzzy-ratio
4,payment did not go through during chckout,payment failed during checkout,No Match
5,payment issue at check out,payment failed during checkout,No Match
6,application crashes when opening setings,app crashes when opening settings,Fuzzy-partial_ratio
7,app crash when going to settings,app crashes when opening settings,Fuzzy-ratio
8,settings cause the app to chrash,,No Match
9,forgot passwrd and cant reset,forgot password and unable to reset,Fuzzy-ratio


# **J040 ASGMT 4 TASK 2**

### Goal
The objective of Task 2 is to normalize and match name variations to their corresponding base names. This is useful for entity resolution and deduplication in NLP tasks.

### Approach
We preprocess names to handle formatting inconsistencies, then use fuzzy string matching (token_sort_ratio) to match each variation to the best base name. The accuracy of the matching is evaluated at the end.

# Importing Libraries

We import the `re` library for regular expressions, which is used in name normalization and preprocessing.

In [11]:
import re

## Loading the Dataset

We load two datasets:
- `name_variations.csv`: Contains different variations of names.
- `base_names.csv`: Contains the canonical base names to match against.

In [None]:
name_variations = pd.read_csv("Data/name_variations.csv")
base_names = pd.read_csv("Data/base_names.csv")

In [13]:
base_names, name_variations

(    Base_Name_ID          Base_Name
 0              1         John Smith
 1              2     Jennifer Brown
 2              3   Michael O'Connor
 3              4       Maria Garcia
 4              5         Robert Lee
 5              6      Linda Johnson
 6              7      William Davis
 7              8   Elizabeth Wilson
 8              9     David Martinez
 9             10        Susan Clark
 10            11    James Rodriguez
 11            12         Mary Lewis
 12            13         Paul Allen
 13            14        Karen Young
 14            15        Thomas King
 15            16       Nancy Wright
 16            17       Daniel Scott
 17            18        Sandra Hill
 18            19  Christopher Green
 19            20      Jessica Adams,
           Variation Matches_With_Base_Name
 0      Thomas  King            Thomas King
 1        ThomasKing            Thomas King
 2      Maria Garcia           Maria Garcia
 3         MaryLewis             Mary Lewis
 4

## Name Preprocessing

We normalize names by:
- Adding spaces between concatenated words (e.g., 'ThomasKing' → 'Thomas King')
- Converting to lowercase
- Removing non-alphabetic characters
- Stripping unnecessary spaces

This ensures consistent formatting for accurate matching.

In [14]:
def normalize_name(name: str) -> str:
    if pd.isna(name):
        return ""
    name = name.strip()
    name = re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', name)
    name = name.lower()
    name = re.sub(r'[^a-z\s]', '', name)
    name = re.sub(r'\s+', ' ', name).strip()
    return name

## Applying Normalization

We apply the normalization function to both name variations and base names, creating new columns with the processed names for matching.

In [15]:
name_variations["Normalized"] = name_variations["Variation"].astype(str).apply(normalize_name)
base_names["Normalized"] = base_names["Base_Name"].astype(str).apply(normalize_name)

## Matching Names

We use the `token_sort_ratio` method from thefuzz library to match each normalized name variation to the best base name, based on similarity score and a defined threshold.

In [16]:
def get_best_match(name, base_names, threshold=80):
    match = process.extractOne(
        name,
        base_names["Normalized"].tolist(),
        scorer=fuzz.token_sort_ratio
    )
    if match and match[1] >= threshold:
        # Get the original base name for reporting
        matched_row = base_names.loc[base_names["Normalized"] == match[0], "Base_Name"].values[0]
        return matched_row, match[1]
    return None, None

## Running Fuzzy Match

We iterate through each name variation, perform fuzzy matching against base names, and store the results for further analysis.

In [17]:
results = []
for name, norm in zip(name_variations["Variation"], name_variations["Normalized"]):
    matched_name, score = get_best_match(norm, base_names)
    results.append((name, matched_name, score))


## Results DataFrame

We convert the matching results into a DataFrame for easy viewing and analysis. The DataFrame shows each variation, its matched base name, and the similarity score.

In [18]:
matches_df = pd.DataFrame(results, columns=["Variation_Name", "Matched_Base_Name", "Score"])
matches_df.head(15)

Unnamed: 0,Variation_Name,Matched_Base_Name,Score
0,Thomas King,Thomas King,100.0
1,ThomasKing,Thomas King,100.0
2,Maria Garcia,Maria Garcia,100.0
3,MaryLewis,Mary Lewis,100.0
4,Nancy W.,,
5,Dani3l Scott,Daniel Scott,96.0
6,JOHN smith,John Smith,100.0
7,linda johnson,Linda Johnson,100.0
8,N@ncy Wright,Nancy Wright,96.0
9,William Davis,William Davis,100.0


## Results

We evaluate the accuracy of the matching process by comparing predicted matches to the actual base names. This helps assess the effectiveness of the normalization and matching approach.

In [19]:
from sklearn.metrics import accuracy_score

y_true = name_variations['Matches_With_Base_Name'].fillna("No Match")
y_pred = matches_df['Matched_Base_Name'].fillna("No Match")

accuracy_score(y_true, y_pred)

0.95