<a href="https://colab.research.google.com/github/charookc5/Text-Semantics-Classification/blob/main/Semantics_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Story 2: The ML Field Classifier
As a HR manager
 I want to upload employee data from different sources
 So that the system automatically identifies email fields, phone numbers, and addresses

---


How it works:
System sees: "john.doe@company.com" in column "Contact Info"
ML model recognizes email pattern → suggests mapping to "email_address"
System sees: "(555) 123-4567" in column "Phone"
ML model recognizes phone pattern → suggests mapping to "phone_number"
No manual rules needed - the AI figures it out!

SAMPLE DATASET (jumbled values)

In [7]:
import pandas as pd

data = {
    "Contact Info": ["john.doe@company.com", "(555) 123-4567", "123 Elm Street, NY", "jane_smith@org.net", "987-654-3210"],
    "Phone": ["(555) 123-4567", "N/A", "555-987-6543", "123-456-7890", "N/A"],
    "Address": ["123 Elm Street, NY", "456 Oak Ave, CA", "789 Pine Rd, TX", "321 Maple Blvd, WA", "N/A"]
}

df = pd.DataFrame(data)
df.to_csv("employee_contacts.csv", index=False)
df.head()


Unnamed: 0,Contact Info,Phone,Address
0,john.doe@company.com,(555) 123-4567,"123 Elm Street, NY"
1,(555) 123-4567,,"456 Oak Ave, CA"
2,"123 Elm Street, NY",555-987-6543,"789 Pine Rd, TX"
3,jane_smith@org.net,123-456-7890,"321 Maple Blvd, WA"
4,987-654-3210,,


1. Regex + ML Classifier (Random Forest)

---

✅ Pros: Fast, interpretable ❌ Cons: Needs labeled data

In [8]:
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

def extract_features(cell):
    return {
        "has_at": "@" in cell,
        "has_digits": bool(re.search(r"\d", cell)),
        "has_phone_pattern": bool(re.search(r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}", cell)),
        "length": len(cell)
    }

X = pd.DataFrame([extract_features(str(cell)) for col in df.columns for cell in df[col]])
y = ["email_address", "phone_number", "address"] * len(df)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
clf = RandomForestClassifier().fit(X_train, y_train)
print("Accuracy:", clf.score(X_test, y_test))


Accuracy: 0.75


In [9]:
import time

# Choose one sample from the test set
sample = X_test.iloc[0:1]

# Time the prediction
start_time = time.time()
predicted_label = clf.predict(sample)
end_time = time.time()

# Output results
print("Predicted Label:", predicted_label[0])
print("Execution Time for One Sample:", round(end_time - start_time, 6), "seconds")


Predicted Label: phone_number
Execution Time for One Sample: 0.009379 seconds


2. Transformer-Based Classification (Sentence-BERT)

---
✅ Pros: Handles fuzzy fields like addresses ❌ Cons: Requires semantic interpretation of clusters


In [10]:
!pip install -q sentence-transformers

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df.values.flatten())

kmeans = KMeans(n_clusters=3).fit(embeddings)
labels = kmeans.labels_

print("Cluster labels:", labels)


Cluster labels: [2 0 2 0 1 2 2 0 2 2 0 2 0 1 1]


In [11]:
# Step 1: Install and import
!pip install -q sentence-transformers

import time
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

# Step 2: Sample dataset
data = {
    "Contact Info": ["john.doe@company.com", "(555) 123-4567", "123 Elm Street, NY", "jane_smith@org.net", "987-654-3210"],
    "Phone": ["(555) 123-4567", "N/A", "555-987-6543", "123-456-7890", "N/A"],
    "Address": ["123 Elm Street, NY", "456 Oak Ave, CA", "789 Pine Rd, TX", "321 Maple Blvd, WA", "N/A"]
}
df = pd.DataFrame(data)

# Step 3: Load model and embed original data
model = SentenceTransformer('all-MiniLM-L6-v2')
original_data = df.values.flatten()
embeddings = model.encode(original_data)

# Step 4: Fit KMeans
kmeans = KMeans(n_clusters=3, random_state=42).fit(embeddings)

# Step 5: Time prediction for new data
new_data = ["789 Mango Lane, TX"]
start_time = time.time()
new_embedding = model.encode(new_data)
predicted_cluster = kmeans.predict(new_embedding)
end_time = time.time()

# Step 6: Output
print("Predicted Cluster:", predicted_cluster[0])
print("Execution Time for New Data:", round(end_time - start_time, 6), "seconds")


Predicted Cluster: 0
Execution Time for New Data: 0.023276 seconds


3. Embedding + Vector Search (Qdrant-style)

---

✅ Pros: Ready for LangChain/Qdrant integration ❌ Cons: Needs semantic prototypes for matching

In [16]:
import re
from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Flatten the DataFrame
flat_values = df.values.flatten()

# Embed values
vectors = model.encode(flat_values)

# Define semantic classifier
def classify(cell):
    cell = str(cell)
    if "@" in cell:
        return "email"
    elif re.match(r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}", cell):
        return "phone"
    elif any(word in cell.lower() for word in ["street", "ave", "blvd", "rd", "lane"]):
        return "address"
    return "unknown"

# Apply semantic tagging
payloads = [{"value": val, "semantic": classify(val)} for val in flat_values]

# Print results
for i, vec in enumerate(vectors):
    print(f"Vector {i}: {payloads[i]['semantic']} → {payloads[i]['value']}")


Vector 0: email → john.doe@company.com
Vector 1: phone → (555) 123-4567
Vector 2: address → 123 Elm Street, NY
Vector 3: phone → (555) 123-4567
Vector 4: unknown → N/A
Vector 5: address → 456 Oak Ave, CA
Vector 6: address → 123 Elm Street, NY
Vector 7: phone → 555-987-6543
Vector 8: address → 789 Pine Rd, TX
Vector 9: email → jane_smith@org.net
Vector 10: phone → 123-456-7890
Vector 11: address → 321 Maple Blvd, WA
Vector 12: phone → 987-654-3210
Vector 13: unknown → N/A
Vector 14: unknown → N/A


In [13]:
import time
from sentence_transformers import SentenceTransformer

# Load model (if not already loaded)
model = SentenceTransformer('all-MiniLM-L6-v2')

# New unknown data point
new_data = ["456 Banana Blvd, CA"]

# Start timing
start_time = time.time()

# Step 1: Embed the new data
new_vector = model.encode(new_data)

# Step 2: Simulate semantic tagging (based on index or heuristic)
# For demo purposes, we’ll use a simple heuristic
if "@" in new_data[0]:
    semantic = "email"
elif any(char.isdigit() for char in new_data[0]) and "(" in new_data[0]:
    semantic = "phone"
elif any(word in new_data[0].lower() for word in ["street", "ave", "blvd", "rd", "lane"]):
    semantic = "address"
else:
    semantic = "unknown"

# End timing
end_time = time.time()

# Output
print(" New Data:", new_data[0])
print(" Predicted Semantic Tag:", semantic)
print(" Execution Time:", round(end_time - start_time, 6), "seconds")


 New Data: 456 Banana Blvd, CA
 Predicted Semantic Tag: address
 Execution Time: 0.017206 seconds


4. Rule-Augmented Hybrid

---
✅ Pros: Simple and effective ❌ Cons: Rules can be brittle


In [14]:
def classify(cell):
    if "@" in cell:
        return "email"
    elif re.match(r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}", cell):
        return "phone"
    elif any(word in cell.lower() for word in ["street", "ave", "blvd", "rd"]):
        return "address"
    return "unknown"

df_flat = pd.DataFrame(df.values.flatten(), columns=["value"])
df_flat["semantic_type"] = df_flat["value"].apply(classify)
df_flat.head()


Unnamed: 0,value,semantic_type
0,john.doe@company.com,email
1,(555) 123-4567,phone
2,"123 Elm Street, NY",address
3,(555) 123-4567,phone
4,,unknown


In [15]:
import time
import pandas as pd
import re

# Sample data (if not already defined)
data = {
    "Contact Info": ["john.doe@company.com", "(555) 123-4567", "123 Elm Street, NY", "jane_smith@org.net", "987-654-3210"],
    "Phone": ["(555) 123-4567", "N/A", "555-987-6543", "123-456-7890", "N/A"],
    "Address": ["123 Elm Street, NY", "456 Oak Ave, CA", "789 Pine Rd, TX", "321 Maple Blvd, WA", "N/A"]
}
df = pd.DataFrame(data)

# Flatten the data
df_flat = pd.DataFrame(df.values.flatten(), columns=["value"])

# Define classifier
def classify(cell):
    if "@" in cell:
        return "email"
    elif re.match(r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}", cell):
        return "phone"
    elif any(word in cell.lower() for word in ["street", "ave", "blvd", "rd"]):
        return "address"
    return "unknown"

#  Time the classification
start_time = time.time()
df_flat["semantic_type"] = df_flat["value"].apply(classify)
end_time = time.time()

# Output results
print(df_flat.head())
print(" Execution Time for Rule-Based Classification:", round(end_time - start_time, 6), "seconds")


                  value semantic_type
0  john.doe@company.com         email
1        (555) 123-4567         phone
2    123 Elm Street, NY       address
3        (555) 123-4567         phone
4                   N/A       unknown
 Execution Time for Rule-Based Classification: 0.000712 seconds
