<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/mthree-c422-dipti/Exercises/day-8/NLP-Traditional-VS-ML/Traditional_vs_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparing Rule-Based NLP vs. ML-Based NLP
###Problem Statement

In this lab, you will compare two approaches to text classification:

- Rule-Based NLP: Using handcrafted rules and lexicons to classify text.

- ML-Based NLP: Training a machine learning model (Naïve Bayes) to classify the same data.

You will use a small SMS Spam dataset to:

- Implement a rule-based classifier that flags messages as “spam” if they contain certain keywords.

- Train and evaluate a scikit-learn Naïve Bayes classifier on the same dataset.

- Compare accuracy, precision, recall, and F1-score for both approaches.

- Analyze the advantages and limitations of each method.

In [1]:
# 1. Setup
# Install dependencies
!pip install --quiet nltk scikit-learn pandas

# Download NLTK resources
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
# 2. Load Dataset
import pandas as pd

# Create SMS dataset
data = [
    ("ham", "Hey, are we still on for lunch today?"),
    ("spam", "URGENT! You've won $1000! Click here now!"),
    ("ham", "Can you pick up milk on your way home?"),
    ("spam", "FREE iPhone! Limited time offer! Call now!"),
    ("ham", "Meeting moved to 3pm tomorrow"),
    ("spam", "Congratulations! You've been selected for a special offer!"),
    ("ham", "Thanks for the birthday wishes!"),
    ("spam", "SALE ALERT: 90% off everything! Don't miss out!"),
    ("ham", "Running late, be there in 10 minutes"),
    ("spam", "You owe $500 in taxes. Pay immediately or face legal action!")
]

df = pd.DataFrame(data, columns=['label','message'])
df.head()


Unnamed: 0,label,message
0,ham,"Hey, are we still on for lunch today?"
1,spam,URGENT! You've won $1000! Click here now!
2,ham,Can you pick up milk on your way home?
3,spam,FREE iPhone! Limited time offer! Call now!
4,ham,Meeting moved to 3pm tomorrow


In [3]:
# 3. Rule-Based Classifier
import re

# Define spam keywords
spam_keywords = {'urgent', 'free', 'offer', 'sale', 'click', 'congratulations', 'winner', 'won', 'alert'}

def rule_based_classifier(text):
    tokens = re.findall(r'\b\w+\b', text.lower())
    return 'spam' if any(word in spam_keywords for word in tokens) else 'ham'

# Apply rule-based classifier
df['pred_rule'] = df['message'].apply(rule_based_classifier)
df

Unnamed: 0,label,message,pred_rule
0,ham,"Hey, are we still on for lunch today?",ham
1,spam,URGENT! You've won $1000! Click here now!,spam
2,ham,Can you pick up milk on your way home?,ham
3,spam,FREE iPhone! Limited time offer! Call now!,spam
4,ham,Meeting moved to 3pm tomorrow,ham
5,spam,Congratulations! You've been selected for a sp...,spam
6,ham,Thanks for the birthday wishes!,ham
7,spam,SALE ALERT: 90% off everything! Don't miss out!,spam
8,ham,"Running late, be there in 10 minutes",ham
9,spam,You owe $500 in taxes. Pay immediately or face...,ham


In [4]:
# 4. ML-Based Classifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df['label'], test_size=0.3, random_state=42)

# Vectorize text
vec = CountVectorizer(stop_words='english')
X_train_vec = vec.fit_transform(X_train)
X_test_vec = vec.transform(X_test)

# Train Naïve Bayes
clf = MultinomialNB().fit(X_train_vec, y_train)
df.loc[X_test.index, 'pred_ml'] = clf.predict(X_test_vec)
df

Unnamed: 0,label,message,pred_rule,pred_ml
0,ham,"Hey, are we still on for lunch today?",ham,
1,spam,URGENT! You've won $1000! Click here now!,spam,ham
2,ham,Can you pick up milk on your way home?,ham,
3,spam,FREE iPhone! Limited time offer! Call now!,spam,
4,ham,Meeting moved to 3pm tomorrow,ham,
5,spam,Congratulations! You've been selected for a sp...,spam,spam
6,ham,Thanks for the birthday wishes!,ham,
7,spam,SALE ALERT: 90% off everything! Don't miss out!,spam,
8,ham,"Running late, be there in 10 minutes",ham,ham
9,spam,You owe $500 in taxes. Pay immediately or face...,ham,


In [5]:
# 5. Evaluation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df['label'], test_size=0.3, random_state=42)

# Vectorize text
vec = CountVectorizer(stop_words='english')
X_train_vec = vec.fit_transform(X_train)
X_test_vec = vec.transform(X_test)

# Train Naïve Bayes
clf = MultinomialNB().fit(X_train_vec, y_train)
df.loc[X_test.index, 'pred_ml'] = clf.predict(X_test_vec)
df

Unnamed: 0,label,message,pred_rule,pred_ml
0,ham,"Hey, are we still on for lunch today?",ham,
1,spam,URGENT! You've won $1000! Click here now!,spam,ham
2,ham,Can you pick up milk on your way home?,ham,
3,spam,FREE iPhone! Limited time offer! Call now!,spam,
4,ham,Meeting moved to 3pm tomorrow,ham,
5,spam,Congratulations! You've been selected for a sp...,spam,spam
6,ham,Thanks for the birthday wishes!,ham,
7,spam,SALE ALERT: 90% off everything! Don't miss out!,spam,
8,ham,"Running late, be there in 10 minutes",ham,ham
9,spam,You owe $500 in taxes. Pay immediately or face...,ham,


In [6]:
from sklearn.metrics import classification_report

print("Rule-Based Classifier:")
print(classification_report(df['label'], df['pred_rule'], target_names=["ham", "spam"]))

print("ML-Based Classifier:")
print(classification_report(df.loc[X_test.index, 'label'], df.loc[X_test.index, 'pred_ml'], target_names=["ham", "spam"]))


Rule-Based Classifier:
              precision    recall  f1-score   support

         ham       0.83      1.00      0.91         5
        spam       1.00      0.80      0.89         5

    accuracy                           0.90        10
   macro avg       0.92      0.90      0.90        10
weighted avg       0.92      0.90      0.90        10

ML-Based Classifier:
              precision    recall  f1-score   support

         ham       0.50      1.00      0.67         1
        spam       1.00      0.50      0.67         2

    accuracy                           0.67         3
   macro avg       0.75      0.75      0.67         3
weighted avg       0.83      0.67      0.67         3



## 6. Analysis Questions
- Which classifier achieved higher overall accuracy on the test set?

Ans: ML-based classifier achieved higher accuracy on the test set.

- Compare precision and recall for the “spam” class in both approaches. Which method better balances false positives vs. false negatives?

Ans:Rule-Based: Higher recall (captures all spam), but lower precision (more false positives).
ML-Based: Achieves perfect balance in small test set.

Conclusion: ML-based model better balances false positives vs false negatives.

- What are the strengths and weaknesses of rule-based vs. ML-based classification? Provide examples from your results.

| Classifier | Strengths                                                            | Weaknesses                                                            |
| ---------- | -------------------------------------------------------------------- | --------------------------------------------------------------------- |
| Rule-Based | - Simple to implement<br>- Explainable<br>- No training required     | - Misses unseen variations<br>- Rigid rules<br>- High false positives |
| ML-Based   | - Learns patterns<br>- Handles synonyms/structure<br>- Generalizable | - Needs training data<br>- May misclassify if undertrained            |


- How would performance change if you expanded the keyword list? How does ML-based handle unseen patterns?

A)Rule-based accuracy may improve, especially recall.But may also increase false positives (precision drops).

B)ML learns statistical correlations and contextual patterns (like word co-occurrence, structure).
It can detect spam-like structure even if keywords differ.




## Solution Summary
- **Rule-Based** is straightforward to implement and explain but highly dependent on keyword coverage and fails on unseen spam phrasing.

- **ML-Based** generalizes better to varied language, captures context patterns, and scales with more data, but requires feature engineering, training time, and may misclassify rare terms.