<a href="https://colab.research.google.com/github/ai-wrangler/BA_sms_LLM/blob/main/SMS_LLM_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SMS Spam Classification with Embeddings and HuggingFace LLM
This Colab-ready notebook recreates the Lab 5 text mining workflow from Weka using Python pipelines and HuggingFace's free Inference API. You'll load the original ARFF dataset, build embedding-based classifiers, invoke HuggingFace models for zero-shot spam detection, and compare the evaluation metrics across approaches.

## How to use this notebook in Google Colab
1. Upload `SMS_LLM_Colab.ipynb` to Colab (File → Upload notebook) or open it from Drive.
2. Runtime → Change runtime type → make sure Python 3.10+; GPU is optional.
3. Prepare the dataset: copy `TextCollection_sms.arff` to your Drive or download it locally so you can upload it when prompted.
4. Get a free HuggingFace API token from https://huggingface.co/settings/tokens (create a "Read" token). Store it securely (`Tools → Secrets` in Colab or `google.colab.userdata`). This notebook expects an environment variable called `HF_API_KEY`.
5. Run the cells in order—each is annotated to match the lab workflow and highlight differences between embeddings and LLM-based classification.

**HuggingFace Free Tier Limits:**
- Rate limit: 1,000 requests/day for free tier (varies by model)
- Most models support 1,024-2,048 tokens per request
- Some popular models may have lower rate limits during peak usage

### Fixing 'Invalid Notebook' Error on GitHub

To resolve the 'state' key missing error when rendering your notebook on GitHub, you should clear all cell outputs before saving and committing your notebook. Here's how to do it in Google Colab:

1.  **Open your notebook** in Google Colab.
2.  Go to the **'Runtime'** menu at the top.
3.  Select **'Clear all outputs'**.
4.  **Save the notebook** (File > Save).
5.  Then, you can **download the `.ipynb` file** and upload it to GitHub, or sync it if you are using Google Drive integration with GitHub.

In [1]:
# Install libraries that are not included in the base Colab runtime
%pip install -q pandas numpy scikit-learn seaborn matplotlib sentence-transformers scipy liac-arff huggingface_hub

^C
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
import json
import os
import random
import re
import time
from pathlib import Path
import arff # Replaced from scipy.io import arff with import arff (for liac-arff)
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    f1_score, precision_score, recall_score
)
from sklearn.model_selection import train_test_split

plt.style.use('seaborn-v0_8-darkgrid')

In [None]:
# Reproducibility helpers
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

## Load the SMS Spam ARFF dataset
The lab uses `TextCollection_sms.arff`. Use one of the cells below to make it available in the Colab filesystem. Uploading via the UI is the quickest path if the file is on your laptop.

In [None]:
# Option A: Mount Google Drive (run this if the ARFF lives in Drive)
from google.colab import drive
drive.mount('/content/drive')
# After mounting, set ARFF_PATH = '/content/drive/MyDrive/path/to/TextCollection_sms.arff'

In [None]:
# Option B: Upload the ARFF manually (run this if the file is on your machine)
from google.colab import files
uploaded = files.upload()
ARFF_PATH = next(iter(uploaded))  # use the first uploaded filename

In [None]:
# If you mounted Drive instead of uploading, set the explicit path here.
# Example: ARFF_PATH = '/content/drive/MyDrive/datasets/TextCollection_sms.arff'
ARFF_PATH = locals().get('ARFF_PATH', 'TextCollection_sms.arff')
print(f'Using dataset located at: {ARFF_PATH}')

In [None]:
# Read the ARFF file into a DataFrame and mirror the original lab schema
# Using liac-arff as scipy.io.arff does not support string attributes
arff_data = arff.load(open(ARFF_PATH, 'r'))
raw_data = arff_data['data']
attributes = arff_data['attributes']
column_names = [attr[0] for attr in attributes]

sms_df = pd.DataFrame(raw_data, columns=column_names)
# liac-arff reads strings directly, so no decoding is needed
sms_df = sms_df.rename(columns={'Text': 'message', 'class-att': 'label'})
sms_df['label'] = sms_df['label'].map({'0': 'ham', '1': 'spam'})
sms_df['char_len'] = sms_df['message'].str.len()
sms_df.head()

In [None]:
# Quick class balance check
ax = sms_df['label'].value_counts().sort_index().plot(kind='bar', color=['#4C72B0', '#DD8452'])
ax.set(title='Class distribution', xlabel='Label', ylabel='Count')
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2, p.get_height()),
                ha='center', va='bottom')
plt.show()
sms_df.describe(include='all')

In [None]:
# Train/test split mirroring the lab evaluation
X_train, X_test, y_train, y_test = train_test_split(
    sms_df['message'],
    sms_df['label'],
    test_size=0.2,
    stratify=sms_df['label'],
    random_state=RANDOM_STATE
)
print(f'Train set: {X_train.shape[0]} messages | Test set: {X_test.shape[0]} messages')

In [None]:
# Shared evaluation helpers for classical models and the LLM
results = []

def capture_metrics(name: str, y_true, y_pred) -> pd.Series:
    metrics = {
        'model': name,
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, pos_label='spam'),
        'recall': recall_score(y_true, y_pred, pos_label='spam'),
        'f1': f1_score(y_true, y_pred, pos_label='spam')
    }
    results.append(metrics)
    print(json.dumps(metrics, indent=2))
    return pd.Series(metrics)

def plot_confusion(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred, labels=['ham', 'spam'])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['ham', 'spam'], yticklabels=['ham', 'spam'])
    plt.title(title)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

## Baseline 1: TF–IDF + Logistic Regression
Replicates the bag-of-words style features typically explored in Weka's text mining lab.

In [None]:
tfidf = TfidfVectorizer(lowercase=True, stop_words='english', min_df=3, ngram_range=(1, 2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

bow_clf = LogisticRegression(max_iter=200, random_state=RANDOM_STATE, n_jobs=None)
bow_clf.fit(X_train_tfidf, y_train)
bow_preds = bow_clf.predict(X_test_tfidf)
capture_metrics('TFIDF + LogisticRegression', y_test, bow_preds)
print(classification_report(y_test, bow_preds))
plot_confusion(y_test, bow_preds, 'TF-IDF Logistic Regression Confusion Matrix')

## Baseline 2: SentenceTransformer Embeddings + Logistic Regression
Uses a semantic embedding (MiniLM) to capture contextual similarity beyond word frequencies.

In [None]:
embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
X_train_emb = embedder.encode(X_train.tolist(), show_progress_bar=True, batch_size=128)
X_test_emb = embedder.encode(X_test.tolist(), show_progress_bar=True, batch_size=128)

embed_clf = LogisticRegression(max_iter=500, random_state=RANDOM_STATE)
embed_clf.fit(X_train_emb, y_train)
embed_preds = embed_clf.predict(X_test_emb)
capture_metrics('MiniLM Embeddings + LogisticRegression', y_test, embed_preds)
print(classification_report(y_test, embed_preds))
plot_confusion(y_test, embed_preds, 'MiniLM Logistic Regression Confusion Matrix')

## Configure HuggingFace for LLM-based Zero/Few-shot Classification
You need a free HuggingFace API token from https://huggingface.co/settings/tokens. In Colab you can store it via `Tools → Secrets` and retrieve it with `google.colab.userdata.get('HF_API_KEY')`. Alternatively, set `os.environ['HF_API_KEY']` manually (just avoid hard-coding secrets in plain text).

**Available Free Models:**
- `meta-llama/Llama-3.2-3B-Instruct` - Fast and efficient for classification
- `microsoft/Phi-3-mini-4k-instruct` - Compact model, good for simple tasks
- `mistralai/Mistral-7B-Instruct-v0.3` - Balanced performance
- `HuggingFaceH4/zephyr-7b-beta` - Good for instruction following

**Rate Limits:** Free tier allows ~1,000 requests/day. We'll use a smaller sample size (100-200 messages) to stay within limits.

In [None]:
import os
from huggingface_hub import InferenceClient

HF_API_KEY = os.environ.get('LLM_TOKEN')
if HF_API_KEY is None:
    try:
        from google.colab import userdata
        HF_API_KEY = userdata.get('LLM_TOKEN')
    except ImportError:
        pass

if not HF_API_KEY:
    raise ValueError('Missing HuggingFace API key. Set HF_API_KEY via Colab secrets or environment variables before continuing.')

# Using Phi-3-mini for fast inference and good performance on classification tasks
HF_MODEL = 'microsoft/Phi-3-mini-4k-instruct'
client = InferenceClient(token=HF_API_KEY)
print(f'HuggingFace Inference API ready with model: {HF_MODEL}')
print(f'Note: Free tier limit is ~1,000 requests/day. Adjust LLM_SAMPLE_SIZE accordingly.')

### Alternative HuggingFace Models for Spam Detection

You can try different models by changing the `HF_MODEL` variable. Here are recommended free-tier options:

**Recommended Models:**
1. **`microsoft/Phi-3-mini-4k-instruct`** (Default)
   - Size: 3.8B parameters
   - Speed: Fast (~1-2s per request)
   - Best for: Quick classification tasks
   - Context: 4k tokens

2. **`mistralai/Mistral-7B-Instruct-v0.3`**
   - Size: 7B parameters
   - Speed: Moderate (~2-3s per request)
   - Best for: Better accuracy on nuanced messages
   - Context: 8k tokens

3. **`HuggingFaceH4/zephyr-7b-beta`**
   - Size: 7B parameters
   - Speed: Moderate (~2-3s per request)
   - Best for: Instruction following
   - Context: 8k tokens

4. **`meta-llama/Llama-3.2-3B-Instruct`**
   - Size: 3B parameters
   - Speed: Very fast (~1s per request)
   - Best for: Quick responses, good quality
   - Context: 8k tokens

**Free Tier Limits:**
- **Rate Limit**: ~1,000 requests per day
- **Token Limit**: Varies by model (typically 1,024-4,096 tokens per request)
- **Concurrent Requests**: 1-2 at a time
- Monitor usage at: https://huggingface.co/settings/tokens

In [None]:
# Optional: Test your HuggingFace API connection and estimate usage
def test_hf_connection():
    """Test the HuggingFace API and estimate daily quota usage."""
    try:
        test_message = "Win a free iPhone now! Click here!"
        print("Testing HuggingFace API connection...")
        result = classify_with_hf(test_message)
        print(f"✓ API connection successful!")
        print(f"✓ Test classification: '{test_message}' -> {result}")
        print(f"\nDaily quota estimate:")
        print(f"  - LLM_SAMPLE_SIZE: {LLM_SAMPLE_SIZE} messages")
        print(f"  - Estimated API calls: {LLM_SAMPLE_SIZE}")
        print(f"  - Free tier limit: ~1,000 requests/day")
        print(f"  - Remaining quota: ~{1000 - LLM_SAMPLE_SIZE} requests")
        print(f"  - Estimated time: {(LLM_SAMPLE_SIZE * REQUEST_DELAY / 60):.1f} minutes")
        return True
    except Exception as e:
        print(f"✗ API connection failed: {e}")
        print("\nTroubleshooting:")
        print("  1. Check your HF_API_KEY is valid")
        print("  2. Verify at https://huggingface.co/settings/tokens")
        print("  3. Ensure the token has 'Read' permissions")
        print("  4. Check if you've hit daily rate limits")
        return False

# Uncomment to test your connection:
# test_hf_connection()

## LLM Inference Loop
HuggingFace free tier has rate limits (~1,000 requests/day), so we evaluate on a smaller stratified subset of the held-out test set (default 100 messages). Adjust `LLM_SAMPLE_SIZE` based on your daily quota needs.

In [None]:
# Reduced sample size to respect free tier limits (1,000 requests/day)
LLM_SAMPLE_SIZE = 100  # Adjust based on your remaining daily quota
llm_eval_df = (
    pd.concat([X_test.reset_index(drop=True), y_test.reset_index(drop=True)], axis=1)
    .rename(columns={'message': 'text', 'label': 'label'})
)
if LLM_SAMPLE_SIZE and LLM_SAMPLE_SIZE < len(llm_eval_df):
    llm_eval_df = (
        llm_eval_df
        .groupby('label', group_keys=False)
        .apply(lambda grp: grp.sample(
            n=max(1, int(LLM_SAMPLE_SIZE * len(grp) / len(llm_eval_df))),
            random_state=RANDOM_STATE
        ), include_groups=False)
        .reset_index(drop=True)
    )

system_prompt = (
    "You are a strict SMS spam filter. Respond with ONLY one word: "
    "either 'Spam' for unsolicited or fraudulent messages, or 'Ham' for regular "
    "personal/business messages. Do not explain or add any other text."
)

def classify_with_hf(text: str, retry: int = 3, backoff: float = 3.0) -> str:
    """
    Classify SMS message using HuggingFace Inference API.
    Uses text_generation method which is more widely supported.
    """
    # Combine system prompt and user message for text generation models
    full_prompt = f"{system_prompt}\n\nMessage: \"{text}\"\nClassification:"
    
    for attempt in range(retry):
        try:
            # Use text_generation instead of chat_completion for better compatibility
            response = client.text_generation(
                prompt=full_prompt,
                model=HF_MODEL,
                max_new_tokens=10,  # We only need one word
                temperature=0.1,  # Low temperature for consistent classification
                return_full_text=False
            )
            
            raw = response.strip().lower()
            
            # Parse response - handle various formats
            if 'spam' in raw and 'ham' in raw:
                raw = raw.split()[0]
            if 'spam' in raw:
                return 'spam'
            if 'ham' in raw:
                return 'ham'
                
        except Exception as error:
            error_str = str(error).lower()
            # Handle rate limiting specifically
            if 'rate limit' in error_str or '429' in error_str:
                wait_time = backoff * (2 ** attempt)  # Exponential backoff
                print(f'Rate limit hit. Waiting {wait_time:.1f}s before retry...')
                time.sleep(wait_time)
            elif attempt == retry - 1:
                print(f'LLM classification failed after retries: {error}')
                return 'ham'  # Default to ham on failure
            else:
                time.sleep(backoff * (attempt + 1))
    
    return 'ham'

# Add small delay between requests to avoid rate limiting
REQUEST_DELAY = 0.5  # seconds between requests

llm_predictions = []
print(f"Processing {len(llm_eval_df)} messages with {REQUEST_DELAY}s delay between requests...")
print(f"Estimated time: {(len(llm_eval_df) * REQUEST_DELAY / 60):.1f} minutes\n")

for idx, row in llm_eval_df.iterrows():
    prediction = classify_with_hf(row['text'])
    llm_predictions.append(prediction)
    
    # Small delay to respect rate limits
    if idx < len(llm_eval_df) - 1:
        time.sleep(REQUEST_DELAY)
    
    if (idx + 1) % 10 == 0 or idx + 1 == len(llm_eval_df):
        print(f"Processed {idx + 1}/{len(llm_eval_df)} messages")

llm_eval_df['prediction'] = llm_predictions

capture_metrics(f'{HF_MODEL} (LLM zero-shot)', llm_eval_df['label'], llm_eval_df['prediction'])
print(classification_report(llm_eval_df['label'], llm_eval_df['prediction']))
plot_confusion(llm_eval_df['label'], llm_eval_df['prediction'], f'{HF_MODEL.split("/")[1]} Confusion Matrix (Sample)')

In [None]:
results_df = pd.DataFrame(results)
results_df.sort_values('f1', ascending=False).reset_index(drop=True)

### Observations
* **TF–IDF + Logistic Regression** mirrors the original Weka text-mining pipeline and usually delivers high recall on overt spam phrases such as "free entry" or "claim now".
* **MiniLM embeddings** capture semantics and can reduce false positives on nuanced ham, at the cost of downloading the encoder and adding encoding latency.
* **HuggingFace LLM** (Phi-3-mini) needs no training data but has rate limits on free tier (~1,000 requests/day); it performs well on context-heavy messages and can understand nuanced spam patterns.
* **Rate Limit Management**: We use smaller sample sizes (100 messages) and add delays between requests to stay within free tier limits.
* Hybrid scoring (e.g., fall back to HuggingFace LLM when the classical models disagree) is a strong extension for future lab work.

## Optional: Export Artifacts
If you want to retain the evaluation outputs in Drive, run the cell below and then use the Colab file browser or `drive.mount` to move the CSVs.

In [None]:
results_df.to_csv('spam_lab_results_summary.csv', index=False)
llm_eval_df.to_csv('spam_lab_llm_predictions.csv', index=False)
print('Artifacts saved locally. Upload to Drive if you need persistent storage.')

## Next Steps
1. Try different HuggingFace models like `mistralai/Mistral-7B-Instruct-v0.3` or `HuggingFaceH4/zephyr-7b-beta` to compare quality/speed trade-offs.
2. Prompt-tune the LLM with few-shot examples in the system prompt for better performance on shorthand or code-mixed spam.
3. Experiment with alternative embedding models (`all-mpnet-base-v2`, fastText) and blend their scores with the LLM for ensemble voting.
4. Monitor your HuggingFace API usage at https://huggingface.co/settings/tokens to track daily limits.
5. Consider upgrading to HuggingFace Pro ($9/month) for higher rate limits if you need to process more messages.
6. Implement batching or caching strategies to minimize API calls while maximizing evaluation coverage.