In [None]:
# ===============================
# Task 5: Auto Tagging Support Tickets Using LLM
# Zero-Shot + Few-Shot with HuggingFace
# ===============================

!pip install transformers accelerate datasets -q

import os
import pandas as pd
from transformers import pipeline

# ---------------------------
# Step 1: Load Dataset
# ---------------------------
df = pd.read_csv("customer_support_tickets.csv")
print("Columns in CSV:", df.columns.tolist())

# ---------------------------
# Step 2: Define Candidate Categories
# ---------------------------
candidate_labels = ["Login Issue", "Payment Issue", "Bug", "Feature Request", "Other"]

# ---------------------------
# Step 3: Initialize Zero-Shot Classifier
# ---------------------------
classifier = pipeline(
    "zero-shot-classification",
    model="joeddav/xlm-roberta-large-xnli",  # You can switch to bart-large-mnli for faster runs
    device=0  # Use GPU if available
)

# ---------------------------
# Step 4: Resume if Checkpoint Exists
# ---------------------------
checkpoint_file = "classified_partial.csv"
results = []

if os.path.exists(checkpoint_file):
    print(f"🔄 Resuming from checkpoint: {checkpoint_file}")
    results_df = pd.read_csv(checkpoint_file)
    processed_ids = set(results_df["ticket_id"].tolist())
    results = results_df.to_dict("records")
else:
    processed_ids = set()

# ---------------------------
# Step 5: Prepare Ticket Texts
# ---------------------------
texts = [
    f"{row['Ticket Subject']} - {row['Ticket Description']}"
    for _, row in df.iterrows()
]

# ---------------------------
# Step 6: Zero-Shot (batch all at once)
# ---------------------------
print("⚡ Running Zero-Shot Classification...")
zs_outputs = classifier(texts, candidate_labels, multi_label=True, batch_size=16)

# ---------------------------
# Step 7: Few-Shot Prompts
# ---------------------------
few_shot_prompts = [
    (
        "Example 1:\nTicket: 'I can’t log in to my account.'\nAnswer: Login Issue (0.85), Other (0.15)\n"
        "Example 2:\nTicket: 'My payment failed.'\nAnswer: Payment Issue (0.80), Other (0.20)\n"
        "Example 3:\nTicket: 'App crashes when uploading file.'\nAnswer: Bug (0.75), Feature Request (0.15), Other (0.10)\n"
        f"Now classify this ticket:\nTicket: '{ticket_text}'\nAnswer:"
    )
    for ticket_text in texts
]

print("⚡ Running Few-Shot Classification...")
fs_outputs = classifier(few_shot_prompts, candidate_labels, multi_label=True, batch_size=8)

# ---------------------------
# Step 8: Collect Results
# ---------------------------
for idx, row in df.iterrows():
    if row["Ticket ID"] in processed_ids:
        continue  # Skip already processed tickets

    ticket_text = texts[idx]

    # ---- Zero-Shot ----
    zs = zs_outputs[idx]
    top_idx_zs = sorted(range(len(zs["scores"])), key=lambda i: zs["scores"][i], reverse=True)[:3]
    zs_labels = [zs["labels"][j] for j in top_idx_zs]
    zs_scores = [round(float(zs["scores"][j]), 3) for j in top_idx_zs]

    # ---- Few-Shot ----
    fs = fs_outputs[idx]
    top_idx_fs = sorted(range(len(fs["scores"])), key=lambda i: fs["scores"][i], reverse=True)[:3]
    fs_labels = [fs["labels"][j] for j in top_idx_fs]
    fs_scores = [round(float(fs["scores"][j]), 3) for j in top_idx_fs]

    results.append({
        "ticket_id": row["Ticket ID"],
        "ticket_text": ticket_text,
        "zero_shot_labels": zs_labels,
        "zero_shot_scores": zs_scores,
        "few_shot_labels": fs_labels,
        "few_shot_scores": fs_scores
    })

    if idx % 10 == 0:
        print(f"Processed {idx} tickets...")

    if idx % 500 == 0 and idx > 0:
        # Save checkpoint
        pd.DataFrame(results).to_csv(checkpoint_file, index=False)
        print(f"💾 Saved checkpoint at {idx} tickets")

# ---------------------------
# Step 9: Save Final Results
# ---------------------------
results_df = pd.DataFrame(results)
results_df.to_csv("classified_tickets_results.csv", index=False)
print("✅ Done. Results saved to classified_tickets_results.csv")


Columns in CSV: ['Ticket ID', 'Customer Name', 'Customer Email', 'Customer Age', 'Customer Gender', 'Product Purchased', 'Date of Purchase', 'Ticket Type', 'Ticket Subject', 'Ticket Description', 'Ticket Status', 'Resolution', 'Ticket Priority', 'Ticket Channel', 'First Response Time', 'Time to Resolution', 'Customer Satisfaction Rating']


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cuda:0


⚡ Running Zero-Shot Classification...
⚡ Running Few-Shot Classification...
Processed 0 tickets...
Processed 10 tickets...
Processed 20 tickets...
Processed 30 tickets...
Processed 40 tickets...
Processed 50 tickets...
Processed 60 tickets...
Processed 70 tickets...
Processed 80 tickets...
Processed 90 tickets...
Processed 100 tickets...
Processed 110 tickets...
Processed 120 tickets...
Processed 130 tickets...
Processed 140 tickets...
Processed 150 tickets...
Processed 160 tickets...
Processed 170 tickets...
Processed 180 tickets...
Processed 190 tickets...
Processed 200 tickets...
Processed 210 tickets...
Processed 220 tickets...
Processed 230 tickets...
Processed 240 tickets...
Processed 250 tickets...
Processed 260 tickets...
Processed 270 tickets...
Processed 280 tickets...
Processed 290 tickets...
Processed 300 tickets...
Processed 310 tickets...
Processed 320 tickets...
Processed 330 tickets...
Processed 340 tickets...
Processed 350 tickets...
Processed 360 tickets...
Processed 3

In [1]:
import pandas as pd
import numpy as np

# Load classified CSV
df = pd.read_csv("classified_tickets_results.csv")

# Extract top-1 labels
df["zs_top1"] = df["zero_shot_labels"].apply(lambda x: eval(x)[0])
df["fs_top1"] = df["few_shot_labels"].apply(lambda x: eval(x)[0])

# Extract top-1 scores
df["zs_score1"] = df["zero_shot_scores"].apply(lambda x: eval(x)[0])
df["fs_score1"] = df["few_shot_scores"].apply(lambda x: eval(x)[0])

# Compare agreement
agreement = (df["zs_top1"] == df["fs_top1"]).mean()
print(f"Top-1 label agreement between Zero-Shot and Few-Shot: {agreement*100:.2f}%")

# Average top-1 score (confidence)
zs_avg = df["zs_score1"].mean()
fs_avg = df["fs_score1"].mean()
print(f"Average Zero-Shot top-1 score: {zs_avg:.3f}")
print(f"Average Few-Shot top-1 score: {fs_avg:.3f}")

# Quick summary: which one seems more confident?
if zs_avg > fs_avg:
    print("Zero-Shot seems more confident on average.")
elif fs_avg > zs_avg:
    print("Few-Shot seems more confident on average.")
else:
    print("Both have similar confidence.")

# Optional: show 5 example differences
print("\nSample differences (first 5):")
diff = df[df["zs_top1"] != df["fs_top1"]].head(5)
for _, row in diff.iterrows():
    print(f"\nTicket: {row['ticket_text'][:50]}...")
    print(f"Zero-Shot: {row['zs_top1']} ({row['zs_score1']:.2f})")
    print(f"Few-Shot:  {row['fs_top1']} ({row['fs_score1']:.2f})")


Top-1 label agreement between Zero-Shot and Few-Shot: 17.62%
Average Zero-Shot top-1 score: 0.810
Average Few-Shot top-1 score: 0.889
Few-Shot seems more confident on average.

Sample differences (first 5):

Ticket: Product setup - I'm having an issue with the {prod...
Zero-Shot: Payment Issue (0.94)
Few-Shot:  Login Issue (0.95)

Ticket: Peripheral compatibility - I'm having an issue wit...
Zero-Shot: Other (0.94)
Few-Shot:  Login Issue (0.96)

Ticket: Network problem - I'm facing a problem with my {pr...
Zero-Shot: Bug (0.92)
Few-Shot:  Login Issue (0.74)

Ticket: Account access - I'm having an issue with the {pro...
Zero-Shot: Bug (0.53)
Few-Shot:  Login Issue (0.98)

Ticket: Data loss - I'm having an issue with the {product_...
Zero-Shot: Bug (0.90)
Few-Shot:  Login Issue (0.92)


***Overall Steps Followed***

Load Dataset: Read the customer support tickets CSV containing ticket text and metadata.

Define Categories: Set candidate labels for classification:
["Login Issue", "Payment Issue", "Bug", "Feature Request", "Other"]

Zero-Shot Classification: Used joeddav/xlm-roberta-large-xnli to predict top-3 tags for each ticket.

Few-Shot Classification: Designed small in-context prompts with examples and used the same model to predict top-3 tags.

Save Results: Stored ticket ID, text, top-3 zero-shot/few-shot labels and scores in a CSV.

Comparison (Quick Analysis): Evaluated agreement between zero-shot and few-shot top-1 labels and compared their average confidence scores.

 **Quick Comparison Results**

Top-1 label agreement: 17.62% → Zero-shot and Few-shot often predict different top tags.

Average top-1 confidence:

Zero-Shot: 0.810

Few-Shot: 0.889

Observation: Few-Shot is more confident on average and often chooses more contextually relevant tags.

**Conclusion**

Few-Shot classification appears more confident and context-aware than Zero-Shot.

Zero-Shot sometimes predicts broader categories, while Few-Shot tends to focus on the most likely relevant tag.

This demonstrates the benefit of prompt engineering / few-shot examples in improving classification quality even without fine-tuning.