### **Multi-label Email Classification Fine-Tuning Task on BERT Variants**

**Objective:** The goal of this task is to address the challenge of disorganized email inboxes, where multiple categories of emails are mixed together. Users often have to go through each email individually, making it difficult to quickly identify their type or priority. By fine-tuning BERT variants for multi-label classification, we aim to automatically categorize emails into their respective classes, improving inbox organization and user efficiency.

**Implementation Steps:**

#### **Project Setup**

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from datasets import load_dataset
import torch
import gc
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from peft import LoraConfig, get_peft_model, TaskType

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, hamming_loss
import evaluate

#### **Load Dataset**

In [2]:
# Load the multi-label emails dataset
dataset = load_dataset("imnim/multiclass-email-classification")

In [3]:
# Check the dataset
print("Dataset: \n", dataset)
print("="*50)
print(f"Dataset Shape: Rows-{dataset['train'].num_rows}, Columns-{len(dataset['train'].column_names)}")
print("="*50)
print("Sample Data: \n", dataset['train'][0])


Dataset: 
 DatasetDict({
    train: Dataset({
        features: ['subject', 'body', 'labels'],
        num_rows: 2105
    })
})
Dataset Shape: Rows-2105, Columns-3
Sample Data: 
 {'subject': 'Meeting Reminder: Quarterly Sales Review Tomorrow', 'body': 'Dear Team, Just a friendly reminder that our Quarterly Sales Review meeting is scheduled for tomorrow at 10:00 AM in the conference room. Please make sure to bring your sales reports and any relevant updates. Coffee and pastries will be provided. Looking forward to a productive meeting. Best regards, [Your Name]', 'labels': ['Business', 'Reminders']}


#### **Data Preprocessing**

In [4]:
# Combine subject and body of each email into a single text field
def combine_text(examples):
    examples["text"] = examples["subject"] + " " + examples["body"]
    return examples


# Apply the function to the dataset
dataset = dataset.map(combine_text)

In [5]:
# Check the updated dataset
dataset["train"][0]

{'subject': 'Meeting Reminder: Quarterly Sales Review Tomorrow',
 'body': 'Dear Team, Just a friendly reminder that our Quarterly Sales Review meeting is scheduled for tomorrow at 10:00 AM in the conference room. Please make sure to bring your sales reports and any relevant updates. Coffee and pastries will be provided. Looking forward to a productive meeting. Best regards, [Your Name]',
 'labels': ['Business', 'Reminders'],
 'text': 'Meeting Reminder: Quarterly Sales Review Tomorrow Dear Team, Just a friendly reminder that our Quarterly Sales Review meeting is scheduled for tomorrow at 10:00 AM in the conference room. Please make sure to bring your sales reports and any relevant updates. Coffee and pastries will be provided. Looking forward to a productive meeting. Best regards, [Your Name]'}

In [6]:
# Split the dataset before further processing
split_dataset = dataset["train"].train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

In [7]:
# Encode multi-labels using MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(train_dataset["labels"])
num_labels = len(mlb.classes_)

# Define a function to encode the labels
def encode_labels(examples):
    # Transform the entire list of labels for each sample
    encoded = mlb.transform(examples["labels"])
    # Convert to float32 (important for multi-label)
    examples["labels_encoded"] = encoded.astype(np.float32).tolist()
    return examples

# Apply the function to the dataset
train_dataset = train_dataset.map(encode_labels, batched=True)
test_dataset = test_dataset.map(encode_labels, batched=True)

In [8]:
# Check the encoded dataset
print("Original labels:", train_dataset["labels"][:2])
print("Encoded labels:", train_dataset["labels_encoded"][:2])
print("Label classes:", mlb.classes_)

Original labels: [['Business', 'Reminders'], ['Promotions']]
Encoded labels: [[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]]
Label classes: ['Business' 'Customer Support' 'Events & Invitations' 'Finance & Bills'
 'Job Application' 'Newsletters' 'Personal' 'Promotions' 'Reminders'
 'Travel & Bookings']


In [9]:
# Tokenization
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize and encode from your original data
def tokenize_and_encode(examples):
    # Tokenize text
    tokenized = tokenizer(
        examples["text"], 
        truncation=True, 
        max_length=256,
        padding=False,
        return_tensors=None
    )
    
    # Encode labels
    encoded_labels = mlb.transform(examples["labels"])
    tokenized["labels"] = encoded_labels.astype(np.float32).tolist()
    
    return tokenized

# Apply to clean datasets 
train_dataset = train_dataset.map(tokenize_and_encode, batched=True)
test_dataset = test_dataset.map(tokenize_and_encode, batched=True)

# Remove unnecessary columns
train_dataset = train_dataset.remove_columns(["subject", "body", "text"])
test_dataset = test_dataset.remove_columns(["subject", "body", "text"])

print("Rebuilt datasets successfully!")
print("Train columns:", train_dataset.column_names)

Map:   0%|          | 0/1684 [00:00<?, ? examples/s]

Map:   0%|          | 0/421 [00:00<?, ? examples/s]

Rebuilt datasets successfully!
Train columns: ['labels', 'labels_encoded', 'input_ids', 'attention_mask']


In [10]:
# Verify the structure
print("Training sample:", train_dataset[0])

Training sample: {'labels': [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0], 'labels_encoded': [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0], 'input_ids': [101, 14764, 1024, 9046, 2136, 3116, 6203, 2136, 2372, 1010, 2023, 2003, 1037, 5379, 14764, 2008, 2057, 2031, 2256, 4882, 2136, 3116, 5115, 2005, 4826, 2012, 2184, 1024, 4002, 2572, 1012, 3531, 2191, 2469, 2000, 3319, 1996, 11376, 25828, 1998, 2272, 4810, 2007, 2151, 14409, 2030, 20062, 2000, 3745, 1012, 2292, 1005, 1055, 2031, 1037, 13318, 3116, 1998, 6848, 2256, 5082, 2006, 7552, 3934, 1012, 2559, 2830, 2000, 3773, 2017, 2035, 2045, 999, 2190, 12362, 1010, 1031, 2115, 2171, 1033, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [11]:
# Prepare the label ids
id2label = {i: label for i, label in enumerate(mlb.classes_)}
label2id = {label: i for i, label in enumerate(mlb.classes_)}

# Check the mappings
print("ID to Label:", id2label)
print("Label to ID:", label2id)

ID to Label: {0: 'Business', 1: 'Customer Support', 2: 'Events & Invitations', 3: 'Finance & Bills', 4: 'Job Application', 5: 'Newsletters', 6: 'Personal', 7: 'Promotions', 8: 'Reminders', 9: 'Travel & Bookings'}
Label to ID: {'Business': 0, 'Customer Support': 1, 'Events & Invitations': 2, 'Finance & Bills': 3, 'Job Application': 4, 'Newsletters': 5, 'Personal': 6, 'Promotions': 7, 'Reminders': 8, 'Travel & Bookings': 9}


In [12]:
# Define the data collator for dynamic padding
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding=True,           # Enable padding
    return_tensors="pt",    # Return PyTorch tensors
)