# Consumer Complaint Classification – Transformers 

## Introduction  

The goal of this project is to develop a **text classification model** that categorizes **consumer complaints** into five financial categories using **transformer-based algorithms**.  

The dataset, obtained from the **Consumer Financial Protection Bureau (CFPB)**, contains **over 2 million consumer complaints** from **2011 to 2024**. Each complaint is a textual **narrative** describing a financial issue, and these complaints have been labeled into **five main categories**:  

- **Loans**  
- **Credit Reporting**  
- **Bank Accounts & Services**  
- **Debt Collection**  
- **Credit Card Services**  

---

## Data Understanding  

The dataset originates from the **Consumer Complaint Database** maintained by the **Consumer Financial Protection Bureau (CFPB)**, a **U.S. federal agency** that mediates disputes between financial institutions and consumers. Consumers submit complaints through an **online form**, detailing their financial issues.  

The dataset was **downloaded from the CFPB website** and underwent **preprocessing** to prepare it for NLP tasks. The key modifications include:  

- Retaining only records where a **"Consumer complaint narrative"** is available.  
- Reducing the dataset from **5,842,373** records to **2,023,066** entries.  
- Renaming the **"Consumer complaint narrative"** column to **"narrative"** for ease of coding.  
- Consolidating **18 original product categories** into **5 main categories (product_5)** to address overlaps in classification.  

---

## Consumer Complaint Classification Pipeline


In [1]:
# Install required packages
# !pip install transformers datasets pandas numpy scikit-learn torch tqdm

# --------------------
# Standard Libraries
# --------------------
import os
import re
import random

# --------------------
# Data Manipulation
# --------------------
import pandas as pd
import numpy as np

# --------------------
# Visualization
# --------------------
import matplotlib.pyplot as plt
import seaborn as sns


# --------------------
# Machine Learning
# --------------------
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# --------------------
# PyTorch & Transformers
# --------------------
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import (
    DistilBertTokenizer,
    DistilBertForSequenceClassification,
    get_linear_schedule_with_warmup,
    RobertaTokenizer, 
    RobertaForSequenceClassification
)

# --------------------
# Progress Bar
# --------------------
from tqdm.auto import tqdm

# --------------------
# Set Seed for Reproducibility
# --------------------
def set_seed(seed_value=42):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

set_seed(42)

# --------------------
# Device Configuration
# --------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


2025-05-19 23:35:58.095476: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-19 23:35:58.240560: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747694158.298410   16624 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747694158.313827   16624 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-19 23:35:58.459896: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

Using device: cpu


In [None]:
# Load the data
df = pd.read_csv('data/complaints.csv')
df.head()

In [None]:
# Drop the 'Unnamed: 0' column
df = df.drop(columns=['Unnamed: 0'])

In [None]:
# Define sample size per category
sample_size = 10000

df_resampled = df.groupby("product_5").sample(n=sample_size, random_state=42)

df_resampled = df_resampled.reset_index(drop=True)

df_resampled["product_5"].value_counts()

In [None]:
# Encode the labels (LabelEncoder: str → int)
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(df_resampled['product_5'])

# Create label mapping (int → str) for interpretation later
label_mapping = dict(zip(range(len(label_encoder.classes_)), label_encoder.classes_))
reverse_label_mapping = {v: k for k, v in label_mapping.items()}  # Optional reverse

# Print the mapping for verification
print("\nLabel mapping (int → class name):")
for i, label in label_mapping.items():
    print(f"{i}: {label}")

In [None]:
# Split the data
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    df_resampled['narrative'].values,
    encoded_labels,
    test_size=0.3,
    random_state=42,
    stratify=encoded_labels
)

val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts,
    temp_labels,
    test_size=0.5,
    random_state=42,
    stratify=temp_labels
)

print(f"\nTraining set size: {len(train_texts)}")
print(f"Validation set size: {len(val_texts)}")
print(f"Test set size: {len(test_texts)}")

Transformers settings

In [None]:
# === SELECT MODEL TYPE ===
robertA_base = False
robertA_base_large = False
distilbert = True  # Set the one you want to True

# === AUTOMATIC CONFIG BASED ON FLAGS ===
if distilbert:
    model_name = 'distilbert-base-uncased'
    model_type = 'distilbert'
    batch_size = 16  # You can keep it higher for smaller models
elif robertA_base:
    model_name = 'roberta-base'
    model_type = 'roberta'
    batch_size = 16
elif robertA_base_large:
    model_name = 'roberta-large'
    model_type = 'roberta'
    batch_size = 8  # roberta-large is heavy — lower the batch size
else:
    raise ValueError("Please set one of the model flags to True.")


In [None]:
# Create a custom dataset
class ConsumerComplaintDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        text = text.strip()

        # Tokenize the text
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

In [None]:
if distilbert:
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
elif robertA_base:
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
elif robertA_base_large:
    tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
# Create datasets
train_dataset = ConsumerComplaintDataset(train_texts, train_labels, tokenizer)
val_dataset = ConsumerComplaintDataset(val_texts, val_labels, tokenizer)
test_dataset = ConsumerComplaintDataset(test_texts, test_labels, tokenizer)

batch_size = 16
# Create data loaders
if robertA_base_large
batch_size = 8

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

In [None]:
def train_model(model, train_dataloader, val_dataloader, epochs=4, learning_rate=5e-5):
    # Prepare optimizer and scheduler
    optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8)
    
    # Total number of training steps
    total_steps = len(train_dataloader) * epochs
    
    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(
        optimizer, 
        num_warmup_steps=0, 
        num_training_steps=total_steps
    )
    
    # Lists to store loss and accuracy
    train_losses = []
    val_losses = []
    val_accuracies = []
    
    for epoch in range(epochs):
        print(f"\nEpoch {epoch+1}/{epochs}")
        print('-' * 40)
        
        # Training
        model.train()
        total_train_loss = 0
        
        progress_bar = tqdm(train_dataloader, desc="Training")
        
        for batch in progress_bar:
            # Move batch to device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            # Clear previous gradients
            model.zero_grad()
            
            # Forward pass
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )
            
            loss = outputs.loss
            total_train_loss += loss.item()
            
            # Backward pass
            loss.backward()
            
            # Clip the norm of the gradients to 1.0
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            
            # Update parameters and learning rate
            optimizer.step()
            scheduler.step()
            
            # Update progress bar
            progress_bar.set_postfix({'loss': loss.item()})
        
        avg_train_loss = total_train_loss / len(train_dataloader)
        train_losses.append(avg_train_loss)
        print(f"Average training loss: {avg_train_loss:.4f}")
        
        # Validation
        model.eval()
        total_val_loss = 0
        predictions = []
        true_labels = []
        
        for batch in tqdm(val_dataloader, desc="Validation"):
            # Move batch to device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            # Forward pass without gradient calculation
            with torch.no_grad():
                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )
            
            loss = outputs.loss
            total_val_loss += loss.item()
            
            # Get predictions
            logits = outputs.logits
            preds = torch.argmax(logits, dim=1).cpu().numpy()
            
            # Store predictions and true labels
            predictions.extend(preds)
            true_labels.extend(labels.cpu().numpy())
        
        avg_val_loss = total_val_loss / len(val_dataloader)
        val_losses.append(avg_val_loss)
        
        val_accuracy = accuracy_score(true_labels, predictions)
        val_accuracies.append(val_accuracy)
        
        print(f"Validation Loss: {avg_val_loss:.4f}")
        print(f"Validation Accuracy: {val_accuracy:.4f}")
        print("\nClassification Report:")
        print(classification_report(true_labels, predictions))
    
    return model, train_losses, val_losses, val_accuracies

In [None]:
if model_type == 'distilbert':
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    model = DistilBertForSequenceClassification.from_pretrained(
        model_name,
        num_labels=len(label_mapping),
        output_attentions=False,
        output_hidden_states=False
    )
elif model_type == 'roberta':
    tokenizer = RobertaTokenizer.from_pretrained(model_name)
    model = RobertaForSequenceClassification.from_pretrained(
        model_name,
        num_labels=len(label_mapping),
        output_attentions=False,
        output_hidden_states=False
    )

model = model.to(device)
# === Training parameters ===
epochs = 3
learning_rate = 5e-5

# === Train the model ===
model, train_losses, val_losses, val_accuracies = train_model(
    model,
    train_dataloader,
    val_dataloader,
    epochs=epochs,
    learning_rate=learning_rate
)

# === Save the model ===
# Convert model name (e.g., "distilbert-base-uncased") into a clean folder name
model_save_path = f"{model_name.replace('/', '_')}_consumer_complaints_model"
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)
print(f"Model saved to {model_save_path}")

In [None]:
def evaluate_model(model, test_dataloader):
    model.eval()
    
    predictions = []
    true_labels = []
    
    for batch in tqdm(test_dataloader, desc="Evaluating"):
        # Move batch to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        # Forward pass without gradient calculation
        with torch.no_grad():
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
        
        # Get predictions
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1).cpu().numpy()
        
        # Store predictions and true labels
        predictions.extend(preds)
        true_labels.extend(labels.cpu().numpy())
    
    # Calculate metrics
    accuracy = accuracy_score(true_labels, predictions)
    precision = precision_score(true_labels, predictions, average='weighted')
    recall = recall_score(true_labels, predictions, average='weighted')
    f1 = f1_score(true_labels, predictions, average='weighted')
    
    print(f"Test Accuracy: {accuracy:.4f}")
    print(f"Test Precision: {precision:.4f}")
    print(f"Test Recall: {recall:.4f}")
    print(f"Test F1 Score: {f1:.4f}")
    
    print("\nClassification Report:")
    print(classification_report(true_labels, predictions))
    
    # Plot confusion matrix
    cm = confusion_matrix(true_labels, predictions)
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=label_encoder.classes_, 
                yticklabels=label_encoder.classes_)
    plt.title('Confusion Matrix', fontsize=16)
    plt.xlabel('Predicted Label', fontsize=14)
    plt.ylabel('True Label', fontsize=14)
    plt.tight_layout()
    plt.show()
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'true_labels': true_labels,
        'predictions': predictions
    }

# Add missing import for confusion matrix
from sklearn.metrics import confusion_matrix

# Evaluate the model on the test set
test_results = evaluate_model(model, test_dataloader)

In [None]:
def predict_complaint_category(text, model, tokenizer, label_mapping):
    # Tokenize the text
    encoding = tokenizer(
        text,
        add_special_tokens=True,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    
    # Move inputs to the device
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    
    # Set the model to evaluation mode
    model.eval()
    
    # Forward pass without gradient calculation
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    
    # Get the predictions
    logits = outputs.logits
    prediction_id = torch.argmax(logits, dim=1).item()
    
    # Get the predicted category
    predicted_category = label_mapping[prediction_id]
    
    return predicted_category

# Example usage
example_text = "I am having issues with my credit card. The bank charged me an annual fee even though they said it would be waived."
predicted_category = predict_complaint_category(example_text, model, tokenizer, label_mapping)
print(f"Predicted category: {predicted_category}")

In [None]:
# Example complaints
test_complaints = [
    "I am having issues with my credit card. The bank charged me an annual fee even though they said it would be waived.",
    "My credit report shows incorrect information. There are accounts listed that don't belong to me.",
    "I requested a loan modification three months ago, but I haven't heard anything back from the lender.",
    "A debt collector keeps calling me for a debt that isn't mine. I've told them multiple times it's not my debt.",
    "I cannot access my bank account online. The website keeps showing an error message."
]

# Predict categories
for i, complaint in enumerate(test_complaints):
    category = predict_complaint_category(complaint, model, tokenizer, label_mapping)
    print(f"Complaint {i+1}: {complaint[:50]}...")
    print(f"Predicted category: {category}\n")