<b><h1 style="text-align:center;">COMM493 - Coding AI for Business</h1></b>
<h5 style="text-align:center;">Assignment 2 - Text Classification Via Natural Language Processing</h5>
<h5 style="text-align:center;">Maxwell Brookes - 20244724</h5>
<h5 style="text-align:center;">March 11th, 2025</h5>

### Introduction
**Motivation:** Leverage NLP so that member complaints can be categorized and sent to the correct department as fast as possible. With specialized customer service representatives and efficient sorting, member issues can be resolved extremely fast. This streamlined system will reduce operational costs while increasing member retention and cross-selling opportunities. Satisfied members will be more likely to expand their use of loans, investments, and premium services, thereby increasing revenue. Over time, these cumulative improvements will compound into higher profitability, allowing for expansion and reinvestment.

**Data:** <a href="https://www.kaggle.com/datasets/abhishek14398/automatic-ticket-classification-dataset/">Automatic Ticket Classification Dataset</a> contains complaint texts that belong to various departments of a financial institution.

**Goal:** Probabilistically map each ticket onto its respective department/category. Using the trained model, any new customer complaint can be classified and routed to its relevant department.

### 0: Set Up Environment
Load imports and define constants that will be used to prepare data and train model.

In [1]:
# load imports
from datetime import datetime
import pandas as pd
import json
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
import numpy as np
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker import image_uris

# Initialize SageMaker
sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = sagemaker_session.default_bucket()
prefix = 'complaints-classification-optimized'

# Precompile regex patterns
REDACTION_PATTERNS = re.compile(
    r'\b(?:X+X|X{2,}(?:/X{2,})+|\d+[-/]?X+|X+[-/]?\d+|X{4,})\b', 
    flags=re.IGNORECASE
)
CLEANING_PATTERNS = [
    (re.compile(r'(\\[nt])+'), ' '),
    (re.compile(r'\$ ?(\d+)'), r'\1 dollars'),
    (re.compile(r'\b(\d+)(?:st|nd|rd|th)\b'), r'\1'),
    (re.compile(r'[^\w\s]'), '')
]



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


<b><h2 style="text-align:center;">DATA PREPROCESSING</h2></b>

### 1: Load Dataset
Load the dataset into memory using json load.

In [2]:
with open('complaints.json', 'r') as f:
    data = json.load(f)
df = pd.json_normalize(data)
print('Data Shape:', df.shape)

Data Shape: (78313, 22)


### 2: Clean Rows & Columns
Extract only usefull data from the dataset. 
First, rename the desired columns for clarity, then drop all undesired columns. For this task, only the *category* and *text* columns are required.
Finally, remove all rows where either the *category* or *text* is null, NAN, or empty.

In [3]:
# rename columns
df.rename(columns={
    '_source.product': 'category',
    '_source.complaint_what_happened': 'text'
}, inplace=True)

# drop columns
columns_to_keep = ['category', 'text']
all_columns = df.columns.tolist()
columns_to_drop = [col for col in all_columns if col not in columns_to_keep]
df.drop(columns_to_drop, axis=1, inplace=True)

# null handling
before = df.shape[0]
df = df[
    df['text'].str.strip().astype(bool) &
    df['category'].notna() &
    df['category'].str.strip().astype(bool)
].copy()
after = df.shape[0]

# show info
print(f"Dropped {before-after} rows from the dataframe.")
df.columns

Dropped 57241 rows from the dataframe.


Index(['category', 'text'], dtype='object')

### 3: Clean Text & Stratify
Clean up the text column so that redacted information (XXXX) is standardized, all text is lowercase, and special characters are removed.
Then, apply stratified sampling with quantile-based balancing so that category frequencies are balanced.

In [4]:
# text cleaning function
def clean_text_column(text_series):
    cleaned = text_series.str.replace(REDACTION_PATTERNS, '[REDACTED]')
    for pattern, replacement in CLEANING_PATTERNS:
        cleaned = cleaned.str.replace(pattern, replacement)
    return cleaned.str.lower().str.strip()


# apply text cleaning
df['text'] = clean_text_column(df['text'])

# Stratified sampling with quantile-based balancing
category_counts = df['category'].value_counts()
min_samples = int(max(100, category_counts.quantile(0.85)))  # Ensure minimum 100 samples
balanced_dfs = []
for category, group in df.groupby('category'):
    if len(group) < min_samples:
        group = resample(group, replace=True, n_samples=min_samples, random_state=123)
    balanced_dfs.append(group)

df = pd.concat(balanced_dfs, ignore_index=True)
print(f"Balanced Data Shape: {df.shape}")

# Preview
df.head()

Balanced Data Shape: (51523, 2)


Unnamed: 0,category,text
0,Bank account or service,being charged erroneous bank fees on my checki...
1,Bank account or service,chase bank whom i have banked with foe redacte...
2,Bank account or service,received a chase quick pay for move in payment...
3,Bank account or service,chase bank has disabled my ability to make mob...
4,Bank account or service,jp morgan chase bank reordered my transactions...


### 4: Split, Format, & Upload Data
Split dataset into traning and validation subsets, format for 'blazingtext' then upload to server.

In [5]:
# split training and validation data
train_data, validation_data = train_test_split(
    df, 
    test_size=0.2, 
    random_state=123,
    stratify=df['category']
)


# format for blazingtext
def format_blazingtext(df):
    return '\n'.join(
        f"__label__{cat.replace(' ', '_')} {txt}" 
        for cat, txt in zip(df['category'], df['text'])
    )


# Batch write formatted data
for name, data in [('train', train_data), ('validation', validation_data)]:
    with open(f'{name}.txt', 'w') as f:
        f.write(format_blazingtext(data))

<b><h2 style="text-align:center;">PROOF OF CONCEPT</h2></b>

### 5: Upload Data To S3
Upload the traning and validation datasets to S3 bucket so that it can be used by BlazingText

In [6]:
version = datetime.now().strftime("%Y%m%d-%H%M")
s3_prefix = f"{prefix}/{version}"

sagemaker_session.upload_data('train.txt', bucket=bucket, key_prefix=f'{s3_prefix}/train')
sagemaker_session.upload_data('validation.txt', bucket=bucket, key_prefix=f'{s3_prefix}/validation')

's3://sagemaker-us-east-1-922202922528/complaints-classification-optimized/20250311-2004/validation/validation.txt'

### 6: Train BlazingText Model
Using the training and validation data, train the blazing text model.

**`"mode": "supervised"`**  
Optimized for text classification tasks, this setting enables the model to learn from labeled complaint data, directly aligning with Kawartha Credit Union's need to categorize member issues into predefined departments (e.g., "lost credit card" → Fraud Department).

---

**`"epochs": 50`**  
Balances thorough training with computational efficiency. Financial complaint texts often contain nuanced language (e.g., "unauthorized transaction" vs. "payment delay"), requiring enough iterations to capture patterns without overfitting.

---

**`"learning_rate": 0.1`**  
A moderately high rate accelerates convergence while maintaining stability—critical for processing large volumes of complaint tickets without sacrificing model accuracy.

---

**`"min_count": 2`**  
Ignores words appearing fewer than twice, filtering out rare typos or member-specific jargon (e.g., unique abbreviations) that lack generalizable value for department classification.

---

**`"vector_dim": 300`**  
Standard dimensionality for capturing semantic relationships in financial terminology (e.g., linking "mortgage" to "refinance") while avoiding excessive computational overhead.

---

**`"word_ngrams": 3`**  
Considers 3-word sequences to detect context-critical phrases like "credit card dispute" or "failed wire transfer," which single words or bigrams might misinterpret.

---

**`"bucket": 200000`**  
Allocates sufficient hash space for the credit union's domain-specific vocabulary (~10,000-15,000 unique financial terms), minimizing hash collisions while conserving memory.

---

**`"early_stopping": True` + `"patience": 5`**  
Halts training if validation accuracy plateaus for 5 epochs, preventing overfitting to repetitive complaint patterns (e.g., seasonal loan inquiries) and reducing wasted compute resources.

---

**`"threads": 8`**  
Maximizes CPU utilization for faster training—critical given the urgency of deploying an efficient complaint-resolution system to improve member retention.

---

**Alignment with Business Goals**  
These settings collectively prioritize:  
- **Accuracy**: Minimizing misrouted complaints through robust word representations (`vector_dim`, `word_ngrams`)  
- **Efficiency**: Rapid training (`threads`, `learning_rate`) for timely system deployment  
- **Scalability**: Memory-conscious design (`bucket`, `min_count`) to handle growing complaint volumes

---

**Results:**
Model achieves >90% validation accuracy, which means that customer complaints will be routed to the correct departements very efficiently.

In [7]:
region_name = boto3.Session().region_name
container = sagemaker.image_uris.retrieve(
    framework='blazingtext',
    region=region_name
)

# Enhanced hyperparameters
hyperparams = {
    "mode": "supervised",
    "epochs": 50,
    "learning_rate": 0.1,
    "min_count": 2,
    "vector_dim": 300,
    "word_ngrams": 3,
    "bucket": 200000,
    "early_stopping": True,
    "patience": 5,
    "threads": 8  # Utilize more CPU cores
}

# Configure estimator with optimized instance
bt_estimator = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m5.xlarge',  # Better compute ratio
    output_path=f's3://{bucket}/{s3_prefix}/output',
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparams
)

# Start training with versioned data
bt_estimator.fit({'train': f's3://{bucket}/{s3_prefix}/train/train.txt',
                  'validation': f's3://{bucket}/{s3_prefix}/validation/validation.txt'})

2025-03-11 20:04:40 Starting - Starting the training job...
..25-03-11 20:05:14 Downloading - Downloading input data.
..25-03-11 20:05:39 Downloading - Downloading the training image.
.[34mArguments: train[0mng - Training image download completed. Training in progress..
  self.stdout = io.open(c2pread, 'rb', bufsize)[0m
[34m[03/11/2025 20:06:03 INFO 139857610688320] nvidia-smi took: 0.025176525115966797 secs to identify 0 gpus[0m
[34m[03/11/2025 20:06:03 INFO 139857610688320] Running single machine CPU BlazingText training using supervised mode.[0m
[34mNumber of CPU sockets found in instance is  1[0m
[34m[03/11/2025 20:06:03 INFO 139857610688320] Processing /opt/ml/input/data/train/train.txt . File size: 54.95701217651367 MB[0m
[34m[03/11/2025 20:06:03 INFO 139857610688320] Processing /opt/ml/input/data/validation/validation.txt . File size: 13.780094146728516 MB[0m
[34mRead 10M words[0m
[34mRead 10M words[0m
[34mNumber of words:  20149[0m
[34mLoading validation dat

### 7: Deploy The Model
Create endpoint for model deployment.

In [8]:
bt_predictor = bt_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge',
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
    endpoint_name=f'complaint-classifier-{version}'
)

# Configure auto-scaling
client = boto3.client('application-autoscaling')
client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=f'endpoint/complaint-classifier-{version}/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=3
)

-----!

{'ScalableTargetARN': 'arn:aws:application-autoscaling:us-east-1:922202922528:scalable-target/056m8a55a09479aa4a7ebe5a875a352504e2',
 'ResponseMetadata': {'RequestId': 'e21b65e3-b88e-46fb-9c4a-4903628abcd9',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'e21b65e3-b88e-46fb-9c4a-4903628abcd9',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '131',
   'date': 'Tue, 11 Mar 2025 20:13:00 GMT'},
  'RetryAttempts': 0}}

### 8: Query Model
Show how business can use model to predict category of customer complaint based on the text.
In this example the customer complaint *"I have issues with my credit card payment being processed incorrectly."* is catagorized as *Credit card or prepaid card* with 93% probability.

In [9]:
# Create test complaint
sample_complaint = "I have issues with my credit card payment being processed incorrectly."

# Get predictions for top 5 categories
prediction = bt_predictor.predict({
    "instances": [sample_complaint],
    "configuration": {"k": 5}
})

# Process results
print("--- Complaint Text ---")
print(sample_complaint)
print(f"\n--- Predicted Categories (Top 5) ---")

labels = prediction[0]['label']
probs = prediction[0]['prob']

for rank, (label, prob) in enumerate(zip(labels, probs), 1):
    clean_label = label.replace('__label__', '').replace('_', ' ')
    print(f"Rank {rank}: {clean_label.ljust(30)} {prob*100:.2f}%")

--- Complaint Text ---
I have issues with my credit card payment being processed incorrectly.

--- Predicted Categories (Top 5) ---
Rank 1: Credit card or prepaid card    97.71%
Rank 2: Credit reporting, credit repair services, or other personal consumer reports 1.85%
Rank 3: Credit card                    0.38%
Rank 4: Debt collection                0.07%
Rank 5: Money transfer, virtual currency, or money service 0.00%


### 9: Clean Up
Delete endpoint to avoid ongoing charges.

In [10]:
# Delete endpoint & model to avoid ongoing charges
bt_predictor.delete_model()
bt_predictor.delete_endpoint()