<b><h2 style="text-align:center;">COMM493 - Coding AI for Business</h2></b>
<h5 style="text-align:center;">Assignment 2 - Text Classification Via Natural Language Processing</h5>
<h5 style="text-align:center;">Maxwell Brookes - 20244724</h5>
<h5 style="text-align:center;">March 1st, 2025</h5>

### 0: Intro
**Motivation:** TODO

**Data:** <a href="https://www.kaggle.com/datasets/abhishek14398/automatic-ticket-classification-dataset/">Automatic Ticket Classification Dataset</a> dataset contains ...
<a href="https://www.kaggle.com/code/abhishek14398/automatic-ticket-classification-case-study-nlp">implemenmtation</a>


**Goal:** Map each ticket onto its respective department/category. You can then use this data to train any supervised model such as logistic regression, decision tree or random forest. Using this trained model, you can classify any new customer complaint support ticket into its relevant department.

<b><h2 style="text-align:center;">DATA PREPROCESSING</h2></b>

### 1: Imports for data preprocessing

In [1]:
import pandas as pd
import json
import re
import string
from sklearn.model_selection import train_test_split
print('Successfully loaded imports for data preprocessing!')

Successfully loaded imports for data preprocessing!


### 2: Load dataset

In [2]:
with open('complaints.json', 'r') as f:
    data = json.load(f)
df = pd.json_normalize(data)
print('Data Shape:', df.shape)

Data Shape: (78313, 22)


### 3: Rename columns

In [3]:
# rename columns
df.rename(columns={
    '_index': 'index',
    '_type': 'type',
    '_id': 'id',
    '_score': 'score',
    '_source.tags': 'tags',
    '_source.zip_code': 'zip_code',
    '_source.complaint_id': 'complaint_id',
    '_source.issue': 'issue',
    '_source.date_received': 'date_received',
    '_source.state': 'state',
    '_source.consumer_disputed': 'consumer_disputed',
    '_source.product': 'category',
    '_source.company_response': 'company_response',
    '_source.company': 'company',
    '_source.submitted_via': 'submitted_via',
    '_source.date_sent_to_company': 'date_sent_to_company',
    '_source.company_public_response': 'company_public_response',
    '_source.sub_product': 'sub_category',
    '_source.timely': 'timely',
    '_source.complaint_what_happened': 'complaint_text',
    '_source.sub_issue': 'sub_issue',
    '_source.consumer_consent_provided': 'consumer_consent_provided'
}, inplace=True)
df.columns

Index(['index', 'type', 'id', 'score', 'tags', 'zip_code', 'complaint_id',
       'issue', 'date_received', 'state', 'consumer_disputed', 'category',
       'company_response', 'company', 'submitted_via', 'date_sent_to_company',
       'company_public_response', 'sub_category', 'timely', 'complaint_text',
       'sub_issue', 'consumer_consent_provided'],
      dtype='object')

### 4: Drop columns, drop rows, clean text

In [4]:
# combine catagories with sub_catagories
# df['category'] = df['category'] + ' - ' + df['sub_category']

# drop columns
columns_to_keep = ['issue', 'category', 'complaint_text', 'timely']
all_columns = df.columns.tolist()
columns_to_drop = [col for col in all_columns if col not in columns_to_keep]
df.drop(columns_to_drop, axis=1, inplace=True)

# drop rows where 'complaint_text' is null, empty, or contains only whitespace
before = df.shape[0]
df = df[df['complaint_text'].str.strip().astype(bool)]
after = df.shape[0]
rows_dropped = before - after
print(f"Dropped: {rows_dropped} rows")

# drop rows where 'category' is null, nan, empty, or contains only whitespace
before = df.shape[0]
mask = df['category'].notna() & (df['category'].str.strip().astype(bool))
df = df[mask]
after = df.shape[0]
rows_dropped = before - after
print(f"Dropped: {rows_dropped} rows")


def clean_text(text):
    text = text.lower()  # Make the text lowercase
    text = re.sub('\[.*\]','', text).strip() # Remove text in square brackets
    text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
    text = re.sub('\S*\d\S*\s*','', text).strip()  # Remove words containing numbers
    text = text.replace('xxxx', '') # replace hidden information
    text = text.replace('  ', ' ') # replace hidden information
    return text.strip()


# clean text
df.complaint_text = df.complaint_text.apply(lambda x: clean_text(x))
df.complaint_text.head()

# Preview
df.head()

Dropped: 57241 rows
Dropped: 2109 rows


Unnamed: 0,issue,category,timely,complaint_text
1,Written notification about debt,Debt collection - Credit card debt,Yes,good morning my name is and i appreciate it i...
2,"Other features, terms, or problems",Credit card or prepaid card - General-purpose ...,Yes,i upgraded my card in and was told by the age...
10,Incorrect information on your report,"Credit reporting, credit repair services, or o...",Yes,chase card was reported on however fraudulent ...
11,Incorrect information on your report,"Credit reporting, credit repair services, or o...",Yes,on while trying to book a ticket i came acro...
14,Managing an account,Checking or savings account - Checking account,Yes,my grand son give me check for i deposit it in...


### 5: Split and upload

In [5]:
# Split data into training and test sets
train_data, validation_data = train_test_split(df, test_size=0.2, random_state=123)
print('Validation_data Data Shape:', validation_data.shape)
print('Test Data Shape:', train_data.shape)


def format_blazingtext_data(df):
    formatted_data = []
    for _, row in df.iterrows():
        label = row['category'].replace(' ', '_')  # Replace spaces in labels
        text = ' '.join(row['complaint_text'].split())  # Remove extra whitespace
        formatted_data.append(f"__label__{label} {text}")
    return '\n'.join(formatted_data)


# Format training and test data
train_data_formatted = format_blazingtext_data(train_data)
validation_data_formatted = format_blazingtext_data(validation_data)

# Save to files
with open('train.txt', 'w') as f:
    f.write(train_data_formatted)
with open('validation.txt', 'w') as f:
    f.write(validation_data_formatted)

Validation_data Data Shape: (3793, 4)
Test Data Shape: (15170, 4)


<b><h2 style="text-align:center;">MODEL TRAINING</h2></b>

### 6: Imports for model training and deployment

In [6]:
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = get_execution_role()

# Define S3 bucket and prefix for data storage
bucket = sagemaker_session.default_bucket()
prefix = 'complaints-classification'



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Upload Data to S3

In [7]:
s3_train_path = f's3://{bucket}/{prefix}/train'
s3_val_path = f's3://{bucket}/{prefix}/validation'

sagemaker_session.upload_data('train.txt', bucket=bucket, key_prefix=f'{prefix}/train')
sagemaker_session.upload_data('validation.txt', bucket=bucket, key_prefix=f'{prefix}/validation')

's3://sagemaker-us-east-1-922202922528/complaints-classification/validation/validation.txt'

Train the BlazingText Model

In [8]:
# Get BlazingText image URI
region_name = boto3.Session().region_name
container = get_image_uri(region_name, 'blazingtext')

# Configure hyperparameters
hyperparams = {
    "mode": "supervised",   # Text classification mode
    "epochs": 10,          # Number of training epochs
    "learning_rate": 0.01, # Learning rate
    "min_count": 2,        # Ignore words with frequency < 2
    "vector_dim": 100,     # Word embedding dimension
    "early_stopping": True,
    "patience": 3,         # Stop if validation loss doesn't improve for 3 epochs
    "word_ngrams": 2       # Use bigrams
}

# Create estimator
bt_estimator = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{bucket}/{prefix}/output',
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparams
)

# Start training job
bt_estimator.fit({'train': s3_train_path, 'validation': s3_val_path})

2025-03-09 21:32:27 Starting - Starting the training job...
..25-03-09 21:32:41 Starting - Preparing the instances for training.
..25-03-09 21:33:03 Downloading - Downloading input data.
.[34mArguments: train[0mading - Downloading the training image.
  self.stdout = io.open(c2pread, 'rb', bufsize)[0m
[34m[03/09/2025 21:34:08 INFO 140117156800320] nvidia-smi took: 0.02523207664489746 secs to identify 0 gpus[0m
[34m[03/09/2025 21:34:08 INFO 140117156800320] Running single machine CPU BlazingText training using supervised mode.[0m
[34mNumber of CPU sockets found in instance is  1[0m
[34m[03/09/2025 21:34:08 INFO 140117156800320] Processing /opt/ml/input/data/train/train.txt . File size: 19.488234519958496 MB[0m
[34m[03/09/2025 21:34:08 INFO 140117156800320] Processing /opt/ml/input/data/validation/validation.txt . File size: 4.8691558837890625 MB[0m
[34mRead 3M words[0m
[34mNumber of words:  15016[0m

2025-03-09 21:34:30 Training - Training image download completed. Train

Deploy the Model for Inference

In [None]:
# Deploy the trained model to an endpoint
bt_predictor = bt_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large'
)

# Example prediction
sample_complaint = "I was charged a fee that I did not authorize."
formatted_sample = clean_text(sample_complaint)

# Predict category
response = bt_predictor.predict(formatted_sample)
print("Predicted category:", response[0]['label'])

---

Clean Up (Important!)

In [None]:
# Delete endpoint to avoid ongoing charges
bt_predictor.delete_endpoint()