

## **Preprocessing of A2P SMS Dataset**

The following preprocessing steps were performed on the raw SMS dataset to prepare it for AI-based spam classification:

1. **Upload and Load Dataset**

   * The raw CSV file was uploaded and loaded into a DataFrame.
   * Encoding `utf-8-sig` was used to correctly interpret special characters like the rupee symbol (`₹`).

2. **Cleaning Messages**

   * All text was converted to **lowercase**.
   * Extra spaces were removed and multiple whitespaces were normalized.
   * Internal line breaks and carriage returns were removed to ensure each message stays in a single row.

3. **Remove Duplicates and Empty Rows**

   * Duplicate messages were dropped to prevent bias in training.
   * Empty rows and any extra header rows within the dataset were removed.

4. **Balancing the Dataset**

   * The dataset had imbalanced classes: `Transactional`, `Spam`, and `Promotional`.
   * Minor classes were **upsampled** to match the largest class to ensure equal representation.

5. **Final Dataset**

   * The processed dataset includes three columns:

     * `message` → Original SMS text
     * `category` → Message label (`Transactional`, `Promotional`, `Spam`)
     * `cleaned_message` → Preprocessed message text

   * Saved as `labeled_sms_dataset_FINAL.csv` ready for model training.

---

### **Preprocessing Workflow**

```
Raw CSV Dataset (message, category)
           │
           ▼
  Load into Pandas DataFrame
           │
           ▼
   Encoding Fix (utf-8-sig)
   - Corrects special characters like ₹
           │
           ▼
   Remove Extra Headers & Empty Rows
   - Removes any repeated headers or blank lines
           │
           ▼
       Clean Messages
   - Convert text to lowercase
   - Normalize spaces
   - Remove line breaks
           │
           ▼
   Remove Duplicates
   - Ensures unique messages only
           │
           ▼
   Balance Classes
   - Upsample minor classes (Spam, Promotional)
   - Ensures equal representation of all categories
           │
           ▼
Final Dataset:
Columns: message | category | cleaned_message
Saved as: labeled_sms_dataset_FINAL.csv
```


In [12]:
import pandas as pd
import re
from sklearn.utils import resample
from google.colab import files


In [13]:
# Upload your CSV file
uploaded = files.upload()

# Get uploaded filename
file_name = list(uploaded.keys())[0]
print(f"Uploaded file: {file_name}")

# Read dataset (no headers in original)
sms_df = pd.read_csv(file_name, names=["message", "category"], header=None, encoding="utf-8")
print(f"Total rows loaded: {len(sms_df)}")
sms_df.head()


Saving message_dataset_50k .csv to message_dataset_50k  (2).csv
Uploaded file: message_dataset_50k  (2).csv
Total rows loaded: 50001


Unnamed: 0,message,category
0,Message,Category
1,Final notice. Update your info: https://verify...,Spam
2,Reset your password now at https://get-rich-fa...,Spam
3,Your transaction ID is TXN471861. Please keep ...,Transactional
4,Your package with tracking ID 162556 has been ...,Transactional


In [14]:
def clean_sms(text):
    """
    Cleans SMS text:
    - Convert to lowercase
    - Remove extra spaces
    - Remove newlines
    """
    if pd.isna(text):
        return ""
    text = str(text).strip().lower()
    text = re.sub(r'\s+', ' ', text)            # collapse multiple spaces
    text = re.sub(r'[\r\n]+', ' ', text)        # remove line breaks
    return text

print("Cleaning messages...")
sms_df['cleaned_message'] = sms_df['message'].apply(clean_sms)

# Remove duplicates
sms_df = sms_df.drop_duplicates(subset='message').reset_index(drop=True)
print(f"After cleaning & deduplication: {sms_df.shape}")
sms_df.head()


Cleaning messages...
After cleaning & deduplication: (4221, 3)


Unnamed: 0,message,category,cleaned_message
0,Message,Category,message
1,Final notice. Update your info: https://verify...,Spam,final notice. update your info: https://verify...
2,Reset your password now at https://get-rich-fa...,Spam,reset your password now at https://get-rich-fa...
3,Your transaction ID is TXN471861. Please keep ...,Transactional,your transaction id is txn471861. please keep ...
4,Your package with tracking ID 162556 has been ...,Transactional,your package with tracking id 162556 has been ...


In [15]:
print("Balancing categories...")

# Separate by category
df_trans = sms_df[sms_df['category'] == 'Transactional']
df_spam = sms_df[sms_df['category'] == 'Spam']
df_promo = sms_df[sms_df['category'] == 'Promotional']

# Determine target size
target_size = max(len(df_trans), len(df_spam), len(df_promo))

# Upsample minority classes
df_trans_up = resample(df_trans, replace=True, n_samples=target_size, random_state=42)
df_spam_up = resample(df_spam, replace=True, n_samples=target_size, random_state=42)
df_promo_up = resample(df_promo, replace=True, n_samples=target_size, random_state=42)

# Combine and shuffle
balanced_df = pd.concat([df_trans_up, df_spam_up, df_promo_up])
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

print("Category distribution after balancing:")
print(balanced_df['category'].value_counts())
balanced_df.head()


Balancing categories...
Category distribution after balancing:
category
Transactional    4020
Promotional      4020
Spam             4020
Name: count, dtype: int64


Unnamed: 0,message,category,cleaned_message
0,Your transaction ID is TXN179896. Please keep ...,Transactional,your transaction id is txn179896. please keep ...
1,Check out our latest deals at https://secure.b...,Promotional,check out our latest deals at https://secure.b...
2,Click here to win an iPhone: https://winfreeca...,Spam,click here to win an iphone: https://winfreeca...
3,Your transaction ID is TXN418755. Please keep ...,Transactional,your transaction id is txn418755. please keep ...
4,Check out our latest deals at https://trip.com,Promotional,check out our latest deals at https://trip.com


In [11]:
output_file = "labeled_sms_dataset_FINAL.csv"

# Ensure column order: message, category, cleaned_message
final_df = balanced_df[['message', 'category', 'cleaned_message']]

# Save CSV
final_df.to_csv(output_file, index=False, encoding='utf-8', lineterminator='\n')
print(f"Final cleaned & balanced dataset saved as {output_file}")

# Download CSV
files.download(output_file)


Final cleaned & balanced dataset saved as labeled_sms_dataset_FINAL.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>