In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

from google.colab import auth
auth.authenticate_user()

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# NLP Complaint Intelligence System
## PHASE 2: Text Embeddings & Feature Representation

---

## Step 1: Load Cleaned Dataset

### Why this step is required

In production ML pipelines, models never consume raw data.
They always consume **cleaned, validated datasets** produced
by an upstream data preparation phase.

Loading the cleaned dataset ensures:
- No data leakage
- Consistent experiments
- Clear separation of responsibilities

This mirrors real-world ML system design.


In [3]:
import pandas as pd

clean_path = "/content/drive/MyDrive/PROJECTS/EXTRA PROJECT/NLP_PROJECT_COMPLAINT/data/complaints_clean.csv"
df = pd.read_csv(clean_path)

## Step 2: Basic Dataset Validation After Cleaning

### Why this step is required

Even after cleaning, we must confirm:
- Dataset loaded correctly
- Text and labels are aligned
- No unexpected null values exist

This is a standard safety check before feature generation.


In [4]:
df.shape

(399811, 2)

In [5]:
df.head()

Unnamed: 0,text,label
0,i am writing to formally submit a complaint un...,Credit reporting or other personal consumer re...
1,ucs section a states i have the right to priva...,Credit reporting or other personal consumer re...
2,i am writing to have the following information...,Credit reporting or other personal consumer re...
3,transunion still has a bankruptcy on my accoun...,Credit reporting or other personal consumer re...
4,this account was opened over years ago this da...,"Credit reporting, credit repair services, or o..."


In [6]:
df.isnull().sum()

Unnamed: 0,0
text,0
label,0


## Step 3: Separate Input Text and Target Labels

### Why this step is required

Machine learning pipelines treat:
- Input features (X)
- Target labels (y)

as two independent entities.

Explicit separation:
- Improves clarity
- Prevents accidental data leakage
- Matches production training pipelines


In [7]:
X = df["text"]
y = df["label"]

In [15]:
y.value_counts(normalize=True)*100

Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
Credit reporting or other personal consumer reports,56.137025
"Credit reporting, credit repair services, or other personal consumer reports",16.807191
Debt collection,8.784401
Checking or savings account,4.625936
"Money transfer, virtual currency, or money service",3.183004
Credit card,3.143485
Mortgage,2.084985
Credit card or prepaid card,1.875886
Vehicle loan or lease,1.146292
Student loan,1.029236


## Step 4: Why We Start With TF-IDF Embeddings

### Why this step is required

Before deep learning or transformers, industry projects
always establish a **strong baseline**.

TF-IDF is used because:
- It is fast and memory-efficient
- It performs extremely well on complaint-style text
- It provides strong interpretability
- It works reliably on large datasets in Colab

Most production NLP systems still use TF-IDF as a baseline.


## Step 5: Convert Text to TF-IDF Feature Vectors

### Why this step is required

ML models cannot understand raw text.
TF-IDF converts text into numerical vectors that capture:
- Word importance
- Relative frequency
- Discriminative power

These vectors become the input to classifiers and neural networks.


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
tfidf = TfidfVectorizer(
    max_features=50000,
    ngram_range=(1,2),
    min_df=5,
    max_df=0.9
)

X_tfidf = tfidf.fit_transform(X)

## Step 6: Understand the Shape of Embeddings

### Why this step is required

Understanding feature shape is critical for:
- Model selection
- Memory planning
- Neural network design

This step confirms how text has been transformed numerically.


In [10]:
X_tfidf.shape

(399811, 50000)

## Step 7: Encode Target Labels Numerically

### Why this step is required

ML models and neural networks cannot work with string labels.
They require numeric representations.

Label encoding:
- Converts categories into numeric IDs
- Preserves class identity
- Is reversible for interpretation


In [11]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

## Step 8: Verify Encoded Labels

### Why this step is required

Before training any model, we must ensure:
- Correct number of classes
- No label mismatch
- Clean mapping between text and labels


In [12]:
len(label_encoder.classes_)

14

In [13]:
label_encoder.classes_

array(['Checking or savings account', 'Credit card',
       'Credit card or prepaid card',
       'Credit reporting or other personal consumer reports',
       'Credit reporting, credit repair services, or other personal consumer reports',
       'Debt collection', 'Debt or credit management',
       'Money transfer, virtual currency, or money service', 'Mortgage',
       'Payday loan, title loan, or personal loan',
       'Payday loan, title loan, personal loan, or advance loan',
       'Prepaid card', 'Student loan', 'Vehicle loan or lease'],
      dtype=object)

## PHASE 2 Interim Summary

At this stage:
- Clean text has been converted into numerical embeddings
- TF-IDF features are ready for modeling
- Target labels are encoded safely
- The dataset is prepared for classical ML and neural models

This represents a production-ready feature engineering pipeline.


# **TFIDF EMBEDDING AND LABEL ENCODING SAVING**


In [14]:
import os
import joblib

save_dir = "/content/drive/MyDrive/PROJECTS/EXTRA PROJECT/NLP_PROJECT_COMPLAINT/final_model"
os.makedirs(save_dir, exist_ok=True)

joblib.dump(tfidf, os.path.join(save_dir, "tfidf_vectorizer.pkl"))
joblib.dump(label_encoder, os.path.join(save_dir, "label_encoder.pkl"))

['/content/drive/MyDrive/PROJECTS/EXTRA PROJECT/NLP_PROJECT_COMPLAINT/final_model/label_encoder.pkl']