In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

from google.colab import auth
auth.authenticate_user()

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# NLP Complaint Intelligence System
## PHASE 3: Semantic Embeddings using Sentence Transformers (API-Free)

---

## Step 1: Why Sentence Transformers Are Used

### Why this step is required

Large-scale NLP systems cannot depend on paid or quota-limited APIs
for offline training.

Sentence Transformers provide:
- Strong semantic embeddings
- No API keys
- No quota limits
- Full control over data
- Industry adoption for scalable NLP

This makes them ideal for large datasets and production pipelines.


## Step 2: Load Cleaned Dataset from Google Drive

### Why this step is required

All embeddings must be generated from a clean, validated dataset.
This avoids noise propagation and ensures stable model performance.

In [2]:
import pandas as pd

clean_path = "/content/drive/MyDrive/PROJECTS/EXTRA PROJECT/NLP_PROJECT_COMPLAINT/data/complaints_clean.csv"
df = pd.read_csv(clean_path)

df.shape

(399811, 2)

In [3]:
df = df[["text", "label"]].dropna()
df.head()

Unnamed: 0,text,label
0,i am writing to formally submit a complaint un...,Credit reporting or other personal consumer re...
1,ucs section a states i have the right to priva...,Credit reporting or other personal consumer re...
2,i am writing to have the following information...,Credit reporting or other personal consumer re...
3,transunion still has a bankruptcy on my accoun...,Credit reporting or other personal consumer re...
4,this account was opened over years ago this da...,"Credit reporting, credit repair services, or o..."


## Step 3: Prepare Text Data for Embedding

### Why this step is required

Sentence Transformers operate only on raw text.
Labels are separated and preserved for supervised learning.


In [4]:
texts = df["text"].tolist()
labels = df["label"].tolist()

## Step 4: Install Sentence Transformers Library

### Why this step is required

Sentence Transformers is not pre-installed in Colab.
We install it once per runtime.


In [5]:
!pip install -q sentence-transformers

## Step 5: Load Pretrained Sentence Transformer Model

### Why this step is required

The `all-MiniLM-L6-v2` model is:
- Lightweight
- Fast
- Industry-standard
- Well-suited for complaint and support text


In [6]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Step 6: Generate Semantic Embeddings

### Why this step is required

- Machine learning models cannot understand raw text.
- Embeddings convert each complaint into a dense numerical
  vector that captures semantic meaning.

In [7]:
X_embeddings = model.encode(
    texts,
    batch_size=64,
    show_progress_bar=True
)

X_embeddings.shape

Batches:   0%|          | 0/6248 [00:00<?, ?it/s]

(399811, 384)

## Step 7: Verify Embedding and Label Alignment

### Why this step is required

Before training models, we must ensure:
- Each text has one embedding
- Labels align with embeddings

In [8]:
len(X_embeddings), len(labels)

(399811, 399811)

## Step 8: Encode Labels Numerically

### Why this step is required

Supervised ML models require numeric labels.
Label encoding preserves class identity.

In [9]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(labels)

len(label_encoder.classes_)

14

## Step 9: Create Final Model Artifact Directory

### Why this step is required

Production pipelines save all reusable assets:
- Embeddings
- Encoders
- Models

This enables reuse without recomputation.


In [10]:
import os

save_dir = "/content/drive/MyDrive/PROJECTS/EXTRA PROJECT/NLP_PROJECT_COMPLAINT/final_model"
os.makedirs(save_dir, exist_ok=True)

## Step 10: Save Sentence Transformer Embeddings

### Why this step is required

Embedding generation is time-consuming.
Saving them allows fast experimentation and reproducibility.


In [11]:
import joblib

joblib.dump(X_embeddings, f"{save_dir}/sentence_embeddings.pkl")

['/content/drive/MyDrive/PROJECTS/EXTRA PROJECT/NLP_PROJECT_COMPLAINT/final_model/sentence_embeddings.pkl']

## Step 11: Save Label Encoder

### Why this step is required

Label encoders are required at inference time
to map model outputs back to original categories.


In [12]:
joblib.dump(label_encoder, f"{save_dir}/label_encoder.pkl")

['/content/drive/MyDrive/PROJECTS/EXTRA PROJECT/NLP_PROJECT_COMPLAINT/final_model/label_encoder.pkl']

## PHASE 3 Completion Summary

At the end of this phase:
- Text has been converted into semantic embeddings
- No external APIs or quotas were used
- Embeddings are scalable and production-ready
- All artifacts are saved for downstream modeling

This design reflects how real NLP systems are built in industry.
