ðŸ§  Task 2: Text Chunking, Embeddings, and Vector Store Construction
Objective

The goal of this task is to prepare the cleaned CFPB consumer complaint narratives for Retrieval-Augmented Generation (RAG).
This involves:

Chunking long complaint texts into manageable segments

Generating dense vector embeddings

Storing embeddings in a FAISS vector database for semantic search

1. Imports and Environment Setup

In [13]:
import pandas as pd
import numpy as np
from tqdm import tqdm

from sentence_transformers import SentenceTransformer
import faiss


  from .autonotebook import tqdm as notebook_tqdm


2. Load Preprocessed Complaint Data

We load the cleaned and filtered dataset produced in Task 1 and remove any remaining missing narratives.

In [14]:
df = pd.read_csv("../data/processed/filtered_complaints.csv")

df = df.dropna(subset=["clean_narrative"]).reset_index(drop=True)

df.shape


(80667, 20)

In [15]:
from sklearn.model_selection import train_test_split

# Optional: limit dataset size with stratified sampling
sample_size = 15000  # or 10000
df_sampled, _ = train_test_split(
    df,
    train_size=sample_size,
    stratify=df["Product"],  # ensures balanced distribution across products
    random_state=42
)
df = df_sampled.reset_index(drop=True)


3. Text Chunking Strategy
Why Chunking Is Necessary

Complaint narratives can be very long

Embedding models perform best on shorter texts

Chunking improves retrieval accuracy and recall in RAG systems

We use overlapping word-based chunks to preserve context.

Chunking Function

In [16]:
def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = words[i:i + chunk_size]
        chunks.append(" ".join(chunk))

    return chunks


Apply Chunking to All Complaints

In [17]:
all_chunks = []
metadata = []

for idx, row in df.iterrows():
    chunks = chunk_text(row["clean_narrative"])

    for chunk in chunks:
        all_chunks.append(chunk)
        metadata.append({
            "Complaint ID": row["Complaint ID"],
            "Product": row["Product"]
        })

len(all_chunks)


20731

In [18]:
model = SentenceTransformer("all-MiniLM-L6-v2")


Generate Embeddings

In [19]:
embeddings = model.encode(
    all_chunks,
    show_progress_bar=True,
    convert_to_numpy=True
)

embeddings.shape


Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 648/648 [08:18<00:00,  1.30it/s]


(20731, 384)

Create FAISS Vector Store

In [20]:
dimension = embeddings.shape[1]

index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

index.ntotal


20731

In [21]:
import os

vector_dir = "data/vector_store"
os.makedirs(vector_dir, exist_ok=True)

# Save FAISS index
faiss.write_index(index, os.path.join(vector_dir, "complaints_faiss.index"))

# Save metadata
pd.DataFrame(metadata).to_csv(
    os.path.join(vector_dir, "complaints_metadata.csv"),
    index=False
)


Save Vector Store & Metadata

In [22]:
faiss.write_index(index, "../data/complaints_faiss.index")

pd.DataFrame(metadata).to_csv(
    "../data/complaints_metadata.csv",
    index=False
)


Sanity Test Retrieval

In [23]:
query = "credit card charged fees I did not authorize"

query_embedding = model.encode([query])
distances, indices = index.search(query_embedding, k=5)

for i in indices[0]:
    print(all_chunks[i][:200], "\n")


my card was charged for a total of xxxx dollars which i did n t authorize american express didnt srefund my money 

the bank opened an annual fee credit card without my permission 

i signed up for a xxxx xxxxr card with first progress xx xx year the card has a 29 00 annual fee after activation i never activated the card i never received any statements i recently obtained a copy  

a credit card from barclaysxxxx xxxx account was opened in my name without my knowledge or consent i did not apply for this credit card and i did not authorize any charges on it i discovered the accou 

the credit card added intrest to my credit card bill for no reason and my bill was past due all because of that and they wont take the late fee and interest off my bill 

