ðŸ§  Task 2: Text Chunking, Embeddings, and Vector Store Construction
Objective

The goal of this task is to prepare the cleaned CFPB consumer complaint narratives for Retrieval-Augmented Generation (RAG).
This involves:

Chunking long complaint texts into manageable segments

Generating dense vector embeddings

Storing embeddings in a FAISS vector database for semantic search

1. Imports and Environment Setup

In [3]:
import pandas as pd
import numpy as np
from tqdm import tqdm

from sentence_transformers import SentenceTransformer
import faiss


  from .autonotebook import tqdm as notebook_tqdm


2. Load Preprocessed Complaint Data

We load the cleaned and filtered dataset produced in Task 1 and remove any remaining missing narratives.

In [4]:
df = pd.read_csv("../data/processed/filtered_complaints.csv")

df = df.dropna(subset=["clean_narrative"]).reset_index(drop=True)

df.shape


(80667, 20)

3. Text Chunking Strategy
Why Chunking Is Necessary

Complaint narratives can be very long

Embedding models perform best on shorter texts

Chunking improves retrieval accuracy and recall in RAG systems

We use overlapping word-based chunks to preserve context.

Chunking Function

In [5]:
def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = words[i:i + chunk_size]
        chunks.append(" ".join(chunk))

    return chunks


Apply Chunking to All Complaints

In [6]:
all_chunks = []
metadata = []

for idx, row in df.iterrows():
    chunks = chunk_text(row["clean_narrative"])

    for chunk in chunks:
        all_chunks.append(chunk)
        metadata.append({
            "Complaint ID": row["Complaint ID"],
            "Product": row["Product"]
        })

len(all_chunks)


111561

In [7]:
model = SentenceTransformer("all-MiniLM-L6-v2")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Generate Embeddings

In [8]:
embeddings = model.encode(
    all_chunks,
    show_progress_bar=True,
    convert_to_numpy=True
)

embeddings.shape


Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3487/3487 [59:35<00:00,  1.03s/it]   


(111561, 384)

Create FAISS Vector Store

In [10]:
dimension = embeddings.shape[1]

index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

index.ntotal


111561

Save Vector Store & Metadata

In [13]:
faiss.write_index(index, "../data/complaints_faiss.index")

pd.DataFrame(metadata).to_csv(
    "../data/complaints_metadata.csv",
    index=False
)


Sanity Test Retrieval

In [14]:
query = "credit card charged fees I did not authorize"

query_embedding = model.encode([query])
distances, indices = index.search(query_embedding, k=5)

for i in indices[0]:
    print(all_chunks[i][:200], "\n")


credit card was charged with something i did not buy 

a inappropriate fee was charged to my account along with an associated inappropriate interest charge the credit card company has not responded to or acknowledged my dispute of these charges 

my card was charged for a total of xxxx dollars which i did n t authorize american express didnt srefund my money 

i was not notified that the annual fee for the credit card was coming up i was also not able to login into my credit card account so i was unable to make a payment or even see that the annual fee was  

my card was charge unauthorize but credit card company is not willing to correct this charge 

