# Task 1 â€“ Data Preprocessing

This notebook applies filtering and text cleaning steps to the CFPB
consumer complaint dataset to prepare it for downstream RAG tasks.

The core text-cleaning logic is implemented in `src/preprocessing.py`
to enable reuse across notebooks and scripts.


In [1]:
import pandas as pd
import sys
from pathlib import Path

# Ensure src/ is importable
sys.path.append(str(Path("..").resolve()))

from src.preprocessing import clean_text


## Load Raw Complaint Data


In [3]:
df = pd.read_csv("../data/raw/complaints.csv")
df.head()


  df = pd.read_csv("../data/raw/complaints.csv")


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2025-06-20,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,,,Experian Information Solutions Inc.,FL,32092,,,Web,2025-06-20,In progress,Yes,,14195687
1,2025-06-20,Debt collection,Telecommunications debt,Attempts to collect debt not owed,Debt is not yours,,Company can't verify or dispute the facts in t...,"Eastern Account Systems of Connecticut, Inc.",FL,342XX,,,Web,2025-06-20,Closed with explanation,Yes,,14195688
2,2025-06-20,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,,,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AZ,85225,,,Web,2025-06-20,In progress,Yes,,14195689
3,2025-06-20,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,,,Experian Information Solutions Inc.,AZ,85225,,,Web,2025-06-20,In progress,Yes,,14195690
4,2025-06-20,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Account status incorrect,,,Experian Information Solutions Inc.,IL,60628,,,Web,2025-06-20,In progress,Yes,,14195692


## Filter Relevant Product Categories
We restrict the dataset to the financial products required for this project
and remove complaints without narratives.


In [6]:
TARGET_PRODUCTS = [
    "Credit card",
    "Personal loan",
    "Savings account",
    "Money transfer"
]

df_filtered = df[
    df["Product"].isin(TARGET_PRODUCTS) &
    df["Consumer complaint narrative"].notna()
].copy()

df_filtered.shape


(80667, 18)

## Clean Complaint Narratives
Text normalization improves embedding quality by reducing noise.


In [9]:
df_filtered["clean_narrative"] = (
    df_filtered["Consumer complaint narrative"]
    .astype(str)
    .apply(clean_text)
)

df_filtered[["Consumer complaint narrative", "clean_narrative"]].head()


Unnamed: 0,Consumer complaint narrative,clean_narrative
12237,A XXXX XXXX card was opened under my name by a...,a xxxx xxxx card was opened under my name by a...
13280,"Dear CFPB, I have a secured credit card with c...",dear cfpb i have a secured credit card with ci...
13506,I have a Citi rewards cards. The credit balanc...,i have a citi rewards cards the credit balance...
13955,b'I am writing to dispute the following charge...,b i am writing to dispute the following charge...
14249,"Although the account had been deemed closed, I...",although the account had been deemed closed i ...


## Save Cleaned Dataset


In [None]:
output_path = "../data/processed/filtered_complaints.csv"
df_filtered.to_csv(output_path, index=False)

print(f"Saved cleaned data to {output_path}")
