# 📊 CFPB Complaints EDA & Preprocessing Notebook

This notebook:
✅ Loads the dataset
✅ Analyzes product distribution and narrative lengths
✅ Counts complaints with/without narratives
✅ Filters for specific products & cleans narratives
✅ Saves the final cleaned dataset

All functions come from the modular scripts in the `src` folder, which are imported below.


# 📊 CFPB Complaints EDA & Preprocessing

This notebook:
✅ Analyzes distribution of complaints across products
✅ Calculates and visualizes word counts of complaint narratives
✅ Counts how many complaints have / don’t have narratives
✅ Filters dataset for specific products & removes empty narratives
✅ Cleans narrative text for embedding quality


In [5]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import os
%pip install pandas matplotlib seaborn scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [3]:
import sys
import os

# Add project root directory to sys.path
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.insert(0, project_root)


In [6]:
df_raw = pd.read_csv("../data/raw/complaints.csv", usecols=["Product"])
print(df_raw['Product'].dropna().str.strip().str.lower().unique())

['credit reporting or other personal consumer reports' 'debt collection'
 'credit card' 'checking or savings account'
 'money transfer, virtual currency, or money service'
 'vehicle loan or lease' 'debt or credit management' 'mortgage'
 'payday loan, title loan, personal loan, or advance loan' 'prepaid card'
 'student loan' 'credit reporting'
 'credit reporting, credit repair services, or other personal consumer reports'
 'credit card or prepaid card' 'payday loan, title loan, or personal loan'
 'bank account or service' 'money transfers' 'consumer loan' 'payday loan'
 'other financial service' 'virtual currency']


In [7]:

# Load only the 'Product' column from the CSV file
file_path = "../data/raw/complaints.csv"
columns_to_load = ["Product"]

df = pd.read_csv(file_path, usecols=columns_to_load)

# Clean the 'Product' column: drop missing values, strip spaces, convert to lowercase, get unique values
unique_products = (
    df['Product']
    .dropna()
    .str.strip()
    .str.lower()
    .unique()
)

print(unique_products)


['credit reporting or other personal consumer reports' 'debt collection'
 'credit card' 'checking or savings account'
 'money transfer, virtual currency, or money service'
 'vehicle loan or lease' 'debt or credit management' 'mortgage'
 'payday loan, title loan, personal loan, or advance loan' 'prepaid card'
 'student loan' 'credit reporting'
 'credit reporting, credit repair services, or other personal consumer reports'
 'credit card or prepaid card' 'payday loan, title loan, or personal loan'
 'bank account or service' 'money transfers' 'consumer loan' 'payday loan'
 'other financial service' 'virtual currency']


In [11]:
%pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')


Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\desta\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\desta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
# Ensure src_path is in sys.path
from src.preprocessing import run_preprocessing

# Note: The dataset might not contain explicit BNPL product rows,
# but we still recheck and map based on narrative text.
df_clean = run_preprocessing()

# Display first few rows
print(df_clean.head())


🔄 Starting chunked preprocessing...
📦 Processing chunk 1...
📦 Processing chunk 2...
📦 Processing chunk 3...
📦 Processing chunk 4...
📦 Processing chunk 5...
📦 Processing chunk 6...
📦 Processing chunk 7...
📦 Processing chunk 8...
📦 Processing chunk 9...
📦 Processing chunk 10...
📦 Processing chunk 11...
📦 Processing chunk 12...
📦 Processing chunk 13...
📦 Processing chunk 14...
📦 Processing chunk 15...
📦 Processing chunk 16...
📦 Processing chunk 17...
📦 Processing chunk 18...
📦 Processing chunk 19...
📦 Processing chunk 20...
📦 Processing chunk 21...
📦 Processing chunk 22...
📦 Processing chunk 23...
📦 Processing chunk 24...
📦 Processing chunk 25...
📦 Processing chunk 26...
📦 Processing chunk 27...
📦 Processing chunk 28...
📦 Processing chunk 29...
📦 Processing chunk 30...
📦 Processing chunk 31...
📦 Processing chunk 32...
📦 Processing chunk 33...
📦 Processing chunk 34...
📦 Processing chunk 35...
📦 Processing chunk 36...
📦 Processing chunk 37...
📦 Processing chunk 38...
📦 Processing chunk 39..

KeyError: 'narrative_len'

<Figure size 1000x500 with 0 Axes>

✅ **Done!**
- Dataset filtered to include only target products & non-empty narratives
- Narratives cleaned and ready for embedding
- File saved to `data/filtered_complaints.csv`
