
https://www.kaggle.com/datasets/sherrytp/consumer-complaints?resource=download
The Consumer Complaint Database is a collection of complaints about consumer financial products and services that we sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaints referred to other regulators, such as complaints about depository institutions with less than $10 billion in assets, are not published in the Consumer Complaint Database. The database generally updates daily.


=> Use this dataset to classify what product or service a complaint is pointing to, given the complaint narrative provided by the customer. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---

# SETUP

Next cells are about: the number of entries, column names, data types, and the presence of missing values.

In [None]:
sns.set_style("whitegrid")

df = pd.read_csv('../data/complaints.csv')

# --- Initial Inspection ---
print("--- Dataset Shape ---")
print(df.shape)
print("\n--- Column Info and Missing Values ---")
df.info()

In [None]:
print("\n--- First 5 Rows ---")
pd.set_option('display.max_colwidth', 100) # Widen column display for narratives
print(df.head())

---

# Target Variable Analysis: 'Product'

Next cell is about: Understanding the distribution of your target variable 'Product', critical for any classification task

In [None]:
# Define target and feature columns for clarity
target_column = 'Product'
feature_column = 'Consumer complaint narrative'

# --- Analyze the Target Variable ('Product') ---
print(f"--- Analysis of Target Variable: '{target_column}' ---")

# Get the number of unique product categories
num_classes = df[target_column].nunique()
print(f"Number of unique product categories: {num_classes}")

# Get the frequency of each category
print("\nDistribution of complaints per product:")
print(df[target_column].value_counts())

# --- Visualize the Target Variable Distribution ---
plt.figure(figsize=(12, 8))
sns.countplot(y=df[target_column], order=df[target_column].value_counts().index, palette='viridis')
plt.title('Frequency of Complaints per Product Category', fontsize=16)
plt.xlabel('Number of Complaints', fontsize=12)
plt.ylabel('Product Category', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.tight_layout()
plt.show()

#### Finding 3: Overlapping Product Categories

**Observation:** The `Product` column contains several overlapping and redundant categories. For example, `Credit card` is a sub-category of `Credit card or prepaid card`.

**Risk:** This ambiguity will confuse the model and make performance evaluation unreliable.

**Decision:** I will consolidate these sub-categories into their more general parent categories. This will create a cleaner, mutually exclusive set of target labels. The specific mapping rules will be implemented as a function in the `src/preprocessing.py` module.

----

# Feature Variable Analysis: 'Consumer complaint narrative'

Next cell is about: understanding the characteristics of the feature variable

In [None]:
# --- Analyze the Feature Variable ('Consumer complaint narrative') ---
print(f"\n--- Analysis of Feature Variable: '{feature_column}' ---")

# Check for missing values in the narrative column
missing_narratives = df[feature_column].isnull().sum()
print(f"Number of missing narratives: {missing_narratives}")
print(f"Percentage of missing narratives: {missing_narratives / len(df) * 100:.2f}%")

# Decision: For a baseline, we'll remove rows with missing narratives.
df_clean = df.dropna(subset=[feature_column]).copy()

# --- Analyze the length of complaint narratives ---
df_clean['narrative_len'] = df_clean[feature_column].str.len()

print("\n--- Basic Statistics for Narrative Length ---")
print(df_clean['narrative_len'].describe())

# --- Visualize the Narrative Length Distribution ---
plt.figure(figsize=(12, 6))
sns.histplot(df_clean['narrative_len'], bins=50, kde=True)
plt.title('Distribution of Complaint Narrative Lengths', fontsize=16)
plt.xlabel('Length of Narrative (number of characters)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xlim(0, 5000)
plt.show()

## EDA Conclusion and Preprocessing Strategy

### Key Data Findings & Technical Plan

This analysis focuses on the technical characteristics of the data and the immediate actions required for preprocessing.

1.  **Missing Narratives (64%):** A majority of the dataset lacks a consumer complaint narrative. This is a feature of the data source, where consumers must consent to publication.
    * **Action:** For model training, all rows with missing narratives will be dropped.
    * **Impact:** This reduces the training set size to ~1.1 million entries and introduces a consent-based selection bias.

2.  **Right-Skewed Narrative Length:** The complaint texts vary significantly in length, with a long tail of extremely lengthy complaints.
    * **Action:** The baseline TF-IDF vectorizer is robust to this variance, so no special truncation is needed for this phase.


### Analysis of Systemic Biases & Data Limitations

1.  **Regulatory Bias:** The database explicitly excludes complaints against depository institutions with less than $10 billion in assets. The model will therefore be specialized for the products and complaint patterns of **larger financial institutions**.

2.  **Company Response Bias:** The publication process is tied to a company's response or a 15-day waiting period. This means the dataset is a **curated subset** of all filed complaints, not a raw feed.

**Overall Conclusion:** The final training data represents complaints made against large institutions, which were acknowledged by those institutions, and where the consumer consented to share their story. These limitations are critical for defining the model's scope.