# Capstone Three – Data Wrangling
_**AI-Powered Personal Spending & Financial Behavior Coach**_

**Data Acquisition
Rationale**

This project uses public and synthetic transaction-level data to avoid privacy concerns while still modeling realistic consumer financial behavior. All datasets are programmatically loaded to ensure reproducibility and scalability.

**Data Sources**

- Main dataset: Comprehensive Credit Card Transactions Dataset: https://www.kaggle.com/datasets/rajatsurana979/comprehensive-credit-card-transactions-dataset

- Open banking sample transaction data

- Synthetic labeled transaction data (LLM-assisted, used only for augmentation)

First, we import necessary libraries and load the raw dataset. We’ll inspect the data to understand the available columns, data types, and basic statistics. This ensures we know what we’re working with before cleaning and feature engineering.

In [3]:
import pandas as pd
import numpy as np

# Load dataset
transactions = pd.read_csv('credit_card_transaction_flow.csv')

# Inspect
transactions.head()
transactions.info()
transactions.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Customer ID         50000 non-null  int64  
 1   Name                50000 non-null  object 
 2   Surname             50000 non-null  object 
 3   Gender              44953 non-null  object 
 4   Birthdate           50000 non-null  object 
 5   Transaction Amount  50000 non-null  float64
 6   Date                50000 non-null  object 
 7   Merchant Name       50000 non-null  object 
 8   Category            50000 non-null  object 
dtypes: float64(1), int64(1), object(7)
memory usage: 3.4+ MB


Unnamed: 0,Customer ID,Transaction Amount
count,50000.0,50000.0
mean,500136.79696,442.119239
std,288232.43164,631.669724
min,29.0,5.01
25%,251191.5,79.0075
50%,499520.5,182.195
75%,749854.25,470.515
max,999997.0,2999.88


Initial inspection ensures we understand columns, data types, and missing values, which is crucial for extracting business-relevant insights like spending patterns, recurring behaviors, and transaction volume per user.

**Standardizing column** names improves readability and ensures consistent references throughout the notebook. This reduces errors and enhances reproducibility for stakeholders.

In [4]:
transactions.columns = transactions.columns.str.strip().str.lower().str.replace(' ', '_')
transactions.head()


Unnamed: 0,customer_id,name,surname,gender,birthdate,transaction_amount,date,merchant_name,category
0,752858,Sean,Rodriguez,F,20-10-2002,35.47,03-04-2023,Smith-Russell,Cosmetic
1,26381,Michelle,Phelps,,24-10-1985,2552.72,17-07-2023,"Peck, Spence and Young",Travel
2,305449,Jacob,Williams,M,25-10-1981,115.97,20-09-2023,Steele Inc,Clothing
3,988259,Nathan,Snyder,M,26-10-1977,11.31,11-01-2023,"Wilson, Wilson and Russell",Cosmetic
4,764762,Crystal,Knapp,F,02-11-1951,62.21,13-06-2023,Palmer-Hinton,Electronics


All columns are now in snake_case format, making subsequent code cleaner and more consistent for analysis and feature engineering.

Next, we address **Missing Values**. Different strategies are applied depending on column type:
- Categorical columns with few missing values are labeled **'unknown'**

- Numeric columns are imputed with **median values** to maintain realistic financial calculations

In [11]:
# Check missing values
transactions.isnull().sum()

# Instead of using inplace on a column, do a direct assignment
transactions['category'] = transactions['category'].fillna('unknown')
transactions['merchant_name'] = transactions['merchant_name'].fillna('unknown')
transactions['transaction_amount'] = transactions['transaction_amount'].fillna(transactions['transaction_amount'].median())


Missing values are now handled, ensuring **accurate spending summaries and analyses** for stakeholders. For example, monthly or category-level aggregates will not be skewed by missing data.

Correct data types are critical for analysis. We **convert** the transaction date to datetime for temporal analysis and ensure the amount column is numeric for financial calculations. Note that dates are in DD-MM-YYYY format, so we use dayfirst=True.

In [17]:
# Parse dates correctly
transactions['date'] = pd.to_datetime(transactions['date'], dayfirst=True)
transactions['birthdate'] = pd.to_datetime(transactions['birthdate'], dayfirst=True)

# Ensure amounts are numeric
transactions['transaction_amount'] = transactions['transaction_amount'].astype(float)


With proper types, we can perform temporal analysis and calculate net spending trends per user.

Create **features** for temporal analysis and financial behavior:

- Month, day of week, and hour (if time exists)

- Expense vs income flag

- Net flow per transaction

In [18]:
# Time-based features
transactions['month'] = transactions['date'].dt.month
transactions['day_of_week'] = transactions['date'].dt.day_name()

# Flag expenses (all are debits here) and net flow
transactions['is_expense'] = 1  # All transactions are spending
transactions['net_flow'] = -transactions['transaction_amount']  # Negative for spending


These features allow stakeholders to understand monthly, weekly, and daily spending patterns, which is crucial for **user behavior segmentation and coaching recommendations**.

**Standardize merchant names and categories for meaningful aggregation and reporting**.

In [19]:
# Normalize text
transactions['merchant_name'] = transactions['merchant_name'].str.strip().str.lower()
transactions['category'] = transactions['category'].str.strip().str.lower()

# Map similar categories to broader groups
category_map = {
    'grocery': 'food', 
    'supermarket': 'food', 
    'restaurant': 'dining', 
    'fast food': 'dining',
    'fuel': 'transport',
    'taxi': 'transport',
    'cosmetic': 'personal_care',
    'electronics': 'tech'
}
transactions['category'] = transactions['category'].replace(category_map)


**Normalized categories** allow stakeholders to aggregate spending by meaningful groups, e.g., total monthly food spending, tech purchases, or transportation costs.

Create human-readable transaction descriptions for **LLM** summaries and personalized coaching messages.

In [20]:
transactions['description'] = transactions.apply(
    lambda x: f"Spent ${x['transaction_amount']:.2f} on {x['category'].replace('_', ' ').title()} at {x['merchant_name'].title()}",
    axis=1
)
transactions['description_source'] = 'synthetic'


These descriptions can be used by **LLMs** to generate insightful and personalized summaries, improving user engagement and understanding of spending patterns.

Save the cleaned, feature-rich dataset for modeling, clustering, and LLM insights.

In [21]:
transactions.to_csv('transactions_clean.csv', index=False)


Dataset is now ready for business analytics and AI modeling, fully reproducible for stakeholders.

Perform **quick sanity checks** to verify data quality and extract preliminary insights.

In [22]:
# Top categories
transactions['category'].value_counts()

# Monthly net spending
transactions.groupby('month')['net_flow'].sum()

# Sample LLM descriptions
transactions['description'].head(5)


0       Spent $35.47 on Personal Care at Smith-Russell
1    Spent $2552.72 on Travel at Peck, Spence And Y...
2              Spent $115.97 on Clothing at Steele Inc
3    Spent $11.31 on Personal Care at Wilson, Wilso...
4                Spent $62.21 on Tech at Palmer-Hinton
Name: description, dtype: object

Validation ensures **data integrity and provides early business insights**: top spending categories, monthly net outflows, and sample coaching text.

## Data Wrangling Summary

- Dataset cleaned, structured, and feature-engineered
- Business-relevant features: net cash flow, temporal metrics, expense flags
- LLM-ready descriptions for user-facing coaching insights
- Saved CSV ready for clustering, recommendations, and reports
- Markdown explains each step in business terms