## PRE-PROCESSING & TRAINING DATA DEVELOPMENT

**Purpose**

This notebook prepares the cleaned transaction data for modeling. The goal is to convert raw transactional information into machine-learning-ready features while preserving business meaning.

Objective

- Prepare the cleaned transaction dataset for both machine learning and LLM-based financial coaching.
- High-cardinality categorical columns are encoded efficiently using frequency encoding.
- Low-cardinality columns are one-hot encoded.
- Numeric features are standardized.
- Train/test splits are created for modeling.
- Text features are preserved for LLM use.

In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [12]:
transactions = pd.read_csv("transactions_clean.csv")
transactions.head()

Unnamed: 0,customer_id,name,surname,gender,birthdate,transaction_amount,date,merchant_name,category,month,day_of_week,is_expense,net_flow,description,description_source
0,752858,Sean,Rodriguez,F,2002-10-20,35.47,2023-04-03,smith-russell,personal_care,4,Monday,1,-35.47,Spent $35.47 on Personal Care at Smith-Russell,synthetic
1,26381,Michelle,Phelps,,1985-10-24,2552.72,2023-07-17,"peck, spence and young",travel,7,Monday,1,-2552.72,"Spent $2552.72 on Travel at Peck, Spence And Y...",synthetic
2,305449,Jacob,Williams,M,1981-10-25,115.97,2023-09-20,steele inc,clothing,9,Wednesday,1,-115.97,Spent $115.97 on Clothing at Steele Inc,synthetic
3,988259,Nathan,Snyder,M,1977-10-26,11.31,2023-01-11,"wilson, wilson and russell",personal_care,1,Wednesday,1,-11.31,"Spent $11.31 on Personal Care at Wilson, Wilso...",synthetic
4,764762,Crystal,Knapp,F,1951-11-02,62.21,2023-06-13,palmer-hinton,tech,6,Tuesday,1,-62.21,Spent $62.21 on Tech at Palmer-Hinton,synthetic


In [22]:
# Parse dates correctly to avoid warnings
transactions["birthdate"] = pd.to_datetime(
    transactions["birthdate"], 
    format="%Y-%m-%d",  # matches dataset format
    errors="coerce"
)

transactions["date"] = pd.to_datetime(
    transactions["date"], 
    format="%d-%m-%Y",  # adjust to transaction column format
    errors="coerce"
)

# Compute age
reference_date = transactions["date"].max()
transactions["age"] = ((reference_date - transactions["birthdate"]).dt.days // 365)

transactions[["birthdate", "age"]].head()


Unnamed: 0,birthdate,age
0,2002-10-20,21
1,1985-10-24,38
2,1981-10-25,42
3,1977-10-26,46
4,1951-11-02,72


To support both traditional machine learning and LLM-based financial coaching, features are split into:

1. Structured ML Features

Used for prediction and segmentation:
- category
- merchant_name
- gender
- day_of_week
- age
- is_expense
- month

2. Unstructured Text Features

Reserved for LLM use:
- description
- description_source

In [23]:
ml_features = transactions[["category", "merchant_name", "gender", "day_of_week", "age", "is_expense", "month"]]
y = transactions["transaction_amount"]

text_features = transactions[["description", "description_source"]]


Machine learning models require numeric inputs.
Categorical variables are converted using one-hot encoding while preserving category meaning.

In [24]:
ml_features_encoded = ml_features.copy()
high_card_cols = ["category", "merchant_name"]

for col in high_card_cols:
    freq = ml_features_encoded[col].value_counts(normalize=True)
    ml_features_encoded[col] = ml_features_encoded[col].map(freq)


In [25]:
low_card_cols = ["gender", "day_of_week"]

ml_features_encoded = pd.get_dummies(ml_features_encoded, columns=low_card_cols, drop_first=True)
ml_features_encoded.head()


Unnamed: 0,category,merchant_name,age,is_expense,month,gender_M,day_of_week_Monday,day_of_week_Saturday,day_of_week_Sunday,day_of_week_Thursday,day_of_week_Tuesday,day_of_week_Wednesday
0,0.16486,4e-05,21,1,4,False,True,False,False,False,False,False
1,0.16754,2e-05,38,1,7,False,True,False,False,False,False,False
2,0.16522,6e-05,42,1,9,True,False,False,False,False,False,True
3,0.16486,2e-05,46,1,1,True,False,False,False,False,False,True
4,0.16648,2e-05,72,1,6,False,False,False,False,False,True,False


In [26]:
numeric_cols = ["age", "is_expense", "month"] + high_card_cols  # include freq-encoded columns

scaler = StandardScaler()
X_scaled = scaler.fit_transform(ml_features_encoded[numeric_cols])


In [27]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)


In [30]:
import joblib
from sklearn.model_selection import train_test_split

# Assuming ml_features_encoded is already scaled or frequency-encoded
X = ml_features_encoded
y = transactions["transaction_amount"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Save splits for modeling notebook
joblib.dump(X_train, "X_train_scaled.pkl")
joblib.dump(X_test, "X_test_scaled.pkl")
joblib.dump(y_train, "y_train.pkl")
joblib.dump(y_test, "y_test.pkl")

# You already saved these artifacts for reference:
joblib.dump(scaler, "scaler.pkl")
joblib.dump(high_card_cols, "high_card_cols.pkl")
joblib.dump(low_card_cols, "low_card_cols.pkl")


['low_card_cols.pkl']

In [31]:
text_features.head()


Unnamed: 0,description,description_source
0,Spent $35.47 on Personal Care at Smith-Russell,synthetic
1,"Spent $2552.72 on Travel at Peck, Spence And Y...",synthetic
2,Spent $115.97 on Clothing at Steele Inc,synthetic
3,"Spent $11.31 on Personal Care at Wilson, Wilso...",synthetic
4,Spent $62.21 on Tech at Palmer-Hinton,synthetic


**Final Notes**

- Structured features are now ready for ML modeling (regression, clustering).
- Text features are separated for LLM analysis.
- The notebook uses memory-efficient encoding so it works even on 50,000+ rows.
- Preprocessing artifacts are saved for downstream modeling and Streamlit deployment.