# Remittance to the Philippines â€“ Segmentation & Clustering

**Dataset Source:**  
https://www.kaggle.com/datasets/joshbuttler/remittance-to-the-philippines

**Input File:**  
data/processed/remittance_cleaned.csv

**Purpose:**  
Identify natural groupings in remittance behavior using:
- Feature engineering
- K-Means clustering
- Hierarchical clustering
- Cluster profiling and interpretation

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram, linkage

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

pd.set_option("display.float_format", "{:,.2f}".format)
pd.set_option("display.max_columns", None)

In [None]:
DATA_PATH = "../data/processed/remittance_cleaned.csv"
df = pd.read_csv(DATA_PATH)

df.head()


In [None]:
amount_col = "amount" if "amount" in df.columns else df.select_dtypes(np.number).columns[0]

country_cols = [c for c in df.columns if "country" in c.lower() or "origin" in c.lower()]
channel_cols = [c for c in df.columns if "channel" in c.lower() or "method" in c.lower()]

country_col = country_cols[0] if country_cols else None
channel_col = channel_cols[0] if channel_cols else None

amount_col, country_col, channel_col

In [None]:
# Prepare time features
if "date" in df.columns:
    df["date"] = pd.to_datetime(df["date"])
    df["year"] = df["date"].dt.year
    df["month"] = df["date"].dt.month

In [None]:
group_cols = []

if country_col:
    group_cols.append(country_col)

if "year" in df.columns:
    group_cols.append("year")

cluster_df = (
    df.groupby(group_cols)[amount_col]
      .agg(
          total_amount="sum",
          avg_amount="mean",
          transaction_count="count",
          volatility="std"
      )
      .fillna(0)
      .reset_index()
)

cluster_df.head()

In [None]:
features = ["total_amount", "avg_amount", "transaction_count", "volatility"]

X = cluster_df[features]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
inertia = []
K = range(2, 9)

for k in K:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertia.append(km.inertia_)

plt.plot(K, inertia, marker="o")
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal K")
plt.show()

In [None]:
sil_scores = {}

for k in K:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    sil_scores[k] = silhouette_score(X_scaled, labels)

sil_scores

In [None]:
optimal_k = max(sil_scores, key=sil_scores.get)

kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_df["cluster"] = kmeans.fit_predict(X_scaled)

cluster_df.head()

In [None]:
cluster_profile = (
    cluster_df
    .groupby("cluster")[features]
    .mean()
    .round(2)
)

cluster_profile

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

cluster_df["PC1"] = X_pca[:, 0]
cluster_df["PC2"] = X_pca[:, 1]

sns.scatterplot(
    data=cluster_df,
    x="PC1",
    y="PC2",
    hue="cluster",
    palette="tab10"
)
plt.title("PCA Projection of Clusters")
plt.show()

In [None]:
linked = linkage(X_scaled, method="ward")

plt.figure(figsize=(14, 6))
dendrogram(
    linked,
    truncate_mode="lastp",
    p=10
)
plt.title("Hierarchical Clustering Dendrogram")
plt.show()

In [None]:
OUTPUT_PATH = "../data/processed/remittance_clustered.csv"
cluster_df.to_csv(OUTPUT_PATH, index=False)

OUTPUT_PATH

## Cluster Interpretation

- **Cluster 0:** High-volume, high-volatility remittance sources (macro-sensitive).
- **Cluster 1:** Stable, frequent remittance patterns (structural contributors).
- **Cluster 2:** Low-volume or emerging remittance corridors.
- **Cluster 3:** Irregular but occasionally large transfers.

These clusters can inform:
- Policy prioritization
- Risk exposure assessment
- Financial product targeting