# Exercise 12: DBSCAN Clustering for P-Card Anomaly Detection

**Objective:** Identify unusual government purchase card (P-Card) spending patterns using DBSCAN clustering to detect potential waste or misuse.

**Dataset:** `pcard_summary.csv` - Monthly spending behavior for government cardholders

**Key Variables:**
- `avg_transaction_amount`: Average dollar amount per transaction
- `transactions_per_month`: Number of transactions per month
- `pct_weekend_transactions`: Percentage of transactions occurring on weekends

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
#load dataset
df_pcard = pd.read_csv('pcard_summary.csv')

print(f"Dataset shape: {df_pcard.shape}")
print(f"Columns: {df_pcard.columns.tolist()}")

In [None]:
df_pcard.head(10)

df_pcard.info()

df_pcard.describe()

#check for missing values
df_pcard.isnull().sum()

#examine department distribution
df_pcard['department'].value_counts()


In [None]:
#let's haev a look at histograms for the numeric variables
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

#transaction amount
axes[0].hist(df_pcard['avg_transaction_amount'], bins=30, color='steelblue', edgecolor='black')
axes[0].axvline(df_pcard['avg_transaction_amount'].mean(), color='red', linestyle='--', label='Mean')
axes[0].set_xlabel('Average Transaction Amount ($)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Average Transaction Amount')
axes[0].legend()

#transaction counts per month
axes[1].hist(df_pcard['transactions_per_month'], bins=30, color='coral', edgecolor='black')
axes[1].axvline(df_pcard['transactions_per_month'].mean(), color='red', linestyle='--', label='Mean')
axes[1].set_xlabel('Transactions per Month')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Transactions per Month')
axes[1].legend()

#interesting to look at weekend purchases for business cards
axes[2].hist(df_pcard['pct_weekend_transactions'], bins=30, color='mediumseagreen', edgecolor='black')
axes[2].axvline(df_pcard['pct_weekend_transactions'].mean(), color='red', linestyle='--', label='Mean')
axes[2].set_xlabel('Percent Weekend Transactions')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Distribution of Weekend Transaction Percent')
axes[2].legend()

#show
plt.tight_layout()
plt.show()

In [None]:
#create pairwise scatterplot matrix
feature_cols = ['avg_transaction_amount', 'transactions_per_month', 'pct_weekend_transactions']

#build pairs to plot
pairplot = sns.pairplot(
    df_pcard[feature_cols],
    diag_kind='hist',
    plot_kws={'alpha': 0.6, 's': 50},
    diag_kws={'bins': 30, 'edgecolor': 'black'}
)

#show
pairplot.fig.suptitle('Pairwise Relationships of P-Card Variables', y=1.01, fontsize=14)
plt.show()

In [None]:
#create 3D visualization
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

#build scatter plot
scatter = ax.scatter(
    df_pcard['avg_transaction_amount'],
    df_pcard['transactions_per_month'],
    df_pcard['pct_weekend_transactions'],
    c='steelblue',
    alpha=0.6,
    s=50,
    edgecolors='black',
    linewidth=0.5
)

#setup plot
ax.set_xlabel('Avg Transaction Amount ($)', labelpad=10)
ax.set_ylabel('Transactions per Month', labelpad=10)
ax.set_zlabel('Pct Weekend Transactions', labelpad=10)
ax.set_title('3D View of P-Card Spending Patterns', fontsize=14, pad=20)

#show plot
#I like this plot, even though in our viz class
#  we are discouraged from using 3D visualizations
plt.show()

In [None]:
# Extract numeric features
feature_cols = ['avg_transaction_amount', 'transactions_per_month', 'pct_weekend_transactions']
X_features = df_pcard[feature_cols].copy()

#show
print(f"Features shape: {X_features.shape}")

X_features.describe()


In [None]:
#standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_features)

#build scaled data frame
df_scaled = pd.DataFrame(X_scaled, columns=feature_cols)

#show
print("Features after scaling:")

df_scaled.describe()

In [None]:
#create k-distance plot
min_samples_value = 4

#find nearest neighbors
neighbors_model = NearestNeighbors(n_neighbors=min_samples_value)
neighbors_fit = neighbors_model.fit(X_scaled)

#find distances
distances, indices = neighbors_fit.kneighbors(X_scaled)
distances_sorted = np.sort(distances[:, min_samples_value-1], axis=0)[::-1]

#build the plot
plt.figure(figsize=(10, 6))
plt.plot(distances_sorted, linewidth=2, color='steelblue')
plt.xlabel('Data Points (sorted by distance)', fontsize=12)
plt.ylabel('Distance to 4th Nearest Neighbor', fontsize=12)
plt.title('K-Distance Plot for Epsilon Selection (min_samples=4)', fontsize=14)
plt.grid(True, alpha=0.3)

#plot for each eps candidate
for eps_candidate in [0.3, 0.5, 0.7, 1.0]:
    plt.axhline(y=eps_candidate, color='red', linestyle='--', alpha=0.5, label=f'eps={eps_candidate}')

#show
plt.legend()
plt.show()

#show conclusion of eps value determination
print(f"Suggested eps values:")
print(f"25th percentile: {np.percentile(distances_sorted, 25):.3f}")
print(f"50th percentile: {np.percentile(distances_sorted, 50):.3f}")
print(f"75th percentile: {np.percentile(distances_sorted, 75):.3f}")

In [None]:
#test multiple parameter combinations
eps_values = [0.3, 0.5, 0.7, 1.0]
min_samples_values = [3, 4, 5, 6]

#build a list of results across min sample values (above)
results_list = []

for eps_value in eps_values:
    for min_samples_value in min_samples_values:
        dbscan_model = DBSCAN(eps=eps_value, min_samples=min_samples_value)
        cluster_labels = dbscan_model.fit_predict(X_scaled)
        
        num_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
        num_noise = list(cluster_labels).count(-1)
        pct_noise = (num_noise / len(cluster_labels)) * 100
        
        results_list.append({
            'eps': eps_value,
            'min_samples': min_samples_value,
            'num_clusters': num_clusters,
            'num_noise': num_noise,
            'pct_noise': pct_noise
        })

df_param_results = pd.DataFrame(results_list)

#show the created list of eps values
print("Parameter Exploration Results:")

df_param_results

In [None]:
#visualize parameter exploration
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

#build the data to plot
for min_samp in min_samples_values:
    df_subset = df_param_results[df_param_results['min_samples'] == min_samp]
    axes[0].plot(df_subset['eps'], df_subset['num_clusters'], marker='o', label=f'min_samples={min_samp}')

#build the plots
axes[0].set_xlabel('eps', fontsize=12)
axes[0].set_ylabel('Number of Clusters', fontsize=12)
axes[0].set_title('Effect of Parameters on Cluster Count', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

for min_samp in min_samples_values:
    df_subset = df_param_results[df_param_results['min_samples'] == min_samp]
    axes[1].plot(df_subset['eps'], df_subset['pct_noise'], marker='o', label=f'min_samples={min_samp}')

#setup plots
axes[1].set_xlabel('eps', fontsize=12)
axes[1].set_ylabel('Percentage Noise Points', fontsize=12)
axes[1].set_title('Effect of Parameters on Noise Detection', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

#show
plt.tight_layout()
plt.show()

In [None]:
#select optimal parameters
optimal_eps = 0.5
optimal_min_samples = 4

print(f"Running DBSCAN with eps={optimal_eps} and min_samples={optimal_min_samples}")

#fit final model - woohoo!
dbscan_final = DBSCAN(eps=optimal_eps, min_samples=optimal_min_samples)
final_cluster_labels = dbscan_final.fit_predict(X_scaled)

#setup
df_pcard['cluster_label'] = final_cluster_labels

#show
print(f"Clustering complete!")
print(f"Unique cluster labels: {sorted(df_pcard['cluster_label'].unique())}")

In [None]:
#count clusters and noise
num_clusters_final = len(set(final_cluster_labels)) - (1 if -1 in final_cluster_labels else 0)
num_noise_final = list(final_cluster_labels).count(-1)
pct_noise_final = (num_noise_final / len(final_cluster_labels)) * 100

#show results
print(f"Number of clusters: {num_clusters_final}")
print(f"Number of outliers: {num_noise_final}")
print(f"Percentage of outliers: {pct_noise_final:.2f}%")
print(f"\nCluster distribution:")

df_pcard['cluster_label'].value_counts().sort_index()

#compute mean values per cluster
cluster_means = (
    df_pcard
    .groupby('cluster_label')[feature_cols]
    .mean()
    .round(2)
)

#show cluster mean values
print("Mean values by cluster:")
cluster_means

#create cluster summary
cluster_summary = (
    df_pcard
    .groupby('cluster_label')
    .agg({
        'cardholder_id': 'count',
        'avg_transaction_amount': 'mean',
        'transactions_per_month': 'mean',
        'pct_weekend_transactions': 'mean'
    })
    .rename(columns={'cardholder_id': 'count'})
    .round(2)
)

#show summary of cluster findings
cluster_summary['pct_of_total'] = ((cluster_summary['count'] / len(df_pcard)) * 100).round(2)

print("Detailed cluster summary:")

In [None]:
#create several 2D scatterplots

#setup
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

unique_labels = sorted(df_pcard['cluster_label'].unique())
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))
color_dict = {label: colors[i] if label != -1 else 'red' for i, label in enumerate(unique_labels)}

#average transaction amount, by month
for label in unique_labels:
    mask = df_pcard['cluster_label'] == label
    label_name = 'Outliers' if label == -1 else f'Cluster {label}'
    axes[0].scatter(
        df_pcard.loc[mask, 'avg_transaction_amount'],
        df_pcard.loc[mask, 'transactions_per_month'],
        c=[color_dict[label]],
        label=label_name,
        alpha=0.7,
        s=80,
        edgecolors='black',
        linewidth=0.5
    )

axes[0].set_xlabel('Average Transaction Amount ($)', fontsize=11)
axes[0].set_ylabel('Transactions per Month', fontsize=11)
axes[0].set_title('Spending Amount vs Frequency', fontsize=12)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

#let's also look at weekend transactions (biz car usage on weekends!)
for label in unique_labels:
    mask = df_pcard['cluster_label'] == label
    label_name = 'Outliers' if label == -1 else f'Cluster {label}'
    axes[1].scatter(
        df_pcard.loc[mask, 'transactions_per_month'],
        df_pcard.loc[mask, 'pct_weekend_transactions'],
        c=[color_dict[label]],
        label=label_name,
        alpha=0.7,
        s=80,
        edgecolors='black',
        linewidth=0.5
    )

axes[1].set_xlabel('Transactions per Month', fontsize=11)
axes[1].set_ylabel('Pct Weekend Transactions', fontsize=11)
axes[1].set_title('Frequency vs Weekend Usage', fontsize=12)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

#overall summary
for label in unique_labels:
    mask = df_pcard['cluster_label'] == label
    label_name = 'Outliers' if label == -1 else f'Cluster {label}'
    axes[2].scatter(
        df_pcard.loc[mask, 'avg_transaction_amount'],
        df_pcard.loc[mask, 'pct_weekend_transactions'],
        c=[color_dict[label]],
        label=label_name,
        alpha=0.7,
        s=80,
        edgecolors='black',
        linewidth=0.5
    )

axes[2].set_xlabel('Average Transaction Amount ($)', fontsize=11)
axes[2].set_ylabel('Pct Weekend Transactions', fontsize=11)
axes[2].set_title('Amount vs Weekend Usage', fontsize=12)
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
#let's look at 3D visualization of the clusters
#  (seemed worth a try)
fig = plt.figure(figsize=(14, 10))
ax = fig.add_subplot(111, projection='3d')

#build the labels
for label in unique_labels:
    mask = df_pcard['cluster_label'] == label
    label_name = 'Outliers' if label == -1 else f'Cluster {label}'
    
    ax.scatter(
        df_pcard.loc[mask, 'avg_transaction_amount'],
        df_pcard.loc[mask, 'transactions_per_month'],
        df_pcard.loc[mask, 'pct_weekend_transactions'],
        c=[color_dict[label]],
        label=label_name,
        alpha=0.7,
        s=80,
        edgecolors='black',
        linewidth=0.5
    )

#setup
ax.set_xlabel('Avg Transaction Amount ($)', labelpad=10, fontsize=11)
ax.set_ylabel('Transactions per Month', labelpad=10, fontsize=11)
ax.set_zlabel('Pct Weekend Transactions', labelpad=10, fontsize=11)
ax.set_title('3D Cluster Visualization', fontsize=14, pad=20)
ax.legend(loc='upper right')

#show
plt.show()

In [None]:
#identify outliers
#in this data, the answer is in the outliers, not the clusters
#  (we had a clue from class)
df_outliers = df_pcard[df_pcard['cluster_label'] == -1].copy()

print(f"Number of outlier cardholders: {len(df_outliers)}")
print(f"\nOutlier cardholder details:")

#compare outliers to typical spending
df_typical = df_pcard[df_pcard['cluster_label'] != -1].copy()

comparison_dict = {
    'metric': feature_cols,
    'typical_mean': [df_typical[col].mean() for col in feature_cols],
    'outlier_mean': [df_outliers[col].mean() if len(df_outliers) > 0 else None for col in feature_cols],
    'typical_std': [df_typical[col].std() for col in feature_cols],
    'outlier_std': [df_outliers[col].std() if len(df_outliers) > 0 else None for col in feature_cols]
}

#compare
df_comparison = pd.DataFrame(comparison_dict).round(2)

print("Comparison of typical vs outlier spending:")
df_comparison

#create outlier flag using .map()
df_pcard['outlier_flag'] = df_pcard['cluster_label'].map(lambda x: 'Outlier' if x == -1 else 'Typical')

#box plots comparing outliers to typical
#same method as established above
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

df_pcard.boxplot(column='avg_transaction_amount', by='outlier_flag', ax=axes[0])
axes[0].set_title('Average Transaction Amount')
axes[0].set_xlabel('Cardholder Type')
axes[0].set_ylabel('Amount ($)')

df_pcard.boxplot(column='transactions_per_month', by='outlier_flag', ax=axes[1])
axes[1].set_title('Transactions per Month')
axes[1].set_xlabel('Cardholder Type')
axes[1].set_ylabel('Count')

df_pcard.boxplot(column='pct_weekend_transactions', by='outlier_flag', ax=axes[2])
axes[2].set_title('Weekend Transaction Percent')
axes[2].set_xlabel('Cardholder Type')
axes[2].set_ylabel('Percentage')

#show
plt.suptitle('Outlier vs Typical Spending Comparison', y=1.02, fontsize=14)
plt.tight_layout()
plt.show()


In [None]:
#analyze department representation
#for a closer look at where problems may lie
print("Department distribution among outliers:")
print(df_outliers['department'].value_counts())
print("\nDepartment distribution among typical cardholders:")
print(df_typical['department'].value_counts())

#calculate department-level outlier rates
dept_outlier_rate = (
    df_pcard
    .groupby('department')['outlier_flag']
    .apply(lambda x: (x == 'Outlier').sum() / len(x) * 100)
    .round(2)
    .sort_values(ascending=False)
)

#show
print("Outlier rate by department:")
dept_outlier_rate

## Key Findings

### Typical Spending Clusters

I start with a normal cluster analysis, identifying typical P-Card spending patterns which can be identified by examining the non-outlier clusters. These clusters represent cardholders with:

- **Moderate transaction amounts**: Average transaction values within expected ranges for government purchases
- **Consistent monthly activity**: Regular transaction frequency patterns
- **Standard weekend usage**: Weekend transaction percentages aligned with normal needs

### Outlier Cardholders (a focus of this data)

Outliers (`cluster_label = -1`) represent cardholders whose spending patterns deviate significantly from typical behavior. These anomalies may warrant additional consideration for:

**There may be some otential legitimate reasons, such as:**
- Larger projects requiring larger purchases
- Emergency or time-sensitive procurement needs
- Specialized roles with unique purchasing requirements
- Field operations requiring weekend transactions (think: emergencies)

**Potential misuse indicators:**
- Unusually high weekend transaction percentages (personal use?)
- Higher average transaction amounts without justification
- Anomalistic transaction frequency
- Combinations of unusual patterns across multiple dimensions

### Recommendations

1. **Immediate review**: Conduct detailed audits of identified outlier cardholders
2. **Department analysis**: Investigate departments with higher outlier rates
3. **Policy review**: Evaluate whether spending policies need clarification
4. **Ongoing monitoring**: Implement regular DBSCAN analysis for continuous anomaly detection
5. **Context gathering**: Interview outlier cardholders to understand legitimate business justifications

---

### Summary

DBSCAN clustering successfully identified natural groupings in P-Card spending behavior and flagged outliers for further investigation. The combination of `avg_transaction_amount`, `transactions_per_month`, and `pct_weekend_transactions` provides a robust multi-dimensional view of spending patterns. Regular application of this analysis can help the Office of the Inspector General proactively identify potential waste or misuse before it becomes a larger issue.