# Tier 4: DBSCAN Clustering

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** 97cd4a02-d93d-40b1-b75b-fe523d810120

---

## Citation
Brandon Deloatch, "Tier 4: DBSCAN Clustering," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** 97cd4a02-d93d-40b1-b75b-fe523d810120
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, adjusted_rand_score
from sklearn.neighbors import NearestNeighbors
import warnings
warnings.filterwarnings('ignore')

print(" Tier 4: DBSCAN Clustering - Libraries Loaded!")
print("=" * 50)
print("DBSCAN Techniques:")
print("• Density-based clustering with noise detection")
print("• Automatic cluster number determination")
print("• Outlier identification and removal")
print("• Parameter optimization (eps, min_samples)")
print("• Irregular cluster shape handling")

In [None]:
# Generate DBSCAN-optimized datasets
np.random.seed(42)

# Geospatial store locations with density clusters
n_locations = 800
city_centers = [(40.7589, -73.9851), (40.6892, -74.0445), (40.8176, -73.9782)] # NYC areas

geo_data = []
cluster_labels_true = []

for i, (lat_center, lon_center) in enumerate(city_centers):
 n_cluster = np.random.randint(150, 200)

 # Generate clustered points with varying density
 lats = np.random.normal(lat_center, 0.02, n_cluster)
 lons = np.random.normal(lon_center, 0.02, n_cluster)

 for lat, lon in zip(lats, lons):
 geo_data.append({
 'latitude': lat,
 'longitude': lon,
 'sales_volume': np.random.lognormal(10, 0.5),
 'foot_traffic': np.random.poisson(100),
 'competition_nearby': np.random.beta(2, 5)
 })
 cluster_labels_true.append(i)

# Add noise points (outliers)
n_noise = 200
for _ in range(n_noise):
 geo_data.append({
 'latitude': np.random.uniform(40.5, 41.0),
 'longitude': np.random.uniform(-74.3, -73.7),
 'sales_volume': np.random.lognormal(8, 1),
 'foot_traffic': np.random.poisson(30),
 'competition_nearby': np.random.beta(5, 2)
 })
 cluster_labels_true.append(-1) # Noise label

geo_df = pd.DataFrame(geo_data)
geo_df['true_cluster'] = cluster_labels_true

print(" DBSCAN Dataset Created:")
print(f"Total locations: {len(geo_df)}")
print(f"True clusters: {len(set(cluster_labels_true)) - (1 if -1 in cluster_labels_true else 0)}")
print(f"Noise points: {sum(1 for x in cluster_labels_true if x == -1)}")
print(f"Sales range: ${geo_df['sales_volume'].min():,.0f} - ${geo_df['sales_volume'].max():,.0f}")

In [None]:
# 1. DBSCAN PARAMETER OPTIMIZATION
print(" 1. DBSCAN PARAMETER OPTIMIZATION")
print("=" * 35)

# Prepare data for clustering
features = ['latitude', 'longitude', 'sales_volume', 'foot_traffic', 'competition_nearby']
scaler = StandardScaler()
geo_scaled = scaler.fit_transform(geo_df[features])

# Find optimal eps using k-distance graph
def find_optimal_eps(data, k=4):
 """Find optimal eps parameter using k-distance graph"""
 nbrs = NearestNeighbors(n_neighbors=k).fit(data)
 distances, indices = nbrs.kneighbors(data)
 distances = np.sort(distances[:, k-1], axis=0)
 return distances

k_distances = find_optimal_eps(geo_scaled, k=4)
# Optimal eps is typically at the "elbow" of the k-distance curve
eps_optimal = np.percentile(k_distances, 95) # Conservative estimate

print(f"Optimal eps estimate: {eps_optimal:.3f}")

# Test different parameter combinations
eps_values = [eps_optimal * 0.5, eps_optimal * 0.75, eps_optimal, eps_optimal * 1.25, eps_optimal * 1.5]
min_samples_values = [3, 4, 5, 6]

dbscan_results = {}
best_score = -1
best_params = None

for eps in eps_values:
 for min_samples in min_samples_values:
 dbscan = DBSCAN(eps=eps, min_samples=min_samples)
 cluster_labels = dbscan.fit_predict(geo_scaled)

 # Calculate metrics (only if we have non-noise clusters)
 n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
 n_noise = list(cluster_labels).count(-1)

 if n_clusters > 1 and n_clusters < len(geo_df) - n_noise:
 # Only calculate silhouette for non-noise points
 non_noise_mask = cluster_labels != -1
 if np.sum(non_noise_mask) > n_clusters:
 silhouette = silhouette_score(geo_scaled[non_noise_mask],
 cluster_labels[non_noise_mask])
 else:
 silhouette = -1
 else:
 silhouette = -1

 dbscan_results[(eps, min_samples)] = {
 'labels': cluster_labels,
 'n_clusters': n_clusters,
 'n_noise': n_noise,
 'silhouette': silhouette
 }

 if silhouette > best_score:
 best_score = silhouette
 best_params = (eps, min_samples)

 print(f"eps={eps:.3f}, min_samples={min_samples}: "
 f"clusters={n_clusters}, noise={n_noise}, silhouette={silhouette:.3f}")

# Apply best DBSCAN model
best_eps, best_min_samples = best_params
final_dbscan = DBSCAN(eps=best_eps, min_samples=best_min_samples)
final_labels = final_dbscan.fit_predict(geo_scaled)

geo_df['dbscan_cluster'] = final_labels
n_clusters_final = len(set(final_labels)) - (1 if -1 in final_labels else 0)
n_noise_final = list(final_labels).count(-1)

print(f"\nBest DBSCAN parameters: eps={best_eps:.3f}, min_samples={best_min_samples}")
print(f"Final results: {n_clusters_final} clusters, {n_noise_final} noise points")
print(f"Best silhouette score: {best_score:.3f}")

# Compare with true labels
if len(set(cluster_labels_true)) > 1:
 ari_score = adjusted_rand_score(cluster_labels_true, final_labels)
 print(f"Adjusted Rand Index vs true clusters: {ari_score:.3f}")

In [None]:
# 2. INTERACTIVE DBSCAN VISUALIZATIONS
print(" 2. INTERACTIVE DBSCAN VISUALIZATIONS")
print("=" * 39)

# Create comprehensive DBSCAN dashboard
fig = make_subplots(
 rows=2, cols=2,
 subplot_titles=[
 'K-Distance Graph (Eps Optimization)',
 'Geographic Clusters (DBSCAN Results)',
 'Parameter Grid Search Heatmap',
 'Cluster Characteristics Comparison'
 ],
 specs=[[{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"secondary_y": False}]]
)

# K-distance graph
sorted_indices = np.argsort(k_distances)
fig.add_trace(
 go.Scatter(x=np.arange(len(k_distances)), y=k_distances[sorted_indices],
 mode='lines', name='4-NN Distance',
 line=dict(color='blue', width=2)),
 row=1, col=1
)
fig.add_hline(y=eps_optimal, line=dict(color='red', dash='dash'),
 annotation_text=f"Optimal eps: {eps_optimal:.3f}", row=1, col=1)

# Geographic visualization with DBSCAN results
colors = ['red', 'blue', 'green', 'purple', 'orange', 'yellow', 'pink']
unique_clusters = sorted(set(final_labels))

for cluster in unique_clusters:
 cluster_data = geo_df[geo_df['dbscan_cluster'] == cluster]

 if cluster == -1: # Noise points
 fig.add_trace(
 go.Scattermapbox(
 lat=cluster_data['latitude'],
 lon=cluster_data['longitude'],
 mode='markers',
 marker=dict(size=6, color='black', opacity=0.6),
 name='Noise',
 text=cluster_data['sales_volume'].round(0),
 hovertemplate='Noise Point<br>Sales: $%{text:,.0f}<extra></extra>'
 ),
 row=1, col=2
 )
 else:
 color = colors[cluster % len(colors)]
 fig.add_trace(
 go.Scattermapbox(
 lat=cluster_data['latitude'],
 lon=cluster_data['longitude'],
 mode='markers',
 marker=dict(size=8, color=color, opacity=0.8),
 name=f'Cluster {cluster}',
 text=cluster_data['sales_volume'].round(0),
 hovertemplate=f'Cluster {cluster}<br>Sales: $%{{text:,.0f}}<extra></extra>'
 ),
 row=1, col=2
 )

# Parameter grid search heatmap
eps_grid = []
min_samples_grid = []
silhouette_grid = []

for (eps, min_samples), results in dbscan_results.items():
 eps_grid.append(eps)
 min_samples_grid.append(min_samples)
 silhouette_grid.append(results['silhouette'])

# Create parameter performance matrix
eps_unique = sorted(set(eps_grid))
min_samples_unique = sorted(set(min_samples_grid))
performance_matrix = np.full((len(min_samples_unique), len(eps_unique)), -1.0)

for i, ms in enumerate(min_samples_unique):
 for j, eps in enumerate(eps_unique):
 for k, (eps_val, ms_val) in enumerate(zip(eps_grid, min_samples_grid)):
 if eps_val == eps and ms_val == ms:
 performance_matrix[i, j] = silhouette_grid[k]
 break

fig.add_trace(
 go.Heatmap(
 z=performance_matrix,
 x=[f"{eps:.3f}" for eps in eps_unique],
 y=[f"{ms}" for ms in min_samples_unique],
 colorscale='Viridis',
 showscale=True,
 hovertemplate='eps: %{x}<br>min_samples: %{y}<br>Silhouette: %{z:.3f}<extra></extra>'
 ),
 row=2, col=1
)

# Cluster characteristics comparison
cluster_stats = geo_df.groupby('dbscan_cluster').agg({
 'sales_volume': 'mean',
 'foot_traffic': 'mean',
 'competition_nearby': 'mean'
}).reset_index()

cluster_stats = cluster_stats[cluster_stats['dbscan_cluster'] != -1] # Exclude noise

for i, metric in enumerate(['sales_volume', 'foot_traffic', 'competition_nearby']):
 fig.add_trace(
 go.Bar(
 x=cluster_stats['dbscan_cluster'],
 y=cluster_stats[metric],
 name=metric.replace('_', ' ').title(),
 marker_color=colors[i % len(colors)],
 yaxis=f'y{4 if i == 0 else 4}',
 offsetgroup=i
 ),
 row=2, col=2
 )

# Update layout
fig.update_layout(
 height=800,
 title="DBSCAN Clustering Analysis Dashboard",
 mapbox=dict(
 style="open-street-map",
 center=dict(lat=40.7589, lon=-73.9851),
 zoom=10
 )
)

fig.update_xaxes(title_text="Point Index", row=1, col=1)
fig.update_xaxes(title_text="Eps Parameter", row=2, col=1)
fig.update_xaxes(title_text="Cluster ID", row=2, col=2)
fig.update_yaxes(title_text="4-NN Distance", row=1, col=1)
fig.update_yaxes(title_text="Min Samples", row=2, col=1)
fig.update_yaxes(title_text="Average Values", row=2, col=2)

fig.show()

# Business insights and ROI
print(f"\n DBSCAN BUSINESS INSIGHTS:")

total_revenue = 0
for cluster in unique_clusters:
 if cluster != -1: # Skip noise
 cluster_data = geo_df[geo_df['dbscan_cluster'] == cluster]
 cluster_size = len(cluster_data)
 avg_sales = cluster_data['sales_volume'].mean()
 avg_traffic = cluster_data['foot_traffic'].mean()
 cluster_revenue = cluster_size * avg_sales
 total_revenue += cluster_revenue

 # Determine cluster type
 if avg_sales > geo_df['sales_volume'].median() and avg_traffic > geo_df['foot_traffic'].median():
 cluster_type = "High-Performance Hub"
 elif avg_sales > geo_df['sales_volume'].median():
 cluster_type = "High-Revenue Zone"
 elif avg_traffic > geo_df['foot_traffic'].median():
 cluster_type = "High-Traffic Area"
 else:
 cluster_type = "Standard Zone"

 print(f"\nCluster {cluster}: {cluster_type}")
 print(f"• Locations: {cluster_size}")
 print(f"• Avg sales: ${avg_sales:,.0f}")
 print(f"• Avg traffic: {avg_traffic:.0f} visitors")
 print(f"• Total revenue: ${cluster_revenue:,.0f}")

# ROI calculation
location_optimization_improvement = 0.20 # 20% improvement from better location strategy
marketing_efficiency_gain = 0.15 # 15% marketing efficiency from targeted clusters
noise_point_investigation_cost = n_noise_final * 1000 # $1000 per noise point investigation

roi_revenue_increase = total_revenue * location_optimization_improvement
marketing_savings = total_revenue * 0.05 * marketing_efficiency_gain # 5% of revenue as marketing
total_benefits = roi_revenue_increase + marketing_savings - noise_point_investigation_cost

implementation_cost = 150_000

print(f"\n DBSCAN CLUSTERING ROI:")
print(f"• Total cluster revenue: ${total_revenue:,.0f}")
print(f"• Location optimization gain: ${roi_revenue_increase:,.0f}")
print(f"• Marketing efficiency savings: ${marketing_savings:,.0f}")
print(f"• Investigation costs: ${noise_point_investigation_cost:,.0f}")
print(f"• Net annual benefits: ${total_benefits:,.0f}")
print(f"• Implementation cost: ${implementation_cost:,.0f}")
print(f"• ROI: {(total_benefits - implementation_cost)/implementation_cost*100:.0f}%")

print(f"\n Cross-Reference Learning Path:")
print(f"• Next: Tier4_PCA.ipynb (dimensionality reduction)")
print(f"• Related: Tier6_AnomalyDetection.ipynb (noise detection)")
print(f"• Advanced: Intermediate_Clustering.ipynb (ensemble methods)")