# DBSCAN Clustering using PyCaret

## Assignment Part D

This notebook demonstrates DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering using the PyCaret library.

### What is DBSCAN?
DBSCAN is a density-based clustering algorithm that:
- Groups together points that are closely packed together
- Marks points in low-density regions as outliers
- Can find arbitrarily shaped clusters
- Does not require specifying the number of clusters beforehand

### Why PyCaret?
PyCaret is a low-code machine learning library that simplifies the clustering workflow with:
- Automated preprocessing
- Multiple clustering algorithms
- Built-in evaluation metrics
- Visualization tools

## 1. Installation and Setup

In [None]:
# Install required libraries with proper dependency handling
# First, uninstall and reinstall numpy to ensure compatibility
!pip uninstall numpy -y -q
!pip install numpy==1.23.5 -q
!pip install pycaret -q
!pip install plotly -q
!pip install scikit-learn pandas matplotlib seaborn -q

print("Installation complete!")
print("Note: If you still see warnings, restart the runtime (Runtime -> Restart runtime) and run again.")

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_moons, make_blobs, make_circles
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from pycaret.clustering import *
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

## 2. Load and Prepare Data

We'll use multiple datasets to demonstrate DBSCAN's strength in finding non-spherical clusters:
1. Customer segmentation data from online
2. Synthetic datasets (moons, circles) to show DBSCAN's capabilities

In [None]:
# Load customer segmentation dataset from online source
url = "https://raw.githubusercontent.com/dphi-official/Datasets/master/Mall_Customers.csv"
data = pd.read_csv(url)

print("Dataset shape:", data.shape)
print("\nFirst few rows:")
display(data.head())

print("\nDataset info:")
print(data.info())

print("\nBasic statistics:")
display(data.describe())

In [None]:
# Visualize the data distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Age distribution
axes[0, 0].hist(data['Age'], bins=20, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Age Distribution', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')

# Income distribution
axes[0, 1].hist(data['Annual Income (k$)'], bins=20, edgecolor='black', alpha=0.7, color='orange')
axes[0, 1].set_title('Annual Income Distribution', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Annual Income (k$)')
axes[0, 1].set_ylabel('Frequency')

# Spending Score distribution
axes[1, 0].hist(data['Spending Score (1-100)'], bins=20, edgecolor='black', alpha=0.7, color='green')
axes[1, 0].set_title('Spending Score Distribution', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Spending Score')
axes[1, 0].set_ylabel('Frequency')

# Income vs Spending Score scatter
axes[1, 1].scatter(data['Annual Income (k$)'], data['Spending Score (1-100)'], alpha=0.6)
axes[1, 1].set_title('Income vs Spending Score', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Annual Income (k$)')
axes[1, 1].set_ylabel('Spending Score (1-100)')

plt.tight_layout()
plt.show()

## 3. PyCaret Setup

Initialize the PyCaret clustering environment. This step handles:
- Data preprocessing
- Feature scaling
- Missing value treatment
- Categorical encoding

In [None]:
# Select features for clustering
# We'll use Annual Income and Spending Score for clear visualization
clustering_data = data[['Annual Income (k$)', 'Spending Score (1-100)']]

# Setup PyCaret environment
cluster_setup = setup(
    data=clustering_data,
    normalize=True,  # Normalize features
    session_id=123,  # For reproducibility
    verbose=False
)

print("PyCaret setup complete!")

## 4. Create and Train DBSCAN Model

### DBSCAN Parameters:
- **eps (epsilon)**: Maximum distance between two samples for them to be considered in the same neighborhood
- **min_samples**: Minimum number of samples in a neighborhood for a point to be considered a core point

In [None]:
# Create DBSCAN model with PyCaret
dbscan_model = create_model(
    'dbscan',
    num_clusters=None  # DBSCAN determines clusters automatically
)

print("\nDBSCAN Model created successfully!")
print(f"Model parameters: {dbscan_model}")

In [None]:
# Assign clusters to the data
result = assign_model(dbscan_model)

print("\nClustering results:")
display(result.head(10))

# Check cluster distribution
print("\nCluster distribution:")
print(result['Cluster'].value_counts().sort_index())

# Note: Cluster -1 represents noise/outliers in DBSCAN
n_clusters = len(result['Cluster'].unique()) - (1 if -1 in result['Cluster'].values else 0)
n_noise = list(result['Cluster']).count(-1)

print(f"\nNumber of clusters: {n_clusters}")
print(f"Number of noise points (outliers): {n_noise}")
print(f"Percentage of outliers: {(n_noise/len(result))*100:.2f}%")

## 5. Visualize Clustering Results

In [None]:
# Use PyCaret's built-in visualization
print("Cluster Visualization:")
plot_model(dbscan_model, plot='cluster')

In [None]:
# Distribution plot
print("Cluster Distribution:")
plot_model(dbscan_model, plot='distribution')

In [None]:
# Custom detailed visualization
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Plot 1: Clusters with different colors
scatter = axes[0].scatter(
    result['Annual Income (k$)'],
    result['Spending Score (1-100)'],
    c=result['Cluster'],
    cmap='viridis',
    s=100,
    alpha=0.6,
    edgecolors='black'
)
axes[0].set_xlabel('Annual Income (k$)', fontsize=12)
axes[0].set_ylabel('Spending Score (1-100)', fontsize=12)
axes[0].set_title('DBSCAN Clustering Results', fontsize=14, fontweight='bold')
plt.colorbar(scatter, ax=axes[0], label='Cluster')

# Plot 2: Highlight outliers
outliers = result[result['Cluster'] == -1]
non_outliers = result[result['Cluster'] != -1]

axes[1].scatter(
    non_outliers['Annual Income (k$)'],
    non_outliers['Spending Score (1-100)'],
    c=non_outliers['Cluster'],
    cmap='viridis',
    s=100,
    alpha=0.6,
    edgecolors='black',
    label='Clusters'
)

if len(outliers) > 0:
    axes[1].scatter(
        outliers['Annual Income (k$)'],
        outliers['Spending Score (1-100)'],
        c='red',
        s=150,
        alpha=0.8,
        marker='x',
        edgecolors='black',
        linewidths=2,
        label='Outliers'
    )

axes[1].set_xlabel('Annual Income (k$)', fontsize=12)
axes[1].set_ylabel('Spending Score (1-100)', fontsize=12)
axes[1].set_title('DBSCAN with Outliers Highlighted', fontsize=14, fontweight='bold')
axes[1].legend()

plt.tight_layout()
plt.show()

## 6. Clustering Quality Evaluation

We'll evaluate the clustering using multiple metrics:
- **Silhouette Score**: Measures how similar an object is to its own cluster compared to other clusters (-1 to 1, higher is better)
- **Davies-Bouldin Index**: Average similarity ratio of each cluster with its most similar cluster (lower is better)
- **Calinski-Harabasz Index**: Ratio of between-cluster to within-cluster dispersion (higher is better)

In [None]:
# Prepare data for evaluation (exclude outliers for some metrics)
non_outlier_data = result[result['Cluster'] != -1]
X_eval = non_outlier_data[['Annual Income (k$)', 'Spending Score (1-100)']].values
labels_eval = non_outlier_data['Cluster'].values

# Calculate clustering metrics
if len(non_outlier_data['Cluster'].unique()) > 1:
    silhouette = silhouette_score(X_eval, labels_eval)
    davies_bouldin = davies_bouldin_score(X_eval, labels_eval)
    calinski_harabasz = calinski_harabasz_score(X_eval, labels_eval)
    
    print("="*60)
    print("CLUSTERING QUALITY METRICS")
    print("="*60)
    print(f"\nSilhouette Score: {silhouette:.4f}")
    print("  → Range: [-1, 1], Higher is better")
    print("  → Interpretation: Measures how similar points are to their own cluster")
    
    print(f"\nDavies-Bouldin Index: {davies_bouldin:.4f}")
    print("  → Range: [0, ∞], Lower is better")
    print("  → Interpretation: Average similarity between each cluster and its most similar one")
    
    print(f"\nCalinski-Harabasz Score: {calinski_harabasz:.4f}")
    print("  → Range: [0, ∞], Higher is better")
    print("  → Interpretation: Ratio of between-cluster to within-cluster dispersion")
    print("="*60)
    
    # Create a summary DataFrame
    metrics_df = pd.DataFrame({
        'Metric': ['Silhouette Score', 'Davies-Bouldin Index', 'Calinski-Harabasz Score'],
        'Value': [silhouette, davies_bouldin, calinski_harabasz],
        'Interpretation': ['Higher is better', 'Lower is better', 'Higher is better']
    })
    
    print("\nMetrics Summary:")
    display(metrics_df)
else:
    print("Only one cluster found. Metrics cannot be calculated.")

In [None]:
# Detailed cluster statistics
print("\nDetailed Cluster Statistics:")
print("="*80)

for cluster_id in sorted(result['Cluster'].unique()):
    cluster_data = result[result['Cluster'] == cluster_id]
    
    if cluster_id == -1:
        print(f"\nOUTLIERS (Cluster {cluster_id}):")
    else:
        print(f"\nCLUSTER {cluster_id}:")
    
    print(f"  Size: {len(cluster_data)} points ({len(cluster_data)/len(result)*100:.2f}%)")
    print(f"  Income - Mean: ${cluster_data['Annual Income (k$)'].mean():.2f}k, "
          f"Std: ${cluster_data['Annual Income (k$)'].std():.2f}k")
    print(f"  Spending - Mean: {cluster_data['Spending Score (1-100)'].mean():.2f}, "
          f"Std: {cluster_data['Spending Score (1-100)'].std():.2f}")
    print("-"*80)

## 7. DBSCAN on Synthetic Datasets

Demonstrate DBSCAN's strength in finding non-spherical clusters using synthetic datasets.

In [None]:
# Generate synthetic datasets
np.random.seed(42)

# 1. Moons dataset (non-spherical)
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# 2. Circles dataset (non-spherical)
X_circles, _ = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=42)

# 3. Blobs dataset (spherical)
X_blobs, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.5, random_state=42)

datasets = [
    ('Moons', X_moons),
    ('Circles', X_circles),
    ('Blobs', X_blobs)
]

fig, axes = plt.subplots(2, 3, figsize=(20, 12))

for idx, (name, X) in enumerate(datasets):
    # Original data
    axes[0, idx].scatter(X[:, 0], X[:, 1], alpha=0.6, s=50)
    axes[0, idx].set_title(f'{name} - Original Data', fontsize=14, fontweight='bold')
    axes[0, idx].set_xlabel('Feature 1')
    axes[0, idx].set_ylabel('Feature 2')
    
    # DBSCAN clustering
    df_temp = pd.DataFrame(X, columns=['feature1', 'feature2'])
    
    # Setup and cluster
    s = setup(data=df_temp, normalize=True, session_id=123, verbose=False)
    model = create_model('dbscan', num_clusters=None, verbose=False)
    result_temp = assign_model(model)
    
    # Plot clustered data
    scatter = axes[1, idx].scatter(
        result_temp['feature1'],
        result_temp['feature2'],
        c=result_temp['Cluster'],
        cmap='viridis',
        alpha=0.6,
        s=50,
        edgecolors='black'
    )
    axes[1, idx].set_title(f'{name} - DBSCAN Clustering', fontsize=14, fontweight='bold')
    axes[1, idx].set_xlabel('Feature 1')
    axes[1, idx].set_ylabel('Feature 2')
    plt.colorbar(scatter, ax=axes[1, idx], label='Cluster')
    
    # Print cluster info
    n_clusters = len(result_temp['Cluster'].unique()) - (1 if -1 in result_temp['Cluster'].values else 0)
    n_noise = list(result_temp['Cluster']).count(-1)
    print(f"{name}: {n_clusters} clusters, {n_noise} outliers")

plt.tight_layout()
plt.show()

## 8. Hyperparameter Tuning

Experiment with different eps and min_samples values to optimize clustering.

In [None]:
# Tune DBSCAN with different parameters
from sklearn.cluster import DBSCAN

# Reset setup with original data
s = setup(data=clustering_data, normalize=True, session_id=123, verbose=False)

# Try different parameter combinations
eps_values = [0.3, 0.5, 0.7, 1.0]
min_samples_values = [3, 5, 10]

results_tuning = []

for eps in eps_values:
    for min_samples in min_samples_values:
        # Create DBSCAN with custom parameters
        dbscan_custom = DBSCAN(eps=eps, min_samples=min_samples)
        
        # Fit and predict
        X_normalized = get_config('X_train')
        labels = dbscan_custom.fit_predict(X_normalized)
        
        # Calculate metrics
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        n_noise = list(labels).count(-1)
        
        # Calculate silhouette if possible
        if n_clusters > 1 and n_noise < len(labels):
            non_noise_idx = labels != -1
            if sum(non_noise_idx) > 0:
                sil_score = silhouette_score(X_normalized[non_noise_idx], labels[non_noise_idx])
            else:
                sil_score = -1
        else:
            sil_score = -1
        
        results_tuning.append({
            'eps': eps,
            'min_samples': min_samples,
            'n_clusters': n_clusters,
            'n_noise': n_noise,
            'noise_pct': (n_noise/len(labels))*100,
            'silhouette': sil_score
        })

# Display tuning results
tuning_df = pd.DataFrame(results_tuning)
tuning_df = tuning_df.sort_values('silhouette', ascending=False)

print("\nHyperparameter Tuning Results:")
print("="*80)
display(tuning_df)

# Find best parameters
best_params = tuning_df.iloc[0]
print(f"\nBest Parameters:")
print(f"  eps: {best_params['eps']}")
print(f"  min_samples: {best_params['min_samples']}")
print(f"  Silhouette Score: {best_params['silhouette']:.4f}")
print(f"  Number of Clusters: {int(best_params['n_clusters'])}")
print(f"  Noise Points: {int(best_params['n_noise'])} ({best_params['noise_pct']:.2f}%)")

## 9. Compare DBSCAN with Other Clustering Algorithms

In [None]:
# Reset setup
s = setup(data=clustering_data, normalize=True, session_id=123, verbose=False)

# Create multiple clustering models
print("Creating and comparing different clustering algorithms...\n")

# K-Means
kmeans = create_model('kmeans', num_clusters=5, verbose=False)
kmeans_result = assign_model(kmeans)

# DBSCAN
dbscan = create_model('dbscan', verbose=False)
dbscan_result = assign_model(dbscan)

# Hierarchical Clustering
hclust = create_model('hclust', num_clusters=5, verbose=False)
hclust_result = assign_model(hclust)

# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# K-Means
scatter1 = axes[0].scatter(
    kmeans_result['Annual Income (k$)'],
    kmeans_result['Spending Score (1-100)'],
    c=kmeans_result['Cluster'],
    cmap='viridis',
    s=100,
    alpha=0.6,
    edgecolors='black'
)
axes[0].set_title('K-Means Clustering', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Annual Income (k$)')
axes[0].set_ylabel('Spending Score (1-100)')
plt.colorbar(scatter1, ax=axes[0])

# DBSCAN
scatter2 = axes[1].scatter(
    dbscan_result['Annual Income (k$)'],
    dbscan_result['Spending Score (1-100)'],
    c=dbscan_result['Cluster'],
    cmap='viridis',
    s=100,
    alpha=0.6,
    edgecolors='black'
)
axes[1].set_title('DBSCAN Clustering', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Annual Income (k$)')
axes[1].set_ylabel('Spending Score (1-100)')
plt.colorbar(scatter2, ax=axes[1])

# Hierarchical
scatter3 = axes[2].scatter(
    hclust_result['Annual Income (k$)'],
    hclust_result['Spending Score (1-100)'],
    c=hclust_result['Cluster'],
    cmap='viridis',
    s=100,
    alpha=0.6,
    edgecolors='black'
)
axes[2].set_title('Hierarchical Clustering', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Annual Income (k$)')
axes[2].set_ylabel('Spending Score (1-100)')
plt.colorbar(scatter3, ax=axes[2])

plt.tight_layout()
plt.show()

# Compare metrics
comparison_data = []
for name, result_df in [('K-Means', kmeans_result), ('DBSCAN', dbscan_result), ('Hierarchical', hclust_result)]:
    non_outlier = result_df[result_df['Cluster'] != -1]
    X_eval = non_outlier[['Annual Income (k$)', 'Spending Score (1-100)']].values
    labels_eval = non_outlier['Cluster'].values
    
    if len(non_outlier['Cluster'].unique()) > 1:
        sil = silhouette_score(X_eval, labels_eval)
        db = davies_bouldin_score(X_eval, labels_eval)
        ch = calinski_harabasz_score(X_eval, labels_eval)
    else:
        sil, db, ch = -1, -1, -1
    
    comparison_data.append({
        'Algorithm': name,
        'Clusters': len(result_df['Cluster'].unique()),
        'Silhouette': sil,
        'Davies-Bouldin': db,
        'Calinski-Harabasz': ch
    })

comparison_df = pd.DataFrame(comparison_data)
print("\nAlgorithm Comparison:")
display(comparison_df)

## 10. Conclusion

### Key Takeaways:

1. **DBSCAN Advantages:**
   - Does not require specifying the number of clusters
   - Can find arbitrarily shaped clusters
   - Robust to outliers (identifies them explicitly)
   - Works well with non-spherical cluster shapes

2. **DBSCAN Limitations:**
   - Sensitive to eps and min_samples parameters
   - Struggles with varying density clusters
   - Not suitable for high-dimensional data (curse of dimensionality)

3. **When to Use DBSCAN:**
   - Unknown number of clusters
   - Non-spherical cluster shapes
   - Need to identify outliers
   - Varying cluster sizes

4. **PyCaret Benefits:**
   - Simplified workflow with automatic preprocessing
   - Built-in visualization tools
   - Easy model comparison
   - Consistent API across different algorithms

### Clustering Quality Summary:
The evaluation metrics help us understand:
- **Silhouette Score**: How well-separated the clusters are
- **Davies-Bouldin Index**: Compactness and separation of clusters
- **Calinski-Harabasz Score**: Overall cluster quality

### Next Steps:
- Experiment with different eps and min_samples values
- Try HDBSCAN for varying density clusters
- Apply dimensionality reduction before clustering high-dimensional data
- Use domain knowledge to interpret and validate clusters