<a href="https://colab.research.google.com/github/akshatamadavi/data_mining/blob/main/clustering/04_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part D: DBSCAN Clustering using PyCaret

This notebook demonstrates DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering using the PyCaret library. DBSCAN is a density-based clustering algorithm that can discover clusters of arbitrary shapes and identify outliers as noise points.

**Key Features of DBSCAN:**
- Does not require specifying the number of clusters beforehand
- Can identify outliers/noise in the data
- Works well with clusters of varying shapes and densities
- Requires two parameters: eps (neighborhood radius) and min_samples (minimum points to form a cluster)

In [None]:
# Install PyCaret
!pip install pycaret --quiet

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_moons
from pycaret.clustering import *
import warnings
warnings.filterwarnings('ignore')

print('✓ All libraries imported successfully!')

## 2. Load and Prepare Dataset

We'll create a synthetic dataset with non-linear clusters (moons shape) to demonstrate DBSCAN's ability to identify complex cluster shapes.

In [None]:
# Generate synthetic dataset with non-linear clusters (moons shape)
X, y = make_moons(n_samples=300, noise=0.1, random_state=42)

# Convert to DataFrame
df = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())

# Visualize the raw data
plt.figure(figsize=(10, 6))
plt.scatter(df['Feature_1'], df['Feature_2'], c='gray', alpha=0.6, edgecolors='black', s=50)
plt.title('Original Dataset (Moons Shape) - Before Clustering', fontsize=14, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n✓ Dataset created and visualized successfully!")

## 3. Setup PyCaret Clustering Environment

Initialize PyCaret's clustering environment with the dataset. PyCaret will automatically preprocess the data and prepare it for clustering.

In [None]:
# Initialize PyCaret clustering environment
# This step preprocesses the data and sets up the clustering pipeline
cluster_setup = setup(data=df, session_id=42, verbose=False)

print("✓ PyCaret clustering environment initialized successfully!")

## 4. Create DBSCAN Clustering Model

Now we'll create a DBSCAN model using PyCaret's `create_model` function. DBSCAN will automatically identify clusters based on density.

In [None]:
# Create DBSCAN model using PyCaret
dbscan_model = create_model('dbscan')

print("✓ DBSCAN model created successfully!")
print(f"\nModel Details:")
print(f"Number of clusters found: {len(set(dbscan_model.labels_)) - (1 if -1 in dbscan_model.labels_ else 0)}")
print(f"Number of noise points: {list(dbscan_model.labels_).count(-1)}")
print(f"eps (neighborhood radius): {dbscan_model.eps}")
print(f"min_samples: {dbscan_model.min_samples}")

## 5. Visualize Clustering Results and Evaluate Quality

Visualize the DBSCAN clustering results and evaluate the clustering quality using various metrics.

In [None]:
# Get cluster predictions
clusters = assign_model(dbscan_model)

# Visualize clustering using PyCaret's plot_model
print("Generating clustering visualizations...\n")
plot_model(dbscan_model)

# Custom visualization with matplotlib
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Clusters with colors
for cluster_id in set(dbscan_model.labels_):
    if cluster_id == -1:
        # Noise points in black
        cluster_mask = dbscan_model.labels_ == cluster_id
        axes[0].scatter(df[cluster_mask]['Feature_1'], df[cluster_mask]['Feature_2'],
                       c='black', marker='x', s=50, alpha=0.5, label='Noise')
    else:
        cluster_mask = dbscan_model.labels_ == cluster_id
        axes[0].scatter(df[cluster_mask]['Feature_1'], df[cluster_mask]['Feature_2'],
                       s=100, alpha=0.6, edgecolors='black', label=f'Cluster {cluster_id}')

axes[0].set_title('DBSCAN Clustering Results', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Cluster sizes
cluster_counts = pd.Series(dbscan_model.labels_).value_counts().sort_index()
colors = ['black' if x == -1 else f'C{x}' for x in cluster_counts.index]
axes[1].bar(range(len(cluster_counts)), cluster_counts.values, color=colors, edgecolor='black')
axes[1].set_title('Cluster Sizes', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Cluster ID (-1 = Noise)')
axes[1].set_ylabel('Number of Points')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Evaluate clustering quality
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Filter out noise points for metrics
mask = dbscan_model.labels_ != -1
if mask.sum() > 0:
    silhouette_avg = silhouette_score(df[mask], dbscan_model.labels_[mask])
    db_score = davies_bouldin_score(df[mask], dbscan_model.labels_[mask])
    ch_score = calinski_harabasz_score(df[mask], dbscan_model.labels_[mask])

    print("\n" + "="*50)
    print("CLUSTERING QUALITY METRICS")
    print("="*50)
    print(f"Silhouette Score: {silhouette_avg:.4f} (higher is better, range: [-1, 1])")
    print(f"Davies-Bouldin Index: {db_score:.4f} (lower is better)")
    print(f"Calinski-Harabasz Score: {ch_score:.4f} (higher is better)")
    print("="*50)
else:
    print("\nNo valid clusters found for evaluation.")

print("\n✓ Visualization and evaluation completed successfully!")

## 6. Conclusion

This notebook successfully demonstrated DBSCAN clustering using PyCaret library. Key takeaways:

**Advantages of DBSCAN:**
- Automatically discovers the number of clusters
- Identifies outliers/noise points effectively
- Works well with non-linear cluster shapes (as shown with moons dataset)
- No need to specify cluster count beforehand

**PyCaret Benefits:**
- Simple and intuitive API for clustering tasks
- Automated preprocessing and setup
- Built-in visualization and evaluation tools
- Reduces code complexity significantly

**Clustering Quality:**
The evaluation metrics (Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Score) provide quantitative measures of how well-separated and compact the clusters are.

For production use, consider tuning the `eps` and `min_samples` parameters based on your specific dataset characteristics.

## 1. Install and Import Required Libraries

First, we need to install PyCaret library and import necessary packages for data manipulation and visualization.