<a href="https://colab.research.google.com/github/balajisivakumar/stratogrid_Pb/blob/main/Ruff_Determining_Right__Cluster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pandas
!pip install transformers
!pip install torch
!pip install datasets
!pip install chardet

!pip install pymongo
!pip install xlsxwriter
!pip install names spacy
!python -m spacy download en_core_web_sm
!pip install names-dataset
!pip install umap-learn matplotlib seaborn scikit-learn
!pip install faiss-cpu


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [7]:
# ✅ Step 1: Import libraries
import pandas as pd
import numpy as np
import umap
import hdbscan
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import time
import warnings

# ✅ Suppress Warnings
warnings.filterwarnings('ignore')

# ✅ Step 2: Load the PCA-reduced embeddings file
file_path = '/content/FAIS_BERT_TO_PCA.pkl'
print("🚀 Loading embeddings...")
df = pd.read_pickle(file_path)
print(f"✅ File loaded: {file_path}")

# ✅ Step 3: Extract PCA embeddings
embedding_cols = [col for col in df.columns if col.startswith('PCA_')]
pca_embeddings = df[embedding_cols].values.astype('float32')

# ✅ Step 4: Updated Search Ranges
n_neighbors_values = [50, 80, 100]
min_dist_values = [0.4, 0.5, 0.6]
min_cluster_size_values = [40, 60, 80]
min_samples_values = [40, 60, 80]
alpha_values = [0.5, 0.8]

best_noise_percentage = 100  # Keep track of best result

# ✅ Loop through combinations
for n_neighbors in n_neighbors_values:
    for min_dist in min_dist_values:
        for min_cluster_size in min_cluster_size_values:
            for min_samples in min_samples_values:
                for alpha in alpha_values:
                    print(f"\n🚀 Trying combination:")
                    print(f"➡️ n_neighbors = {n_neighbors}, min_dist = {min_dist}, min_cluster_size = {min_cluster_size}, min_samples = {min_samples}, alpha = {alpha}")

                    # ✅ Step 5: First UMAP pass → Establish global structure
                    print("🚀 Running global UMAP...")
                    umap_model_global = umap.UMAP(
                        n_neighbors=n_neighbors,
                        min_dist=min_dist,
                        n_components=2,
                        metric='euclidean',
                        n_epochs=2000,
                        random_state=42
                    )

                    with tqdm(total=umap_model_global.n_epochs, desc="Global UMAP Fitting", unit="epoch") as pbar:
                        umap_embeddings_global = umap_model_global.fit_transform(pca_embeddings)
                        pbar.update(umap_model_global.n_epochs)

                    # ✅ Step 6: HDBSCAN clustering → Higher sensitivity, less noise
                    print("🚀 Running HDBSCAN...")
                    clusterer = hdbscan.HDBSCAN(
                        min_cluster_size=min_cluster_size,
                        min_samples=min_samples,
                        cluster_selection_method='leaf',
                        alpha=alpha,
                        metric='euclidean',
                        prediction_data=True
                    )

                    with tqdm(total=len(umap_embeddings_global), desc="Clustering with HDBSCAN", unit="row") as pbar:
                        labels = clusterer.fit_predict(umap_embeddings_global)
                        for _ in range(len(umap_embeddings_global)):
                            pbar.update(1)
                            time.sleep(0.0005)

                    # ✅ Step 7: Save cluster + UMAP to DataFrame
                    df['CLUSTER'] = labels
                    df['UMAP_X'] = umap_embeddings_global[:, 0]
                    df['UMAP_Y'] = umap_embeddings_global[:, 1]

                    # ✅ Step 9: Summary of Results
                    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
                    n_noise = list(labels).count(-1)
                    noise_percentage = (n_noise / len(df)) * 100

                    print("\n🔎 **Clustering Summary:**")
                    print(f"➡️ Total rows processed: {len(df)}")
                    print(f"➡️ Number of clusters: {n_clusters}")
                    print(f"➡️ Noise points (unclustered): {n_noise} ({noise_percentage:.2f}%)")
                    print(f"➡️ Final dimensions after PCA: {pca_embeddings.shape[1]}")

                    # ✅ Step 10: Handle High Noise Warning
                    if noise_percentage > 2:
                        print(f"\n⚠️ **Noise is greater than 2% ({noise_percentage:.2f}%)** — consider increasing `min_samples` or lowering `min_dist`.")
                    else:
                        print(f"\n✅ **Noise is under control at {noise_percentage:.2f}%** — clustering looks good!")

                    # ✅ Keep track of best result
                    if noise_percentage < best_noise_percentage:
                        best_noise_percentage = noise_percentage
                        best_params = {
                            'n_neighbors': n_neighbors,
                            'min_dist': min_dist,
                            'min_cluster_size': min_cluster_size,
                            'min_samples': min_samples,
                            'alpha': alpha,
                            'n_clusters': n_clusters,
                            'noise_percentage': noise_percentage
                        }

# ✅ Final Result Summary
print("\n🏆 **Best Configuration:**")
print(f"➡️ n_neighbors = {best_params['n_neighbors']}")
print(f"➡️ min_dist = {best_params['min_dist']}")
print(f"➡️ min_cluster_size = {best_params['min_cluster_size']}")
print(f"➡️ min_samples = {best_params['min_samples']}")
print(f"➡️ alpha = {best_params['alpha']}")
print(f"➡️ Number of clusters = {best_params['n_clusters']}")
print(f"➡️ Noise percentage = {best_params['noise_percentage']:.2f}%")


🚀 Loading embeddings...
✅ File loaded: /content/FAIS_BERT_TO_PCA.pkl

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 40, min_samples = 40, alpha = 0.5
🚀 Running global UMAP...


Global UMAP Fitting: 100%|██████████| 2000/2000 [03:53<00:00,  8.58epoch/s]


🚀 Running HDBSCAN...


Clustering with HDBSCAN: 100%|██████████| 13410/13410 [00:08<00:00, 1600.89row/s]



🔎 **Clustering Summary:**
➡️ Total rows processed: 13410
➡️ Number of clusters: 72
➡️ Noise points (unclustered): 6722 (50.13%)
➡️ Final dimensions after PCA: 216

⚠️ **Noise is greater than 2% (50.13%)** — consider increasing `min_samples` or lowering `min_dist`.

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 40, min_samples = 40, alpha = 0.8
🚀 Running global UMAP...


Global UMAP Fitting: 100%|██████████| 2000/2000 [03:52<00:00,  8.61epoch/s]


🚀 Running HDBSCAN...


Clustering with HDBSCAN: 100%|██████████| 13410/13410 [00:08<00:00, 1600.13row/s]



🔎 **Clustering Summary:**
➡️ Total rows processed: 13410
➡️ Number of clusters: 72
➡️ Noise points (unclustered): 6722 (50.13%)
➡️ Final dimensions after PCA: 216

⚠️ **Noise is greater than 2% (50.13%)** — consider increasing `min_samples` or lowering `min_dist`.

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 40, min_samples = 60, alpha = 0.5
🚀 Running global UMAP...


Global UMAP Fitting: 100%|██████████| 2000/2000 [03:51<00:00,  8.63epoch/s]


🚀 Running HDBSCAN...


Clustering with HDBSCAN: 100%|██████████| 13410/13410 [00:08<00:00, 1586.60row/s]



🔎 **Clustering Summary:**
➡️ Total rows processed: 13410
➡️ Number of clusters: 49
➡️ Noise points (unclustered): 6808 (50.77%)
➡️ Final dimensions after PCA: 216

⚠️ **Noise is greater than 2% (50.77%)** — consider increasing `min_samples` or lowering `min_dist`.

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 40, min_samples = 60, alpha = 0.8
🚀 Running global UMAP...


Global UMAP Fitting: 100%|██████████| 2000/2000 [04:16<00:00,  7.81epoch/s]


🚀 Running HDBSCAN...


Clustering with HDBSCAN: 100%|██████████| 13410/13410 [00:08<00:00, 1584.57row/s]



🔎 **Clustering Summary:**
➡️ Total rows processed: 13410
➡️ Number of clusters: 49
➡️ Noise points (unclustered): 6808 (50.77%)
➡️ Final dimensions after PCA: 216

⚠️ **Noise is greater than 2% (50.77%)** — consider increasing `min_samples` or lowering `min_dist`.

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 40, min_samples = 80, alpha = 0.5
🚀 Running global UMAP...


Global UMAP Fitting: 100%|██████████| 2000/2000 [03:51<00:00,  8.64epoch/s]


🚀 Running HDBSCAN...


Clustering with HDBSCAN: 100%|██████████| 13410/13410 [00:08<00:00, 1572.37row/s]



🔎 **Clustering Summary:**
➡️ Total rows processed: 13410
➡️ Number of clusters: 34
➡️ Noise points (unclustered): 7988 (59.57%)
➡️ Final dimensions after PCA: 216

⚠️ **Noise is greater than 2% (59.57%)** — consider increasing `min_samples` or lowering `min_dist`.

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 40, min_samples = 80, alpha = 0.8
🚀 Running global UMAP...


Global UMAP Fitting: 100%|██████████| 2000/2000 [03:50<00:00,  8.67epoch/s]


🚀 Running HDBSCAN...


Clustering with HDBSCAN: 100%|██████████| 13410/13410 [00:08<00:00, 1576.73row/s]



🔎 **Clustering Summary:**
➡️ Total rows processed: 13410
➡️ Number of clusters: 34
➡️ Noise points (unclustered): 7988 (59.57%)
➡️ Final dimensions after PCA: 216

⚠️ **Noise is greater than 2% (59.57%)** — consider increasing `min_samples` or lowering `min_dist`.

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 60, min_samples = 40, alpha = 0.5
🚀 Running global UMAP...


Global UMAP Fitting: 100%|██████████| 2000/2000 [03:51<00:00,  8.65epoch/s]


🚀 Running HDBSCAN...


Clustering with HDBSCAN: 100%|██████████| 13410/13410 [00:08<00:00, 1602.39row/s]



🔎 **Clustering Summary:**
➡️ Total rows processed: 13410
➡️ Number of clusters: 46
➡️ Noise points (unclustered): 6504 (48.50%)
➡️ Final dimensions after PCA: 216

⚠️ **Noise is greater than 2% (48.50%)** — consider increasing `min_samples` or lowering `min_dist`.

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 60, min_samples = 40, alpha = 0.8
🚀 Running global UMAP...


Global UMAP Fitting: 100%|██████████| 2000/2000 [03:51<00:00,  8.65epoch/s]


🚀 Running HDBSCAN...


Clustering with HDBSCAN: 100%|██████████| 13410/13410 [00:08<00:00, 1604.20row/s]



🔎 **Clustering Summary:**
➡️ Total rows processed: 13410
➡️ Number of clusters: 46
➡️ Noise points (unclustered): 6504 (48.50%)
➡️ Final dimensions after PCA: 216

⚠️ **Noise is greater than 2% (48.50%)** — consider increasing `min_samples` or lowering `min_dist`.

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 60, min_samples = 60, alpha = 0.5
🚀 Running global UMAP...


Global UMAP Fitting: 100%|██████████| 2000/2000 [03:49<00:00,  8.70epoch/s]


🚀 Running HDBSCAN...


Clustering with HDBSCAN: 100%|██████████| 13410/13410 [00:08<00:00, 1577.72row/s]



🔎 **Clustering Summary:**
➡️ Total rows processed: 13410
➡️ Number of clusters: 37
➡️ Noise points (unclustered): 6380 (47.58%)
➡️ Final dimensions after PCA: 216

⚠️ **Noise is greater than 2% (47.58%)** — consider increasing `min_samples` or lowering `min_dist`.

🚀 Trying combination:
➡️ n_neighbors = 50, min_dist = 0.4, min_cluster_size = 60, min_samples = 60, alpha = 0.8
🚀 Running global UMAP...


Global UMAP Fitting:   0%|          | 0/2000 [02:59<?, ?epoch/s]


KeyboardInterrupt: 