# HDBSCAN
Evaluate HDBSCAN as a non-parametric clustering algorithm

## References
[How HDBSCAN Works](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)

[Grok Chat](https://grok.com/share/bGVnYWN5_cdb89d47-1ebf-4821-9f2f-d7d1563010e6)

In [21]:
!pip install -q hdbscan scikit-learn numpy onnx skl2onnx onnxruntime

In [9]:
import numpy as np
from hdbscan import HDBSCAN
from sklearn.datasets import make_blobs  # For dummy data; replace with your RNA-seq encodings

## Data
Predict labels and encodings for training and test sets for a SIMS model

In [6]:
!cd .. && python scripts/cluster.py predict  \
    public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx \
    checkpoints/allen-celltypes+human-cortex+various-cortical-areas.h5ad \
    --num-samples 1000

Generating predictions and encodings for checkpoints/allen-celltypes+human-cortex+various-cortical-areas.h5ad using model public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx
2162 genes in the sample and not in the model
Processing 1000 cells with batch size 32
100%|██████████████████████████████████████| 1000/1000 [00:03<00:00, 326.66it/s]
Saved encodings to checkpoints/allen-celltypes+human-cortex+various-cortical-areas-encodings.npy
Saved predictions to checkpoints/allen-celltypes+human-cortex+various-cortical-areas-predictions.npy


In [7]:
!cd .. && python scripts/cluster.py predict  \
    public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx \
    data/allen-celltypes+human-cortex+m1.h5ad \
    --num-samples 1000

Generating predictions and encodings for data/allen-celltypes+human-cortex+m1.h5ad using model public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx
2162 genes in the sample and not in the model
Processing 1000 cells with batch size 32
100%|██████████████████████████████████████| 1000/1000 [00:03<00:00, 324.24it/s]
Saved encodings to data/allen-celltypes+human-cortex+m1-encodings.npy
Saved predictions to data/allen-celltypes+human-cortex+m1-predictions.npy


In [16]:
X_train = np.load("../checkpoints/allen-celltypes+human-cortex+various-cortical-areas-encodings.npy")
Y_train = np.load("../checkpoints/allen-celltypes+human-cortex+various-cortical-areas-predictions.npy")

X_test = np.load("../data/allen-celltypes+human-cortex+m1-encodings.npy")
Y_test = np.load("../data/allen-celltypes+human-cortex+m1-predictions.npy")


with open("../public/models/allen-celltypes+human-cortex+various-cortical-areas.classes", "r") as f:
    labels = [line.strip() for line in f]

In [15]:
# Initialize and train HDBSCAN on the training set for this model
hdbscan_model = HDBSCAN(
    min_cluster_size=100,  # Minimum size of clusters; tune based on your data
    min_samples=100,       # Controls noise sensitivity; tune as needed
    prediction_data=True # Required for approximate_predict
)
hdbscan_model.fit(X_train)

# Get cluster labels for training data
train_labels = hdbscan_model.labels_
print(f"Number of clusters found: {len(np.unique(train_labels)) - (1 if -1 in train_labels else 0)}")

Number of clusters found: 3


In [18]:
# Predict clusters for new data without retraining
from hdbscan import approximate_predict
test_labels, strengths = approximate_predict(hdbscan_model, X_test)

print(f"Number of clusters found: {len(np.unique(test_labels)) - (1 if -1 in test_labels else 0)}")

Number of clusters found: 2
