# HDBSCAN
Evaluate HDBSCAN as a non-parametric clustering algorithm

## References
[How HDBSCAN Works](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)
[Grok Chat](https://grok.com/share/bGVnYWN5_cdb89d47-1ebf-4821-9f2f-d7d1563010e6)

In [21]:
!pip install -q hdbscan scikit-learn numpy onnx skl2onnx onnxruntime

In [22]:
!cd .. && python scripts/cluster.py encode  \
    public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx \
    checkpoints/allen-celltypes+human-cortex+various-cortical-areas.h5ad

Encoding samples from checkpoints/allen-celltypes+human-cortex+various-cortical-areas.h5ad using model public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx
2162 genes in the sample and not in the model
Processing 49494 cells with batch size 32
100%|████████████████████████████████████| 49494/49494 [02:31<00:00, 327.15it/s]
Saved encodings to checkpoints/allen-celltypes+human-cortex+various-cortical-areas-encodings.npy


In [15]:
import numpy as np
from hdbscan import HDBSCAN
from sklearn.datasets import make_blobs  # For dummy data; replace with your RNA-seq encodings

In [25]:
X_train = np.load("../checkpoints/allen-celltypes+human-cortex+various-cortical-areas-encodings.npy")
print(f"Loaded encodings of shape {X_train.shape}")

Loaded encodings of shape (49494, 8)


In [31]:
with open("../public/models/allen-celltypes+human-cortex+various-cortical-areas.classes", "r") as f:
    num_cell_classes = [line.strip() for line in f]
print(f"{len(num_cell_classes)} classes in this model")

20 classes in this model


In [34]:
# Initialize and train HDBSCAN on the training set for this model
hdbscan_model = HDBSCAN(
    min_cluster_size=100,  # Minimum size of clusters; tune based on your data
    min_samples=100,       # Controls noise sensitivity; tune as needed
    prediction_data=True # Required for approximate_predict
)
hdbscan_model.fit(X_train)

# Get cluster labels for training data
train_labels = hdbscan_model.labels_
print(f"Number of clusters found: {len(np.unique(train_labels)) - (1 if -1 in train_labels else 0)}")

Number of clusters found: 32


In [39]:
!cd .. && python scripts/cluster.py encode  \
    public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx \
    data/allen-celltypes+human-cortex+m1.h5ad

Encoding samples from data/allen-celltypes+human-cortex+m1.h5ad using model public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx
2162 genes in the sample and not in the model
Processing 76533 cells with batch size 32
100%|████████████████████████████████████| 76533/76533 [03:59<00:00, 320.07it/s]
Saved encodings to data/allen-celltypes+human-cortex+m1-encodings.npy


In [45]:
X_test = np.load("../data/allen-celltypes+human-cortex+m1-encodings.npy")
print(f"Loaded test encodings of shape {X_test.shape}")

Loaded test encodings of shape (76533, 8)


In [47]:
# Predict clusters for new data without retraining
from hdbscan import approximate_predict
test_labels, strengths = approximate_predict(hdbscan_model, X_test)
print(f"Test labels: {test_labels}")

print(f"Number of clusters found: {len(np.unique(test_labels)) - (1 if -1 in test_labels else 0)}")

Test labels: [-1 -1 -1 ... -1 -1 -1]
Number of clusters found: 20
