# HDBSCAN
Evaluate HDBSCAN as a non-parametric clustering algorithm

## References
[How HDBSCAN Works](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)

[Grok Chat](https://grok.com/share/bGVnYWN5_cdb89d47-1ebf-4821-9f2f-d7d1563010e6)

In [21]:
!pip install -q hdbscan scikit-learn numpy onnx skl2onnx onnxruntime

In [9]:
import numpy as np
from hdbscan import HDBSCAN
from sklearn.datasets import make_blobs  # For dummy data; replace with your RNA-seq encodings

## Ingest
Predict labels and encodings for training and test sets for a SIMS model

In [6]:
!cd .. && python scripts/cluster.py predict  \
    public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx \
    checkpoints/allen-celltypes+human-cortex+various-cortical-areas.h5ad \
    --num-samples 1000

Generating predictions and encodings for checkpoints/allen-celltypes+human-cortex+various-cortical-areas.h5ad using model public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx
2162 genes in the sample and not in the model
Processing 1000 cells with batch size 32
100%|██████████████████████████████████████| 1000/1000 [00:03<00:00, 326.66it/s]
Saved encodings to checkpoints/allen-celltypes+human-cortex+various-cortical-areas-encodings.npy
Saved predictions to checkpoints/allen-celltypes+human-cortex+various-cortical-areas-predictions.npy


In [7]:
!cd .. && python scripts/cluster.py predict  \
    public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx \
    data/allen-celltypes+human-cortex+m1.h5ad \
    --num-samples 1000

Generating predictions and encodings for data/allen-celltypes+human-cortex+m1.h5ad using model public/models/allen-celltypes+human-cortex+various-cortical-areas.onnx
2162 genes in the sample and not in the model
Processing 1000 cells with batch size 32
100%|██████████████████████████████████████| 1000/1000 [00:03<00:00, 324.24it/s]
Saved encodings to data/allen-celltypes+human-cortex+m1-encodings.npy
Saved predictions to data/allen-celltypes+human-cortex+m1-predictions.npy


In [16]:
X_train = np.load("../checkpoints/allen-celltypes+human-cortex+various-cortical-areas-encodings.npy")
Y_train = np.load("../checkpoints/allen-celltypes+human-cortex+various-cortical-areas-predictions.npy")

X_test = np.load("../data/allen-celltypes+human-cortex+m1-encodings.npy")
Y_test = np.load("../data/allen-celltypes+human-cortex+m1-predictions.npy")


with open("../public/models/allen-celltypes+human-cortex+various-cortical-areas.classes", "r") as f:
    labels = [line.strip() for line in f]

## Train

In [15]:
# Initialize and train HDBSCAN on the training set for this model
hdbscan_model = HDBSCAN(
    min_cluster_size=100,  # Minimum size of clusters; tune based on your data
    min_samples=100,       # Controls noise sensitivity; tune as needed
    prediction_data=True # Required for approximate_predict
)
hdbscan_model.fit(X_train)

# Get cluster labels for training data
train_labels = hdbscan_model.labels_
print(f"Number of clusters found: {len(np.unique(train_labels)) - (1 if -1 in train_labels else 0)}")

Number of clusters found: 3


## Evaluate

In [18]:
# Predict clusters for new data without retraining
from hdbscan import approximate_predict
test_labels, strengths = approximate_predict(hdbscan_model, X_test)

print(f"Number of clusters found: {len(np.unique(test_labels)) - (1 if -1 in test_labels else 0)}")

Number of clusters found: 2


## Export
Attempt to export the trained hdbscan model to onnx

In [23]:
# Import libraries for ONNX conversion
import skl2onnx
import onnx
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as rt

In [25]:
# Define the input shape for the ONNX model
# X_train.shape[1] represents the number of features in your input data
input_type = [('input', FloatTensorType([None, X_train.shape[1]]))]

# Convert the HDBSCAN model to ONNX format
onnx_model = skl2onnx.convert_sklearn(
    hdbscan_model, 
    initial_types=input_type,
    options={id(hdbscan_model): {'zipmap': False}}
)

# Save the ONNX model to a file
onnx_model_path = "../models/hdbscan_model.onnx"
with open(onnx_model_path, "wb") as f:
    f.write(onnx_model.SerializeToString())

print(f"ONNX model saved to {onnx_model_path}")

MissingShapeCalculator: Unable to find a shape calculator for type '<class 'hdbscan.hdbscan_.HDBSCAN'>'.
It usually means the pipeline being converted contains a
transformer or a predictor with no corresponding converter
implemented in sklearn-onnx. If the converted is implemented
in another library, you need to register
the converted so that it can be used by sklearn-onnx (function
update_registered_converter). If the model is not yet covered
by sklearn-onnx, you may raise an issue to
https://github.com/onnx/sklearn-onnx/issues
to get the converter implemented or even contribute to the
project. If the model is a custom model, a new converter must
be implemented. Examples can be found in the gallery.


In [28]:
from skl2onnx import update_registered_converter
from skl2onnx.common.shape_calculator import calculate_linear_classifier_output_shapes
from onnxconverter_common import direct_ops

def hdbscan_shape_calculator(operator):
    calculate_linear_classifier_output_shapes(operator)

def hdbscan_converter(scope, operator, container):
    # This is a simplified converter - actual implementation would need to match HDBSCAN's logic
    X = operator.inputs[0]
    output = operator.outputs[0]
    
    # Add basic operations (this would need to be expanded for full HDBSCAN functionality)
    direct_ops.add_identity(container, X, output.name)

# Register the custom converter
update_registered_converter(
    HDBSCAN,
    "CustomHDBSCAN",
    hdbscan_shape_calculator,
    hdbscan_converter
)

# Then try the conversion again
onnx_model = convert_sklearn(
    hdbscan_model,
    "hdbscan_model",
    initial_types=initial_type
)

ImportError: cannot import name 'direct_ops' from 'onnxconverter_common' (/Users/rcurrie/cell-space/venv/lib/python3.10/site-packages/onnxconverter_common/__init__.py)

In [29]:
import onnxconverter_common