# Playing with HDBSCAN 

### Objective: to understand the basics of how to implement hdbscan clustering on my precomputed distance metric table

Following basic usage from the HDBSCAN github repository, in the docs directory:
https://github.com/scikit-learn-contrib/hdbscan

In [1]:
import pandas as pd
from sklearn.datasets.samples_generator import make_blobs


In [2]:
blobs, labels = make_blobs(n_samples=4000, n_features=8, centers = 4) #creates numpy ndarrays with 4 blob centers
type(blobs)

numpy.ndarray

In [3]:
print(blobs.shape)
print(labels.shape)

(4000, 8)
(4000,)


In [4]:
df_blob = pd.DataFrame(blobs) #just a dataframe of the data
df_blob.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,-0.707333,8.452894,8.363299,-2.899506,7.525401,10.012177,7.124912,-9.517923
1,8.342652,-5.772818,0.407365,2.74634,-8.761033,9.122898,1.710239,10.345968
2,6.244715,-10.292425,4.270402,6.608374,3.692815,4.053554,-4.025397,2.589381
3,0.323498,-0.848869,-4.668813,7.090943,-2.282454,8.873566,4.968602,-1.600008
4,8.306039,-6.393436,0.093635,4.210367,-8.623712,9.071091,1.471359,9.653458


In [5]:
df_labels = pd.DataFrame(labels) #dont understand why we need these labels? are these column indeces?
df_labels.head()

Unnamed: 0,0
0,2
1,1
2,0
3,3
4,1


In [6]:
import hdbscan

In [7]:
clusterer = hdbscan.HDBSCAN() #create cluster object 

In [8]:
clusterer.fit(blobs) #fit the data  

HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
    approx_min_span_tree=True, core_dist_n_jobs=4, gen_min_span_tree=False,
    leaf_size=40, match_reference_implementation=False,
    memory=Memory(cachedir=None), metric='euclidean', min_cluster_size=5,
    min_samples=None, p=None)

The clustering is complete! to get the results you need the attribute labels_

In [9]:
clusterer.labels_

array([0, 3, 1, ..., 0, 1, 1])

Each data point get a cluster label, cluser 1, cluster 2, cluster 3... an array of integers 

In [10]:
clusterer.labels_.size

4000

indeed there are 4000 data points in this array of integers - same as n_samples

In [11]:
clusterer.labels_.max()

3

a total of 4 clusters with labels 0,1,2,3

stealing wise words:

Importantly HDBSCAN is noise aware -- it has a notion of data samples that are not assigned to any cluster. This is handled by assigning these samples the label -1. But wait, there's more. The hdbscan library implements soft clustering, where wach data point is assigned a cluster membership score ranging from 0.0 to 1.0. A score of 0.0 represents a sample that is not in the cluster at all (all noise points will get this score) while a score of 1.0 represents a sample that is at the heart of the cluster (note that this is not the spatial centroid notion of core). You can access these scores via the probabilities_ attribute.

In [12]:
clusterer.probabilities_

array([ 0.69206705,  0.75816776,  0.95033556, ...,  0.62846889,
        0.69155578,  0.80544359])

# Different metrics are supported by HDBSCAN

stealing more wise words:

That is all well and good, but even data that is embedded in a vector space may not want to consider distances between data points to be pure Euclidean distance. What can we do in that case? We are still in good shape, since hdbscan supports a wide variety of metrics, which you can set when creating the clusterer object. For example we can do the following:

In [15]:
clusterer = hdbscan.HDBSCAN(metric='manhattan')
clusterer.fit(blobs)
clusterer.labels_

array([0, 3, 2, ..., 0, 2, 2])

See the list of supported metrics from scikit learn 

In [16]:
hdbscan.dist_metrics.METRIC_MAPPING

{'arccos': hdbscan.dist_metrics.ArccosDistance,
 'braycurtis': hdbscan.dist_metrics.BrayCurtisDistance,
 'canberra': hdbscan.dist_metrics.CanberraDistance,
 'chebyshev': hdbscan.dist_metrics.ChebyshevDistance,
 'cityblock': hdbscan.dist_metrics.ManhattanDistance,
 'cosine': hdbscan.dist_metrics.ArccosDistance,
 'dice': hdbscan.dist_metrics.DiceDistance,
 'euclidean': hdbscan.dist_metrics.EuclideanDistance,
 'hamming': hdbscan.dist_metrics.HammingDistance,
 'haversine': hdbscan.dist_metrics.HaversineDistance,
 'infinity': hdbscan.dist_metrics.ChebyshevDistance,
 'jaccard': hdbscan.dist_metrics.JaccardDistance,
 'kulsinski': hdbscan.dist_metrics.KulsinskiDistance,
 'l1': hdbscan.dist_metrics.ManhattanDistance,
 'l2': hdbscan.dist_metrics.EuclideanDistance,
 'mahalanobis': hdbscan.dist_metrics.MahalanobisDistance,
 'manhattan': hdbscan.dist_metrics.ManhattanDistance,
 'matching': hdbscan.dist_metrics.MatchingDistance,
 'minkowski': hdbscan.dist_metrics.MinkowskiDistance,
 'p': hdbscan.dis

# Precomputed Distance metric 

 If you create the clusterer with the metric set to precomputed then the clusterer will assume that, rather than being handed a vector of points in a vector space, it is recieving an all pairs distance matrix.

In [18]:
from sklearn.metrics.pairwise import pairwise_distances

In [19]:
distance_matrix = pairwise_distances(blobs)
clusterer = hdbscan.HDBSCAN(metric='precomputed')
clusterer.fit(distance_matrix)
clusterer.labels_

array([0, 2, 3, ..., 0, 3, 3])

### Experimental Playground

In [14]:
make_blobs?