# ST-HDBSCAN: Spatiotemporal Hierarchical DBSCAN for Trajectory Data

## Abstract

The study of human mobility has advanced greatly in recent years due to the availability of
commercial large-scale GPS trajectory datasets [3]. However, the validity of findings that use
these datasets depends heavily on the robustness of their pre-processing methods. An important
step in the processing of mobility data is the detection of stops within GPS trajectories, for
which many clustering algorithms have been proposed [4, 8, 6, 1]. Yet, the high sparsity of
commercial GPS data can affect the performance of these stop-detection algorithms.
In the case of DBSCAN, while it initially identifies dense regions, it can often over-cluster
or under-cluster due to noise and weakly connected points given the chosen ε. ST-DBSCAN [4]
uses two distance thresholds, Eps1 for spatial and Eps2 for non-spatial values. The algorithm
compares the average non-spatial value, such as temperature, of a cluster with a new com-
ing value, to prevent merging adjacent clusters. Nevertheless, datasets that include this kind
of information are not comparable to realistic GPS-based trajectories. A promising algorithm
is T-DBSCAN [6], which searches forward in time for a continuous density-based neighbor-
hood of core points. Points spatially close, within Eps, and within a roaming threshold, CEps, are included in a cluster. Additionally, we used a time-augmented DBSCAN algorithm, TA-DBSCAN, which recursively processes the clusters obtained from DBSCAN to address the issue of initial clusters overlapping in time. However, methods [9] that validate stop-detection algorithms based on synthetic data show that these can omit, merge, or split stops based on the selection of epsilon and sparsity of the data.

If we define parameters that may be considered fine (low ε), it might completely miss a stop at a larger location. In contrast, coarse parameters (large ε) may struggle to differentiate stops within small neighboring locations [3]. Since different venues vary in stop durations and areas, this could influence the parameter choices [9]. To address this parameter selection limitation, we propose a spatiotemporal variation of Hierarchical DBSCAN [5], ST-HDBSCAN. Unlike DBSCAN, which relies on one threshold of density to cluster points, our variation constructs separate structures for space and time distances that preserve density-based connections in these two dimensions. This approach ensures that when pruning the hierarchical tree structure needed for cluster formation, we account for varying spatiotemporal densities. As a result, clusters emerge naturally without requiring specific time and space thresholds, working effectively across different data sparsity levels.

In [78]:
%load_ext autoreload
%autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [79]:
import pandas as pd
import numpy as np
from datetime import timedelta
import pygeohash as gh
import geopandas as gpd
from matplotlib import cm
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from pyproj import Transformer
from collections import defaultdict

In [80]:
import nomad.io.base as loader
import nomad.constants as constants
import nomad.stop_detection.hdbscan as HDBSCAN
import nomad.filters as filters
import nomad.city_gen as cg

In [48]:
traj_cols = {'user_id':'uid',
             'datetime':'local_datetime',
             'latitude':'latitude',
             'longitude':'longitude'}

data = loader.from_file("../../nomad/data/gc_sample.csv")

In [49]:
# We create a time offset column with different UTC offsets (in seconds)
data['tz_offset'] = 0
data.loc[data.index[:5000],'tz_offset'] = -7200
data.loc[data.index[-5000:], 'tz_offset'] = 3600

# create datetime column as a string
data['local_datetime'] = loader._unix_offset_to_str(data.timestamp, data.tz_offset)
data['local_datetime'] = pd.to_datetime(data['local_datetime'], utc=True)

# create x, y columns in web mercator
gdf = gpd.GeoSeries(gpd.points_from_xy(data.longitude, data.latitude),
                        crs="EPSG:4326")
projected = gdf.to_crs("EPSG:3857")
data['x'] = projected.x
data['y'] = projected.y

In [50]:
data

Unnamed: 0,uid,timestamp,latitude,longitude,tz_offset,local_datetime,x,y
0,wizardly_joliot,1704119340,38.321711,-36.667334,-7200,2024-01-01 14:29:00+00:00,-4.081789e+06,4.624973e+06
1,wizardly_joliot,1704119700,38.321676,-36.667365,-7200,2024-01-01 14:35:00+00:00,-4.081792e+06,4.624968e+06
2,wizardly_joliot,1704155880,38.320959,-36.666748,-7200,2024-01-02 00:38:00+00:00,-4.081724e+06,4.624866e+06
3,wizardly_joliot,1704156000,38.320936,-36.666739,-7200,2024-01-02 00:40:00+00:00,-4.081723e+06,4.624863e+06
4,wizardly_joliot,1704156840,38.320924,-36.666747,-7200,2024-01-02 00:54:00+00:00,-4.081724e+06,4.624861e+06
...,...,...,...,...,...,...,...,...
25830,angry_spence,1705303380,38.320399,-36.667438,3600,2024-01-15 07:23:00+00:00,-4.081801e+06,4.624787e+06
25831,angry_spence,1705303740,38.320413,-36.667469,3600,2024-01-15 07:29:00+00:00,-4.081804e+06,4.624789e+06
25832,angry_spence,1705303980,38.320384,-36.667455,3600,2024-01-15 07:33:00+00:00,-4.081802e+06,4.624785e+06
25833,angry_spence,1705304340,38.320349,-36.667473,3600,2024-01-15 07:39:00+00:00,-4.081804e+06,4.624780e+06


In [51]:
user_sample = data.loc[data.uid == "angry_spence"]
user_sample = user_sample[['timestamp', 'x', 'y']]

In [52]:
user_sample

Unnamed: 0,timestamp,x,y
24139,1704104460,-4.081702e+06,4.624871e+06
24140,1704104820,-4.081697e+06,4.624867e+06
24141,1704104940,-4.081696e+06,4.624866e+06
24142,1704105540,-4.081698e+06,4.624865e+06
24143,1704105720,-4.081699e+06,4.624866e+06
...,...,...,...
25830,1705303380,-4.081801e+06,4.624787e+06
25831,1705303740,-4.081804e+06,4.624789e+06
25832,1705303980,-4.081802e+06,4.624785e+06
25833,1705304340,-4.081804e+06,4.624780e+06


In [65]:
time_pairs = HDBSCAN._find_bursts(user_sample['timestamp'], 120)
core_distances = HDBSCAN._compute_core_distance(user_sample, time_pairs, 4)
mrd = HDBSCAN._compute_mrd_graph(user_sample, core_distances)
mst_edges = HDBSCAN._mst(mrd)
mstext_edges = mst_ext(mst_edges, core_distances)

In [114]:
label_map, hierarchy = HDBSCAN.hdbscan(mstext_edges, min_cluster_size = 2)

In [115]:
from collections import Counter
print(Counter(label_map.values()))

Counter({-1: 1647, 15: 2, 106: 2, 50: 2, 148: 2, 18: 2, 127: 2, 20: 2, 59: 2, 30: 1, 25: 1, 36: 1, 33: 1, 9: 1, 16: 1, 112: 1, 146: 1, 32: 1, 159: 1, 38: 1, 91: 1, 46: 1, 84: 1, 72: 1, 14: 1, 124: 1, 156: 1, 116: 1, 152: 1, 104: 1, 89: 1, 73: 1, 10: 1, 64: 1, 82: 1, 94: 1, 134: 1, 67: 1, 130: 1, 80: 1, 62: 1, 93: 1})


In [93]:
from collections import defaultdict

def _compute_cluster_stability(hierarchy, label_map, timestamps, min_cluster_size):
    """
    Compute the stability of each cluster from the hierarchy.

    Parameters
    ----------
    hierarchy : list of tuples
        Each tuple is (scale, parent_cluster_id, [child_cluster_ids]).
    label_map : dict
        Final label mapping of each timestamp.
    timestamps : set
        All unique timestamps.
    min_cluster_size : int
        The minimum cluster size allowed.

    Returns
    -------
    stability : dict
        {cluster_id: stability score}
    """
    lambda_min = {0: 0.0} # λ_min(Ci) is the minimum density level at which Ci exists
    lambda_max = {} # λ_max(xj,Ci) is the density level beyond which object xj no longer belongs to cluster Ci
    
    # Track which cluster each timestamp was in at which scale
    membership = defaultdict(list)  # timestamp: [(cluster_id, scale_exit)]

    # Initially all points are in cluster 0
    for ts in timestamps:
        membership[ts].append((0, None))

    for scale, parent_id, child_ids in hierarchy:
        # Parent no longer exists at this scale
        lambda_max[parent_id] = scale
        # Children begin to exist at this scale
        for child_id in child_ids:
            lambda_min[child_id] = scale

        # Reassign points in label_map to child clusters
        for ts, label in label_map.items():
            if label in child_ids:
                membership[ts].append((label, None))
        
        # For all points in parent that aren't reassigned (noise), mark their exit
        for ts, ts_evolution in membership.items():
            for i in range(len(ts_evolution)):
                cid, exit_scale = ts_evolution[i]
                if cid == parent_id and exit_scale is None:
                    ts_evolution[i] = (cid, scale)

    # Stability : S(Ci)= Σx_j ∈ C_i {λ_max(xj,Ci) − λ_min(Ci)}
    stability = defaultdict(float)
    for ts in timestamps:
        for cid, exit_scale in membership[ts]:
            if cid not in lambda_min:
                continue
            birth = lambda_min[cid] # λ_min(Ci)
            death = exit_scale if exit_scale is not None else lambda_max.get(cid, birth) # λ_max(xj,Ci)
            if death is not None and death > birth: 
                stability[cid] += death - birth # λ_max(xj,Ci) − λ_min(Ci)

    return dict(stability)

In [94]:
def _select_stable_clusters(hierarchy, stability):
    """
    Implements the HDBSCAN bottom-up optimization algorithm for selecting
    the most stable, non-overlapping clusters.

    Parameters
    ----------
    hierarchy : list of (scale, parent_id, [child_ids])
        The cluster split history.
    stability : dict
        {cluster_id: stability score}

    Returns
    -------
    selected_clusters : set
        Set of selected cluster IDs.
    """

    # Build tree (parent -> children)
    tree = defaultdict(list)
    parents = {}
    for _, parent, children in hierarchy:
        tree[parent].extend(children)
        for child in children:
            parents[child] = parent

    # List of all cluster ids
    all_clusters = set(stability.keys()) | set(parents.keys()) | set(tree.keys())

    # Leaf clusters
    leaf_clusters = [cid for cid in all_clusters if cid not in tree]

    # Bottom-up dynamic programming
    best_stability = {}
    selected = {}

    def dfs(cluster_id):
        if cluster_id not in tree:
            best_stability[cluster_id] = stability.get(cluster_id, 0.0)
            selected[cluster_id] = True
            return best_stability[cluster_id]

        child_stabilities = 0
        
        for child in tree[cluster_id]:
            child_stabilities += dfs(child)

        current_stability = stability.get(cluster_id, 0.0)

        if current_stability >= child_stabilities:
            best_stability[cluster_id] = current_stability
            selected[cluster_id] = True
            for child in tree[cluster_id]:
                selected[child] = False
        else:
            best_stability[cluster_id] = child_stabilities
            selected[cluster_id] = False

        return best_stability[cluster_id]

    dfs(0)

    final_clusters = {cid for cid, selected in selected.items() if selected}

    return final_clusters

In [106]:
def _hdbscan_labels(label_map, final_clusters):
    """
    Assign final labels to each timestamp based on selected stable clusters.
    
    Parameters
    ----------
    label_map : dict
        {Original timestamp: cluster_id mapping}
    final_clusters : set
        Set of selected cluster IDs.

    Returns
    -------
    final_labels : dict
        {timestamp: final cluster label (-1 if noise)}
    """
    final_labels = {}
    
    for ts, cid in label_map.items():
        if cid in final_clusters:
            final_labels[ts] = cid
        else:
            final_labels[ts] = -1  # noise

    return final_labels

In [108]:
stability = _compute_cluster_stability(hierarchy, label_map, user_sample['timestamp'], 4)
final_clusters = _select_stable_clusters(hierarchy, stability)

In [111]:
stability

{0: np.float64(92577.55222618669)}