# ST-HDBSCAN: Spatiotemporal Hierarchical DBSCAN for Trajectory Data

## Abstract

The study of human mobility has advanced greatly in recent years due to the availability of
commercial large-scale GPS trajectory datasets [3]. However, the validity of findings that use
these datasets depends heavily on the robustness of their pre-processing methods. An important
step in the processing of mobility data is the detection of stops within GPS trajectories, for
which many clustering algorithms have been proposed [4, 8, 6, 1]. Yet, the high sparsity of
commercial GPS data can affect the performance of these stop-detection algorithms.
In the case of DBSCAN, while it initially identifies dense regions, it can often over-cluster
or under-cluster due to noise and weakly connected points given the chosen ε. ST-DBSCAN [4]
uses two distance thresholds, Eps1 for spatial and Eps2 for non-spatial values. The algorithm
compares the average non-spatial value, such as temperature, of a cluster with a new com-
ing value, to prevent merging adjacent clusters. Nevertheless, datasets that include this kind
of information are not comparable to realistic GPS-based trajectories. A promising algorithm
is T-DBSCAN [6], which searches forward in time for a continuous density-based neighbor-
hood of core points. Points spatially close, within Eps, and within a roaming threshold, CEps, are included in a cluster. Additionally, we used a time-augmented DBSCAN algorithm, TA-DBSCAN, which recursively processes the clusters obtained from DBSCAN to address the issue of initial clusters overlapping in time. However, methods [9] that validate stop-detection algorithms based on synthetic data show that these can omit, merge, or split stops based on the selection of epsilon and sparsity of the data.

If we define parameters that may be considered fine (low ε), it might completely miss a stop at a larger location. In contrast, coarse parameters (large ε) may struggle to differentiate stops within small neighboring locations [3]. Since different venues vary in stop durations and areas, this could influence the parameter choices [9]. To address this parameter selection limitation, we propose a spatiotemporal variation of Hierarchical DBSCAN [5], ST-HDBSCAN. Unlike DBSCAN, which relies on one threshold of density to cluster points, our variation constructs separate structures for space and time distances that preserve density-based connections in these two dimensions. This approach ensures that when pruning the hierarchical tree structure needed for cluster formation, we account for varying spatiotemporal densities. As a result, clusters emerge naturally without requiring specific time and space thresholds, working effectively across different data sparsity levels.

In [37]:
%load_ext autoreload
%autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [38]:
import pandas as pd
import numpy as np
from datetime import timedelta
import pygeohash as gh
import geopandas as gpd
from matplotlib import cm
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from pyproj import Transformer
import heapq
from collections import defaultdict

In [39]:
import nomad.io.base as loader
import nomad.constants as constants
import nomad.stop_detection.hdbscan as HDBSCAN
import nomad.filters as filters
import nomad.city_gen as cg

In [6]:
traj_cols = {'user_id':'uid',
             'datetime':'local_datetime',
             'latitude':'latitude',
             'longitude':'longitude'}

data = loader.from_file("../../nomad/data/gc_sample.csv")

In [7]:
# We create a time offset column with different UTC offsets (in seconds)
data['tz_offset'] = 0
data.loc[data.index[:5000],'tz_offset'] = -7200
data.loc[data.index[-5000:], 'tz_offset'] = 3600

# create datetime column as a string
data['local_datetime'] = loader._unix_offset_to_str(data.timestamp, data.tz_offset)
data['local_datetime'] = pd.to_datetime(data['local_datetime'], utc=True)

# create x, y columns in web mercator
gdf = gpd.GeoSeries(gpd.points_from_xy(data.longitude, data.latitude),
                        crs="EPSG:4326")
projected = gdf.to_crs("EPSG:3857")
data['x'] = projected.x
data['y'] = projected.y

In [8]:
data

Unnamed: 0,uid,timestamp,latitude,longitude,tz_offset,local_datetime,x,y
0,wizardly_joliot,1704119340,38.321711,-36.667334,-7200,2024-01-01 14:29:00+00:00,-4.081789e+06,4.624973e+06
1,wizardly_joliot,1704119700,38.321676,-36.667365,-7200,2024-01-01 14:35:00+00:00,-4.081792e+06,4.624968e+06
2,wizardly_joliot,1704155880,38.320959,-36.666748,-7200,2024-01-02 00:38:00+00:00,-4.081724e+06,4.624866e+06
3,wizardly_joliot,1704156000,38.320936,-36.666739,-7200,2024-01-02 00:40:00+00:00,-4.081723e+06,4.624863e+06
4,wizardly_joliot,1704156840,38.320924,-36.666747,-7200,2024-01-02 00:54:00+00:00,-4.081724e+06,4.624861e+06
...,...,...,...,...,...,...,...,...
25830,angry_spence,1705303380,38.320399,-36.667438,3600,2024-01-15 07:23:00+00:00,-4.081801e+06,4.624787e+06
25831,angry_spence,1705303740,38.320413,-36.667469,3600,2024-01-15 07:29:00+00:00,-4.081804e+06,4.624789e+06
25832,angry_spence,1705303980,38.320384,-36.667455,3600,2024-01-15 07:33:00+00:00,-4.081802e+06,4.624785e+06
25833,angry_spence,1705304340,38.320349,-36.667473,3600,2024-01-15 07:39:00+00:00,-4.081804e+06,4.624780e+06


In [9]:
user_sample = data.loc[data.uid == "angry_spence"]
user_sample = user_sample[['timestamp', 'x', 'y']]

In [21]:
user_sample

Unnamed: 0,timestamp,x,y
24139,1704104460,-4.081702e+06,4.624871e+06
24140,1704104820,-4.081697e+06,4.624867e+06
24141,1704104940,-4.081696e+06,4.624866e+06
24142,1704105540,-4.081698e+06,4.624865e+06
24143,1704105720,-4.081699e+06,4.624866e+06
...,...,...,...
25830,1705303380,-4.081801e+06,4.624787e+06
25831,1705303740,-4.081804e+06,4.624789e+06
25832,1705303980,-4.081802e+06,4.624785e+06
25833,1705304340,-4.081804e+06,4.624780e+06


In [40]:
HDBSCAN._find_bursts(user_sample['timestamp'], 120, burst_col=True)

Unnamed: 0,timestamp,burst_label
0,1704104460,0
1,1704104820,0
2,1704104940,0
3,1704105540,0
4,1704105720,0
...,...,...
1691,1705303380,29
1692,1705303740,29
1693,1705303980,29
1694,1705304340,29


In [5]:
def hdbscan(mst_ext, min_cluster_size):
    hierarchy = []
    
    # 4.1 For the root of the tree assign all objects the same label (single “cluster”).
    all_pings = set()
    
    for u, v, _ in mst_ext:
        all_pings.add(u)
        all_pings.add(v)
        
    label_map = {ts: 0 for ts in all_pings} # {'t1':0, 't2':0, 't3':0}
    active_clusters = {0: set(all_pings)} # e.g. { 0: {'t1', 't2', 't3'} }
    
    # sort edges in decreasing order of weight
    mst_ext_sorted = sorted(mst_ext, key=lambda x: -x[2]) 
    
    # group edges by weight
    dendrogram_scales = defaultdict(list)
    for u, v, w in mst_ext_sorted:
        dendrogram_scales[w].append((u, v))

    current_label_id = max(label_map.values()) + 1

    # Iteratively remove all edges from MSText in decreasing order of weights
    # 4.2.1 Before each removal, set the dendrogram scale value of the current hierarchical level as the weight of the edge(s) to be removed.
    for scale, edges in dendrogram_scales.items():
        affected_clusters = set()
        edges_to_remove = []
        
        for u, v in edges:
            if label_map.get(u) != label_map.get(v): # if labels are different continue
                continue
            cluster_id = label_map[u]
            affected_clusters.add(cluster_id)
            edges_to_remove.append((u, v))

        # 4.2.2: For each affected cluster, reassign components
        for cluster_id in affected_clusters:
            if cluster_id == -1 or cluster_id not in active_clusters:
                continue  # skip noise or already removed clusters
            
            members = active_clusters[cluster_id]
    
            # build connectivity graph (excluding removed edges)
            G = _connectivity_graph(members, mst_ext, edges_to_remove)
            components = _connected_components(G)
            non_spurious = [c for c in components if len(c) >= min_cluster_size]

            # cluster has disappeared
            if not non_spurious:
                for ts in members:
                    label_map[ts] = -1  # noise
                del active_clusters[cluster_id]
            # cluster has just shrunk
            elif len(non_spurious) == 1:
                remaining = non_spurious[0]
                for ts in members:
                    label_map[ts] = cluster_id if ts in remaining else -1
                active_clusters[cluster_id] = remaining
            # true cluster split: multiple valid subclusters
            elif len(non_spurious) > 1:
                new_ids = []
                del active_clusters[cluster_id]
                for comp in non_spurious:
                    for ts in comp:
                        label_map[ts] = current_label_id
                    active_clusters[current_label_id] = set(comp)
                    new_ids.append(current_label_id)
                    current_label_id += 1
                
                hierarchy.append((scale, cluster_id, new_ids))

    return {"label_map": label_map, "hierarchy": hierarchy}

def _connectivity_graph(nodes, edges, removed_edges):
    '''
    nodes: set of timestamps {t1,t2,t3} 
    edges: list of (u, v, w) tuples
    removed_edges: list of (u,v) tuples
    '''
    graph = defaultdict(set)
    removed_set = set(frozenset(e) for e in removed_edges)
    for u, v, _ in edges:
        if frozenset((u, v)) in removed_set:
            continue
        if u in nodes and v in nodes and u != v:
            graph[u].add(v)
            graph[v].add(u)
    return graph

def _connected_components(graph):
    '''
    graph: 
    '''
    seen = set()
    components = []

    for node in graph:
        if node in seen:
            continue
        stack = [node]
        comp = set()
        while stack:
            n = stack.pop()
            if n not in seen:
                seen.add(n)
                comp.add(n)
                stack.extend(graph[n] - seen)
        components.append(comp)

    return components

In [11]:
core_distances = compute_core_distance(user_sample, 4)
mrd = compute_mrd(user_sample, core_distances)
mst_edges = mst(mrd)
mstext_edges = mst_ext(mst_edges, core_distances)
hdbscan(mstext_edges, 5)

KeyboardInterrupt: 