## DTW Time-Series Clustering Demo

This is a demo of the pipeline that generates a simulated data set, imports it into Ananke, and then clusters it (either with STS, DTW, or DDTW distances). The final function then cobbles together a scoring of the results, searching for the best cluster amongst the parameter values.

In [1]:
from ananke._database_rework import TimeSeriesData
from ananke._ts_simulation import *
from ananke._efficientcluster import auto_cluster, get_nearest_clusters
import os

  from ._conv import register_converters as _register_converters
  from pandas.core import datetools


In [19]:
#Tunable parameters
timepoints = [0,7,14,21,28,35,49,56,63,65,70,77,84,91,97,102,109,114]
nsamples = len(timepoints)
nclust = 50
nts_per_clust = 10
nnoise = 250
distance_measure = 'dtw'

In [20]:
if os.path.isfile("simulation.h5"):
    os.remove("simulation.h5")
tsdata = TimeSeriesData("simulation.h5")
dataset_names = tsdata.initialize_by_shape(timepoints=timepoints)

Creating required data sets in new HDF5 file at simulation.h5


In [39]:
sim = gen_table(fl_sig=0, w_sig=6,
                fl_bg=-6, w_bg=6,
                bg_disp_mu=0, bg_disp_sigma=1,
                sig_disp_mu2=0, sig_disp_sigma2=1,
                n_clust=nclust, n_sig=nts_per_clust, n_tax_sig=1, n_bg=nnoise,
                len_arima=100, len_ts=nsamples, len_signal=nsamples-5)
X = sim['table']
Y = sim['signals']

tsdata.register_timeseries(["ts%d" % (x,) for x in range(nclust*nts_per_clust+nnoise)])
i = 0
for ts in X:
    tsdata.set_timeseries_data(data=X[i,:], index=i, action='replace')
    i += 1

In [51]:
#tsdata.reset_clusters() #If you want to re-run clustering with another distance measure, clear the results
# This should be run if you want to re-cluster on the same random data matrix X

#Note: Each of the methods are going to have their own distinct distribution, so it's not possible
#to nail down a consistent range of epsilon across measures

#For Derivative Dynamic Time Warping distance:
#auto_cluster(tsdata, n_precompute = 0, param_min=0.0001, param_max = 0.001, param_step = 0.0001, distance_measure = distance_measure, n_threads = 3)
#For Dynamic Time Warping distance:
auto_cluster(tsdata, n_precompute = 0, param_min=0.1, param_max = 0.2, param_step = 0.01, distance_measure = distance_measure, n_threads = 3)
#For Short Time-Series Distance:
#auto_cluster(tsdata, n_precompute = 0, param_min=0.001, param_max = 0.01, param_step = 0.001, distance_measure = distance_measure, n_threads = 3)

Importing time-series matrix
Normalizing with sklearn
Initializing TimeSeries objects
0 time series initialized and sorted1 time series initialized and sorted2 time series initialized and sorted3 time series initialized and sorted4 time series initialized and sorted5 time series initialized and sorted6 time series initialized and sorted7 time series initialized and sorted8 time series initialized and sorted9 time series initialized and sorted10 time series initialized and sorted11 time series initialized and sorted12 time series initialized and sorted13 time series initialized and sorted14 time series initialized and sorted15 time series initialized and sorted16 time series initialized and sorted17 time series initialized and sorted18 time series initialized and sorted19 time series initialized and sorted20 time series initialized and sorted21 time series initialized and sorted22 time series initialized and sorted23 time series initialized and sorted24 time seri

After 1500 samples of the distances, the max distance was 1.884196
Computing for 0.190000


  distance, path = fastdtw(time_series1.data, time_series2.data)


Computing for 0.180000
Computing for 0.170000
Computing for 0.160000
Computing for 0.150000
Computing for 0.140000
Computing for 0.130000
Computing for 0.120000
Computing for 0.110000
Computing for 0.100000
Found 0 neighbours
Pre-computation of most abundant sequences complete


The code below takes the pure seed signals, uses them to "fish" for the nearest cluster in the graph, and then searches through the epsilon values to find the closest clustering scheme to the ground truth.

Each seed from 0-9 belongs in the same cluster, so we check if we can "fish" out all 10 sequences given only the seed. This means that there are `nclust\*nts_per_clust^2` possible true positive values, if we are able to fetch all 10 within-cluster sequences from each of the 10 seeds. FPs are an accounting of the number of *other* sequences included in the best-scoring clustering scheme.

In [52]:
TP_total = 0
FP_total = 0
for i in range(Y.shape[0]):
    seed = Y[i,:]
    #Find the nearest cluster IDs for each epsilon
    clusters = get_nearest_clusters(tsdata, seed, distance_measure)
    minimum = i-i%nts_per_clust
    maximum = minimum + nts_per_clust - 1
    TP_max = 0
    for epsilon, cluster_id in clusters:
        #Returns the indexes of the sequences in the nearest cluster
        cluster_member_indexes = tsdata.get_cluster(cluster_id, epsilon)
        TP = 0
        FP = 0
        for member in cluster_member_indexes:
            if ((member >= minimum) & (member <= maximum)):
                TP += 1
            else:
                FP += 1
        if FP < FP_max:
            TP_max = TP
            FP_max = FP
        elif (FP == FP_max) & (TP > TP_max):
            TP_max = TP
    TP_total += TP_max
    FP_total += FP_max
print(TP_total)
print(FP_total)

  distance, path = fastdtw(time_series1.data, time_series2.data)


3598
0


As an example, if there are 50 clusters with 10 distinct time series each, that makes for 5000 possible TP values, so that is a useful denominator to figure out how this is performing. Initial results show that for simulations with shifts, STS is able to detect the expected proportion (the expected number with a shift of 0), but DTW does significantly better. DDTW seems to fail pretty hard, and that may be because I'm using the wrong output values or I have the wrong parameter ranges for it.