## DTW Time-Series Clustering Demo

This is a demo of the pipeline that generates a simulated data set, imports it into Ananke, and then clusters it (either with STS, DTW, or DDTW distances). The final function then cobbles together a scoring of the results, searching for the best cluster amongst the parameter values.

# Setup

This is one that is good to do in a Conda kernel. If you have conda or miniconda, you need to conda install the following:

 - `conda install plotly h5py`

In [6]:
from ananke._database_rework import TimeSeriesData
from ananke._ts_simulation import *
from ananke._efficientcluster import auto_cluster, get_nearest_cluster
from sklearn.metrics import adjusted_rand_score
import numpy as np
import random
import os

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import Scatter, Figure, Layout
init_notebook_mode(connected=True)

In [7]:
#Tunable parameters

nsamples = 180
timepoints = list(np.cumsum([random.randint(1,15) for i in range(nsamples)]))
nclust = 50
nts_per_clust = 10
#noise to signal ratio is 2:1
nsr_ratio = 1
shift_amount = 0
signal_variance = 1
distance_measure = 'sts'

In [8]:
if os.path.isfile("simulation.h5"):
    os.remove("simulation.h5")
tsdata = TimeSeriesData("simulation.h5")
dataset_names = tsdata.initialize_by_shape(timepoints=timepoints)

Creating required data sets in new HDF5 file at simulation.h5


In [9]:
nnoise = nsr_ratio*nclust*nts_per_clust
sim = gen_table(fl_sig=0, w_sig=6,
                fl_bg=-6, w_bg=6,
                bg_disp_mu=0, bg_disp_sigma=1,
                sig_disp_mu2=0, sig_disp_sigma2=signal_variance,
                n_clust=nclust, n_sig=nts_per_clust, n_tax_sig=1, n_bg=nnoise,
                len_arima=2*nsamples, len_ts=nsamples, len_signal=nsamples-shift_amount)
X = sim['table']
Y = sim['signals']

tsdata.register_timeseries(["ts%d" % (x,) for x in range(nclust*nts_per_clust+nnoise)])
i = 0
for ts in X:
    tsdata.set_timeseries_data(data=X[i,:], index=i, action='replace')
    i += 1

In [25]:
tsdata.reset_clusters() #If you want to re-run clustering with another distance measure, clear the results
# This should be run if you want to re-cluster on the same random data matrix X

#Note: Each of the methods are going to have their own distinct distribution, so it's not possible
#to nail down a consistent range of epsilon across measures

#For Derivative Dynamic Time Warping distance:
#auto_cluster(tsdata, n_precompute = 0, param_min=0.01, param_max = 0.1, param_step = 0.01, distance_measure = distance_measure, n_threads = 3)
#For Dynamic Time Warping distance:
#auto_cluster(tsdata, n_precompute = 0, param_min=0.01, param_max = 0.25, param_step = 0.01, distance_measure = distance_measure, n_threads = 3)
#For Short Time-Series Distance:
auto_cluster(tsdata, n_precompute = 0, param_min=0.0001, param_max = 0.001, param_step = 0.0001, distance_measure = distance_measure, n_threads = 3)

Importing time-series matrix
Normalizing with sklearn
Initializing TimeSeries objects
1 time series initialized and sorted2 time series initialized and sorted3 time series initialized and sorted4 time series initialized and sorted5 time series initialized and sorted6 time series initialized and sorted7 time series initialized and sorted8 time series initialized and sorted9 time series initialized and sorted10 time series initialized and sorted11 time series initialized and sorted12 time series initialized and sorted13 time series initialized and sorted14 time series initialized and sorted15 time series initialized and sorted16 time series initialized and sorted17 time series initialized and sorted18 time series initialized and sorted19 time series initialized and sorted20 time series initialized and sorted21 time series initialized and sorted22 time series initialized and sorted23 time series initialized and sorted24 time series initialized and sorted25 time ser

Computing for 0.000800
7
Computing for 0.000700
6
Computing for 0.000600
5
Computing for 0.000500
4
Computing for 0.000400
3
Computing for 0.000300
2
Computing for 0.000200
1
Computing for 0.000100
0
Found 0 neighbours
Pre-computation of most abundant sequences complete


The code below takes the pure seed signals, uses them to "fish" for the nearest cluster in the graph, and then searches through the epsilon values to find the closest clustering scheme to the ground truth.


Each seed from 0-9 belongs in the same cluster, so we check if we can "fish" out all 10 sequences given only the seed. This means that there are `nclust\*nts_per_clust^2` possible true positive values, if we are able to fetch all 10 within-cluster sequences from each of the 10 seeds. FPs are an accounting of the number of *other* sequences included in the best-scoring clustering scheme.

In [8]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def search_and_score(signal_matrix):
    signal_ari_scores = {}
    nts = signal_matrix.shape[0]
    for i in range(nts):
        seed = signal_matrix[i,:]
        #Find the nearest cluster IDs for each epsilon
        clusters = get_nearest_cluster(tsdata, seed, distance_measure, n_threads = 4)
        #Clusters are in contiguous chunks of nts_per_clust in the file
        #e.g., if there's 10 per cluster, the 0-9 will be cluster 1, 10-19 cluster 2, etc.
        minimum = i-i%nts_per_clust
        if minimum not in signal_ari_scores:
            signal_ari_scores[minimum] = np.array([])
        maximum = minimum + nts_per_clust
        ground_truth = np.ones(nts)
        ground_truth[minimum:maximum] = 0
        ari_scores = []
        for epsilon, cluster_id in clusters:
            #Returns the indexes of the sequences in the nearest cluster
            cluster_member_indexes = tsdata.get_cluster(cluster_id, epsilon)
            #Ensure that we don't try to score the noise
            # Should we score the noise? Including noise in a cluster should negatively impact the score...
            cluster_member_indexes = cluster_member_indexes[cluster_member_indexes < nts]
            prediction = np.ones(nts)
            prediction[cluster_member_indexes] = 0
            ari_scores.append(adjusted_rand_score(ground_truth, prediction))
        signal_ari_scores[minimum] = np.append(signal_ari_scores[minimum], max(ari_scores))
    return signal_ari_scores

signal_ari_scores = search_and_score(Y)
#How well was each cluster retrieved? What is the best possibly cluster than can be fished out from the data?
best_rand_per_seed = ["%d: %f" % (x, max(signal_ari_scores[x])) for x in signal_ari_scores]
np.mean(best_rand_per_seed.values())

['0: 1.000000',
 '10: 1.000000',
 '20: 1.000000',
 '30: 1.000000',
 '40: 1.000000',
 '50: 1.000000',
 '60: 1.000000',
 '70: 1.000000',
 '80: 1.000000',
 '90: 1.000000',
 '100: 1.000000',
 '110: 1.000000',
 '120: 1.000000',
 '130: 1.000000',
 '140: 1.000000',
 '150: 1.000000',
 '160: 1.000000',
 '170: 1.000000',
 '180: 1.000000',
 '190: 1.000000',
 '200: 1.000000',
 '210: 1.000000',
 '220: 1.000000',
 '230: 1.000000',
 '240: 1.000000',
 '250: 1.000000',
 '260: 1.000000',
 '270: 1.000000',
 '280: 1.000000',
 '290: 1.000000',
 '300: 1.000000',
 '310: 1.000000',
 '320: 1.000000',
 '330: 1.000000',
 '340: 1.000000',
 '350: 1.000000',
 '360: 1.000000',
 '370: 1.000000',
 '380: 1.000000',
 '390: 1.000000',
 '400: 1.000000',
 '410: 1.000000',
 '420: 1.000000',
 '430: 1.000000',
 '440: 1.000000',
 '450: 1.000000',
 '460: 1.000000',
 '470: 1.000000',
 '480: 1.000000',
 '490: 1.000000']

In [26]:
def plot_cluster(seed_index):
    #Plot some of the clusters
    data = []
    signal = Y[seed_index,:]
    observed = tsdata._h5t["data/timeseries/matrix"][seed_index,:]
    clusters = get_nearest_cluster(tsdata, signal, distance_measure)
    epsilon = None
    cluster_id = None
    for epsilon, cluster_id in clusters:
        data = [{'name':'signal', 'x': timepoints, 'y': signal/sum(signal)},
                {'name':'actual', 'x': timepoints, 'y': observed/sum(observed)}]
        cluster_member_indexes = tsdata.get_cluster(cluster_id, epsilon)
        for ts_id in cluster_member_indexes:
            ts = tsdata._h5t["data/timeseries/matrix"][ts_id,:]
            data.append({'name':ts_id, 'y': ts/sum(ts), 'x': timepoints})
        iplot(data)

In [27]:
plot_cluster(0)

In [10]:
#If you want to open the .h5 file somewhere else, this has to be released
del tsdata