In [1]:
library("TSdist")
library("proxy")

"package 'TSdist' was built under R version 3.6.1"Loading required package: proxy
"package 'proxy' was built under R version 3.6.1"
Attaching package: 'proxy'

The following objects are masked from 'package:stats':

    as.dist, dist

The following object is masked from 'package:base':

    as.matrix

Registered S3 method overwritten by 'xts':
  method     from
  as.zoo.xts zoo 
Loaded TSdist v3.5. See ?TSdist for help, citation("TSdist") for use in publication.



Our main function here is similarity_TS. 
* Input:
    * Data Matrix of size $(\sum_{i}^{n}T_{i}) \times p$
    * One of the following
        * len_time: A sequence of $T_{i}$, i=1,2...n; $T_{i}$ are integers here.
        * time_cut : The first time of each unit $(1,T_{1}+1,T_{1}+T_{2}+1,...)$ (not implemented yet)
    * type: The method used when computing distance. This includes "L2" and so on (others not implemented)
        
* Method:
    * Z score normalize each time series, say time 1 to $T_{1}$ for feature 1.
    * Calculate the distance measure according to type with each of n time series, denoted by $d_{i}$
    * Calculate the average estimate of distance of time series as 
    $$d = \frac{\sum_{i=1}^{n}d_{i}T_{i}}{\sum_{i=1}^{n}T_{i}}$$
    * The similarity between the two time series is given by $$\frac{1}{1+d}$$
    
* Output: The $\textbf{similarity}$ (inverse relationship with distance) matrix between features (time series generator) with entries in $[0,1]$. The reason for making it smaller than 1 is that we will raise it to the power $\beta$ later.
* Remark: 
    * (Important!) If the original data has certain time series with the same values, normalized_X will have NA values because in the Z score normalization, $\sigma = 0$. This case should be solved in the data processing step.
    * Maybe later we need the exact time (instead of the index of a certain time in one time series) since some measure requires exact time.
    * After this step, the similarity is raised to the power $\beta$ and then used for calculate TOM

# Functions

In [44]:
# Normalize each time series (not a column)
normalize_TS = function(X,len_time){
    normalized_X = matrix(0,dim(X)[1],dim(X)[2])
    flag = 1 # record the first index of the current time series
    for (t in len_time){
        normalized_X[flag:(flag+t-1),] = scale(X[flag:(flag+t-1),])
        flag = flag + t
    }
    return (normalized_X)
}

# The main function                                                 
# compute similarity using different measure
similarity_TS = function(X,len_time,type){
    p = dim(X)[2] # num of features
    n = length(len_time) # number of samples
    
    ## type = cor, no need to normalize ##
    if (type == "cor"){
        # first compute distance matrix
        if (n==1){
            cor_mat = cor(X)
        }else{
            # cor_mat can be computed like a vector! but I have to create it
            cor_mat = cor(X[1:len_time[1],])*len_time[1]
            flag = 1+len_time[1] # the first time index of the current sample
            for (sample in 2:n){
                cor_mat = (cor_mat + 
            cor(X[flag:(flag+len_time[sample]-1),])*len_time[sample])
                
                flag = flag+len_time[sample]
            }
            cor_mat = cor_mat/(sum(len_time))
        }     
           
        return (cor_mat)
        } # end "cor"

    X = normalize_TS(X,len_time) # normalize time series
    #check for missing value (may due to sigma=0)
    tmp = sum(is.na(X))
    if(tmp>0){
        stop(sprintf("There are %d missing values in normalized X",tmp))
    }
    
    # compute similarity pairwisely with different measure
    
    # "L2" using dist(M) which computes dist between rows of M
    if (type == "L2"){
        # first compute distance matrix
        if (n==1){
            dist_obj = dist(t(X),method = "euclidean")
        }else{
            # dist_obj can be computed like a vector! but I have to create it
            dist_obj = dist(t(X[1:len_time[1],]))*len_time[1]
            flag = 1+len_time[1] # the first time index of the current sample
            for (sample in 2:n){
                dist_obj = (dist_obj + 
            dist(t(X[flag:(flag+len_time[sample]-1),]))*len_time[sample])
                
                flag = flag+len_time[sample]
            }
            dist_obj = dist_obj/(sum(len_time))
        }     
    
        # transform dist to similarity
        sim_obj = 1/(1+dist_obj)
        # similarity matrix
        sim_matrix = as.matrix(sim_obj)
        diag(sim_matrix) = 1
        
        return (sim_matrix)
        } # end "L2"
    
    ## type = cor ##
    if (type == "cor"){
        # first compute distance matrix
        if (n==1){
            cor_mat = cor(X)
        }else{
            # cor_mat can be computed like a vector! but I have to create it
            cor_mat = cor(X[1:len_time[1],])*len_time[1]
            flag = 1+len_time[1] # the first time index of the current sample
            for (sample in 2:n){
                cor_mat = (cor_mat + 
            cor(X[flag:(flag+len_time[sample]-1),])*len_time[sample])
                
                flag = flag+len_time[sample]
            }
            cor_mat = cor_mat/(sum(len_time))
        }     
           
        return (cor_mat)
        } # end "cor"
    
    # Other method: use proxy(dist function) and TSdist package
    # Note: They are really slow!
    if (n==1){
            dist_obj = dist(t(X),method="tsDistances",distance=type)
        }else{
            # dist_obj can be computed like a vector! but I have to create it            
            dist_obj = dist(t(X[1:len_time[1],]),
                            method="tsDistances",distance=type)*len_time[1]
            flag = 1+len_time[1] # the first time index of the current sample
            for (sample in 2:n){
                dist_obj = (dist_obj + 
            dist(t(X[flag:(flag+len_time[sample]-1),]),
                   method="tsDistances",distance=type)*len_time[sample])
                
                flag = flag+len_time[sample]
            }
            dist_obj = dist_obj/(sum(len_time))
        }
    
    # transform dist to similarity
    sim_obj = 1/(1+dist_obj)
    # similarity matrix
    sim_matrix = as.matrix(sim_obj)
    diag(sim_matrix) = 1
        
    return (sim_matrix)
    
    
    }

# For Testing

In [42]:
# For test1 (small)
# data matrix n=2 p=3
X = matrix(0,5,3)
len_time = c(3,2) # T1=3, T2=2
X[1:3,1] = c(1,1,1.1)
X[1:3,2] = c(2,2,2.2)
X[1:3,3] = c(1,0.5,1.5)
X[4:5,1] = c(1.2,1.3)
X[4:5,2] = c(2,2.2)
X[4:5,3] = c(5,15)
X

0,1,2
1.0,2.0,1.0
1.0,2.0,0.5
1.1,2.2,1.5
1.2,2.0,5.0
1.3,2.2,15.0


In [50]:
similarity_TS_old(X,len_time,"cor")

0,1,2
1.0,1.0,0.9196152
1.0,1.0,0.9196152
0.9196152,0.9196152,1.0


In [25]:
similarity_TS(X,len_time,"L2")

1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
1.0000000,0.2712880,0.2708736,0.2666440,0.2738338,0.2756299,0.2680209,0.2675231,0.2746991,0.2686042,...,0.2703564,0.2746343,0.2669990,0.2807680,0.2651419,0.2681634,0.2762139,0.2736541,0.2754371,0.2697074
0.2712880,1.0000000,0.2695106,0.2590738,0.2662311,0.2705772,0.2626558,0.2686639,0.2688095,0.2759955,...,0.2730886,0.2760505,0.2716959,0.2779832,0.2831031,0.2655002,0.2640211,0.2605827,0.2829801,0.2682277
0.2708736,0.2695106,1.0000000,0.2685960,0.2578985,0.2721030,0.2603473,0.2687779,0.2667065,0.2696955,...,0.2675862,0.2725721,0.2622864,0.2593892,0.2748874,0.2736117,0.2763455,0.2693900,0.2793047,0.2653180
0.2666440,0.2590738,0.2685960,1.0000000,0.2724580,0.2665557,0.2791955,0.2700945,0.2706909,0.2638446,...,0.2723927,0.2673047,0.2648090,0.2760754,0.2711149,0.2656687,0.2692360,0.2695154,0.2638088,0.2711493
0.2738338,0.2662311,0.2578985,0.2724580,1.0000000,0.2711123,0.2741979,0.2749225,0.2739001,0.2585896,...,0.2723641,0.2883240,0.2701709,0.2635820,0.2693028,0.2660517,0.2711135,0.2789905,0.2642208,0.2693667
0.2756299,0.2705772,0.2721030,0.2665557,0.2711123,1.0000000,0.2707531,0.2777763,0.2633576,0.2741308,...,0.2719840,0.2699428,0.2839706,0.2758510,0.2658374,0.2672773,0.2687347,0.2718709,0.2681811,0.2707329
0.2680209,0.2626558,0.2603473,0.2791955,0.2741979,0.2707531,1.0000000,0.2706527,0.2656564,0.2608681,...,0.2793622,0.2694121,0.2662464,0.2694506,0.2680143,0.2608927,0.2726216,0.2844767,0.2698562,0.2636083
0.2675231,0.2686639,0.2687779,0.2700945,0.2749225,0.2777763,0.2706527,1.0000000,0.2603744,0.2755307,...,0.2666572,0.2706184,0.2714903,0.2649002,0.2598555,0.2641490,0.2636968,0.2603846,0.2735281,0.2733314
0.2746991,0.2688095,0.2667065,0.2706909,0.2739001,0.2633576,0.2656564,0.2603744,1.0000000,0.2636202,...,0.2662694,0.2731825,0.2586833,0.2740684,0.2685530,0.2679858,0.2737455,0.2670148,0.2622098,0.2671355
0.2686042,0.2759955,0.2696955,0.2638446,0.2585896,0.2741308,0.2608681,0.2755307,0.2636202,1.0000000,...,0.2699634,0.2772925,0.2744138,0.2658088,0.2633177,0.2590284,0.2689154,0.2587223,0.2718973,0.2587170


In [26]:
# For test2 (bigger)
n = 100 # samples
T = 5 # time each samples
p = 100 # features
len_time = rep(T,n)
X = matrix(rnorm(n*T*p),n*T,p) # (n*T)*p

In [11]:
system.time(similarity_TS(X,len_time,"L2"))

   user  system elapsed 
   0.52    0.03    0.54 

# Test of speed

In [22]:
system.time(dist(t(X),method="tsDistances",distance="euclidean"))

   user  system elapsed 
   4.70    0.36    6.14 

In [24]:
system.time(dist(t(X)))

   user  system elapsed 
   0.03    0.00    0.03 

In [23]:
system.time(TSDatabaseDistances(t(X),distance="euclidean"))

   user  system elapsed 
   4.79    0.14    5.89 