Taken's Parameter Search - Potential bug/enhancement #605

jcoll3 · 2021-09-14T19:20:24Z

Hello,

I am currently experimenting with time series classifies using Taken's embedding. The function takens_embedding_optimal_parameters is behaving somewhat differently than I expected. I am getting the following error.

ValueError: Not enough time stamps (176) to produce at least one 7-dimensional vector under the current choice of time delay (30).

I understand why this is happening, but it seems like this parameter combination should just be skipped instead of raising an error. I would just reduce the size of max dimension and/or max delay, but both are valid for lower values of the other.

It looks like issue arises here:

giotto-tda/gtda/time_series/_utils.py

Line 56 in eaa1dd0

X_embedded = _time_delay_embedding(X, time_delay=time_delay,

Maybe there can be check on line 55, or potentially a flag for when _time_delay_embedding is being used a parameter search.

Thanks,
Joe

The text was updated successfully, but these errors were encountered:

ulupo · 2021-09-14T19:41:41Z

Thanks for the report @jcoll3 ! I agree that this seems like undesirable behaviour. May I just squeeze a minimal reproducible example out of you to make sure we're talking about the same thing?

jcoll3 · 2021-09-14T22:14:47Z

Thanks for the fast reply @ulupo! Here is a small example that causes the behavior.

import numpy as np
from gtda.time_series import takens_embedding_optimal_parameters

time_series = np.arange(200)
max_delay = 30
max_dim = 30
stride = 1

takens_embedding_optimal_parameters(time_series, max_delay, max_dim, stride, n_jobs=2)

ulupo · 2021-09-15T07:38:14Z

Thanks @jcoll3 ! I can indeed reproduce the bad behaviour.

Personally, I think we should pre-empt such situations by restricting the range of the for loop in

giotto-tda/gtda/time_series/embedding.py

Line 106 in eaa1dd0

for dim in range(1, max_dimension + 3))

Specifically, I think we need to replace max_dimension with max_dimension_ where

max_dimension_  = min(max_dimension,
                      (X.shape[1] - 1) // time_delay - 1)

This is because we actually even need embeddings up to and including dimension max_dimension_ + 2 to exist.

Incidentally, one could (and should) argue that checks should also be applied before the mutual information for loop in

giotto-tda/gtda/time_series/embedding.py

Line 100 in eaa1dd0

for time_delay in range(1, max_time_delay + 1))

What do you think?

ulupo · 2021-09-15T07:42:41Z

Actually I think the situation is even stricter than what I said above. To avoid errors in line

giotto-tda/gtda/time_series/_utils.py

Line 61 in eaa1dd0

distances, indices = neighbor.kneighbors(X_embedded)

one would also need to make sure that X_embedded has at least three points -- because we are calling NearestNeighbors with 2 neighbours per sample. But this is another easy check involving the stride parameter.

jcoll3 · 2021-09-15T17:19:29Z

I think we're on the right track. I worked out the math for the minimum length to get 3 samples from any (stride, delay, dimension). The third vector starts at point X[stride*2]. Each vector requires (dimension-1)*time_delay+1 points. So we get the inequality (dimension-1)*time_delay+1+stride*2 <= len(X).

Then _max_dimension = min(max_dimension, (len(X)-2*stride-1)//time_delay+1)). You can use the following snippet to test that the math is right. It will always yield exactly 3 embedded points.

params_are_valid = True
debug = False
for time_delay in range(1,10):
    for dim in range(2,10):
        for stride in range(1,10):
            time_series = np.arange((dim-1)*time_delay+1+stride*2) 
            last_point = time_series[-1]
            embedder = SingleTakensEmbedding(
                parameters_type="fixed",
                time_delay=time_delay,
                dimension=dim,
                stride=stride,
            )
            embedding = embedder.fit_transform(time_series)

            params_are_valid = params_are_valid and (len(embedding)==3) and (embedding[-1,-1]==last_point)        
            if debug:
                print("time_series: ",time_series)
                print("time_delay: ", time_delay)
                print("dim: ", dim)
                print("stride: ", stride)
                print("embedding len: ", len(embedding))
                print("embedding: ", embedding)
                print()
                         
print(params_are_valid)

The only remaining issue I see is that the following loops add to max_dimension, so I am currently using _max_dimension = min(max_dimension, ((len(X)-2*stride-1)//time_delay+1)-3). Notice the trailing -3 left separate for emphasis. This feels a bit hacky to me because I don't fully understand what the for loops are doing. Thoughts?

giotto-tda/gtda/time_series/embedding.py

Line 106 in eaa1dd0

for dim in range(1, max_dimension + 3))

giotto-tda/gtda/time_series/embedding.py

Line 111 in eaa1dd0

for dim in range(2, max_dimension + 1)]

ulupo added the discussion Discussion required label Sep 14, 2021

ulupo self-assigned this Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taken's Parameter Search - Potential bug/enhancement #605

Taken's Parameter Search - Potential bug/enhancement #605

jcoll3 commented Sep 14, 2021

ulupo commented Sep 14, 2021 •

edited

jcoll3 commented Sep 14, 2021

ulupo commented Sep 15, 2021 •

edited

ulupo commented Sep 15, 2021 •

edited

jcoll3 commented Sep 15, 2021

Taken's Parameter Search - Potential bug/enhancement #605

Taken's Parameter Search - Potential bug/enhancement #605

Comments

jcoll3 commented Sep 14, 2021

ulupo commented Sep 14, 2021 • edited

jcoll3 commented Sep 14, 2021

ulupo commented Sep 15, 2021 • edited

ulupo commented Sep 15, 2021 • edited

jcoll3 commented Sep 15, 2021

ulupo commented Sep 14, 2021 •

edited

ulupo commented Sep 15, 2021 •

edited

ulupo commented Sep 15, 2021 •

edited