Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taken's Parameter Search - Potential bug/enhancement #605

Open
jcoll3 opened this issue Sep 14, 2021 · 5 comments
Open

Taken's Parameter Search - Potential bug/enhancement #605

jcoll3 opened this issue Sep 14, 2021 · 5 comments
Assignees
Labels
discussion Discussion required

Comments

@jcoll3
Copy link

jcoll3 commented Sep 14, 2021

Hello,

I am currently experimenting with time series classifies using Taken's embedding. The function takens_embedding_optimal_parameters is behaving somewhat differently than I expected. I am getting the following error.

ValueError: Not enough time stamps (176) to produce at least one 7-dimensional vector under the current choice of time delay (30).

I understand why this is happening, but it seems like this parameter combination should just be skipped instead of raising an error. I would just reduce the size of max dimension and/or max delay, but both are valid for lower values of the other.

It looks like issue arises here:

X_embedded = _time_delay_embedding(X, time_delay=time_delay,

Maybe there can be check on line 55, or potentially a flag for when _time_delay_embedding is being used a parameter search.

Thanks,
Joe

@ulupo
Copy link
Collaborator

ulupo commented Sep 14, 2021

Thanks for the report @jcoll3 ! I agree that this seems like undesirable behaviour. May I just squeeze a minimal reproducible example out of you to make sure we're talking about the same thing?

@ulupo ulupo added the discussion Discussion required label Sep 14, 2021
@jcoll3
Copy link
Author

jcoll3 commented Sep 14, 2021

Thanks for the fast reply @ulupo! Here is a small example that causes the behavior.

import numpy as np
from gtda.time_series import takens_embedding_optimal_parameters

time_series = np.arange(200)
max_delay = 30
max_dim = 30
stride = 1

takens_embedding_optimal_parameters(time_series, max_delay, max_dim, stride, n_jobs=2)

@ulupo
Copy link
Collaborator

ulupo commented Sep 15, 2021

Thanks @jcoll3 ! I can indeed reproduce the bad behaviour.

Personally, I think we should pre-empt such situations by restricting the range of the for loop in

for dim in range(1, max_dimension + 3))

Specifically, I think we need to replace max_dimension with max_dimension_ where

max_dimension_  = min(max_dimension,
                      (X.shape[1] - 1) // time_delay - 1)

This is because we actually even need embeddings up to and including dimension max_dimension_ + 2 to exist.

Incidentally, one could (and should) argue that checks should also be applied before the mutual information for loop in

for time_delay in range(1, max_time_delay + 1))

What do you think?

@ulupo
Copy link
Collaborator

ulupo commented Sep 15, 2021

Actually I think the situation is even stricter than what I said above. To avoid errors in line

distances, indices = neighbor.kneighbors(X_embedded)

one would also need to make sure that X_embedded has at least three points -- because we are calling NearestNeighbors with 2 neighbours per sample. But this is another easy check involving the stride parameter.

@ulupo ulupo self-assigned this Sep 15, 2021
@jcoll3
Copy link
Author

jcoll3 commented Sep 15, 2021

I think we're on the right track. I worked out the math for the minimum length to get 3 samples from any (stride, delay, dimension). The third vector starts at point X[stride*2]. Each vector requires (dimension-1)*time_delay+1 points. So we get the inequality (dimension-1)*time_delay+1+stride*2 <= len(X).

Then _max_dimension = min(max_dimension, (len(X)-2*stride-1)//time_delay+1)). You can use the following snippet to test that the math is right. It will always yield exactly 3 embedded points.

params_are_valid = True
debug = False
for time_delay in range(1,10):
    for dim in range(2,10):
        for stride in range(1,10):
            time_series = np.arange((dim-1)*time_delay+1+stride*2) 
            last_point = time_series[-1]
            embedder = SingleTakensEmbedding(
                parameters_type="fixed",
                time_delay=time_delay,
                dimension=dim,
                stride=stride,
            )
            embedding = embedder.fit_transform(time_series)

            params_are_valid = params_are_valid and (len(embedding)==3) and (embedding[-1,-1]==last_point)        
            if debug:
                print("time_series: ",time_series)
                print("time_delay: ", time_delay)
                print("dim: ", dim)
                print("stride: ", stride)
                print("embedding len: ", len(embedding))
                print("embedding: ", embedding)
                print()
                         
print(params_are_valid)

The only remaining issue I see is that the following loops add to max_dimension, so I am currently using _max_dimension = min(max_dimension, ((len(X)-2*stride-1)//time_delay+1)-3). Notice the trailing -3 left separate for emphasis. This feels a bit hacky to me because I don't fully understand what the for loops are doing. Thoughts?

for dim in range(1, max_dimension + 3))

for dim in range(2, max_dimension + 1)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion required
Projects
None yet
Development

No branches or pull requests

2 participants