Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resample: ValueError: x and y arrays must be equal in length along interpolation axis. #361

Closed
adelavega opened this issue Jan 28, 2019 · 27 comments · Fixed by #568
Closed
Labels

Comments

@adelavega
Copy link
Collaborator

When trying to resample a dense variable (physio) with a sampling rate of 0.49999984750004656
the length of the x given to interp1d is one longer than values. My guess is that this has to do with a rounding error with the funky sampling_rate.

@adelavega adelavega added the bug label Jan 28, 2019
@adelavega
Copy link
Collaborator Author

Actually, looks like the issue is len(self.index) != len(self.values) for that variable. Not sure why though.

@adelavega
Copy link
Collaborator Author

This traces back to _build_entity_index.
When calculating the index, it uses the following formula:
reps = int(math.ceil(run.duration * sampling_rate))

In this dataset, with duration=876.00031328201294 and sampling_rate=500, that equals:
438000.15664100647 which math.ceil rounds to 438001.

The question is if that strong ceiling rounding is fair here. I could imagine why only 438000 samples are provided. Maybe regular np.round is more appropriate, or somehow this function can be sensitive to the actual number of samples?

@tyarkoni
Copy link
Collaborator

I think changing it to round is probably fine. But either way, what we should probably do is add a check (called after every update or transformation) that makes sure that all of the internal arrays have the same length (i.e., duration, onset, amplitude, and values) for sparse variables, and that the index has the same length as the values for dense ones.

@adelavega
Copy link
Collaborator Author

adelavega commented Jan 29, 2019

Agreed. But if its off by one? Should this check adjust it? I can just see either number of samples being valid (although round seems like a saner default to me if we had to choose).

@tyarkoni
Copy link
Collaborator

Yes, the idea would be to trim or pad one or the other (or at the very least, raise a more comprehensible exception). Probably the values should take priority and the index adjusted to respect its dimensions.

@effigies
Copy link
Collaborator

Possibly related to #355, #358?

@adswa
Copy link
Contributor

adswa commented Jan 29, 2019

I believe I ran into this (and referenced it incoherently in the lower end of #355)

@adelavega
Copy link
Collaborator Author

adelavega commented Jan 29, 2019

@effigies try changing the rounding from math.ceil to round in your PR (in _build_entity_index. but I think it occurs in at least one other place) to see if that fixes your issue. For issues such as mine (undetermined actual number of volumes, in between whole numbers), we'd need to implement the function that evens out the # of vols.

@effigies
Copy link
Collaborator

f936aed

@adelavega
Copy link
Collaborator Author

This was fixed by #365

@adelavega
Copy link
Collaborator Author

This is coming up for me again, in the context of densifying a sparse variable after HRF (in #411), for a dataset with a TR of 2.00000061.

Anyone else have seen this issue crop up again? I thought #365 added a checkt o align the values even if there was a rounding mismatch

@adelavega adelavega reopened this Mar 6, 2019
@adelavega
Copy link
Collaborator Author

Looks like it's because for me this issue is arising at __init__ (index is generated incorrectly there given a run_info) rather than resample.

Since this happens in conjunction with Convolve, its possible that the inaccuracy is actually occuring upstream. Either way, it seems that there should be a check here.

@tyarkoni
Copy link
Collaborator

tyarkoni commented Mar 6, 2019

I was looking at the resampling code earlier to day and spotted some potential problems. I'm not sure if this issue is coming from the same place (probably not), but FWIW, this line is probably broken:

# This...
x_new = np.linspace(0, n - 1, num=num)
# Should be something like...
x_new = np.linspace(0, n , num=num, endpoint=False)

That would however probably require tweaking some things elsewhere in the code (e.g., in _build_entity_index()). So probably the whole resampling logic needs a thorough re-evaluation. Given that we need to update it to handle sparse acquisition (see here), it might be worth just re-implementing from scratch—I wrote this fairly hastily, and didn't give much thought to various edge cases.

If we really want to do this right, we may also want to consider selecting resampling methods intelligently based on data and target sampling rate (e.g., in cases where we're downsampling by a relatively small factor, we should maybe use scipy.signal.decimate instead of interp1d).

@adelavega
Copy link
Collaborator Author

I think for me it was simply a rounding error, so I changed my code to match, but its not clear which is "correct" to me. Probably worth looking at, and abstract that logic away to be the same function to consistency (if/when resampling is overhauled)

@adelavega
Copy link
Collaborator Author

@tyarkoni This resampling issue just happend to come up for me. In a run with 1033 binary "speech" events, resampling to 10hz result in an average value of around 0.3. But below 5hz (and especially at TR), lots of these values seem to get lost/averaged out. In the end, at sampling_rate = TR, only a single non-zero value remained.... I'll look into it in more detail tomorrow

@tyarkoni
Copy link
Collaborator

tyarkoni commented Mar 7, 2019

How long are the events, and how widely are they spaced? This doesn't necessarily sound problematic—possibly just an indication that the sampling rate is too low (even at 10 Hz).

Also, are you talking about the convolved values, or just the raw values run through just the Resample transformation? The 0.3 thing would be weird if the latter, but seems plausible if the former.

@adelavega
Copy link
Collaborator Author

adelavega commented Mar 7, 2019

It only happens on raw binary values. with very short durations (0.01-0.5s). Usually they are in chunks. I don't think I'd noticed before because I always convolve this regressor.

I suppose the code for to_dense will omit vales shorter than the duration of the sampling_rate, whereas I expected some interpolation to occur.

Edit: The issue is actually that if an events duration * sampling_rate is less than 0.5, it will get rounded down to 0 and omitted:

durations = np.round(self.duration * sampling_rate).astype(int)

@tyarkoni
Copy link
Collaborator

tyarkoni commented Mar 7, 2019

Yeah, that makes sense. The resampling/densification code needs a complete overhaul. I have a plan for it, but need to find the time. Bunch of things to review in the next week or so, but probably after that.

@adelavega
Copy link
Collaborator Author

No rush, convolution "fixes" this for me for now.

@adelavega
Copy link
Collaborator Author

adelavega commented Jan 3, 2020

Proof of concept of the seemingly faulty interpolation of resample (this has diverged a lot from the original issue so maybe a new issue is best).

Given the following SpareRunVariable:

   onset  amplitude  duration
0      1          1         1
1      3          1         1
2      5          1         1
3      7          1         1
4      9          0         1
5     11          1         1

applying to_dense(1) gives you the values that you'd expect:

0   0
1   1
2   0
3   1
4   0
5   1
6   0
7   1
8   0
9   0
10  0
11  1

However, with sampling_rate that is sparser than the events (e.g. 0.5), it seems to not interpolate:

0  0
1  0
2  0
3  0
4  0
5  0

@satra
Copy link
Contributor

satra commented Jan 3, 2020

@adelavega - have you tested this without using the antialiasing filter? i suspect that is what is causing this to return what you have.

just from looking at the dense variable(which looks to be high frequency) the output looks reasonably correct.

@adelavega
Copy link
Collaborator Author

Thanks, although I don't think that's the issue.

Looks like to_dense does not do any resampling at all. That happens in the resample function which is only available for Dense variables.

However, when requesting get_design_matrix(sampling_rate=0.5), it uses to_dense, to turn any Sparse variables to Dense, rather than resample.

Looks we should doing a to_dense (at a high frequency) and then a downsample using resample?

@satra
Copy link
Contributor

satra commented Jan 3, 2020

Looks we should doing a to_dense (at a high frequency) and then a downsample using resample?

which for this particular vector could still likely return zeros. upsampling does not change frequency characteristics, downsampling does (via aliasing), hence there is an asymmetry in processes.

@adelavega
Copy link
Collaborator Author

Trying with a slightly different inputs, the outputs of resample looks more like what I'd expect:

In [127]: onsets = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]                                                                                                                                                       

In [128]: amplitude = [1, 1, 1, 1, 1, 1, 0, 0, 1, 1] 

In [139]: test_var.to_dense(1).values                                                                                                                                                                        
Out[139]: 
    0
0   0
1   1
2   0
3   1
4   0
5   1
6   0
7   1
8   0
9   1
10  0
11  1
12  0
13  0
14  0
15  0
16  0
17  1
18  0
19  1
In [141]: test_var.to_dense(1).resample(0.5).values                                                                                                                                                          
Out[141]: 
          0
0 -0.000382
1  0.614899
2  0.450140
3  0.528530
4  0.475532
5  0.534520
6  0.070803
7  0.018681
8  0.361497
9  1.000018

@adelavega
Copy link
Collaborator Author

I think you make a valid point @satra, thanks, but I think there's a more basic problem here with how we handle automatic densification in get_design_matrix. Arguably, the user should specify their own Resample transformation to avoid this, which is fair. But I think the default behavior is pretty confusing.

@effigies
Copy link
Collaborator

effigies commented Jan 6, 2020

To ensure a faithful representation of sparse variables with dense variables, I think the sampling rate needs to be 1/interval where interval = gcd(*onset_times, *durations). onset_times and durations should probably be in milliseconds, not seconds.

If we're going to allow a sampling rate with ToDense, then we can check whether it's larger than this minimum, and insert a resampling step if needed.

@adelavega
Copy link
Collaborator Author

Tal and I just discussed this and I think the take home is that ToDense is dumb on purpose, as its a cheap alternative to Resample. It's really meant to be used in situations when you have a dense variable that is stored as sparse, and you want to convert it and ensure no resampling occurs.

However, the problem is that get_design_matrix via BIDSVariable.resample was actually calling to_dense when it should be calling resample. That is basically a bug, as its inconsistent with resample in BIDSCollection.resample. I'll try this and hopefully this is enough to solve the issue; if not we can add something like what you suggest.

It would be worthwhile to think about smarter ways to apply resampling which may inspect the variable itself to ensure the right thing is done, such as what you suggested, @effigies , so I may open an issue on that, although it does not seem critical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants