Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate_downsample(): much slower for non-Quantity columns #13093

Open
orionlee opened this issue Apr 9, 2022 · 6 comments
Open

aggregate_downsample(): much slower for non-Quantity columns #13093

orionlee opened this issue Apr 9, 2022 · 6 comments

Comments

@orionlee
Copy link
Contributor

orionlee commented Apr 9, 2022

Description

Timseries aggregate_downsample(): when it is downsampling a non-Quantity columns (Column, NdarrayMixin), it is noticeably slower. We should make them comparable to Quantity columns.

In practice, it could affect columns such as cadence number.

The slow down:

column type 20k samples 200k samples
Quantity ~0.2sec ~2.0sec
Column ~0.42sec ~4.2sec
NdarrayMixin 0.39sec ~3.75sec

The numbers is based on Astropy 4.3.1. Astropy 5.0.4 also has similar slow down, but it has an additional performance regression in #13058 so astropy 4.3.1's number is used here.

Profile result shows that additional overhead is incurred by Column and NdarrayMixin during reduceat operations, primarily in the __array_finalize__() function of the repsective classes.

Script to produce the numbers

from astropy.time import Time
from astropy.timeseries import TimeSeries
from astropy import units as u
import numpy as np
from timeit import default_timer as timer
import sys

aggregate_func = np.nanmean

# the scale of a typical TESS 2-minute cadence lightcurve
num_points = 20000
if len(sys.argv) > 1:
    num_points = int(sys.argv[1])

time = Time(2457000 + np.arange(0, num_points) / 24 / 60 * 2, format="jd")

# pick the type for the column to bin
#
# col1 = np.ones(num_points) * u.electron
col1 = np.ones(num_points)
# col1 = astropy.table.NdarrayMixin(np.ones(num_points))

# other mixed-in types: not supported by aggregate_downsample()

ts = TimeSeries(time=time, data=dict(col1=col1))

start = timer()
ts_b = astropy.timeseries.aggregate_downsample(ts, aggregate_func=aggregate_func, time_bin_size=10*u.minute)
end = timer()

print("ts.bin(10 min) elapsed time:", (end - start))
print("col1 type:", type(ts["col1"]))
print(len(ts), len(ts_b))

Affected versions

The slow down is observed in v4.3.1 and v5.0.4, and probably affect earlier versions.

@dhomeier
Copy link
Contributor

Profile result shows that additional overhead is incurred by Column and NdarrayMixin during reduceat operations, primarily in the __array_finalize__() function of the repsective classes.

So it has nothing to do with converting to MaskedArray in those cases?

@orionlee
Copy link
Contributor Author

orionlee commented Apr 20, 2022

No. The issue existed before MaskedArray come into the picture, back in Astropy 4.
I have not explicitly tested out MaskedArray case yet, though I suspect the situation would be similar.

By process of elimination, I think it's all the array slicing during reduceat() that is taking up time when the array is a Column or NdarrayMixin: , e.g.,

else:
result.append(function(array[indices[i]:indices[i+1]]))
result.append(function(array[indices[-1]:]))

@dhomeier
Copy link
Contributor

dhomeier commented Apr 20, 2022

No. The issue existed before MaskedArray come into the picture, back in Astropy 4.

Not sure I can follow. That case has always been converted to MaskedArray since timeseries was introduced in around 3.2, and reduceat also seems to have existed more or less in that form.
But I can confirm that changing that part to using instead array(np.nan) for ndarray, fully analogously to the Quantity case, does not affect runtime in any noticeable way.

By process of elimination, I think it's all the array slicing during reduceat() that is taking up time when the array is a Column or NdarrayMixin: , e.g.,

But how would that behave differently, as it is called exactly the same way and with the same function whether called with ndarray or Quantity.value as array? And the NdarrayMixin case is spurious here, since for your example the column still identifies as np.ndarray (else it would skip with the warning before ever calling reduceat).

Interestingly for int-type columns I find timings somewhere in between, closer to the Quantity case. All in all the performance impact does not seem serious enough though to put a high priority on this.

@dhomeier
Copy link
Contributor

@orionlee I think I found the cause, in fact passing Column directly rather than its value in the second case. Proposed fix in #13126, tests with and comments welcome!

@orionlee
Copy link
Contributor Author

orionlee commented Apr 20, 2022

I tried a similar fix in PR #13069, but realized it might be too complicated to be bundled in that PR.

#13069 (comment)

@dhomeier
Copy link
Contributor

I think it does not even have to be that complicated, since the check in

if not isinstance(values, (np.ndarray, u.Quantity)):
already ensures that values is an np.ndarray at that point (I just shuffled the cases a bit around in the PR hoping to make the logic clearer).
Feel free to include either of the two options from #13126 in #13069, as it is more useful there anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants