aggregate_downsample(): much slower for non-Quantity columns #13093

orionlee · 2022-04-09T21:46:01Z

Description

Timseries aggregate_downsample(): when it is downsampling a non-Quantity columns (Column, NdarrayMixin), it is noticeably slower. We should make them comparable to Quantity columns.

In practice, it could affect columns such as cadence number.

The slow down:

column type	20k samples	200k samples
`Quantity`	~0.2sec	~2.0sec
`Column`	~0.42sec	~4.2sec
`NdarrayMixin`	0.39sec	~3.75sec

The numbers is based on Astropy 4.3.1. Astropy 5.0.4 also has similar slow down, but it has an additional performance regression in #13058 so astropy 4.3.1's number is used here.

Profile result shows that additional overhead is incurred by Column and NdarrayMixin during reduceat operations, primarily in the __array_finalize__() function of the repsective classes.

Script to produce the numbers

from astropy.time import Time
from astropy.timeseries import TimeSeries
from astropy import units as u
import numpy as np
from timeit import default_timer as timer
import sys

aggregate_func = np.nanmean

# the scale of a typical TESS 2-minute cadence lightcurve
num_points = 20000
if len(sys.argv) > 1:
    num_points = int(sys.argv[1])

time = Time(2457000 + np.arange(0, num_points) / 24 / 60 * 2, format="jd")

# pick the type for the column to bin
#
# col1 = np.ones(num_points) * u.electron
col1 = np.ones(num_points)
# col1 = astropy.table.NdarrayMixin(np.ones(num_points))

# other mixed-in types: not supported by aggregate_downsample()

ts = TimeSeries(time=time, data=dict(col1=col1))

start = timer()
ts_b = astropy.timeseries.aggregate_downsample(ts, aggregate_func=aggregate_func, time_bin_size=10*u.minute)
end = timer()

print("ts.bin(10 min) elapsed time:", (end - start))
print("col1 type:", type(ts["col1"]))
print(len(ts), len(ts_b))

Affected versions

The slow down is observed in v4.3.1 and v5.0.4, and probably affect earlier versions.

The text was updated successfully, but these errors were encountered:

dhomeier · 2022-04-19T20:22:39Z

Profile result shows that additional overhead is incurred by Column and NdarrayMixin during reduceat operations, primarily in the __array_finalize__() function of the repsective classes.

So it has nothing to do with converting to MaskedArray in those cases?

orionlee · 2022-04-20T01:40:58Z

No. The issue existed before MaskedArray come into the picture, back in Astropy 4.
I have not explicitly tested out MaskedArray case yet, though I suspect the situation would be similar.

By process of elimination, I think it's all the array slicing during reduceat() that is taking up time when the array is a Column or NdarrayMixin: , e.g.,

astropy/astropy/timeseries/downsample.py

Lines 27 to 29 in b17655f

    
               else: 
        
                   result.append(function(array[indices[i]:indices[i+1]])) 
        
           result.append(function(array[indices[-1]:]))

dhomeier · 2022-04-20T15:31:28Z

No. The issue existed before MaskedArray come into the picture, back in Astropy 4.

Not sure I can follow. That case has always been converted to MaskedArray since timeseries was introduced in around 3.2, and reduceat also seems to have existed more or less in that form.
But I can confirm that changing that part to using instead array(np.nan) for ndarray, fully analogously to the Quantity case, does not affect runtime in any noticeable way.

By process of elimination, I think it's all the array slicing during reduceat() that is taking up time when the array is a Column or NdarrayMixin: , e.g.,

But how would that behave differently, as it is called exactly the same way and with the same function whether called with ndarray or Quantity.value as array? And the NdarrayMixin case is spurious here, since for your example the column still identifies as np.ndarray (else it would skip with the warning before ever calling reduceat).

Interestingly for int-type columns I find timings somewhere in between, closer to the Quantity case. All in all the performance impact does not seem serious enough though to put a high priority on this.

dhomeier · 2022-04-20T17:03:46Z

@orionlee I think I found the cause, in fact passing Column directly rather than its value in the second case. Proposed fix in #13126, tests with and comments welcome!

orionlee · 2022-04-20T17:12:50Z

I tried a similar fix in PR #13069, but realized it might be too complicated to be bundled in that PR.

#13069 (comment)

dhomeier · 2022-04-20T17:24:04Z

I think it does not even have to be that complicated, since the check in

astropy/astropy/timeseries/downsample.py

Line 229 in a50ea29

if not isinstance(values, (np.ndarray, u.Quantity)):

already ensures that values is an np.ndarray at that point (I just shuffled the cases a bit around in the PR hoping to make the logic clearer).
Feel free to include either of the two options from #13126 in #13069, as it is more useful there anyway.

orionlee added the Feature Request label Apr 9, 2022

orionlee mentioned this issue Apr 9, 2022

Fix aggregate_downsample() performance degradation #13069

Merged

10 tasks

pllim added timeseries Performance labels Apr 14, 2022

dhomeier mentioned this issue Apr 20, 2022

Fix aggregate_downsample performance hit for plain ndarray columns #13126

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aggregate_downsample(): much slower for non-Quantity columns #13093

aggregate_downsample(): much slower for non-Quantity columns #13093

orionlee commented Apr 9, 2022

dhomeier commented Apr 19, 2022

orionlee commented Apr 20, 2022 •

edited

dhomeier commented Apr 20, 2022 •

edited

dhomeier commented Apr 20, 2022

orionlee commented Apr 20, 2022 •

edited

dhomeier commented Apr 20, 2022

aggregate_downsample(): much slower for non-Quantity columns #13093

aggregate_downsample(): much slower for non-Quantity columns #13093

Comments

orionlee commented Apr 9, 2022

Description

Script to produce the numbers

Affected versions

dhomeier commented Apr 19, 2022

orionlee commented Apr 20, 2022 • edited

dhomeier commented Apr 20, 2022 • edited

dhomeier commented Apr 20, 2022

orionlee commented Apr 20, 2022 • edited

dhomeier commented Apr 20, 2022

orionlee commented Apr 20, 2022 •

edited

dhomeier commented Apr 20, 2022 •

edited

orionlee commented Apr 20, 2022 •

edited