aggregate_downsample() API tweak to improve performance #13078

orionlee · 2022-04-07T19:43:20Z

Description

The following is a few ideas on tweaking aggregate_downsample() API to improve performance in practice.
They all fall into the bucket that do not involve changes on the internals of the implementation, but API changes that would enable users to get some performance gain.

1. Change the default of `aggregate_func` parameter to use faster median when available.

Currently, aggregate_func's default is np.nanmedian .
Suggestion, change the default to use the fastest median available. It'd use bottleneck's nanmedian if available.
- in my test, it cuts down aggregate_downsample() by a factor of 2-3 in a timeseries with 200k samples.
It is essentially the behavior of astropy.stats.sigma_clip() API.
- See: https://docs.astropy.org/en/stable/stats/index.html#performance-tips
- its implementation in _parse_cenfunc(cenfunc) helper .

2. Let users optionally specify a subset of columns to bin with a new optional `columns` parameter

A TimeSeries from actual observation often have many columns. In practice, users may not care about downsampling all of them.

E.g., a TESS SPOC lightcurve fits file has about 20+ columns, if users only want to bin the flux, they could then call aggregate_down_sample(ts, columns=['flux', 'flux_err'], time_bin_size=10*u.minute) , which could easily cut down the running time by a factor of 10.

3. Let users specify different `aggregate_func` for err columns.

E.g., a TESS SPOC lightcurve fits file as an example again, to properly bin flux, currently one has to:
i. call aggregate_downsample() once to get the binned flux
ii. call aggregate_downsample() again with root mean square as aggregate_func to get binned flux_err

The 2 calls can be reduced to 1, if aggregate_downsample() let users optionally specify the aggregate_func on a per-column basis.

In terms of API changes, one option is to add new optional aggregate_func_selector parameter to specify aggregate_func on a per-column basis.

With such change, bining a TESS Lightcurve can then be done in 1 call, handling both regular columns and error columns.

def tess_aggregate_func_selector(colname):
    if colname.endswith('_err'):
        return root_mean_square_root_func
    else:
        return np.nanmedian
ts_b = aggregate_downsample(ts, time_bin_size=10*u.minute, aggregate_func_selector=tess_aggregate_func_selector)

Additional context

The smart default of using bottleneck aggregator comes the discussion in Fix aggregate_downsample() performance degradation #13069
optional columns and aggregate_func_selector ideas originated from Lightkurve use case.
- see Speed up lc.bin lightkurve/lightkurve#1191
- see also: Lightcurve.bin() implemetation , calling aggregate_downsample() twice for each bin call.

The text was updated successfully, but these errors were encountered:

orionlee added the Feature Request label Apr 7, 2022

pllim added the timeseries label Apr 7, 2022

orionlee mentioned this issue Apr 7, 2022

Fix aggregate_downsample() performance degradation #13069

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aggregate_downsample() API tweak to improve performance #13078

aggregate_downsample() API tweak to improve performance #13078

orionlee commented Apr 7, 2022 •

edited

aggregate_downsample() API tweak to improve performance #13078

aggregate_downsample() API tweak to improve performance #13078

Comments

orionlee commented Apr 7, 2022 • edited

Description

1. Change the default of aggregate_func parameter to use faster median when available.

2. Let users optionally specify a subset of columns to bin with a new optional columns parameter

3. Let users specify different aggregate_func for err columns.

Additional context

orionlee commented Apr 7, 2022 •

edited

1. Change the default of `aggregate_func` parameter to use faster median when available.

2. Let users optionally specify a subset of columns to bin with a new optional `columns` parameter

3. Let users specify different `aggregate_func` for err columns.