You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following is a few ideas on tweaking aggregate_downsample() API to improve performance in practice.
They all fall into the bucket that do not involve changes on the internals of the implementation, but API changes that would enable users to get some performance gain.
1. Change the default of aggregate_func parameter to use faster median when available.
Currently, aggregate_func's default is np.nanmedian .
Suggestion, change the default to use the fastest median available. It'd use bottleneck's nanmedian if available.
in my test, it cuts down aggregate_downsample() by a factor of 2-3 in a timeseries with 200k samples.
2. Let users optionally specify a subset of columns to bin with a new optional columns parameter
A TimeSeries from actual observation often have many columns. In practice, users may not care about downsampling all of them.
E.g., a TESS SPOC lightcurve fits file has about 20+ columns, if users only want to bin the flux, they could then call aggregate_down_sample(ts, columns=['flux', 'flux_err'], time_bin_size=10*u.minute) , which could easily cut down the running time by a factor of 10.
3. Let users specify different aggregate_func for err columns.
E.g., a TESS SPOC lightcurve fits file as an example again, to properly bin flux, currently one has to:
i. call aggregate_downsample() once to get the binned flux
ii. call aggregate_downsample() again with root mean square as aggregate_func to get binned flux_err
The 2 calls can be reduced to 1, if aggregate_downsample() let users optionally specify the aggregate_func on a per-column basis.
In terms of API changes, one option is to add new optional aggregate_func_selector parameter to specify aggregate_func on a per-column basis.
With such change, bining a TESS Lightcurve can then be done in 1 call, handling both regular columns and error columns.
Description
The following is a few ideas on tweaking
aggregate_downsample()
API to improve performance in practice.They all fall into the bucket that do not involve changes on the internals of the implementation, but API changes that would enable users to get some performance gain.
1. Change the default of
aggregate_func
parameter to use faster median when available.aggregate_func
's default isnp.nanmedian
.bottleneck
'snanmedian
if available.aggregate_downsample()
by a factor of 2-3 in a timeseries with 200k samples.astropy.stats.sigma_clip()
API._parse_cenfunc(cenfunc)
helper .2. Let users optionally specify a subset of columns to bin with a new optional
columns
parameterA TimeSeries from actual observation often have many columns. In practice, users may not care about downsampling all of them.
E.g., a TESS SPOC lightcurve fits file has about 20+ columns, if users only want to bin the flux, they could then call
aggregate_down_sample(ts, columns=['flux', 'flux_err'], time_bin_size=10*u.minute)
, which could easily cut down the running time by a factor of 10.3. Let users specify different
aggregate_func
for err columns.E.g., a TESS SPOC lightcurve fits file as an example again, to properly bin
flux
, currently one has to:i. call
aggregate_downsample()
once to get the binnedflux
ii. call
aggregate_downsample()
again with root mean square asaggregate_func
to get binnedflux_err
The 2 calls can be reduced to 1, if
aggregate_downsample()
let users optionally specify theaggregate_func
on a per-column basis.In terms of API changes, one option is to add new optional
aggregate_func_selector
parameter to specifyaggregate_func
on a per-column basis.With such change, bining a TESS Lightcurve can then be done in 1 call, handling both regular columns and error columns.
Additional context
bottleneck
aggregator comes the discussion in Fix aggregate_downsample() performance degradation #13069columns
andaggregate_func_selector
ideas originated fromLightkurve
use case.Lightcurve.bin() implemetation
, callingaggregate_downsample()
twice for eachbin
call.The text was updated successfully, but these errors were encountered: