You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! Thank you for a fantastic offering with this package. This has been very helpful to get some quantile mapping up and running quickly. As a thank you, I wanted to offer some optimization I did on gridded bias correction that you might find useful. I specialize in vectorizing/optimizing gridded data, but do not have the capacity right now to open a full PR.
I used cm.adjust_3d on a ~111x111 lat lon grid with 40,000 time steps. The progress bar estimated around 2.5 hours for this but I didn't run it in full. With the below implementation, it ran in 1 minute.
You need a dask cluster running and a dask dataset of course to reap those benefits, but the implementation will speed up in-memory datasets too.
importnumpyasnpimportxarrayasxrfromcmethodsimportCMethodsascmdefquantile_map_3d(
obs: xr.DataArray,
simh: xr.DataArray,
simp: xr.DataArray,
n_quantiles: int,
kind: str,
):
"""Quantile mapping vectorized for 3D operations."""defqmap(
obs: xr.DataArray,
simh: xr.DataArray,
simp: xr.DataArray,
n_quantiles: int,
kind: str,
) ->np.array:
"""Helper for apply ufunc to vectorize/parallelize the bias correction step."""returncm.quantile_mapping(
obs=obs, simh=simh, simp=simp, n_quantiles=n_quantiles, kind=kind
)
result=xr.apply_ufunc(
qmap,
obs,
simh,
# Need to spoof a fake time axis since 'time' coord on full dataset is different# than 'time' coord on training dataset.simp.rename({"time": "t2"}),
dask="parallelized",
vectorize=True,
# This will vectorize over the time dimension, so will submit each grid cell# independentlyinput_core_dims=[["time"], ["time"], ["t2"]],
# Need to denote that the final output dataset will be labeled with the# spoofed time coordinateoutput_core_dims=[["t2"]],
kwargs={"n_quantiles": n_quantiles, "kind": kind},
)
# Rename to proper coordinate name.result=result.rename({"t2": "time"})
# ufunc will put the core dimension to the end (time), so want to preserve original# order where time is commonly first.result=result.transpose(*obs.dims)
returnresult
The nice thing about this is that it can handle 1D datasets without any issue. The limitation is they always have to be xarray objects. But it works with dask or in-memory datasets and any arbitrary dimensions as long as a labeled time dimension exists.
The other great thing is you could just implement the apply_ufunc wrapper to every single bias correction code without the need for a separate adjust_3d function. A user can pass in 1D or 2D+ data without any change in code.
Hey @riley-brady, thank you for sharing your suggestions! I will try it out and if it brings such improvements, it will definitely be part of the next release. I was looking for such a solution for a long time when I was developing this package at my old work place. I created this issue #6 past then but then I got lost in other projects.
I also need some time to explore and implement this solution - hopefully during the upcoming week.
To really get the power of this on a larger dataset, you need to have your file stored as a chunked Zarr and be running a local dask cluster. Again, this will work and be much faster than the nested for-loop for in-memory simple datasets (1D, 2D, 3D, and above).
But to the point of your issue #6 and the stackoverflow, if you're going bigger than your RAM you definitely want it stored as Zarr with dask.
If you haven't used zarr or dask before and need some pointers, please let me know!
Hello! Thank you for a fantastic offering with this package. This has been very helpful to get some quantile mapping up and running quickly. As a thank you, I wanted to offer some optimization I did on gridded bias correction that you might find useful. I specialize in vectorizing/optimizing gridded data, but do not have the capacity right now to open a full PR.
I used
cm.adjust_3d
on a ~111x111 lat lon grid with 40,000 time steps. The progress bar estimated around 2.5 hours for this but I didn't run it in full. With the below implementation, it ran in 1 minute.You need a dask cluster running and a dask dataset of course to reap those benefits, but the implementation will speed up in-memory datasets too.
The nice thing about this is that it can handle 1D datasets without any issue. The limitation is they always have to be xarray objects. But it works with dask or in-memory datasets and any arbitrary dimensions as long as a labeled
time
dimension exists.The other great thing is you could just implement the
apply_ufunc
wrapper to every single bias correction code without the need for a separateadjust_3d
function. A user can pass in 1D or 2D+ data without any change in code.Example:
The text was updated successfully, but these errors were encountered: