[timeseries] Parallelize TimeSeriesDataFrame construction from iterable dataset#2977
Conversation
|
Job PR-2977-4dc6c4a is done. |
| return pd.Series(target, name="target", index=idx).to_frame() | ||
|
|
||
| cls._validate_iterable(iterable_dataset) | ||
| all_ts = Parallel(n_jobs=-1)( |
There was a problem hiding this comment.
We should allow users to override this from HPs such as env.num_workers
There was a problem hiding this comment.
This code is only executed outside of the TimeSeriesPredictor (when user constructs a TimeSeriesDataframe with TimeSeriesDataframe.from_iterable_dataset), so this method has no access to HPs.
How about we add an optional kwarg num_cpus: int = -1 to TimeSeriesDataframe.from_iterable_dataset that controls the number of jobs here?
There was a problem hiding this comment.
yes, adding to from_iterable_dataset sounds good to me
|
Job PR-2977-2b36c84 is done. |
tonyhoo
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the improvement
|
Shouldn't this be implemented using ray.io which would allow you to scale to the cluster? |
|
@jmakov while we could make this one operation more scalable, it wouldn't really affect the overall scalability of the library because of other bottlenecks (e.g., the fact that all code relies on in-memory pd.DataFrames). With the current design we can handle datasets with <100M rows which covers the majority of use cases. If we decide to add support for even larger datasets, it will require a substantial re-design of the library, which is currently not planned. |
|
Thanks for the clarification. I see, one would need to refactor to sth like ray dataset which supports streaming to nodes. Not trivial. |
Description of changes:
TimeSeriesDataFramefrom GluonTS-style iterable dataset. Instead of processing the items sequentially with a single core, parallelize the work across items with joblib.electricitywe go from 337s -> 30s).By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.