[timeseries] Parallelize TimeSeriesDataFrame construction from iterable dataset by shchur · Pull Request #2977 · autogluon/autogluon

shchur · 2023-02-27T12:27:20Z

Description of changes:

Parallelize the construction of TimeSeriesDataFrame from GluonTS-style iterable dataset. Instead of processing the items sequentially with a single core, parallelize the work across items with joblib.
- This significantly reduces the processing times for large datasets (e.g., for electricity we go from 337s -> 30s).
- This speeds up our benchmarking & positively affects users who store their datasets in GluonTS format.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2023-02-27T15:20:32Z

Job PR-2977-4dc6c4a is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2977/4dc6c4a/index.html

tonyhoo · 2023-02-27T18:42:03Z

timeseries/src/autogluon/timeseries/dataset/ts_dataframe.py

+            return pd.Series(target, name="target", index=idx).to_frame()
+
+        cls._validate_iterable(iterable_dataset)
+        all_ts = Parallel(n_jobs=-1)(


We should allow users to override this from HPs such as env.num_workers

This code is only executed outside of the TimeSeriesPredictor (when user constructs a TimeSeriesDataframe with TimeSeriesDataframe.from_iterable_dataset), so this method has no access to HPs.

How about we add an optional kwarg num_cpus: int = -1 to TimeSeriesDataframe.from_iterable_dataset that controls the number of jobs here?

yes, adding to from_iterable_dataset sounds good to me

@tonyhoo Done 👍

github-actions · 2023-02-28T19:57:54Z

Job PR-2977-2b36c84 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2977/2b36c84/index.html

tonyhoo

LGTM! Thanks for the improvement

jmakov · 2024-06-27T20:29:52Z

Shouldn't this be implemented using ray.io which would allow you to scale to the cluster?

shchur · 2024-06-28T08:24:51Z

@jmakov while we could make this one operation more scalable, it wouldn't really affect the overall scalability of the library because of other bottlenecks (e.g., the fact that all code relies on in-memory pd.DataFrames). With the current design we can handle datasets with <100M rows which covers the majority of use cases. If we decide to add support for even larger datasets, it will require a substantial re-design of the library, which is currently not planned.

jmakov · 2024-06-28T09:35:11Z

Thanks for the clarification. I see, one would need to refactor to sth like ray dataset which supports streaming to nodes. Not trivial.

shchur added 2 commits February 27, 2023 12:22

Parallelize the _construct_pandas_frame_from_iterable_dataset method

f805697

Rename method

4dc6c4a

shchur requested a review from tonyhoo February 27, 2023 12:27

tonyhoo reviewed Feb 27, 2023

View reviewed changes

Add num_cpus kwarg

2b36c84

tonyhoo approved these changes Mar 2, 2023

View reviewed changes

shchur merged commit 1eddecd into autogluon:master Mar 3, 2023

shchur deleted the parallelize-from-iterable branch March 3, 2023 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[timeseries] Parallelize TimeSeriesDataFrame construction from iterable dataset#2977

[timeseries] Parallelize TimeSeriesDataFrame construction from iterable dataset#2977
shchur merged 3 commits intoautogluon:masterfrom
shchur:parallelize-from-iterable

shchur commented Feb 27, 2023

Uh oh!

github-actions bot commented Feb 27, 2023

Uh oh!

tonyhoo Feb 27, 2023

Uh oh!

shchur Feb 28, 2023

Uh oh!

tonyhoo Feb 28, 2023

Uh oh!

shchur Feb 28, 2023

Uh oh!

github-actions bot commented Feb 28, 2023

Uh oh!

tonyhoo left a comment

Uh oh!

jmakov commented Jun 27, 2024

Uh oh!

shchur commented Jun 28, 2024

Uh oh!

jmakov commented Jun 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shchur commented Feb 27, 2023

Uh oh!

github-actions bot commented Feb 27, 2023

Uh oh!

tonyhoo Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

shchur Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

tonyhoo Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

shchur Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 28, 2023

Uh oh!

tonyhoo left a comment

Choose a reason for hiding this comment

Uh oh!

jmakov commented Jun 27, 2024

Uh oh!

shchur commented Jun 28, 2024

Uh oh!

jmakov commented Jun 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants