Skip to content

[timeseries] Parallelize TimeSeriesDataFrame construction from iterable dataset#2977

Merged
shchur merged 3 commits intoautogluon:masterfrom
shchur:parallelize-from-iterable
Mar 3, 2023
Merged

[timeseries] Parallelize TimeSeriesDataFrame construction from iterable dataset#2977
shchur merged 3 commits intoautogluon:masterfrom
shchur:parallelize-from-iterable

Conversation

@shchur
Copy link
Copy Markdown
Collaborator

@shchur shchur commented Feb 27, 2023

Description of changes:

  • Parallelize the construction of TimeSeriesDataFrame from GluonTS-style iterable dataset. Instead of processing the items sequentially with a single core, parallelize the work across items with joblib.
    • This significantly reduces the processing times for large datasets (e.g., for electricity we go from 337s -> 30s).
    • This speeds up our benchmarking & positively affects users who store their datasets in GluonTS format.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@shchur shchur requested a review from tonyhoo February 27, 2023 12:27
@github-actions
Copy link
Copy Markdown
Contributor

Job PR-2977-4dc6c4a is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2977/4dc6c4a/index.html

return pd.Series(target, name="target", index=idx).to_frame()

cls._validate_iterable(iterable_dataset)
all_ts = Parallel(n_jobs=-1)(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow users to override this from HPs such as env.num_workers

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is only executed outside of the TimeSeriesPredictor (when user constructs a TimeSeriesDataframe with TimeSeriesDataframe.from_iterable_dataset), so this method has no access to HPs.

How about we add an optional kwarg num_cpus: int = -1 to TimeSeriesDataframe.from_iterable_dataset that controls the number of jobs here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, adding to from_iterable_dataset sounds good to me

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonyhoo Done 👍

@github-actions
Copy link
Copy Markdown
Contributor

Job PR-2977-2b36c84 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2977/2b36c84/index.html

Copy link
Copy Markdown
Contributor

@tonyhoo tonyhoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the improvement

@shchur shchur merged commit 1eddecd into autogluon:master Mar 3, 2023
@shchur shchur deleted the parallelize-from-iterable branch March 3, 2023 07:40
@jmakov
Copy link
Copy Markdown

jmakov commented Jun 27, 2024

Shouldn't this be implemented using ray.io which would allow you to scale to the cluster?

@shchur
Copy link
Copy Markdown
Collaborator Author

shchur commented Jun 28, 2024

@jmakov while we could make this one operation more scalable, it wouldn't really affect the overall scalability of the library because of other bottlenecks (e.g., the fact that all code relies on in-memory pd.DataFrames). With the current design we can handle datasets with <100M rows which covers the majority of use cases. If we decide to add support for even larger datasets, it will require a substantial re-design of the library, which is currently not planned.

@jmakov
Copy link
Copy Markdown

jmakov commented Jun 28, 2024

Thanks for the clarification. I see, one would need to refactor to sth like ray dataset which supports streaming to nodes. Not trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants