Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[timeseries] Speed up the train/val splitter #2586

Merged
merged 3 commits into from
Dec 23, 2022

Conversation

shchur
Copy link
Collaborator

@shchur shchur commented Dec 20, 2022

Description of changes:

  • Speed up the append_suffix_to_item_id function
  • Replace the surprisingly slow DataFrame.loc[index] operation with DataFrame.query("item_id in @index").

Testing on a subset of 5000 items from the M5 competition dataset:

Using code currently on master:

Loaded dataset with 7559974 rows and 5000 items.
df.slice_by_timestep(None, -prediction_length): 5.75s
LastWindowSplitter.split(df, prediction_length): 63.83s

After current PR:

Loaded dataset with 7559974 rows and 5000 items.
df.slice_by_timestep(None, -prediction_length): 1.38s
LastWindowSplitter.split(df, prediction_length): 8.30s
Code for reproducing the results
import time
import pandas as pd
from autogluon.timeseries import TimeSeriesDataFrame
from autogluon.timeseries.splitter import LastWindowSplitter

prediction_length = 28
# Dataset consists of the first 5000 items of the M5 competition dataset
raw_data = pd.read_parquet("../m5/data/subset.parquet")
static = pd.read_parquet("../m5/data/static.parquet")

raw_data["item_id"] = raw_data["item_id"].astype("str")
static["item_id"] = static["item_id"].astype("str")
static.set_index("item_id", inplace=True)

print(f"Loaded dataset with {len(raw_data)} rows and {raw_data['item_id'].nunique()} items.")
df = TimeSeriesDataFrame(raw_data, static_features=static)

start = time.time()
df.slice_by_timestep(None, -prediction_length)
print(f"df.slice_by_timestep(None, -prediction_length): {time.time() - start:.2f}s")

start = time.time()
splitter = LastWindowSplitter()
train_data, val_data = splitter.split(df, prediction_length)
print(f"LastWindowSplitter.split(df, prediction_length): {time.time() - start:.2f}s")

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link

Job PR-2586-77c7f57 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2586/77c7f57/index.html

@shchur shchur added this to the 0.6.2 Release milestone Dec 21, 2022
Copy link
Contributor

@canerturkmen canerturkmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some questions.

@github-actions
Copy link

Job PR-2586-12d2b30 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2586/12d2b30/index.html

@github-actions
Copy link

Job PR-2586-7d41cb1 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2586/7d41cb1/index.html

@shchur shchur merged commit 78b0426 into autogluon:master Dec 23, 2022
@shchur shchur deleted the faster-splitter branch December 23, 2022 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: timeseries related to the timeseries module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants