Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up TimeSeriesImputerTransform #217

Merged
merged 17 commits into from
Feb 6, 2024
Merged

Conversation

alex-hse-repository
Copy link
Collaborator

@alex-hse-repository alex-hse-repository commented Jan 15, 2024

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Flame graphs(before)

  • constant
Снимок экрана 2024-01-15 в 16 16 49
  • mean
Снимок экрана 2024-01-15 в 16 17 09
  • seasonal
Снимок экрана 2024-01-15 в 16 17 26

Proposed Changes

  1. Rewrite NaNs restoring logic to masks(now we store binary mask for nans instead of nan_timestamps)

  2. Optimize _fill implementation for different strategies:

    • constant, forward_fill -- remove redundant slicing
    • mean -- use SimpleImputer from sklearn, rewrite it to work correctly on subset of segments
    • seasonal, running_mean -- use MeanTransform to obtain values to use for imputation
  3. Fix bug in TimeSeriesImputerTransform created with window=-1 and seasonality!=1

  4. Add new strategy "seasonal_statistics" -- fill missing values using only existing values

Flame graphs(after)

  • constant(x5): 24.00 -> 5.44
Снимок экрана 2024-01-30 в 15 59 29
  • mean(x5): 24.65 -> 5.85
Снимок экрана 2024-01-30 в 16 01 00
  • seasonal(x100): ??? -> 6.13 (50.58 -> 0.6 for 1k)
Снимок экрана 2024-01-30 в 16 02 47
  • seasonal_statistics(new): 6.82
Снимок экрана 2024-01-30 в 16 06 33

Additional thoughts

  • Rewrite update_columns_from_pandas with set_columns_wide
  • Here the base operations like to_pandas takes most of the time, looks like we can stop optimizing on this point -- however we can think about optimizing the base operations as the might take large time at scale
  • We can add strategies "median" and "most_frequent" from SimpleImputer
  • We can implement transformation on new segments for some of the strategies

Code to reproduce

import pandas as pd
import numpy as np

from etna.datasets import TSDataset, generate_ar_df
from etna.transforms import TimeSeriesImputerTransform

def add_nans(df: pd.DataFrame, nans_frac: float):
    nan_mask = np.random.choice([True, False], size=df.shape[0], p=[1-nans_frac, nans_frac])
    df.loc[nan_mask, "target"] = np.NAN
    return df

df = add_nans(generate_ar_df(n_segments=10000, periods=100, start_time="2000-01-01"), nans_frac=0.5)
ts = TSDataset(df=TSDataset.to_dataset(df), freq="D")

#transform = TimeSeriesImputerTransform(in_column="target", strategy="constant", constant_value=0)
#transform = TimeSeriesImputerTransform(in_column="target", strategy="forward_fill")
#transform = TimeSeriesImputerTransform(in_column="target", strategy="mean")
#transform = TimeSeriesImputerTransform(in_column="target", strategy="seasonal", window=3, seasonality=7)
transform = TimeSeriesImputerTransform(in_column="target", strategy="seasonal_statistics", window=3, seasonality=7)
transform.fit_transform(ts)

Closing issues

@alex-hse-repository alex-hse-repository added this to the Optimization milestone Jan 15, 2024
@alex-hse-repository alex-hse-repository self-assigned this Jan 15, 2024
Copy link

github-actions bot commented Jan 15, 2024

🚀 Deployed on https://deploy-preview-217--etna-docs.netlify.app

@github-actions github-actions bot temporarily deployed to pull request January 15, 2024 12:44 Inactive
@github-actions github-actions bot temporarily deployed to pull request January 22, 2024 08:33 Inactive
@github-actions github-actions bot temporarily deployed to pull request January 22, 2024 10:13 Inactive
@github-actions github-actions bot temporarily deployed to pull request January 22, 2024 11:44 Inactive
@github-actions github-actions bot temporarily deployed to pull request January 30, 2024 13:17 Inactive
Copy link

codecov bot commented Jan 30, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (6a6f180) 89.03% compared to head (6b07273) 89.04%.

Files Patch % Lines
etna/transforms/missing_values/imputation.py 98.46% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master     #217   +/-   ##
=======================================
  Coverage   89.03%   89.04%           
=======================================
  Files         199      199           
  Lines       13183    13200   +17     
=======================================
+ Hits        11738    11754   +16     
- Misses       1445     1446    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

etna/transforms/math/statistics.py Show resolved Hide resolved
etna/transforms/missing_values/imputation.py Show resolved Hide resolved
- If "seasonal" then replace missing dates using seasonal moving average
- If "seasonal" then replace missing dates using seasonal moving average in autoregressive manner

- If "seasonal_statistics" then replace missing dates using seasonal moving average on existing values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why do you suggest this name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't come up with better naming, do you have any ideas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it isn't autorregressive how does it work with missing values? It isn't clear from the docs. We should clarify the differences.

etna/transforms/missing_values/imputation.py Outdated Show resolved Hide resolved
etna/transforms/missing_values/imputation.py Show resolved Hide resolved
etna/transforms/missing_values/imputation.py Show resolved Hide resolved
@github-actions github-actions bot temporarily deployed to pull request January 31, 2024 11:30 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 6, 2024 07:00 Inactive
@alex-hse-repository alex-hse-repository merged commit e978ad0 into master Feb 6, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants