Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_anomalies_isolation_forest method #375

Merged
merged 12 commits into from
Jun 5, 2024
Merged

Conversation

alex-hse-repository
Copy link
Collaborator

@alex-hse-repository alex-hse-repository commented May 30, 2024

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Proposed Changes

To Discuss

  1. Signature of this method differs form the others(we don't have in_column here), what should we return in case of index_only=False? -- Add in_column, use this column to populate the output series values
  2. The current implementation might work quite slow on large datasets, do we want to make any optimizations now? - No

Performance Benchmark

Looks like we can try to speed up computation by parallelizing this procedure by segments, dataframe operations are quite lightweight compering to isolation forest itself

  • periods = 100, n_segments = 100 -- most of the time on isolation forest(about 2 min)
Снимок экрана 2024-06-03 в 11 05 18
  • periods = 100, n_segments = 10000 + 50 features + 1 tree in forest -- most of the time on isolation forest(about 45 s), about 15-20 s on dataframe operations
Снимок экрана 2024-06-03 в 11 08 04
from etna.analysis.outliers import get_anomalies_isolation_forest
import numpy as np
import pandas as pd

from etna.datasets import TSDataset, generate_ar_df
from omegaconf import OmegaConf, DictConfig
import hydra


@hydra.main(version_base="1.3.2", config_path=".", config_name="config")
def main(cfg: DictConfig):
    cfg = OmegaConf.to_container(cfg, resolve=True)

    def add_nans(df: pd.DataFrame, nans_frac: float):
        nan_mask = np.random.choice([True, False], size=df.shape[0], p=[1-nans_frac, nans_frac])
        df.loc[nan_mask, "target"] = np.NAN
        return df

    df = add_nans(df=generate_ar_df(n_segments=cfg["n_segments"], periods=cfg["periods"], start_time="2000-01-01"), nans_frac=cfg["nans_frac"])
    for i in range(cfg["n_features"]):
        df[f"exog_{i}"] = np.random.normal(size=len(df))
    ts = TSDataset(df=df, freq="D")

    get_anomalies_isolation_forest(
        ts=ts,
        features_to_use=[f"exog_{i}" for i in range(cfg["features_to_use"])],
        features_to_ignore=cfg["features_to_ignore"],
        ignore_missing=cfg["ignore_missing"],
        index_only=cfg["index_only"],
        n_estimators=1,
    )


if __name__ == "__main__":
    main()

Closing issues

closes #356

Copy link

github-actions bot commented May 30, 2024

🚀 Deployed on https://deploy-preview-375--etna-docs.netlify.app

@github-actions github-actions bot temporarily deployed to pull request May 30, 2024 14:57 Inactive
@brsnw250
Copy link
Collaborator

  1. Let's add in_column, which would be mandatory for the case index only=False. And use this column to populate the output series values.
  2. It is better to do optimization here.

@github-actions github-actions bot temporarily deployed to pull request June 3, 2024 08:07 Inactive
@github-actions github-actions bot temporarily deployed to pull request June 4, 2024 09:53 Inactive
@github-actions github-actions bot temporarily deployed to pull request June 5, 2024 06:50 Inactive
Copy link

codecov bot commented Jun 5, 2024

Codecov Report

Attention: Patch coverage is 0% with 59 lines in your changes missing coverage. Please review.

Project coverage is 9.70%. Comparing base (5cb8485) to head (12b3fa8).
Report is 1 commits behind head on master.

Files Patch % Lines
...tna/analysis/outliers/isolation_forest_outliers.py 0.00% 57 Missing ⚠️
etna/analysis/__init__.py 0.00% 1 Missing ⚠️
etna/analysis/outliers/__init__.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #375       +/-   ##
==========================================
- Coverage   88.88%   9.70%   -79.19%     
==========================================
  Files         224     225        +1     
  Lines       15381   15431       +50     
==========================================
- Hits        13672    1498    -12174     
- Misses       1709   13933    +12224     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alex-hse-repository alex-hse-repository merged commit 93ae431 into master Jun 5, 2024
16 checks passed
egoriyaa pushed a commit that referenced this pull request Jun 5, 2024
egoriyaa added a commit that referenced this pull request Jun 5, 2024
* Revert "Add `get_anomalies_isolation_forest` method (#375)"

This reverts commit 93ae431.

* fix imports

* fix imports

* fix imports hard

* Revert "fix imports hard"

This reverts commit 10f34c4.

* chore: update changelog

---------

Co-authored-by: Egor Baturin <egoriyaa@github.com>
egoriyaa pushed a commit that referenced this pull request Jun 5, 2024
egoriyaa added a commit that referenced this pull request Jun 6, 2024
(cherry picked from commit 93ae431)

Co-authored-by: alex-hse-repository <55380696+alex-hse-repository@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add get_anomalies_isolation_forest
2 participants