Add `parts` argument to `load_dataset` function #79

ostreech1997 · 2023-09-08T10:23:01Z

Before submitting (must do checklist)

Did you read the contribution guide?
Did you update the docs? We use Numpy format for all the methods and classes.
Did you write any new necessary tests?
Did you update the CHANGELOG?

Proposed Changes

Closing issues

Closes #74.

codecov · 2023-09-08T10:25:50Z

Codecov Report

❗ No coverage uploaded for pull request base (internal_datasets@959b3a0). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@                 Coverage Diff                  @@
##             internal_datasets      #79   +/-   ##
====================================================
  Coverage                     ?   88.61%           
====================================================
  Files                        ?      194           
  Lines                        ?    12488           
  Branches                     ?        0           
====================================================
  Hits                         ?    11066           
  Misses                       ?     1422           
  Partials                     ?        0

☔ View full report in Codecov by Sentry.

📢 Have feedback on the report? Share it here.

github-actions · 2023-09-08T10:31:22Z

🚀 Deployed on https://deploy-preview-79--etna-docs.netlify.app

etna/datasets/internal_datasets.py

ostreech1997 · 2023-09-09T07:42:38Z

Script for time measurements for saving data in wide and long formats.

import tempfile
import urllib.request
import zipfile
from pathlib import Path
import time

import pandas as pd

from etna.datasets.tsdataset import TSDataset


dataset_dir = Path.home() / ".etna" / "electricity_15T"


def _download_dataset_zip(url: str, file_name: str, **kwargs) -> pd.DataFrame:
    try:
        with tempfile.TemporaryDirectory() as td:
            temp_path = Path(td) / "temp.zip"
            urllib.request.urlretrieve(url, temp_path)
            with zipfile.ZipFile(temp_path) as f:
                f.extractall(td)
                df = pd.read_csv(Path(td) / file_name, **kwargs)
    except Exception as err:
        raise Exception(f"Error during downloading and reading dataset. Reason: {repr(err)}")
    return df


def prepare_data():
    url = "https://archive.ics.uci.edu/static/public/321/electricityloaddiagrams20112014.zip"
    dataset_dir.mkdir(exist_ok=True, parents=True)
    data = _download_dataset_zip(url=url, file_name="LD2011_2014.txt", sep=";", dtype=str)

    data = data.rename({"Unnamed: 0": "timestamp"}, axis=1)
    data["timestamp"] = pd.to_datetime(data["timestamp"])
    dt_list = sorted(data["timestamp"].unique())
    data = data.melt("timestamp", var_name="segment", value_name="target")
    data["target"] = data["target"].str.replace(",", ".").astype(float)
    data_train = data[data["timestamp"].isin(dt_list[: -15 * 24])]
    data_test = data[data["timestamp"].isin(dt_list[-15 * 24:])]
    return data, data_train, data_test


def save_wide():
    data, data_train, data_test = prepare_data()
    TSDataset.to_dataset(data).to_csv(dataset_dir / "electricity_15T_full.csv.gz", index=True, compression="gzip")
    TSDataset.to_dataset(data_train).to_csv(
        dataset_dir / "electricity_15T_train.csv.gz", index=True, compression="gzip"
    )
    TSDataset.to_dataset(data_test).to_csv(dataset_dir / "electricity_15T_test.csv.gz", index=True, compression="gzip")


def load_wide():
    data = pd.read_csv(
        dataset_dir / f"electricity_15T_full.csv.gz",
        compression="gzip",
        header=[0, 1],
        index_col=[0],
        parse_dates=[0]
    )
    _ = TSDataset(data, freq="15T")


def save_long():
    data, data_train, data_test = prepare_data()
    data.to_csv(dataset_dir / "electricity_15T_full.csv.gz", index=False, compression="gzip")
    data_train.to_csv(
        dataset_dir / "electricity_15T_train.csv.gz", index=False, compression="gzip"
    )
    data_test.to_csv(dataset_dir / "electricity_15T_test.csv.gz", index=False, compression="gzip")


def load_long():
    data = pd.read_csv(
        dataset_dir / f"electricity_15T_full.csv.gz",
        compression="gzip",
        parse_dates=[0]
    )
    _ = TSDataset(TSDataset.to_dataset(data), freq="15T")


def main():
    time_start = time.time()
    save_wide()
    time_end = time.time()
    print("Time for saving data in wide format:", (time_end - time_start) / 60)

    time_start = time.time()
    load_wide()
    time_end = time.time()
    print("Time for loading data in wide format:", (time_end - time_start) / 60)

    time_start = time.time()
    save_long()
    time_end = time.time()
    print("Time for saving data in long format:", (time_end - time_start) / 60)

    time_start = time.time()
    load_long()
    time_end = time.time()
    print("Time for loading data in long format:", (time_end - time_start) / 60)


if __name__ == "__main__":
    main()

Results:

Time for saving data in wide format: 6.355555299917857
Time for loading data in wide format: 0.18998863299687704
Time for saving data in long format: 11.959305040041606
Time for loading data in long format: 1.4660529494285583

ostreech1997 added 2 commits September 8, 2023 13:20

Initial commit

246730e

Update CHANGELOG.md

fb3187d

d-a-bunin self-requested a review September 8, 2023 10:28

ostreech1997 added the datasets Work with datasets label Sep 8, 2023

ostreech1997 added this to the Internal datasets milestone Sep 8, 2023

github-actions bot temporarily deployed to pull request September 8, 2023 10:41 Inactive

Fix docstring of load_dataset function

4daad05

d-a-bunin reviewed Sep 8, 2023

View reviewed changes

etna/datasets/internal_datasets.py Outdated Show resolved Hide resolved

github-actions bot temporarily deployed to pull request September 8, 2023 12:05 Inactive

Add parse_dates argument to pd.read_csv

12df4bb

github-actions bot temporarily deployed to pull request September 8, 2023 15:09 Inactive

ostreech1997 requested a review from d-a-bunin September 9, 2023 11:02

d-a-bunin approved these changes Sep 9, 2023

View reviewed changes

ostreech1997 merged commit a7dced2 into internal_datasets Sep 9, 2023
15 checks passed

ostreech1997 deleted the issue-74 branch September 9, 2023 15:49

This was referenced Sep 9, 2023

Add parts argument to load_dataset function #74

Closed

Add M4 to internal datasets #64

Closed

Add M3 to internal datasets #84

Closed

ostreech1997 added a commit that referenced this pull request Dec 4, 2023

Add parts argument to load_dataset function (#79)

81616ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `parts` argument to `load_dataset` function #79

Add `parts` argument to `load_dataset` function #79

ostreech1997 commented Sep 8, 2023

codecov bot commented Sep 8, 2023 •

edited

Loading

github-actions bot commented Sep 8, 2023 •

edited

Loading

ostreech1997 commented Sep 9, 2023

Add parts argument to load_dataset function #79

Add parts argument to load_dataset function #79

Conversation

ostreech1997 commented Sep 8, 2023

Before submitting (must do checklist)

Proposed Changes

Closing issues

codecov bot commented Sep 8, 2023 • edited Loading

Codecov Report

github-actions bot commented Sep 8, 2023 • edited Loading

ostreech1997 commented Sep 9, 2023

Add `parts` argument to `load_dataset` function #79

Add `parts` argument to `load_dataset` function #79

codecov bot commented Sep 8, 2023 •

edited

Loading

github-actions bot commented Sep 8, 2023 •

edited

Loading