Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parts argument to load_dataset function #79

Merged
merged 4 commits into from
Sep 9, 2023

Conversation

ostreech1997
Copy link
Collaborator

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Proposed Changes

Closing issues

Closes #74.

@codecov
Copy link

codecov bot commented Sep 8, 2023

Codecov Report

❗ No coverage uploaded for pull request base (internal_datasets@959b3a0). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@                 Coverage Diff                  @@
##             internal_datasets      #79   +/-   ##
====================================================
  Coverage                     ?   88.61%           
====================================================
  Files                        ?      194           
  Lines                        ?    12488           
  Branches                     ?        0           
====================================================
  Hits                         ?    11066           
  Misses                       ?     1422           
  Partials                     ?        0           

☔ View full report in Codecov by Sentry.

📢 Have feedback on the report? Share it here.

@d-a-bunin d-a-bunin self-requested a review September 8, 2023 10:28
@ostreech1997 ostreech1997 added the datasets Work with datasets label Sep 8, 2023
@ostreech1997 ostreech1997 added this to the Internal datasets milestone Sep 8, 2023
@github-actions
Copy link

github-actions bot commented Sep 8, 2023

🚀 Deployed on https://deploy-preview-79--etna-docs.netlify.app

@github-actions github-actions bot temporarily deployed to pull request September 8, 2023 10:41 Inactive
@github-actions github-actions bot temporarily deployed to pull request September 8, 2023 12:05 Inactive
@github-actions github-actions bot temporarily deployed to pull request September 8, 2023 15:09 Inactive
@ostreech1997
Copy link
Collaborator Author

Script for time measurements for saving data in wide and long formats.

import tempfile
import urllib.request
import zipfile
from pathlib import Path
import time

import pandas as pd

from etna.datasets.tsdataset import TSDataset


dataset_dir = Path.home() / ".etna" / "electricity_15T"


def _download_dataset_zip(url: str, file_name: str, **kwargs) -> pd.DataFrame:
    try:
        with tempfile.TemporaryDirectory() as td:
            temp_path = Path(td) / "temp.zip"
            urllib.request.urlretrieve(url, temp_path)
            with zipfile.ZipFile(temp_path) as f:
                f.extractall(td)
                df = pd.read_csv(Path(td) / file_name, **kwargs)
    except Exception as err:
        raise Exception(f"Error during downloading and reading dataset. Reason: {repr(err)}")
    return df


def prepare_data():
    url = "https://archive.ics.uci.edu/static/public/321/electricityloaddiagrams20112014.zip"
    dataset_dir.mkdir(exist_ok=True, parents=True)
    data = _download_dataset_zip(url=url, file_name="LD2011_2014.txt", sep=";", dtype=str)

    data = data.rename({"Unnamed: 0": "timestamp"}, axis=1)
    data["timestamp"] = pd.to_datetime(data["timestamp"])
    dt_list = sorted(data["timestamp"].unique())
    data = data.melt("timestamp", var_name="segment", value_name="target")
    data["target"] = data["target"].str.replace(",", ".").astype(float)
    data_train = data[data["timestamp"].isin(dt_list[: -15 * 24])]
    data_test = data[data["timestamp"].isin(dt_list[-15 * 24:])]
    return data, data_train, data_test


def save_wide():
    data, data_train, data_test = prepare_data()
    TSDataset.to_dataset(data).to_csv(dataset_dir / "electricity_15T_full.csv.gz", index=True, compression="gzip")
    TSDataset.to_dataset(data_train).to_csv(
        dataset_dir / "electricity_15T_train.csv.gz", index=True, compression="gzip"
    )
    TSDataset.to_dataset(data_test).to_csv(dataset_dir / "electricity_15T_test.csv.gz", index=True, compression="gzip")


def load_wide():
    data = pd.read_csv(
        dataset_dir / f"electricity_15T_full.csv.gz",
        compression="gzip",
        header=[0, 1],
        index_col=[0],
        parse_dates=[0]
    )
    _ = TSDataset(data, freq="15T")


def save_long():
    data, data_train, data_test = prepare_data()
    data.to_csv(dataset_dir / "electricity_15T_full.csv.gz", index=False, compression="gzip")
    data_train.to_csv(
        dataset_dir / "electricity_15T_train.csv.gz", index=False, compression="gzip"
    )
    data_test.to_csv(dataset_dir / "electricity_15T_test.csv.gz", index=False, compression="gzip")


def load_long():
    data = pd.read_csv(
        dataset_dir / f"electricity_15T_full.csv.gz",
        compression="gzip",
        parse_dates=[0]
    )
    _ = TSDataset(TSDataset.to_dataset(data), freq="15T")


def main():
    time_start = time.time()
    save_wide()
    time_end = time.time()
    print("Time for saving data in wide format:", (time_end - time_start) / 60)

    time_start = time.time()
    load_wide()
    time_end = time.time()
    print("Time for loading data in wide format:", (time_end - time_start) / 60)

    time_start = time.time()
    save_long()
    time_end = time.time()
    print("Time for saving data in long format:", (time_end - time_start) / 60)

    time_start = time.time()
    load_long()
    time_end = time.time()
    print("Time for loading data in long format:", (time_end - time_start) / 60)


if __name__ == "__main__":
    main()
    

Results:

Time for saving data in wide format: 6.355555299917857
Time for loading data in wide format: 0.18998863299687704
Time for saving data in long format: 11.959305040041606
Time for loading data in long format: 1.4660529494285583

@ostreech1997 ostreech1997 merged commit a7dced2 into internal_datasets Sep 9, 2023
15 checks passed
@ostreech1997 ostreech1997 deleted the issue-74 branch September 9, 2023 15:49
ostreech1997 added a commit that referenced this pull request Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Work with datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants