Skip to content

dmvieira/dataset-manager

Repository files navigation

Dataset Manager

Manage and automatize your datasets for your project with YAML files.

getting-started

Build Status

Current Support: Python 3.5Python 3.6Python 3.7Python 3.8

How it Works

This project create a file called identifier.yaml in your dataset directory with these fields:

source: https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv

description: this dataset is a test dataset

identifier: is the identifier for dataset reference is the file name with yaml extension.

source: is location from dataset.

description: describe your dataset to remember later.

Each dataset is a YAML file inside dataset directory.

Installing

With pip just:

pip install dataset_manager

With conda:

conda install dataset_manager

Using

You can manage your datasets with a list of commands and integrate with Pandas or other data analysis tool.

Manager functions

Show all Datasets

Return a table with all Datasets from dataset path

from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

manager.show_datasets()

Create a Dataset

Create a Dataset with every information you want inside dataset_path defined.

from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

manager.create_dataset(identifier, source, description, **kwargs)

Remove a Dataset

Remove Dataset from dataset_path

from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

manager.remove_dataset(identifier)

Prepare Datasets

Download and Unzip all Datasets

from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

manager.prepare_datasets()

Using Multiple Filesystems

This manager is integrated with Pyfilesystem2 and you can use all builtin filesystems or with third-party extensions or creating your own extension.

With Pyfilesystem2, you can download, extract and manage datasets in any place.

from fs.tempfs import TempFS
from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download, TempFS())

manager.prepare_datasets() # all datasets will be downloaded and extracted on temporary files respecting your local_path_to_download hierarchy

Get one Dataset

Get Dataset line as dict

import pandas as pd
from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

dataset = manager.get_dataset(identifier)

df = pd.read_csv(dataset.uri)

Dataset functions

Download Dataset

Download Dataset based on source. This only download once because validates cache. It works both with HTTP, HTTPS and FTP protocols.

dataset = manager.get_dataset(identifier)

dataset.download()

Unzip Dataset

Unzip Dataset based on dataset uri. It works with zip files and others from supported library: fs.archive

dataset = manager.get_dataset(identifier)

dataset.unzip()

Prepare Dataset

Prepare Dataset combine these two before.

dataset = manager.get_dataset(identifier)

dataset.prepare()

Contributing

Just make pull request and be happy!

Let's grow together ;)