This dataset is created using the template for creating a dataset from scratch as in: https://cookiecutter-easydata.readthedocs.io/en/latest/New-Dataset-Template/. 

In [1]:
# Basic utility functions
import logging
import os
import pathlib
from pprint import pprint

from src.log import logger
from src import paths
from src.utils import list_dir
from functools import partial

# data functions
from src.data import DataSource, Dataset, DatasetGraph, Catalog
from src import helpers

2021-10-26 10:12:51,274 - utils - INFO - NumExpr defaulting to 4 threads.


In [2]:
# Optionally set to debug log level
logger.setLevel(logging.DEBUG)

## Create the Datasource

In [3]:
ds_name = 'dataset-challenge'
dsrc = DataSource(ds_name)

In [4]:
license = """
CC-BY-4.0 is a common form of dataset license. Here you would put the license for your data, along with any attribution and other information necessary to keep in line with the terms included in the original license.

All data that you use should have an explicit license kept with it. To keep the license with the data, Easydata uses a license as one of its metadata fields which can be accessed via

`.LICENSE`

for any Dataset object.

For more on licenses, see the references at the end of the `04-Data-Challenge` notebook.
"""

In [5]:
descr = """
The `.DESCR` is where Easydata keeps a description of the dataset. In this example, you'll see that we have a Dataset object container with metadata, but no data.

For this dataset, if you do a ds.data, you will return NONE. 

A basic description of the data is something that always stays with the data, Easydata uses a descr as one of its metadata fields which can be accessed via

`.DESCR`

for any Dataset object.

When you transform the data, it is nice to append new information including what has been done to the data via the transformation by appending information to the end of the `.DESCR` text.

You can add any metadata you want to ds.metdata, as it is basically a dict with a fancy wrapping paper that lets you access any key via ALL CAPS.
"""

In [6]:
dsrc.add_metadata(contents=descr, force=True)
dsrc.add_metadata(contents=license, kind='LICENSE', force=True)

In [7]:
from src.data.extra import process_extra_files
process_function = process_extra_files
process_function_kwargs = {'file_glob':'*.csv',
                           'do_copy': True,
                           'extra_dir': ds_name+'.extra',
                           'extract_dir': ds_name}



In [8]:
dsrc.process_function = partial(process_function, **process_function_kwargs)

In [9]:
dsrc.update_catalog()

2021-10-26 10:12:51,807 - catalog - DEBUG - Loaded 2 records from 'datasources' Catalog.
2021-10-26 10:12:51,810 - catalog - DEBUG - Verifying serialization for catalog 'datasources'
2021-10-26 10:12:51,815 - catalog - DEBUG - Writing entry:'dataset-challenge' to catalog:'datasources'.
2021-10-26 10:12:51,817 - datasets - DEBUG - Updated datasource:dataset-challenge in catalog


## Create the Corresponding Dataset

In [10]:
from src.data import DatasetGraph

In [11]:
dag = DatasetGraph(catalog_path=paths['catalog_path'])

2021-10-26 10:12:51,851 - catalog - DEBUG - Loaded 3 records from 'transformers' Catalog.
2021-10-26 10:12:51,853 - catalog - DEBUG - Verifying serialization for catalog 'transformers'
2021-10-26 10:12:51,857 - catalog - DEBUG - Loaded 3 records from 'datasets' Catalog.
2021-10-26 10:12:51,858 - catalog - DEBUG - Verifying serialization for catalog 'datasets'
2021-10-26 10:12:51,863 - datasets - DEBUG - Loaded DatasetGraph with 3 nodes and 3 edges.


In [12]:
dag.add_source(output_dataset=ds_name, datasource_name=ds_name, overwrite_catalog=True)

2021-10-26 10:12:51,873 - catalog - DEBUG - Writing entry:'_dataset-challenge' to catalog:'transformers'.
2021-10-26 10:12:51,877 - datasets - INFO - Regenerating output Dataset 'dataset-challenge' and adding to catalog
2021-10-26 10:12:51,878 - datasets - DEBUG - Generating edge traversal list for Dataset:'dataset-challenge'
2021-10-26 10:12:51,879 - datasets - DEBUG - traverse: examining vertex:'dataset-challenge'
2021-10-26 10:12:51,880 - datasets - DEBUG - traverse: all input dependencies:[] satisfied for edge: '_dataset-challenge'
2021-10-26 10:12:51,881 - datasets - DEBUG - Traversal complete. Edges to process: ['_dataset-challenge']
2021-10-26 10:12:51,882 - datasets - DEBUG - process_edge: Processing input datasets for edge:'_dataset-challenge'
2021-10-26 10:12:51,883 - datasets - DEBUG - process_edge:Applying transformer: {'transformer_module': 'src.data.datasets', 'transformer_name': 'dataset_from_datasource', 'transformer_kwargs': {'dataset_name': 'dataset-challenge', 'datas

0it [00:00, ?it/s]

2021-10-26 10:12:51,944 - datasets - INFO - Generated output datasets: ['dataset-challenge'] via edge:'_dataset-challenge'
2021-10-26 10:12:51,949 - datasets - DEBUG - Updating hashes for dataset 'dataset-challenge': {'hashes': {'data': 'sha1:38f65f3b11da4851aaaccc19b1f0cf4d3806f83b', 'target': 'sha1:38f65f3b11da4851aaaccc19b1f0cf4d3806f83b'}}.
2021-10-26 10:12:51,950 - datasets - DEBUG - process_edge: Updating catalog entry for dataset-challenge
2021-10-26 10:12:51,951 - catalog - DEBUG - Writing entry:'dataset-challenge' to catalog:'datasets'.
2021-10-26 10:12:51,953 - datasets - DEBUG - process_edge: Overwriting 'dataset-challenge' in `dataset_path`
2021-10-26 10:12:51,955 - datasets - DEBUG - Updating hashes for dataset 'dataset-challenge': {'hashes': {'data': 'sha1:38f65f3b11da4851aaaccc19b1f0cf4d3806f83b', 'target': 'sha1:38f65f3b11da4851aaaccc19b1f0cf4d3806f83b'}}.
2021-10-26 10:12:51,958 - datasets - DEBUG - Wrote Dataset Metadata: dataset-challenge.metadata
2021-10-26 10:12:51

{'_dataset-challenge': {'transformations': [{'transformer_module': 'src.data.datasets',
    'transformer_name': 'dataset_from_datasource',
    'transformer_kwargs': {'dataset_name': 'dataset-challenge',
     'datasource_name': 'dataset-challenge'}}],
  'output_datasets': ['dataset-challenge']}}

In [13]:
ds = Dataset.from_catalog(ds_name)

2021-10-26 10:12:52,009 - catalog - DEBUG - Loaded 3 records from 'transformers' Catalog.
2021-10-26 10:12:52,011 - catalog - DEBUG - Verifying serialization for catalog 'transformers'
2021-10-26 10:12:52,015 - catalog - DEBUG - Loaded 3 records from 'datasets' Catalog.
2021-10-26 10:12:52,017 - catalog - DEBUG - Verifying serialization for catalog 'datasets'
2021-10-26 10:12:52,020 - datasets - DEBUG - Loaded DatasetGraph with 3 nodes and 3 edges.
2021-10-26 10:12:52,021 - datasets - DEBUG - Generating edge traversal list for Dataset:'dataset-challenge'
2021-10-26 10:12:52,022 - datasets - DEBUG - traverse: examining vertex:'dataset-challenge'
2021-10-26 10:12:52,023 - datasets - DEBUG - traverse: all input dependencies:[] satisfied for edge: '_dataset-challenge'
2021-10-26 10:12:52,024 - datasets - DEBUG - Traversal complete. Edges to process: ['_dataset-challenge']
2021-10-26 10:12:52,028 - datasets - DEBUG - process_edge: Processing input datasets for edge:'_dataset-challenge'
2021

0it [00:00, ?it/s]

2021-10-26 10:12:52,105 - datasets - INFO - Generated output datasets: ['dataset-challenge'] via edge:'_dataset-challenge'
2021-10-26 10:12:52,112 - datasets - DEBUG - Updating hashes for dataset 'dataset-challenge': {'hashes': {'data': 'sha1:38f65f3b11da4851aaaccc19b1f0cf4d3806f83b', 'target': 'sha1:38f65f3b11da4851aaaccc19b1f0cf4d3806f83b'}}.
2021-10-26 10:12:52,113 - datasets - DEBUG - process_edge: Reloading Dataset catalog after processing edge:'_dataset-challenge'
2021-10-26 10:12:52,122 - catalog - DEBUG - Loaded 3 records from 'datasets' Catalog.
2021-10-26 10:12:52,123 - catalog - DEBUG - Verifying serialization for catalog 'datasets'


In [14]:
ds = Dataset.load(ds_name)

2021-10-26 10:12:52,157 - catalog - DEBUG - Loaded 3 records from 'transformers' Catalog.
2021-10-26 10:12:52,160 - catalog - DEBUG - Verifying serialization for catalog 'transformers'
2021-10-26 10:12:52,164 - catalog - DEBUG - Loaded 3 records from 'datasets' Catalog.
2021-10-26 10:12:52,166 - catalog - DEBUG - Verifying serialization for catalog 'datasets'
2021-10-26 10:12:52,168 - datasets - DEBUG - Loaded DatasetGraph with 3 nodes and 3 edges.
2021-10-26 10:12:52,169 - datasets - DEBUG - Verifying hashes using Dataset catalog.
2021-10-26 10:12:52,173 - catalog - DEBUG - Loaded 3 records from 'datasets' Catalog.
2021-10-26 10:12:52,175 - catalog - DEBUG - Verifying serialization for catalog 'datasets'
2021-10-26 10:12:52,179 - datasets - DEBUG - Load dataset-challenge from disk...
2021-10-26 10:12:52,181 - datasets - DEBUG - Loaded dataset-challenge from disk.


In [15]:
print(ds.DESCR)


The `.DESCR` is where Easydata keeps a description of the dataset. In this example, you'll see that we have a Dataset object container with metadata, but no data.

For this dataset, if you do a ds.data, you will return NONE. 

A basic description of the data is something that always stays with the data, Easydata uses a descr as one of its metadata fields which can be accessed via

`.DESCR`

for any Dataset object.

When you transform the data, it is nice to append new information including what has been done to the data via the transformation by appending information to the end of the `.DESCR` text.

You can add any metadata you want to ds.metdata, as it is basically a dict with a fancy wrapping paper that lets you access any key via ALL CAPS.



In [16]:
print(ds.LICENSE)


CC-BY-4.0 is a common form of dataset license. Here you would put the license for your data, along with any attribution and other information necessary to keep in line with the terms included in the original license.

All data that you use should have an explicit license kept with it. To keep the license with the data, Easydata uses a license as one of its metadata fields which can be accessed via

`.LICENSE`

for any Dataset object.

For more on licenses, see the references at the end of the `04-Data-Challenge` notebook.



In [17]:
ds.EXTRA

{}

In [18]:
ds.data