Full documentation can be found at http://datamaestro.rtfd.io
This projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims at being
- a reference for available resources, listing datasets
- a tool to automatically download and process resources (when freely available)
- integration with the experimaestro experiment manager.
- (planned) a tool that allows to copy data from one computer to another
Each datasets is uniquely identified by a qualified name such as com.lecun.mnist
, which is usually the inversed path to the domain name of the website associated with the dataset.
The main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.
-
NLP and information access related dataset
Natural Language Processing (e.g. Sentiment101) and Information access (e.g. TREC) datasets -
image-related dataset Image related datasets (e.g. MNIST)
-
machine learning
Generic machine learning datasets
The command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing datamaestro COMMAND --help
:
search
search dataset by name, tags and/or tasksdownload
download files (if accessible on Internet) or ask for download path otherwiseprepare
download dataset files and outputs a JSON containing path and other dataset informationrepositories
list the available repositoriesorphans
list data directories that do no correspond to any registered dataset (and allows to clean them up)create-dataset
creates a dataset definition
The commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.
$ datamaestro search tag:image
[image] com.lecun.mnist
$ datamaestro prepare com.lecun.mnist
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz: 32.8kB [00:00, 92.1kB/s] INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz: 9.92MB [00:00, 10.6MB/s]
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte
...JSON...
The previous command also returns a JSON on standard output
{
"train": {
"images": {
"path": ".../data/image/com/lecun/mnist/train_images.idx"
},
"labels": {
"path": ".../data/image/com/lecun/mnist/train_labels.idx"
}
},
"test": {
"images": {
"path": ".../data/image/com/lecun/mnist/test_images.idx"
},
"labels": {
"path": ".../data/image/com/lecun/mnist/test_labels.idx"
}
},
"id": "com.lecun.mnist"
}
For those using Python, this is even better since the IDX format is supported
In [1]: from datamaestro import prepare_dataset
In [2]: ds = prepare_dataset("com.lecun.mnist")
In [3]: ds.train.images.data().dtype, ds.train.images.data().shape
Out[3]: (dtype('uint8'), (60000, 28, 28))
Each dataset (or a set of related datasets) is described in Python using a mix of declarative and imperative statements. This allows to quickly define how to download dataset using the datamaestro declarative API; the imperative part is used when creating the JSON output, and is integrated with experimaestro.
Its syntax is described in the documentation.
For MNIST, this corresponds to.
from datamaestro_image.data import ImageClassification, LabelledImages, Base, IDXImage
from datamaestro.download.single import filedownloader
from datamaestro.definitions import argument, datatasks, datatags, dataset
from datamaestro.data.tensor import IDX
@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
@dataset(
ImageClassification,
url="http://yann.lecun.com/exdb/mnist/",
)
def MNIST(train_images, train_labels, test_images, test_labels):
"""The MNIST database
The MNIST database of handwritten digits, available from this page, has a
training set of 60,000 examples, and a test set of 10,000 examples. It is a
subset of a larger set available from NIST. The digits have been
size-normalized and centered in a fixed-size image.
"""
return {
"train": LabelledImages(
images=IDXImage(path=train_images),
labels=IDX(path=train_labels)
),
"test": LabelledImages(
images=IDXImage(path=test_images),
labels=IDX(path=test_labels)
),
}