Sets

Read datasets in a standard way.

Example

from sets import Mnist

# Download, parse and cache the dataset.
train, test = Mnist()

# Sample random batches.
data, target = train.sample(50)

# Iterate over all examples.
for data, target in test:
    pass

Datasets

Dataset	Description	Format	Size
`Mnist`	Standard dataset of handwritten digits.	Data is normalized to 0-1 range. Targets are one-hot encoded.	60k/10k
`SemEvalRelation`	Relation classification from the SemEval 2010 conference.	String sentences with entity tags `<e1>` and `<e2>`.	8k
`Ocr`	Handwritten letter sequences.	Binary images of 16x8 pixels.	6877

Processes

Utility	Description
`Concat`	Concatenate the specified columns of a dataset.
`Glove`	Replace words by pre-trained vectors from the Glovel mode.
`Normalize`	Fit mean and std to a dataset and then normalize any dataset by that.
`OneHot`	Replace words by their index in a specified list.
`Split`	Split a dataset according to one or more ratios.
`Tokenize`	Split and padd sentences using NLTK. Preserve tags in angle brackets.
`WordDistance`	Add a column of offsets to the provided words.

Interface

The Dataset class holds data columns that are immutable Numpy arrays, equal in length. Strings index columns and integers index rows. Supports indexing by value, slice and list.

Attribute	Description
`dataset.columns`	Sorted list of columns.
`dataset.column`, `dataset['column']`	Get a copy of this column's Numpy array.
`del dataset['column']`	Drop one or more columns.
`len(dataset)`	Number of rows. Each column will be of that length.
`for row in dataset`	Iterate over all rows as tuples. Tuples are sorted by column names.
`dataset.sample(size)`	Return new dataset of `size` randomly sampled rows.
`dataset.copy()`	Perform a deep copy.

The Step class is used for producing and processing datasets. All steps have a __call__() function that returns one or more dataset objects. For example, a parser may return the training set and the test set.

train, test = sets.Mnist()

An embedding class may take as parameters to __call__() a dataset with string values and return a version of this dataset with the words replaced by their embeddings.

tokenize = sets.Tokenize()
glove = sets.Glove()
dataset = tokenize(dataset, columns=['data'])
dataset = glove(dataset, columns=['data'])

Caching

By default, datasets will be cached inside ~/.dataset/sets/. You can change the directory by specifying the directory variable in the configuration. To save even more time, use the @sets.disk_cache(basename, directory, method=False) decorator and apply it to your whole pipeline. It hashes function arguments in order to determine if a cache is valid.

Configuration

The configuration is a YAML file named .setsrc. Sets looks for the file in the current working directory, the user's home directory and the SETS_CONFIG environment variable.

directory: ~/.dataset/sets

Contributions

Parsers for new datasets are welcome.

License

Released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
sets		sets
test		test
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sets

Example

Datasets

Processes

Interface

Caching

Configuration

Contributions

License

About

Releases

Packages

Contributors 2

Languages

License

danijar/sets

Folders and files

Latest commit

History

Repository files navigation

Sets

Example

Datasets

Processes

Interface

Caching

Configuration

Contributions

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages