Read datasets in a standard way
Failed to load latest commit information.
sets Count words found in embeddings vocabulary May 22, 2016
.gitignore Add license Mar 16, 2016 Add configuration file May 16, 2016


Read datasets in a standard way.


from sets import Mnist

# Download, parse and cache the dataset.
train, test = Mnist()

# Sample random batches.
data, target = train.sample(50)

# Iterate over all examples.
for data, target in test:


Dataset Description Format Size
Mnist Standard dataset of handwritten digits. Data is normalized to 0-1 range. Targets are one-hot encoded. 60k/10k
SemEvalRelation Relation classification from the SemEval 2010 conference. String sentences with entity tags <e1> and <e2>. 8k
Ocr Handwritten letter sequences. Binary images of 16x8 pixels. 6877


Utility Description
Concat Concatenate the specified columns of a dataset.
Glove Replace words by pre-trained vectors from the Glovel mode.
Normalize Fit mean and std to a dataset and then normalize any dataset by that.
OneHot Replace words by their index in a specified list.
Split Split a dataset according to one or more ratios.
Tokenize Split and padd sentences using NLTK. Preserve tags in angle brackets.
WordDistance Add a column of offsets to the provided words.


The Dataset class holds data columns that are immutable Numpy arrays, equal in length. Strings index columns and integers index rows. Supports indexing by value, slice and list.

Attribute Description
dataset.columns Sorted list of columns.
dataset.column, dataset['column'] Get a copy of this column's Numpy array.
del dataset['column'] Drop one or more columns.
len(dataset) Number of rows. Each column will be of that length.
for row in dataset Iterate over all rows as tuples. Tuples are sorted by column names.
dataset.sample(size) Return new dataset of size randomly sampled rows.
dataset.copy() Perform a deep copy.

The Step class is used for producing and processing datasets. All steps have a __call__() function that returns one or more dataset objects. For example, a parser may return the training set and the test set.

train, test = sets.Mnist()

An embedding class may take as parameters to __call__() a dataset with string values and return a version of this dataset with the words replaced by their embeddings.

tokenize = sets.Tokenize()
glove = sets.Glove()
dataset = tokenize(dataset, columns=['data'])
dataset = glove(dataset, columns=['data'])


By default, datasets will be cached inside ~/.dataset/sets/. You can change the directory by specifying the directory variable in the configuration. To save even more time, use the @sets.disk_cache(basename, directory, method=False) decorator and apply it to your whole pipeline. It hashes function arguments in order to determine if a cache is valid.


The configuration is a YAML file named .setsrc. Sets looks for the file in the current working directory, the user's home directory and the SETS_CONFIG environment variable.

directory: ~/.dataset/sets


Parsers for new datasets are welcome.


Released under the MIT license.