Read datasets in a standard way.
from sets import Mnist # Download, parse and cache the dataset. train, test = Mnist() # Sample random batches. data, target = train.sample(50) # Iterate over all examples. for data, target in test: pass
||Standard dataset of handwritten digits.||Data is normalized to 0-1 range. Targets are one-hot encoded.||60k/10k|
||Relation classification from the SemEval 2010 conference.||String sentences with entity tags
||Handwritten letter sequences.||Binary images of 16x8 pixels.||6877|
||Concatenate the specified columns of a dataset.|
||Replace words by pre-trained vectors from the Glovel mode.|
||Fit mean and std to a dataset and then normalize any dataset by that.|
||Replace words by their index in a specified list.|
||Split a dataset according to one or more ratios.|
||Split and padd sentences using NLTK. Preserve tags in angle brackets.|
||Add a column of offsets to the provided words.|
Dataset class holds data columns that are immutable Numpy arrays, equal
in length. Strings index columns and integers index rows. Supports indexing by
value, slice and list.
||Sorted list of columns.|
||Get a copy of this column's Numpy array.|
||Drop one or more columns.|
||Number of rows. Each column will be of that length.|
||Iterate over all rows as tuples. Tuples are sorted by column names.|
||Return new dataset of
||Perform a deep copy.|
Step class is used for producing and processing datasets. All steps have
__call__() function that returns one or more dataset objects. For example,
a parser may return the training set and the test set.
train, test = sets.Mnist()
An embedding class may take as parameters to
__call__() a dataset with string
values and return a version of this dataset with the words replaced by their
tokenize = sets.Tokenize() glove = sets.Glove() dataset = tokenize(dataset, columns=['data']) dataset = glove(dataset, columns=['data'])
By default, datasets will be cached inside
~/.dataset/sets/. You can change
the directory by specifying the
directory variable in the configuration. To
save even more time, use the
method=False) decorator and apply it to your whole pipeline. It hashes
function arguments in order to determine if a cache is valid.
The configuration is a YAML file named
.setsrc. Sets looks for the file in
the current working directory, the user's home directory and the
Parsers for new datasets are welcome.
Released under the MIT license.