Skip to content
Sayan Sinha edited this page Mar 16, 2017 · 8 revisions

Documentation

This project involves building a “BinDeps.jl for data” which would make the creation of data-providing packages easier. The package would make it easy to download / unzip large files and check their integrity them in a cross-platform way. Facilities for downloading specific datasets can then be built on top of this. It has been developed in Julia Version 0.5.1-pre+55 (2017-02-13 09:11 UTC) Commit 8d4ef37, Linux 4.4.0.

Requirements

The package was tested in presence of the following:

  • Julia 0.5
  • JSON 0.8.3
  • BinDeps 0.4.7
  • GZip 0.2.20

Installation

julia> Pkg.clone("https://github.com/americast/DataDeps.jl.git")
julia> Pkg.build("DataDeps")

Using the package

julia> using DataDeps

The default directory for storing datasets is DataDeps/datasets

Features:-

  • Add dataset

Use "DataDeps.add(<url>, <name>)" to download and unpack the dataset available at <url> as <name>. This will be saved into datasets/<name> inside the DataDeps package directory.
Example usage:

julia> DataDeps.add("https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz", "test")

Alternately, "DataDeps.add(<url>, <name>,<dir>)" may be used where <dir> specifies the directory where the dataset is to be saved.

  • List downloaded datasets

julia> DataDeps.list()

Alternately, "DataDeps.list(<dir>)" may be used where <dir> is the directory where datasets are to be looked for.

  • Segregate data from a dataset

WORK IN PROGRESS

Separate data into input and output and store them in two arrays which may be used for training, validation or testing.
First, the type of file needs to be specified. We call the function setseries with the following parameters:

  1. Name of the file. In case the name is a "series", the digit containing series information is replaced by asterisk (*)
  2. The way files are to be read. The default value is "single". Other option: "series"
  3. Starting integer of the series. This is to be entered only if argument 2 is "series"
  4. Ending integer of the series. This is to be entered only if argument 2 is "series"

Example usage:

julia> DataDeps.setseries("data_batch_*.bin", "series", 1, 5)

Next, the data segregated into input and output is taken in separate varialbles.

julia> x,y = DataDeps.traindata("test")

Alternately, "DataDeps.traindata(<name>,<type>,<dir>)" may be used where <type> is the type of input data (by default "matrix") and <dir> is the directory where datasets are to be looked for.