Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General purpose data loaders #133

Open
etrain opened this issue May 22, 2015 · 5 comments
Open

General purpose data loaders #133

etrain opened this issue May 22, 2015 · 5 comments

Comments

@etrain
Copy link
Contributor

etrain commented May 22, 2015

We have included a number of data loaders tailored to standard academic datasets with KeystoneML, but it would be good to include general purpose WAV and image loaders in the project as well.

In particular, much of the work we did with ImageNet involved working around bugs in Java image libraries and some of that work can be repurposed.

@tomerk
Copy link
Contributor

tomerk commented May 22, 2015

I think there are a few common patterns we've seen so far:

  • directory structure (each item in a line so we can load it w/ sc.textfile, each item a separate file in a directory, everything in a single tar.gz file, etc.)
  • data item storage format (string document as raw text, vector as csv, image as csv, image as binary, wav as binary, etc.)
  • label storage (label is the containing folder name, label next to the item on the same line, label attached to an item uid in a separate "labels" file, etc.)

@tomerk
Copy link
Contributor

tomerk commented May 22, 2015

It's probably best to either encourage storing data in a certain way and pick a single faster pattern to deal with, or to somehow allow mixing and matching among these. Although, I have found that sc.wholetextfiles seems to be slower than sc.textfile, especially when reading from s3 where there was a multiple order of magnitude difference for the newsgroups data which is only several megabytes.

@agibsonccc
Copy link

This looks like a great start. I took the following approach:
https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark/src/main/java/org/deeplearning4j/spark/util/MLLibUtil.java#L112

I assumed a directory structure like mentioned above. In deep learning I typically see images as well as text in directories. Most unstructured data takes some form of a hierarchical storage layout. I'm assuming you guys could use that to your advantage.

The problem you're going to run in to (for desirable patterns) is time series data. For example when working with video encoders (a big part of the problems I typically solve) There's several kinds of ways you can vectorize an image or audio file. It's usually desirable to have in frames.

I'm not sure how far you guys would go with this but it'd be great to see this done right (and in a more integrated fashion)

I personally have to target more platforms than servers (phones are a big one for us) but I'd be happy to share lessons learned or contrib in some way.

@etrain
Copy link
Contributor Author

etrain commented May 22, 2015

In the ImageLoaderUtils class we have a function that takes in a filename and produces a label which is dataset specific (e.g. VOC and ImageNet have different labelsMap functions).

Right now this is built for reading from .tar files with hierarchical layouts embedded in them, but I think we could generalize this to layouts on HDFS. One thing we want to discourage, however, is having lots of tiny files on HDFS, because lots of tiny files really impact HDFS performance, so the current pattern (one tar file per class - or any other sensible way to get a relatively small number of big files) should be encouraged.

Re: time series data/performance - this is probably a separate issue, but we've talked a lot about support for hypercubes as a first-class data structure, both as a local data structure and (eventually) a distributed data structure. Image is an instantiation of this, but the APIs could be much more rich.

@shivaram
Copy link
Contributor

cc @thisisdhaas @sjyk who are also interested in general purpose data loaders for data that comes from SampleClean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants