[WIP] Indexed recordIO support in InputSplit by ptrendx · Pull Request #289 · dmlc/dmlc-core

ptrendx · 2017-07-26T19:35:13Z

Not complete yet, don't merge

piiswrong · 2017-08-02T17:02:22Z

piiswrong · 2017-08-02T23:47:08Z

include/dmlc/io.h

+                            unsigned num_parts,
+                            const char *type,
+                            const bool shuffle = false,
+                            const int seed = 0,


not sure defaulting seed to 0 is a good idea.

It is added to magic number inside indexedrecordioinputsplit. It is actually the second as in image iterator in MxNet (seed defaults to 0 + magic).

piiswrong · 2017-08-02T23:47:19Z

include/dmlc/io.h

+                            const char *type,
+                            const bool shuffle = false,
+                            const int seed = 0,
+                            const size_t batch_size = 256);


batch_size shouldn't have a default value

This is only a hint (like the HintChunkSize) and only IndexedRecordioInputSplit actually uses that. Readers can give back any number of samples anyway so consumers of the read data need to be prepared to handle that.

piiswrong · 2017-08-02T23:54:42Z

src/io/cached_input_split.h

  }
  virtual void HintChunkSize(size_t chunk_size) {
-    buffer_size_ = std::max(chunk_size / sizeof(size_t), buffer_size_);
+    buffer_size_ = std::max(chunk_size / sizeof(uint32_t), buffer_size_);


Even though recordio is aligned to 4 bytes, reader was reading to chunk's data vector which was size_t (so 8 bytes aligned). This is problematic, because there is no guarantee that recordio entry is aligned to 8 bytes. That's why I changed chunk's internal data vector to uint32_t

tqchen · 2017-08-03T02:21:35Z

can you comment a bit about index file format and mechanism, e.g. does every shard need to load the entire index file, etc,

ptrendx · 2017-08-03T04:56:51Z

It is very simple file format (line per entry with 2 integers - entry id and offset) introduced in apache/mxnet#2887 a year ago, but not actually used anywhere (although im2rec.py script actually generates the index file alongside recordIO file).
Each worker needs the whole index file, then each worker is responsible for only its own portion of the indexes.
Shuffle introduces global shuffling, but only inside a worker part (so that there is no need to share seeds between workers etc.). I thought about making it so that groups of few entries are shuffled in case the performance is bad, but at least on systems I tested it on it was fine as-is. This should really improve the shuffle option in MXNet recordio iterator, because right now it shuffles only inside a single chunk, which often is less than or comparable to single batch (so there is effectively no shuffling).
Motivation for this PR is twofold:

I introduced new much faster IO pipeline in MXNet ([WIP] New faster version of the RecordIO iterator apache/mxnet#7152) which unfortunately makes it hard to implement shuffling inside iterator, so shuffling needs to be done on lower level.
having InputSplit return precisely the batch size of images makes the IO pipeline even faster, because the overflow storage is not used, so there are no unnecessary copies being made, and also the cost of OpenMP synchronization between threads is paid only once per batch (In my tests on DGX-1 I went from 6.2k imgs/s to 7k imgs/s when testing just IO on RN50 pipeline).

piiswrong · 2017-08-11T17:34:06Z

Makefile


 line_split.o: src/io/line_split.cc
 recordio_split.o: src/io/recordio_split.cc
+indexed_recordio_split.o: src/io/indexed_recordio_split.cc


may need to add this to amalgamation

ptrendx added 6 commits July 26, 2017 12:32

First version, no working shuffle yet

9bc2458

Added .vimrc to .gitignore

0aa67ca

Fixes for the non-shuffle version

f149751

Fixes for threaded iterator

7645df5

Shuffle

2dda330

Fixes for shuffle

d7731cf

Fix lint

79f073d

piiswrong reviewed Aug 2, 2017

View reviewed changes

piiswrong reviewed Aug 11, 2017

View reviewed changes

piiswrong merged commit 6fd2d23 into dmlc:master Aug 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Indexed recordIO support in InputSplit#289

[WIP] Indexed recordIO support in InputSplit#289
piiswrong merged 7 commits intodmlc:masterfrom
ptrendx:indexed_recordio

ptrendx commented Jul 26, 2017

Uh oh!

piiswrong commented Aug 2, 2017

Uh oh!

piiswrong Aug 2, 2017

Uh oh!

ptrendx Aug 3, 2017

Uh oh!

piiswrong Aug 2, 2017

Uh oh!

ptrendx Aug 3, 2017

Uh oh!

piiswrong Aug 2, 2017

Uh oh!

ptrendx Aug 3, 2017

Uh oh!

tqchen commented Aug 3, 2017

Uh oh!

ptrendx commented Aug 3, 2017

Uh oh!

piiswrong Aug 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ptrendx commented Jul 26, 2017

Uh oh!

piiswrong commented Aug 2, 2017

Uh oh!

piiswrong Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

ptrendx Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

piiswrong Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

ptrendx Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

piiswrong Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

ptrendx Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

tqchen commented Aug 3, 2017

Uh oh!

ptrendx commented Aug 3, 2017

Uh oh!

piiswrong Aug 11, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants