[WIP] Indexed recordIO support in InputSplit#289
Conversation
| unsigned num_parts, | ||
| const char *type, | ||
| const bool shuffle = false, | ||
| const int seed = 0, |
There was a problem hiding this comment.
not sure defaulting seed to 0 is a good idea.
There was a problem hiding this comment.
It is added to magic number inside indexedrecordioinputsplit. It is actually the second as in image iterator in MxNet (seed defaults to 0 + magic).
| const char *type, | ||
| const bool shuffle = false, | ||
| const int seed = 0, | ||
| const size_t batch_size = 256); |
There was a problem hiding this comment.
batch_size shouldn't have a default value
There was a problem hiding this comment.
This is only a hint (like the HintChunkSize) and only IndexedRecordioInputSplit actually uses that. Readers can give back any number of samples anyway so consumers of the read data need to be prepared to handle that.
| } | ||
| virtual void HintChunkSize(size_t chunk_size) { | ||
| buffer_size_ = std::max(chunk_size / sizeof(size_t), buffer_size_); | ||
| buffer_size_ = std::max(chunk_size / sizeof(uint32_t), buffer_size_); |
There was a problem hiding this comment.
Even though recordio is aligned to 4 bytes, reader was reading to chunk's data vector which was size_t (so 8 bytes aligned). This is problematic, because there is no guarantee that recordio entry is aligned to 8 bytes. That's why I changed chunk's internal data vector to uint32_t
|
can you comment a bit about index file format and mechanism, e.g. does every shard need to load the entire index file, etc, |
|
It is very simple file format (line per entry with 2 integers - entry id and offset) introduced in apache/mxnet#2887 a year ago, but not actually used anywhere (although im2rec.py script actually generates the index file alongside recordIO file).
|
|
|
||
| line_split.o: src/io/line_split.cc | ||
| recordio_split.o: src/io/recordio_split.cc | ||
| indexed_recordio_split.o: src/io/indexed_recordio_split.cc |
There was a problem hiding this comment.
may need to add this to amalgamation
Not complete yet, don't merge