Enable training fast rcnn with multiple GPUs#2358
Conversation
|
Hi, Thank you a lot for the effort. The current version is fine. But we are replacing mx.model with mx.module.Module. If would be great if you can make it use the new module instead. Please let me know if you want to do this. Thanks. cc @precedenceguo Please verify correctness |
|
Since there is not enough examples on variable shape input in Module API, we think we should stick to the old model API for now. The detection example rebinds in each iteration so it is not intuitively to use the module API. |
|
The results are verified, and the acceleration ratio is provided in readme |
|
Variable shape is implemented in bucketing module |
|
The idea of bucket seems to me a way of padding inputs to predefined bucket shapes or symbols in order to share memory and save time. However, in the fully convolutional network case, it is undefined how to find "buckets" for variable image size. A naive way would choose some scales and size as buckets but padding data with a lot of zeros may result in inferior performance. |
|
good for merging after test finishes @tqchen |
|
I find following changes in executor_manager.py 285 n = int(texec.outputs[0].shape[0] / (islice.stop - islice.start)) cause my program failed with following error: [23:02:50] /dl/mxnet/dmlc-core/include/dmlc/logging.h:235: [23:02:50] include/mxnet/ndarray.h:230: Check failed: (shape_[0]) >= (end) Slice end index out of range |
|
I know, this is because this change assume that output number in last layer should be the same as label size. This is true for SoftmaxOutput, but is wrong for CTC layer. in ctc layer, label size is always smaller than output number in last layer. |
|
I make a tricky change in pull request #2326 |
| data_shapes = {} | ||
| batch_size = 0 | ||
| for k, v in train_data.provide_data: | ||
| if k == 'data': |
There was a problem hiding this comment.
This is not good.
In previous implements, provide 'data' in provide_data is not necessary. This change means we must have a 'data' element in provide_data. In one of my task, I does not provide 'data'.
I think there is other better ways to calculate batch_size.
There was a problem hiding this comment.
@Seanlinx Could you revert this and find a better fix for rcnn?
|
I think this pull request need more test. It will break example/rnn/lstm_bucket.py. |
This reverts commit 004c238.
No description provided.