Enable training fast rcnn with multiple GPUs by Seanlinx · Pull Request #2358 · apache/mxnet

Seanlinx · 2016-06-07T03:26:20Z

No description provided.

piiswrong · 2016-06-07T05:32:07Z

Hi, Thank you a lot for the effort. The current version is fine.

But we are replacing mx.model with mx.module.Module. If would be great if you can make it use the new module instead.
Module is easier to use and has more functionalities. It shouldn't be hard to move.

Please let me know if you want to do this.

Thanks.

cc @precedenceguo Please verify correctness

Seanlinx · 2016-06-07T07:23:16Z

Since there is not enough examples on variable shape input in Module API, we think we should stick to the old model API for now. The detection example rebinds in each iteration so it is not intuitively to use the module API.

winstywang · 2016-06-07T07:25:08Z

The results are verified, and the acceleration ratio is provided in readme

piiswrong · 2016-06-07T07:25:09Z

Variable shape is implemented in bucketing module

Seanlinx · 2016-06-07T07:53:28Z

The idea of bucket seems to me a way of padding inputs to predefined bucket shapes or symbols in order to share memory and save time. However, in the fully convolutional network case, it is undefined how to find "buckets" for variable image size. A naive way would choose some scales and size as buckets but padding data with a lot of zeros may result in inferior performance.

piiswrong · 2016-06-08T18:05:52Z

good for merging after test finishes @tqchen

xlvector · 2016-06-12T15:10:32Z

I find following changes in executor_manager.py

285 n = int(texec.outputs[0].shape[0] / (islice.stop - islice.start))
286 new_slice = slice(islice.start * n, islice.stop * n)
287 labels_slice = [label[new_slice] for label in labels]

cause my program failed with following error:

[23:02:50] /dl/mxnet/dmlc-core/include/dmlc/logging.h:235: [23:02:50] include/mxnet/ndarray.h:230: Check failed: (shape_[0]) >= (end) Slice end index out of range
Traceback (most recent call last):
File "toy_ctc.py", line 162, in
batch_end_callback=mx.callback.Speedometer(BATCH_SIZE, 50),)
File "../../python/mxnet/model.py", line 796, in fit
sym_gen=self.sym_gen)
File "../../python/mxnet/model.py", line 253, in _train_multi_device
executor_manager.update_metric(eval_metric, data_batch.label)
File "../../python/mxnet/executor_manager.py", line 444, in update_metric
self.curr_execgrp.update_metric(metric, labels)
File "../../python/mxnet/executor_manager.py", line 287, in update_metric
labels_slice = [label[new_slice] for label in labels]
File "../../python/mxnet/ndarray.py", line 220, in getitem
return self._slice(in_slice.start, in_slice.stop)
File "../../python/mxnet/ndarray.py", line 260, in _slice
self.handle, start, stop, ctypes.byref(handle)))
File "../../python/mxnet/base.py", line 77, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [23:02:50] include/mxnet/ndarray.h:230: Check failed: (shape[0]) >= (end) Slice end index out of range

xlvector · 2016-06-12T15:28:27Z

I know, this is because this change assume that output number in last layer should be the same as label size. This is true for SoftmaxOutput, but is wrong for CTC layer.

in ctc layer, label size is always smaller than output number in last layer.

xlvector · 2016-06-12T15:36:08Z

I make a tricky change in pull request #2326

xlvector · 2016-06-15T05:22:09Z

python/mxnet/executor_manager.py

+            data_shapes = {}
+            batch_size = 0
+            for k, v in train_data.provide_data:
+                if k == 'data':


This is not good.

In previous implements, provide 'data' in provide_data is not necessary. This change means we must have a 'data' element in provide_data. In one of my task, I does not provide 'data'.

I think there is other better ways to calculate batch_size.

@Seanlinx Could you revert this and find a better fix for rcnn?

xlvector · 2016-06-15T05:23:36Z

I think this pull request need more test. It will break example/rnn/lstm_bucket.py.
And it make us must to provide 'data' element in provided_data.

This reverts commit 004c238.

rcnn multi-device training

e2525ec

Seanlinx force-pushed the master branch from 0970b43 to e2525ec Compare June 7, 2016 04:40

Seanlinx added 3 commits June 7, 2016 16:07

fix python_test nosetest3 part

f171b22

fix lint

85bc003

fix lint

c3adc07

tqchen merged commit 004c238 into apache:master Jun 8, 2016

xlvector mentioned this pull request Jun 12, 2016

Integrate Baidu-warpctc #2326

Merged

xlvector reviewed Jun 15, 2016
View reviewed changes

Seanlinx added a commit to Seanlinx/mxnet that referenced this pull request Jun 15, 2016

Revert "Enable training fast rcnn with multiple GPUs (apache#2358)"

f58b548

This reverts commit 004c238.

tqchen pushed a commit that referenced this pull request Jun 15, 2016

Revert "Enable training fast rcnn with multiple GPUs (#2358)" (#2425)

967e07d

This reverts commit 004c238.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable training fast rcnn with multiple GPUs#2358

Enable training fast rcnn with multiple GPUs#2358
tqchen merged 4 commits intoapache:masterfrom
Seanlinx:master

Seanlinx commented Jun 7, 2016

Uh oh!

piiswrong commented Jun 7, 2016

Uh oh!

Seanlinx commented Jun 7, 2016

Uh oh!

winstywang commented Jun 7, 2016

Uh oh!

piiswrong commented Jun 7, 2016

Uh oh!

Seanlinx commented Jun 7, 2016

Uh oh!

piiswrong commented Jun 8, 2016 •

edited

Loading

Uh oh!

xlvector commented Jun 12, 2016

Uh oh!

xlvector commented Jun 12, 2016

Uh oh!

xlvector commented Jun 12, 2016 •

edited

Loading

Uh oh!

xlvector Jun 15, 2016

Uh oh!

piiswrong Jun 15, 2016

Uh oh!

xlvector commented Jun 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Seanlinx commented Jun 7, 2016

Uh oh!

piiswrong commented Jun 7, 2016

Uh oh!

Seanlinx commented Jun 7, 2016

Uh oh!

winstywang commented Jun 7, 2016

Uh oh!

piiswrong commented Jun 7, 2016

Uh oh!

Seanlinx commented Jun 7, 2016

Uh oh!

piiswrong commented Jun 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xlvector commented Jun 12, 2016

Uh oh!

xlvector commented Jun 12, 2016

Uh oh!

xlvector commented Jun 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xlvector Jun 15, 2016

Choose a reason for hiding this comment

Uh oh!

piiswrong Jun 15, 2016

Choose a reason for hiding this comment

Uh oh!

xlvector commented Jun 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

piiswrong commented Jun 8, 2016 •

edited

Loading

xlvector commented Jun 12, 2016 •

edited

Loading