Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Enable training fast rcnn with multiple GPUs#2358

Merged
tqchen merged 4 commits intoapache:masterfrom
Seanlinx:master
Jun 8, 2016
Merged

Enable training fast rcnn with multiple GPUs#2358
tqchen merged 4 commits intoapache:masterfrom
Seanlinx:master

Conversation

@Seanlinx
Copy link
Contributor

@Seanlinx Seanlinx commented Jun 7, 2016

No description provided.

@piiswrong
Copy link
Contributor

Hi, Thank you a lot for the effort. The current version is fine.

But we are replacing mx.model with mx.module.Module. If would be great if you can make it use the new module instead.
Module is easier to use and has more functionalities. It shouldn't be hard to move.

Please let me know if you want to do this.

Thanks.

cc @precedenceguo Please verify correctness

@Seanlinx
Copy link
Contributor Author

Seanlinx commented Jun 7, 2016

Since there is not enough examples on variable shape input in Module API, we think we should stick to the old model API for now. The detection example rebinds in each iteration so it is not intuitively to use the module API.

@winstywang
Copy link
Contributor

The results are verified, and the acceleration ratio is provided in readme

@piiswrong
Copy link
Contributor

Variable shape is implemented in bucketing module

@Seanlinx
Copy link
Contributor Author

Seanlinx commented Jun 7, 2016

The idea of bucket seems to me a way of padding inputs to predefined bucket shapes or symbols in order to share memory and save time. However, in the fully convolutional network case, it is undefined how to find "buckets" for variable image size. A naive way would choose some scales and size as buckets but padding data with a lot of zeros may result in inferior performance.

@piiswrong
Copy link
Contributor

piiswrong commented Jun 8, 2016

good for merging after test finishes @tqchen

@tqchen tqchen merged commit 004c238 into apache:master Jun 8, 2016
@xlvector
Copy link
Contributor

I find following changes in executor_manager.py

285 n = int(texec.outputs[0].shape[0] / (islice.stop - islice.start))
286 new_slice = slice(islice.start * n, islice.stop * n)
287 labels_slice = [label[new_slice] for label in labels]

cause my program failed with following error:

[23:02:50] /dl/mxnet/dmlc-core/include/dmlc/logging.h:235: [23:02:50] include/mxnet/ndarray.h:230: Check failed: (shape_[0]) >= (end) Slice end index out of range
Traceback (most recent call last):
File "toy_ctc.py", line 162, in
batch_end_callback=mx.callback.Speedometer(BATCH_SIZE, 50),)
File "../../python/mxnet/model.py", line 796, in fit
sym_gen=self.sym_gen)
File "../../python/mxnet/model.py", line 253, in _train_multi_device
executor_manager.update_metric(eval_metric, data_batch.label)
File "../../python/mxnet/executor_manager.py", line 444, in update_metric
self.curr_execgrp.update_metric(metric, labels)
File "../../python/mxnet/executor_manager.py", line 287, in update_metric
labels_slice = [label[new_slice] for label in labels]
File "../../python/mxnet/ndarray.py", line 220, in getitem
return self._slice(in_slice.start, in_slice.stop)
File "../../python/mxnet/ndarray.py", line 260, in _slice
self.handle, start, stop, ctypes.byref(handle)))
File "../../python/mxnet/base.py", line 77, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [23:02:50] include/mxnet/ndarray.h:230: Check failed: (shape
[0]) >= (end) Slice end index out of range

@xlvector
Copy link
Contributor

I know, this is because this change assume that output number in last layer should be the same as label size. This is true for SoftmaxOutput, but is wrong for CTC layer.

in ctc layer, label size is always smaller than output number in last layer.

@xlvector
Copy link
Contributor

xlvector commented Jun 12, 2016

I make a tricky change in pull request #2326

data_shapes = {}
batch_size = 0
for k, v in train_data.provide_data:
if k == 'data':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good.

In previous implements, provide 'data' in provide_data is not necessary. This change means we must have a 'data' element in provide_data. In one of my task, I does not provide 'data'.

I think there is other better ways to calculate batch_size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Seanlinx Could you revert this and find a better fix for rcnn?

@xlvector
Copy link
Contributor

I think this pull request need more test. It will break example/rnn/lstm_bucket.py.
And it make us must to provide 'data' element in provided_data.

Seanlinx added a commit to Seanlinx/mxnet that referenced this pull request Jun 15, 2016
tqchen pushed a commit that referenced this pull request Jun 15, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants