Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shuffle option to chainermn.scatter_dataset #92

Merged
merged 10 commits into from Aug 24, 2017
Merged

Conversation

keisukefukuda
Copy link
Member

@keisukefukuda keisukefukuda commented Jun 13, 2017

This PR provides shuffle optional argument to chainermn.scatter_dataset.
Additionally, it adds root option, corresponding to MPI_Scatter's root argument.

@iwiwi iwiwi added the feature label Jul 31, 2017
@shu65 shu65 added this to the v1.0.0 milestone Aug 16, 2017
@iwiwi iwiwi self-requested a review August 16, 2017 05:25
@iwiwi iwiwi self-assigned this Aug 16, 2017
@keisukefukuda keisukefukuda changed the title [WIP] Add shuffle option to chainermn.scatter_dataset Add shuffle option to chainermn.scatter_dataset Aug 18, 2017
@iwiwi
Copy link
Contributor

iwiwi commented Aug 18, 2017

Could you also add shuffle=True to examples?

import warnings


def scatter_dataset(dataset, comm):
def scatter_dataset(dataset, comm, root=0, shuffle=False, seed=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add explanation of new arguments to the docstring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!


# We cannot use `mpi_comm.scatter`. This is due to MPI4py's bug.
# For large datasets, when using `mpi_comm.scatter`, it causes MemoryError.
if comm.rank == 0:
# import sys
# sys.stderr.write("scatter_dataset(): root={}".format(root))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unused code 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

if shuffle:
order = numpy.random.RandomState(seed).permutation(n_total_samples)
else:
order = numpy.arange(n_total_samples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order = None will also work, and it would probably bring (slight) memory/network usage improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reflected the comment. Thanks.

@@ -15,6 +16,8 @@ def scatter_dataset(dataset, comm):
dataset: A dataset (e.g., ``list``, ``numpy.ndarray``,
``chainer.datasets.TupleDataset``, ...).
comm: ChainerMN communicator or MPI4py communicator.
shuffle: Shuffle the dataset before being scattered.
root: The root process of the scatter operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add explanation.

  • I checked Chainer's dataset.py and the sentence of shuffle is improved.
  • Added type specifications to shuffle and root.
  • Added exaplanation of seed
    Thanks!

@iwiwi
Copy link
Contributor

iwiwi commented Aug 23, 2017

Could you also add shuffle=True to ImageNet example?

@keisukefukuda
Copy link
Member Author

I added shuffle=True to the imagenet example.
(Only to train data, not test data. Is it OK?)

@iwiwi
Copy link
Contributor

iwiwi commented Aug 24, 2017

LGTM! I will merge after travis tests are passed.

@iwiwi iwiwi merged commit ede666c into master Aug 24, 2017
@iwiwi iwiwi deleted the add-shuffled-scatter branch August 24, 2017 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants