Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MnistDatasetFetcher shuffles on reset causing unexpected behavior #6299

Closed
GuutBoy opened this issue Aug 28, 2018 · 4 comments · Fixed by #6302
Closed

MnistDatasetFetcher shuffles on reset causing unexpected behavior #6299

GuutBoy opened this issue Aug 28, 2018 · 4 comments · Fixed by #6302

Comments

@GuutBoy
Copy link

GuutBoy commented Aug 28, 2018

If the shuffle parameter is set on the MnistDatasetFetcher it will shuffle the dataset on each call to reset. Below is the relevant code from MnistDatasetFetcher:

@Override
public void reset() {
  cursor = 0;
  curr = null;
  if (shuffle)
    MathUtils.shuffleArray(order, rng);
}

This appears to be a bug. In particular, this seem to cause an MnistDatasetIterator constructed with numExamples < 60000 to iterate over a new dataset on each call to reset(). The documentation is not super explicit but this does not seem to be the intended behavior.

An example of this is demonstrated by the code below:

import java.io.IOException;
import org.deeplearning4j.datasets.iterator.impl.MnistDataSetIterator;
import org.nd4j.linalg.dataset.DataSet;

public class MnistDatasetExperiments {

  public static void main(String[] args) throws IOException {
    MnistDataSetIterator it = new MnistDataSetIterator(1, 1, false, false, true, 0);
    DataSet data1 = it.next();
    it.reset();
    DataSet data2 = it.next();
    if (data1.get(0).getFeatures().equals(data2.get(0).getFeatures())) {
      System.out.println("Success 😃");
    } else {
      System.out.println("Failure 😭");
    }
  }
}

Here we construct an iterator over a single example with a batch size of 1. I would expect this iterator to return the same dataset on the first and second call to next (after the call to reset).

@AlexDBlack AlexDBlack self-assigned this Aug 29, 2018
@Charele
Copy link

Charele commented Aug 29, 2018

Why must we get the same dataset?
I think it isn't necessary.

@AlexDBlack
Copy link
Contributor

Thanks for reporting.
Fixed here: #6302

@GuutBoy
Copy link
Author

GuutBoy commented Aug 29, 2018

@Charele It is a design choice and of course you could go with the current behavior. Either way, the actual behavior should probably be clearly documented to avoid confusion.

In my case I wanted to train a classifier over a number of epochs on a limited set of examples. To do this I used a MnistDatasetIterator with numExamples set to, say 500, and call fit on a MultiLayerNetwork with the iterator as the argument. Now I would expect this to just train the classifier on a data set of size 500 but, because of this issue, the classifier is trained on a new random subset for each epoch. This was very confusing to me because I would end up with a very good classifier even when training on a very small data set.

AlexDBlack added a commit that referenced this issue Aug 30, 2018
* Another pass on javadoc link formatting

* #6299 Mnist iterator subset shuffling repeatability

* #6128 fix StackVertex output type

* #6101 DataVec ObjectDetectionRecordReader image center validation

* #6280 validate and throw exception for invalid loss/activation combinations

* Cleanup and fix tests given new validation

* Another round of javadoc link fixes

* Re-enable some now passing tests

* Tweak arbiter max candidates condition to exclude queued candidates

* Small final test fix
sshepel pushed a commit that referenced this issue Aug 30, 2018
* Another pass on javadoc link formatting

* #6299 Mnist iterator subset shuffling repeatability

* #6128 fix StackVertex output type

* #6101 DataVec ObjectDetectionRecordReader image center validation

* #6280 validate and throw exception for invalid loss/activation combinations

* Cleanup and fix tests given new validation

* Another round of javadoc link fixes

* Re-enable some now passing tests

* Tweak arbiter max candidates condition to exclude queued candidates

* Small final test fix
@lock
Copy link

lock bot commented Sep 29, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants