Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MnistDatasetFetcher shuffles on reset causing unexpected behavior #6299

Closed
GuutBoy opened this issue Aug 28, 2018 · 4 comments · Fixed by #6302

Comments

@GuutBoy
Copy link

commented Aug 28, 2018

If the shuffle parameter is set on the MnistDatasetFetcher it will shuffle the dataset on each call to reset. Below is the relevant code from MnistDatasetFetcher:

@Override
public void reset() {
  cursor = 0;
  curr = null;
  if (shuffle)
    MathUtils.shuffleArray(order, rng);
}

This appears to be a bug. In particular, this seem to cause an MnistDatasetIterator constructed with numExamples < 60000 to iterate over a new dataset on each call to reset(). The documentation is not super explicit but this does not seem to be the intended behavior.

An example of this is demonstrated by the code below:

import java.io.IOException;
import org.deeplearning4j.datasets.iterator.impl.MnistDataSetIterator;
import org.nd4j.linalg.dataset.DataSet;

public class MnistDatasetExperiments {

  public static void main(String[] args) throws IOException {
    MnistDataSetIterator it = new MnistDataSetIterator(1, 1, false, false, true, 0);
    DataSet data1 = it.next();
    it.reset();
    DataSet data2 = it.next();
    if (data1.get(0).getFeatures().equals(data2.get(0).getFeatures())) {
      System.out.println("Success 😃");
    } else {
      System.out.println("Failure 😭");
    }
  }
}

Here we construct an iterator over a single example with a batch size of 1. I would expect this iterator to return the same dataset on the first and second call to next (after the call to reset).

@AlexDBlack AlexDBlack self-assigned this Aug 29, 2018

@Charele

This comment has been minimized.

Copy link

commented Aug 29, 2018

Why must we get the same dataset?
I think it isn't necessary.

AlexDBlack added a commit that referenced this issue Aug 29, 2018
@AlexDBlack

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2018

Thanks for reporting.
Fixed here: #6302

@GuutBoy

This comment has been minimized.

Copy link
Author

commented Aug 29, 2018

@Charele It is a design choice and of course you could go with the current behavior. Either way, the actual behavior should probably be clearly documented to avoid confusion.

In my case I wanted to train a classifier over a number of epochs on a limited set of examples. To do this I used a MnistDatasetIterator with numExamples set to, say 500, and call fit on a MultiLayerNetwork with the iterator as the argument. Now I would expect this to just train the classifier on a data set of size 500 but, because of this issue, the classifier is trained on a new random subset for each epoch. This was very confusing to me because I would end up with a very good classifier even when training on a very small data set.

AlexDBlack added a commit that referenced this issue Aug 30, 2018
DL4J: Misc fixes (#6302)
* Another pass on javadoc link formatting

* #6299 Mnist iterator subset shuffling repeatability

* #6128 fix StackVertex output type

* #6101 DataVec ObjectDetectionRecordReader image center validation

* #6280 validate and throw exception for invalid loss/activation combinations

* Cleanup and fix tests given new validation

* Another round of javadoc link fixes

* Re-enable some now passing tests

* Tweak arbiter max candidates condition to exclude queued candidates

* Small final test fix
sshepel added a commit that referenced this issue Aug 30, 2018
DL4J: Misc fixes (#6302)
* Another pass on javadoc link formatting

* #6299 Mnist iterator subset shuffling repeatability

* #6128 fix StackVertex output type

* #6101 DataVec ObjectDetectionRecordReader image center validation

* #6280 validate and throw exception for invalid loss/activation combinations

* Cleanup and fix tests given new validation

* Another round of javadoc link fixes

* Re-enable some now passing tests

* Tweak arbiter max candidates condition to exclude queued candidates

* Small final test fix
@lock

This comment has been minimized.

Copy link

commented Sep 29, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 29, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants
You can’t perform that action at this time.