You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expected behavior: KFoldIterator should split the dataset into k folds as evenly as possible.
Observed behavior: The last batch is often very small, in the range of 0..(k-1), which could explain the high variance of results in issue #5343
Explanation:
If the dataset size (n) is divisible by the desired number of folds (k) without remainder, it splits the dataset evenly.
However, if there is a remainder, it divides only into (k-1) folds and assigns the remainder to the last fold. This creates a last fold of only up to (modulo k-1) elements! This means that that the test set of the last fold will be extremely small and probably create NaN results in evaluation due to missing classes.
In the case that n is divisible by k and k-1 at the same time, this even creates an empty fold which will cause Exceptions later on.
Thanks for the issue (and @RajaniVM for flagging) - easy to confirm with size 99, 10 splits:
@Test
public void test5974(){
DataSet ds = new DataSet(Nd4j.linspace(1,99,99).transpose(), Nd4j.linspace(1,99,99).transpose());
KFoldIterator iter = new KFoldIterator(10, ds);
while(iter.hasNext()){
DataSet fold = iter.next();
System.out.println(fold);
}
}
java.lang.IllegalArgumentException: NDArrayIndex is out of range. Beginning index: 99 must be less than its size: 99
at org.nd4j.linalg.indexing.NDArrayIndex.validate(NDArrayIndex.java:440)
at org.nd4j.linalg.indexing.NDArrayIndex.resolve(NDArrayIndex.java:345)
at org.nd4j.linalg.api.ndarray.BaseNDArray.get(BaseNDArray.java:5083)
at org.nd4j.linalg.dataset.DataSet.getRange(DataSet.java:236)
at org.nd4j.linalg.dataset.api.iterator.KFoldIterator.nextFold(KFoldIterator.java:193)
at org.nd4j.linalg.dataset.api.iterator.KFoldIterator.next(KFoldIterator.java:163)
at org.nd4j.linalg.dataset.KFoldIteratorTest.test5974(KFoldIteratorTest.java:182)
Expected behavior: KFoldIterator should split the dataset into k folds as evenly as possible.
Observed behavior: The last batch is often very small, in the range of 0..(k-1), which could explain the high variance of results in issue #5343
Explanation:
If the dataset size (n) is divisible by the desired number of folds (k) without remainder, it splits the dataset evenly.
However, if there is a remainder, it divides only into (k-1) folds and assigns the remainder to the last fold. This creates a last fold of only up to (modulo k-1) elements! This means that that the test set of the last fold will be extremely small and probably create NaN results in evaluation due to missing classes.
In the case that n is divisible by k and k-1 at the same time, this even creates an empty fold which will cause Exceptions later on.
Related lines in the file
Thanks to @RajaniVM for noticing the problem!
The text was updated successfully, but these errors were encountered: