# Loading Data
This tutorial we focus on how to feed the data into a training and inference program. Most training and inference modules in MXNet accepts data iterators, especially when reading large datasets from filesystems. MXNet uses an iterator to provide data to the neural network. Iterators do some preprocessing and generate batches for the neural network.

MXNet provides basic iterators for MNIST and RecordIO images. To hide the cost of I/O, MXNet uses a prefetch strategy that enables parallelism for the learning process and data fetching. Data is automatically fetched by an independent thread. Here we discuss the API conventions and several provided iterators.

## Jupyter Scala kernel
Add mxnet scala jar which is created as a part of MXNet Scala package installation in classpath as follows:

**Note**: Process to add this jar in your scala kernel classpath can differ according to the scala kernel you are using.

We have used [jupyter-scala kernel](https://github.com/alexarchambault/jupyter-scala) for creating this notebook.

```
classpath.addPath(<path_to_jar>)

e.g
classpath.addPath("mxnet-full_2.11-osx-x86_64-cpu-0.1.2-SNAPSHOT.jar")
```

## Basic Data Iterator

MXNet's data iterator returns a batch of data in each `next` call. We first introduce what a data batch looks like and then how to write a basic data iterator.

### Data Batch
A data batch often contains n examples and the according labels. Here n is often called as the batch size.
The following codes defines a valid data batch is able to be read by most training/inference modules.

In [2]:
import ml.dmlc.mxnet._
import scala.collection.immutable.ListMap

class DataBatch(val data: IndexedSeq[NDArray],
                val label: IndexedSeq[NDArray],
                val index: IndexedSeq[Long],
                val pad: Int,
                // the key for the bucket that should be used for this batch,
                // for bucketing io only
                val bucketKey: AnyRef = null,
                // use ListMap to indicate the order of data/label loading
                // (must match the order of input data/label)
                private val providedData: ListMap[String, Shape] = null,
                private val providedLabel: ListMap[String, Shape] = null) {
  /**
   * Dispose its data and labels
   * The object shall never be used after it is disposed.
   */
  def dispose(): Unit = {
    if (data != null) {
      data.foreach(arr => if (arr != null) arr.dispose())
    }
    if (label != null) {
      label.foreach(arr => if (arr != null) arr.dispose())
    }
  }

  // The name and shape of data
  def provideData: ListMap[String, Shape] = providedData

  // The name and shape of label
  def provideLabel: ListMap[String, Shape] = providedLabel
}

[32mimport [36mml.dmlc.mxnet._[0m
[32mimport [36mscala.collection.immutable.ListMap[0m
defined [32mclass [36mDataBatch[0m

We explain what each attribute means:
- **data** is a list of NDArray, each of them has $n$ length first dimension. For example, if an example is an image with size $224 \times 224$ and RGB channels, then the array shape should be (n, 3, 224, 244). Note that the image batch format used by MXNet is

$$\textrm{batch_size} \times \textrm{num_channel} \times \textrm{height} \times \textrm{width}$$ 

The channels are often in RGB order.

Each array will be copied into a free variable of the Symbol later. The mapping from arrays to free variables should be given by the provide_data attribute of the iterator, which will be discussed shortly.
- **label** is also a list of NDArray. Often each NDArray is a 1-dimensional array with shape (n,). For classification, each class is represented by an integer starting from 0.
- **pad** is an integer shows how many examples are for merely used for padding, which should be ignored in the results. A nonzero padding is often used when we reach the end of the data and the total number of examples cannot be divided by the batch size.
- **providedData** is a ListMap of name and shape of the data.
- **providedLabel** is a ListMap of name and shape of the label.


### Symbol and Data Variables
Before moving the iterator, we first look at how to find which variables in a Symbol are for input data. In MXNet, an operator (mx.sym.*) has one or more input variables and output variables; some operators may have additional auxiliary variables for internal states. For an input variable of an operator, if do not assign it with an output of another operator during creating this operator, then this input variable is free. We need to assign it with external data before running.

The following codes define a simple multilayer perceptron (MLP) and then print all free variables.

In [3]:
val numClasses = 10

val data = Symbol.Variable("data")
val fc1 = Symbol.FullyConnected(name = "fc1")()(Map("data" -> data, "num_hidden" -> 64))
val act1 = Symbol.Activation(name = "relu1")()(Map("data" -> fc1, "act_type" -> "relu"))
val fc2 = Symbol.FullyConnected(name = "fc2")()(Map("data" -> act1, "num_hidden" -> numClasses))
val mlp = Symbol.SoftmaxOutput(name = "softmax")()(Map("data" -> fc2))

mlp.listArguments()
mlp.listOutputs()

log4j:WARN No appenders could be found for logger (MXNetJVM).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.


[36mnumClasses[0m: [32mInt[0m = [32m10[0m
[36mdata[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@30a65ea0
[36mfc1[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@200fe851
[36mact1[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@28fd3448
[36mfc2[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@671608bf
[36mmlp[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@65c719a3
[36mres2_6[0m: [32mIndexedSeq[0m[[32mString[0m] = [33mArrayBuffer[0m(
  [32m"data"[0m,
  [32m"fc1_weight"[0m,
  [32m"fc1_bias"[0m,
  [32m"fc2_weight"[0m,
  [32m"fc2_bias"[0m,
  [32m"softmax_label"[0m
)
[36mres2_7[0m: [32mIndexedSeq[0m[[32mString[0m] = [33mArrayBuffer[0m([32m"softmax_output"[0m)

As can be seen, we name a variable either by its operator's name if it is atomic (e.g. Symbol.Variable("data")) or by the opname_varname convention. The varname often means what this variable is for:

- weight : the weight parameters
- bias : the bias parameters
- output : the output
- label : input label

On the above example, now we know that there are 4 variables for parameters, and two for input data: data for examples and softmax_label for the according labels.

The following example define a matrix factorization object function with rank 10 for recommendation systems. It has three input variables, user for user IDs, item for item IDs, and score is the rating user gives to item.

In [4]:
val numUsers = 1000
val numItems = 1000
val k = 10 

// input
val user = Symbol.Variable("user")
val item = Symbol.Variable("item")
val score = Symbol.Variable("score")

// user feature lookup
val user1 = Symbol.Embedding()()(Map("data" -> user, "input_dim" -> numUsers, "output_dim" -> k))

// item feature lookup
val item1 = Symbol.Embedding()()(Map("data" -> item, "input_dim" -> numItems, "output_dim" -> k))

// predict by the inner product, which is elementwise product and then sum
val pred0 = user1 * item1
val pred1 = Symbol.sum_axis()()(Map("data" -> pred0, "axis" -> 1))
val pred2 = Symbol.Flatten()()(Map("data" -> pred1))

// loss layer
val pred = Symbol.LinearRegressionOutput()()(Map("data" -> pred2, "label" -> score))

[36mnumUsers[0m: [32mInt[0m = [32m1000[0m
[36mnumItems[0m: [32mInt[0m = [32m1000[0m
[36mk[0m: [32mInt[0m = [32m10[0m
[36muser[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@50a64721
[36mitem[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@6f09fabd
[36mscore[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@6cdc9943
[36muser1[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@2971e218
[36mitem1[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@4b0c2758
[36mpred0[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@6bf0e096
[36mpred1[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@636e7a2a
[36mpred2[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@4529257d
[36mpred[0m: [32mSymbol[0m = ml.dmlc.mxnet.Symbol@6f35ec67

### Data Iterators
Now we are ready to show how to create a valid MXNet data iterator. An iterator should extend DataIter class and override following methods:

- **reset()** method to restart reading from the beginning
- **provideData()** to return a Listmap of (str, tuple) pairs, each pair stores an input data variable name and its shape. 
- **provideLabel()** method to return a Listmap of (str, tuple) pairs, which provides information about input labels.
- **getData()** and **getLabel()** methods for getting data and label of current batch.
- **getPad()** for getting the number of padding examples.
- **getIndex()** for getting the index of current batch.
- **next()** method to return a data batch.

The following codes define a simple iterator that return some random data each time

In [5]:
    def dataGen(dim: Array[Int]) : Array[Array[Float]] ={
        val r = new scala.util.Random(100)
        Array.fill(dim(0), dim(1)) { 2*r.nextFloat-1 }
    }
    
    def labelGen(lowLimit: Int, highLimit: Int, dim: Int) : Array[Float] ={
        val r = new scala.util.Random(100)
        val label = for (i <- lowLimit+1 to dim) yield r.nextInt(highLimit).asInstanceOf[Float]
        label.toArray
    }

defined [32mfunction [36mdataGen[0m
defined [32mfunction [36mlabelGen[0m

In [6]:
import scala.collection.mutable.ArrayBuffer
val numBatches: Int=10

class SimpleIter(dataNames: String, dataShapes: Shape, dataGen: Array[Array[Float]],
                 labelNames: String, labelShapes: Shape, labelGen: Array[Float]) extends DataIter{

    val _provideData = ListMap(dataNames -> dataShapes)
    val _provideLabel = ListMap(labelNames -> labelShapes)
    var curBatch = 0

  // Get next data batch from iterator
  override def next(): DataBatch = {
    if (!hasNext) throw new NoSuchElementException

      val data = Array(NDArray.array(dataGen.flatten.toArray, shape = dataShapes))
      val label = Array(NDArray.array(labelGen, shape = labelShapes))
      curBatch += 1
          
      new DataBatch(data=data,label=label, index=getIndex(), pad=getPad(), providedData=_provideData, providedLabel=_provideLabel)
  }    
    
  // reset the iterator    
  override def reset(): Unit = {
    curBatch = 0
  }
  // Check for next batch
  override def hasNext: Boolean = {
      curBatch < numBatches
  }
    
  override def batchSize: Int = numBatches
  // Get data of current batch
  override def getData(): IndexedSeq[NDArray] = IndexedSeq()
  // Get the index of current batch
  override def getIndex(): IndexedSeq[Long] = IndexedSeq[Long]()
  // Get label of current batch
  override def getLabel(): IndexedSeq[NDArray] = IndexedSeq()
  // Get the number of padding examples in current batch
  override def getPad(): Int = 0
  // The name and shape of data provided by this iterator
  override def provideData: ListMap[String, Shape] = _provideData
  // The name and shape of label provided by this iterator
  override def provideLabel: ListMap[String, Shape] = _provideLabel

}

[32mimport [36mscala.collection.mutable.ArrayBuffer[0m
[36mnumBatches[0m: [32mInt[0m = [32m10[0m
defined [32mclass [36mSimpleIter[0m

Now we can feed the data iterator into a training problem. Here we used the Module class, more details about this class is discussed in module.ipynb.

In [7]:
import ml.dmlc.mxnet.module.{FitParams, Module}

val n = 32
val data = new SimpleIter("data", Shape(n,100), 
                  dataGen(Array(n,100)),
                  "softmax_label", Shape(n), 
                  labelGen(0, numClasses, n))

val mod = new Module(mlp)
mod.fit(data, numEpoch=5)

[32mimport [36mml.dmlc.mxnet.module.{FitParams, Module}[0m
[36mn[0m: [32mInt[0m = [32m32[0m
[36mdata[0m: [32mSimpleIter[0m = non-empty iterator
[36mmod[0m: [32mmodule[0m.[32mModule[0m = ml.dmlc.mxnet.module.Module@4f52fe71

While for Symbol pred, we need to provide three inputs, two for examples and one for label. Refer to the MatrixFactorization tutorial to know more.


## More Iterators
MXNet provides multiple efficient data iterators as follows:

### MNISTIter
MNISTIter is the easy way to iterate on the MNIST dataset. 

**Parameters:**

- "image" and "label" - Dataset Param:  MNist Image and Label datapath
- "batch_size" (int, optional, default='128') – Batch Param: Batch Size.
- "shuffle" - Augmentation Param: Whether to shuffle data.
- "flat" (boolean, optional, default=False) – Augmentation Param: Whether to flat the data into 1D.
- "seed" (int, optional, default='0') – Augmentation Param: Random Seed.
- "silent" (boolean, optional, default=False) – Auxiliary Param: Whether to print out data info.
- "num_parts (int, optional, default='1') – partition the data into multiple parts
- "part_index" (int, optional, default='0') – the index of the part will read
- "prefetch_buffer" (long (non-negative), optional, default=4) – Maximal Number of batches to prefetch
- "dtype" ({None, 'float16', 'float32', 'float64', 'int32', 'uint8'},optional, default='None') – Output data type. None means no change

In [9]:
val params = Map(
      "image" -> "data/train-images-idx3-ubyte",
      "label" -> "data/train-labels-idx1-ubyte",
      "data_shape" -> "(784,)",
      "batch_size" -> "100",
      "shuffle" -> "1",
      "flat" -> "1",
      "silent" -> "0",
      "seed" -> "10"
    )

    val mnistPack = IO.MNISTPack(params)

    val nBatch = 600
    var batchCount = 0
    for(batch <- mnistPack) {
      batchCount += 1
    }

    // create DataIter
    val mnistIter = mnistPack.iterator
    // get the name and shape of data provided by this iterator 
    val provideData = mnistIter.provideData
    // get the name and shape of label provided by this iterator 
    val provideLabel = mnistIter.provideLabel
     
    // reset the iterator
    mnistIter.reset()
    batchCount = 0
    // check if iterator has next batch of data
    while (mnistIter.hasNext) {
      mnistIter.next()
      batchCount += 1
    }
 
    mnistIter.reset()
    // get next data batch from iterator
    mnistIter.next()
    // get label of current batch
    val label0 = mnistIter.getLabel().head.toArray
    // get data of current batch
    val data0 = mnistIter.getData().head.toArray
    mnistIter.next()
    mnistIter.next()
    mnistIter.next()
    mnistIter.reset()
    mnistIter.next()
    val label1 = mnistIter.getLabel().head.toArray
    val data1 = mnistIter.getData().head.toArray
  

[36mparams[0m: [32mMap[0m[[32mString[0m, [32mString[0m] = [33mMap[0m(
  [32m"silent"[0m -> [32m"0"[0m,
  [32m"seed"[0m -> [32m"10"[0m,
  [32m"flat"[0m -> [32m"1"[0m,
  [32m"image"[0m -> [32m"data/train-images-idx3-ubyte"[0m,
  [32m"label"[0m -> [32m"data/train-labels-idx1-ubyte"[0m,
  [32m"shuffle"[0m -> [32m"1"[0m,
  [32m"data_shape"[0m -> [32m"(784,)"[0m,
  [32m"batch_size"[0m -> [32m"100"[0m
)
[36mmnistPack[0m: [32mDataPack[0m = [33mMXDataPack[0m(
  ml.dmlc.mxnet.DataBatch@1c83152e,
  ml.dmlc.mxnet.DataBatch@18820148,
  ml.dmlc.mxnet.DataBatch@1410a0a5,
  ml.dmlc.mxnet.DataBatch@45dfe674,
  ml.dmlc.mxnet.DataBatch@4171b184,
  ml.dmlc.mxnet.DataBatch@497170a3,
  ml.dmlc.mxnet.DataBatch@58f5f4a0,
  ml.dmlc.mxnet.DataBatch@6223558c,
  ml.dmlc.mxnet.DataBatch@2e1235dd,
  ml.dmlc.mxnet.DataBatch@6ca4bcd4,
  ml.dmlc.mxnet.DataBatch@1b030514,
  ml.dmlc.mxnet.DataBatch@63f4bccd,
  ml.dmlc.mxnet.DataBatch@5c77d1b3,
  ml.dmlc.mxnet.DataBatch@15f

### ImageRecordIter
ImageRecordIter is for iterating on image RecordIO files
It read images batches from RecordIO files with a rich of data augmentation options.


In [10]:
val params = Map(
      "path_imgrec" -> "data/cifar/train.rec",
      "mean_img" -> "data/cifar/cifar10_mean.bin",
      "rand_crop" -> "False",
      "rand_mirror" -> "False",
      "shuffle" -> "False",
      "data_shape" -> "(3,28,28)",
      "batch_size" -> "100",
      "preprocess_threads" -> "4",
      "prefetch_buffer" -> "1"
    )
    val imgRecIter = IO.ImageRecordIter(params)
    val nBatch = 500
    var batchCount = 0
    // test provideData
    val provideData = imgRecIter.provideData
    val provideLabel = imgRecIter.provideLabel
    
    // Reset the iterator
    imgRecIter.reset()
    while (imgRecIter.hasNext) {
      imgRecIter.next()
      batchCount += 1
    }

    imgRecIter.reset()
    // Get next batch of iterator
    imgRecIter.next()
    // Get label of current batch
    val label0 = imgRecIter.getLabel().head.toArray
    // Get data of current batch
    val data0 = imgRecIter.getData().head.toArray


[36mparams[0m: [32mMap[0m[[32mString[0m, [32mString[0m] = [33mMap[0m(
  [32m"prefetch_buffer"[0m -> [32m"1"[0m,
  [32m"path_imgrec"[0m -> [32m"data/cifar/train.rec"[0m,
  [32m"mean_img"[0m -> [32m"data/cifar/cifar10_mean.bin"[0m,
  [32m"and_mirror"[0m -> [32m"False"[0m,
  [32m"shuffle"[0m -> [32m"False"[0m,
  [32m"preprocess_threads"[0m -> [32m"4"[0m,
  [32m"rand_crop"[0m -> [32m"False"[0m,
  [32m"data_shape"[0m -> [32m"(3,28,28)"[0m,
  [32m"batch_size"[0m -> [32m"100"[0m
)
[36mimgRecIter[0m: [32mDataIter[0m = non-empty iterator
[36mnBatch[0m: [32mInt[0m = [32m500[0m
[36mbatchCount[0m: [32mInt[0m = [32m500[0m
[36mprovideData[0m: [32mListMap[0m[[32mString[0m, [32mShape[0m] = [33mMap[0m([32m"data"[0m -> (100,3,28,28))
[36mprovideLabel[0m: [32mListMap[0m[[32mString[0m, [32mShape[0m] = [33mMap[0m([32m"label"[0m -> (100))
[36mres9_9[0m: [32mDataBatch[0m = ml.dmlc.mxnet.DataBatch@5233832d
[36mlabel0[0

### ResizeIter
Resize a DataIter to given number of batches per epoch. May produce incomplete batch in the middle of an epoch due to padding from internal iterator.

It takes input arguments **dataIter**(Internal data iterator), **reSize**(number of batches per epoch to resize to) and **resetInternal**(whether to reset internal iterator on ResizeIter.reset) and returns resizeIterator.


In [11]:
import ml.dmlc.mxnet.io.{NDArrayIter, ResizeIter, PrefetchingIter}

val params = Map(
      "image" -> "data/train-images-idx3-ubyte",
      "label" -> "data/train-labels-idx1-ubyte",
      "data_shape" -> "(784,)",
      "batch_size" -> "100",
      "shuffle" -> "1",
      "flat" -> "1",
      "silent" -> "0",
      "seed" -> "10"
    )

    val mnistIter = IO.MNISTIter(params)
    val nBatch = 400
    var batchCount = 0

    // Resize a Mnist data iterator
    val resizeIter = new ResizeIter(mnistIter, nBatch, false)

    while(resizeIter.hasNext) {
      resizeIter.next()
      batchCount += 1
    }


[32mimport [36mml.dmlc.mxnet.io.{NDArrayIter, ResizeIter, PrefetchingIter}[0m
[36mparams[0m: [32mMap[0m[[32mString[0m, [32mString[0m] = [33mMap[0m(
  [32m"silent"[0m -> [32m"0"[0m,
  [32m"seed"[0m -> [32m"10"[0m,
  [32m"flat"[0m -> [32m"1"[0m,
  [32m"image"[0m -> [32m"data/train-images-idx3-ubyte"[0m,
  [32m"label"[0m -> [32m"data/train-labels-idx1-ubyte"[0m,
  [32m"shuffle"[0m -> [32m"1"[0m,
  [32m"data_shape"[0m -> [32m"(784,)"[0m,
  [32m"batch_size"[0m -> [32m"100"[0m
)
[36mmnistIter[0m: [32mDataIter[0m = non-empty iterator
[36mnBatch[0m: [32mInt[0m = [32m400[0m
[36mbatchCount[0m: [32mInt[0m = [32m400[0m
[36mresizeIter[0m: [32mio[0m.[32mResizeIter[0m = empty iterator

### PrefetchIter

Performs pre-fetch for other data iterators. Takes one or more DataIters and combine them with prefetching.

This iterator will create another thread to perform next() and then store the data in memory. It potentially accelerates the data read, at the cost of more memory usage.

In [12]:
val params = Map(
      "image" -> "data/train-images-idx3-ubyte",
      "label" -> "data/train-labels-idx1-ubyte",
      "data_shape" -> "(784,)",
      "batch_size" -> "100",
      "shuffle" -> "1",
      "flat" -> "1",
      "silent" -> "0",
      "seed" -> "10"
    )

    val mnistPack1 = IO.MNISTPack(params)
    val mnistPack2 = IO.MNISTPack(params)

    val nBatch = 600
    var batchCount = 0

    val mnistIter1 = mnistPack1.iterator
    val mnistIter2 = mnistPack2.iterator

    var prefetchIter = new PrefetchingIter(
        IndexedSeq(mnistIter1, mnistIter2),
        IndexedSeq(Map("data" -> "data1"), Map("data" -> "data2")),
        IndexedSeq(Map("label" -> "label1"), Map("label" -> "label2"))
    )

    // Check for next batch
    while(prefetchIter.hasNext) {
      prefetchIter.next()
      batchCount += 1
    }

    // The name and shape of data provided by this iterator
    val provideData = prefetchIter.provideData
    // The name and shape of label provided by this iterator
    val provideLabel = prefetchIter.provideLabel

    prefetchIter.reset()
    prefetchIter.next()
    val label0 = prefetchIter.getLabel().head.toArray
    val data0 = prefetchIter.getData().head.toArray

    prefetchIter.dispose()

[36mparams[0m: [32mMap[0m[[32mString[0m, [32mString[0m] = [33mMap[0m(
  [32m"silent"[0m -> [32m"0"[0m,
  [32m"seed"[0m -> [32m"10"[0m,
  [32m"flat"[0m -> [32m"1"[0m,
  [32m"image"[0m -> [32m"data/train-images-idx3-ubyte"[0m,
  [32m"label"[0m -> [32m"data/train-labels-idx1-ubyte"[0m,
  [32m"shuffle"[0m -> [32m"1"[0m,
  [32m"data_shape"[0m -> [32m"(784,)"[0m,
  [32m"batch_size"[0m -> [32m"100"[0m
)
[36mmnistPack1[0m: [32mDataPack[0m = [33mMXDataPack[0m(
  ml.dmlc.mxnet.DataBatch@206a0ce5,
  ml.dmlc.mxnet.DataBatch@19b9b2f4,
  ml.dmlc.mxnet.DataBatch@60961087,
  ml.dmlc.mxnet.DataBatch@aa498fd,
  ml.dmlc.mxnet.DataBatch@7ad9b068,
  ml.dmlc.mxnet.DataBatch@2e2383d5,
  ml.dmlc.mxnet.DataBatch@7ee1acbe,
  ml.dmlc.mxnet.DataBatch@50acb0ef,
  ml.dmlc.mxnet.DataBatch@67411062,
  ml.dmlc.mxnet.DataBatch@55ce1a74,
  ml.dmlc.mxnet.DataBatch@2639c82f,
  ml.dmlc.mxnet.DataBatch@13272fcf,
  ml.dmlc.mxnet.DataBatch@7c0aefc9,
  ml.dmlc.mxnet.DataBatch@593

### NDArrayIter

NDArrayIter is for iterating on NDArray. NDArray is a basic ndarray/Tensor like data structure in mxnet. 
It takes following parameters:
- **data**(NDArrayIter supports single or multiple data and label)
- **label**(Same as data, but is not fed to the model during testing)
- **dataBatchSize**(Batch Size)
- **shuffle**(Whether to shuffle the data) 
- **lastBatchHandle** ("pad", "discard" or "roll_over").- How to handle the last batch.

This iterator will pad, discard or roll over the last batch if the size of data does not match batch_size. Roll over is intended for training and can cause problems if used for prediction.

In [13]:
val shape0 = Shape(Array(1000, 2, 2))
    val data = IndexedSeq(NDArray.ones(shape0), NDArray.zeros(shape0))
    val shape1 = Shape(Array(1000, 1))
    val label = IndexedSeq(NDArray.ones(shape1))
    val batchData0 = NDArray.ones(Shape(Array(128, 2, 2)))
    val batchData1 = NDArray.zeros(Shape(Array(128, 2, 2)))
    val batchLabel = NDArray.ones(Shape(Array(128, 1)))

    // lastBatchHandle = pad
    val dataIter0 = new NDArrayIter(data, label, 128, false, "pad")
    var batchCount = 0
    val nBatch0 = 8
    while(dataIter0.hasNext) {
      val tBatch = dataIter0.next()
      batchCount += 1
     }

    // lastBatchHandle = discard
    val dataIter1 = new NDArrayIter(data, label, 128, false, "discard")
    val nBatch1 = 7
    batchCount = 0
    while(dataIter1.hasNext) {
      val tBatch = dataIter1.next()
      batchCount += 1
    }

    // empty label (for prediction)
    val dataIter2 = new NDArrayIter(data = data, dataBatchSize = 128, lastBatchHandle = "discard")
    batchCount = 0
    while(dataIter2.hasNext) {
      val tBatch = dataIter2.next()
      batchCount += 1
    }


[36mshape0[0m: [32mShape[0m = (1000,2,2)
[36mdata[0m: [32mIndexedSeq[0m[[32mNDArray[0m] = [33mVector[0m(ml.dmlc.mxnet.NDArray@bc757ea0, ml.dmlc.mxnet.NDArray@a1db81b0)
[36mshape1[0m: [32mShape[0m = (1000,1)
[36mlabel[0m: [32mIndexedSeq[0m[[32mNDArray[0m] = [33mVector[0m(ml.dmlc.mxnet.NDArray@fc450a2b)
[36mbatchData0[0m: [32mNDArray[0m = ml.dmlc.mxnet.NDArray@137c6494
[36mbatchData1[0m: [32mNDArray[0m = ml.dmlc.mxnet.NDArray@4ea5965d
[36mbatchLabel[0m: [32mNDArray[0m = ml.dmlc.mxnet.NDArray@85e2cd0f
[36mdataIter0[0m: [32mNDArrayIter[0m = empty iterator
[36mbatchCount[0m: [32mInt[0m = [32m7[0m
[36mnBatch0[0m: [32mInt[0m = [32m8[0m
[36mdataIter1[0m: [32mNDArrayIter[0m = empty iterator
[36mnBatch1[0m: [32mInt[0m = [32m7[0m
[36mdataIter2[0m: [32mNDArrayIter[0m = empty iterator

## Implementation
Iterators can be implemented in either C++ or front-end languages such as Python. The C++ definition is at [include/mxnet/io.h](https://github.com/dmlc/mxnet/blob/master/include/mxnet/io.h), all C++ implementations are located in [src/io](https://github.com/dmlc/mxnet/tree/master/src/io). These implementations heavily rely on [dmlc-core](https://github.com/dmlc/dmlc-core), which supports reading data from various data format and filesystems.

## Further Readings
- [Data loading API](http://mxnet.io/api/scala/io.html)
- [Design of efficient data format](http://mxnet.io/architecture/note_data_loading.html)