# **iMaterialist Challenge (Furniture) at FGVC5**
## **TFNW Kaggle Team**


In [8]:
# Import numpy for the high performance in-memory matrix/array storage and operations.
import numpy as np

# Import h5py for the HD5 filesystem high performance file storage of big data.
import h5py

# Import time to record timing
import time


## Reading in the Dataset Batches

We are ready to read in the dataset and feed it to our model. We have stored the dataset as a series of batch files, so we will want to read one batch file at a time and feed it into the model.

### Feeder

The function feeder() performs the feed task for training our model. We use a generator. From wikipedia:

>*A generator is a special routine that can be used to control the iteration behaviour of a loop. A generator is very similar to a function that returns an array, in that a generator has parameters, can be called, and generates a sequence of values.  However, instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator.*

The feeder() function takes a list of the HD5 stored batch files. Each time the feeder() function is called, it will read in the next batch file, extract the images and corresponding labels, which are returned as the next batch data (X, Y).

Let's take a closer look. The 'while True' may at first appear that the function has an infinite loop and will never return. That's not correct. The key to understanding this is the yield statement. The yield acts similar to a return statement. From Wikipedia:

>*The yield statement is used to define generators, replacing the return of a function to provide a result to its caller without destroying local variables. Unlike a function, where on each call it starts with new set of variables, a generator will resume the execution where it was left off.*

### Feeding Epochs

The 'While True' loop gives the feeder() function the ability to continously cycle through all the batch files. For example, if we have 100 batch files and 2 epochs, the first 100 calls to the feeder will sequentially feed the 100 batch files (see below). The 101st call will cycle back to the while loop and start over again; hence starting the 2nd epoch.

### Feeding Batches within an Epoch

The 'for batch_file in files:' loop is where the batch files are feed one at a time. This loop on each iteration will select the next batch file in the list files.

### Feeding a Batch

The 'with h5py.File(batch_file, 'r') as hf:' statement opens the current batch file and reads in the images and corresponding labels. The file is an HD5 encoding. The images are stored in the group section 'images' and the labels in the group section 'labels'. The HD5 encoding allows random access into the file using indexes. 

To access the 'images' section, we use the associative array access syntax: hf['images']. Likewise, the same for 'labels' section. Within the sections, we copy the entire list of images, and labels using the list copy syntax [:]. For the image data, we normalize the pixel values by converting them from a 0 .. 255 integer range to 0 .. 1 floating point range.

Finally, we get to the yield keyword. At this point the feeder() will return the data read into the variables X and Y, and importantly remember where the function left off. The next time the feeder() function is called, it will pick up where it left off and continue to it reaches the yield statement.

### Note on Normalization

One could ask why we did not store the images in the batch files already normalized. We could have and improve performance in reading in the batch files. 

But it is a trade off for speed vs. disk space. The 0 .. 255 range can be represented as a 8bit integer value, requiring only one byte per pixel in the stored file. If we pre-convert to 0..1, we have a floating point number. At a minimum, we would need to store this value as a 16bit floating point value, requiring two bytes per pixel, doubling the size of the stored file. The 16bit resolution is low precision, so if we have many layers in the neural network, we could face a vanishing gradient problem. In that case, we would need to store as 32bit floating point value, which would require four bytes per pixel, quadrupling the size of the stored file.


In [2]:
def feeder(files):
    """ A generator for feeding batches of images to a neural network 
    files = batches of images/labels as HD5 files
    """
    # We use an infinite loop here so that the generator can be called for an unlimited number of epochs
    while True:
        # Read each batch file one at a time in sequential order
        for batch_file in files:
            # Read the batch to disk from the HD5 file
            with h5py.File(batch_file, 'r') as hf:
                # Read in the images
                # Normalize the pixel data (convert 0..255 to 0..1)
                X = hf['images'][:] / 255 
                # Read in the corresponding labels
                Y = hf['labels'][:]
                
                # Generator - return list of X (images) and Y (labels)
                yield X, Y

### Memory Profiling

When finding the optimal balance between memory (space) and speed (performance), it's generally a good idea to do some memory profiling. We install/import the pympler module for memory profiling.

In [3]:
# Install pympler, used for memory monitoring
!pip install pympler



## Running the Feeder

We are ready now to run the feeder. I have constructed the code below for feeding the training data in batches to a neural network.

### Accessing the Batch Files

We start by getting a list of all the batch files we have stored in our training subdirectory (contents/train). We then calculate the number of batch files by taking the length of this list.

### Running an Epoch

Next, we set the number of epochs (n_epochs) and loop through our epoch range. In each loop iteration, we will feed the entire training data in batches. We do this by keeping a count of the number of batches we have ran. The entire epoch is complete when the count equals the total number of batches (i.e., the number of batch files).

### Running a Batch

Within the epoch loop, we call the iterator on the feeder() function. Each time we call the feeder(), the function will return the images (X) and corresponding labels (Y) from the next batch file.

### Garbage Collection

After each batch is processed (feed into model), and the X and Y data is no longer referenced, Python does not immediately free the memory. Instead, the Python run-time environment does periodic garbage collection (freeing memory), in the background.

The space for X and Y continue to accumulate on the heap until the Python run-time environment implicitly does the garbage collection (i.e. freeing the memory).

If the processing of the data is running much faster than the garbage collection cycle, we could have memory space problems. We are solving this by explicitly calling the garbage collection process after eacy batch is processed.

In [3]:
import os

# Import the garbage (memory management) module
import gc

# Import pympler.tracker for memory monitoring
from pympler import tracker
# Create object to monitor the heap
tr = tracker.SummaryTracker()

# Directory where the training batches are stored
batches = "contents/train/"

# Get a list of all the training\ batch files
batchlist = []
for batch in os.listdir(batches):
    batchlist.append(os.path.join(batches, batch))
    
# Number of batches
nbatches = len(batchlist)

# Number of epochs
n_epochs = 1
    
# Loop through each epoch, each time feeding the entire training set.
for epoch in range(n_epochs):
    # Iteratively call the feeder() function
    nbatch = 0
    for X, Y in feeder(batchlist):
        # Printing some information so you can see that the next batch file was feed
        print("BATCH #:", nbatch)
        print("X", X.shape)
        print("Y", Y.shape)
        tr.print_diff()
        
        # HERE is where you feed the training batch data to the neural network
        
        # This line will force garbage collection of unused memory.
        gc.collect()
        
        # Run a single epoch
        nbatch += 1
        if nbatch == nbatches:
            break
    

BATCH #: 0
X (198, 300, 300, 3)
Y (198,)
                      types |   # objects |   total size
      <class 'numpy.ndarray |           2 |    407.87 MB
               <class 'list |        9256 |    872.02 KB
                <class 'str |        9330 |    669.57 KB
                <class 'int |        1850 |     50.60 KB
               <class 'dict |           3 |      1.03 KB
  <class 'method_descriptor |          12 |    864     B
               <class 'type |           0 |    576     B
               <class 'cell |           7 |    336     B
               <class 'code |           2 |    288     B
    function (null_wrapper) |           2 |    272     B
              <class 'tuple |           4 |    256     B
            <class 'weakref |           3 |    240     B
  <class 'functools.partial |           2 |    160     B
        function (<lambda>) |           1 |    136     B
      function (store_info) |           1 |    136     B
BATCH #: 1
X (192, 300, 300, 3)
Y (192,)
      

                  types |   # objects |   total size
  <class 'numpy.ndarray |           0 |     10.30 MB
            <class 'str |           0 |     54     B
BATCH #: 19
X (196, 300, 300, 3)
Y (196,)
                  types |   # objects |   total size
  <class 'numpy.ndarray |           0 |     12.36 MB
            <class 'str |           0 |     54     B
BATCH #: 20
X (193, 300, 300, 3)
Y (193,)
                  types |   # objects |       total size
            <class 'str |           0 |         54     B
  <class 'numpy.ndarray |           0 |   -6480012     B
BATCH #: 21
X (194, 300, 300, 3)
Y (194,)
                  types |   # objects |   total size
  <class 'numpy.ndarray |           0 |      2.06 MB
            <class 'str |           0 |     54     B
BATCH #: 22
X (196, 300, 300, 3)
Y (196,)
                  types |   # objects |   total size
  <class 'numpy.ndarray |           0 |      4.12 MB
            <class 'str |           0 |     54     B
BATCH #: 23
X (193, 300, 

BATCH #: 40
X (185, 300, 300, 3)
Y (185,)
                            types |   # objects |        total size
              function (<lambda>) |          10 |           1.33 KB
                     <class 'cell |          18 |         864     B
                    <class 'tuple |          12 |         752     B
                      <class 'str |           2 |         215     B
                     <class 'code |           1 |         144     B
   function (_schedule_in_thread) |           1 |         136     B
                      <class 'int |          -1 |         -28     B
                    <class 'float |          -2 |         -48     B
                   <class 'method |          -1 |         -64     B
  <class 'tornado.ioloop._Timeout |          -1 |         -64     B
                     <class 'list |          -2 |        -144     B
        <class 'functools.partial |          -2 |        -160     B
          function (null_wrapper) |          -2 |        -272     B
      

BATCH #: 57
X (194, 300, 300, 3)
Y (194,)
        types |   # objects |   total size
  <class 'str |           0 |     54     B
BATCH #: 58
X (195, 300, 300, 3)
Y (195,)
                  types |   # objects |   total size
  <class 'numpy.ndarray |           0 |      2.06 MB
            <class 'str |           0 |     54     B
BATCH #: 59
X (190, 300, 300, 3)
Y (190,)
                      types |   # objects |        total size
               <class 'dict |           1 |         288     B
                <class 'str |           1 |         146     B
               <class 'cell |           3 |         144     B
               <class 'code |           1 |         144     B
    function (null_wrapper) |           1 |         136     B
        function (<lambda>) |           1 |         136     B
              <class 'tuple |           2 |         120     B
  <class 'functools.partial |           1 |          80     B
               <class 'list |           1 |          72     B
      <cl