# **iMaterialist Challenge (Furniture) at FGVC5**
## **TFNW Kaggle Team**

### **Get Started**

To get started, you will:



1.   Join the Completion
2.   Create a Kaggle API key
3.   Install the Kaggle Python module
4.   Download the dataset for the competition

**Join The Completion**

Goto Kaggle.com and login.

Once logged in, goto the competition page and select the JOIN COMPETITION button

https://www.kaggle.com/c/imaterialist-challenge-furniture-2018

**Create Kaggle API Key**

To download this dataset you will use the Kaggle API for downloading datasets. To use the API, you must first create a Kaggle API Key. To create your API Key, do:



1.   Click on your Profile
2.   Select My Accont
3.   Under API, Select Create New API Token
4.   Download the API Key (kaggle.json)
5.   Copy the API key to:  ~/.kaggle/kaggle.json
6.   On Windows, that would be: \Users\<username>\.kaggle\kaggle.json

**Install the Kaggle Python Module**

C:> pip install kaggle




In [1]:
# Get the latest version of pip
!python -m pip install --upgrade pip

# Install the Kaggle module
!pip install kaggle

Requirement already up-to-date: pip in c:\users\user\appdata\local\programs\python\python35\lib\site-packages




### Download the Dataset Dictionary

The dataset consists of a dataset dictionary and the data. The dataset dictionary is a JSON file that contains the URL location and label for each image in the dataset. We need to download this first using the Kaggle python module:

C:> kaggle competitions download -c imaterialist-challenge-furniture-2018

This will place the data under:

Linux/Mac: ~/.Kaggle/competitions/imaterialist-challenge-furniture-2018<br/>
Windows: \Users\<username>\.Kaggle\competitions\imaterialist-challenge-furniture-2018

In [2]:
!kaggle competitions download -c imaterialist-challenge-furniture-2018

train.json: Skipping, found more recently modified local copy (use --force to force download)
validation.json: Skipping, found more recently modified local copy (use --force to force download)
test.json: Skipping, found more recently modified local copy (use --force to force download)
sample_submission_randomlabel.csv: Skipping, found more recently modified local copy (use --force to force download)


Next, we will install and import some libraries.

In [3]:
# numpy for the high performance in-memory matrix/array storage and operations.
!pip install numpy   
# h5py for the HD5 filesystem high performance file storage of big data.
!pip install h5py   
# Python image manipulation library (replaces PIL)
!pip install pillow  
# requests for HTTP operations
!pip install requests



In [5]:
# Import numpy for the high performance in-memory matrix/array storage and operations.
import numpy as np

# Import h5py for the HD5 filesystem high performance file storage of big data.
import h5py

# Import PIL.Image for Python image manipulation library. 
from PIL import Image

# Import json and requests for HTTP operations
import json, requests

# Import the Byte and String IO library for extracing data returned (response) frome HTTP requests.
from io import BytesIO, StringIO

# Import time to record timing
import time


# **Loading the Dataset (ETL)**

## ** Overview **

The loading and storing of the image data and corresponding data is done in several steps. To facilate minimizing the loading time and memory requirements, the process is broken into batches, and loaded and stored in concurrent (parallel) groups. 

Both the size of batches and the number of concurrent load/stores of batches is configurable. The overall process is as follows:

1. Load the Dataset Dictionary (described below) for the training (or test or validation) data.
2. Determine the number of images to load/store from the data dictionary.
3. Split the loading/storing of images into batches, based on batch size.
4. Sequentially load groups of batches, but where each batch within the batch group is concurrently processed (loaded and stored).

## ** Dispatch **

The function load_dispatcher() handles the dispatching of loading/storing of batches. It starts by
taking the location of the dataset dictionary and loads it into memory. The dataset dictionary contains the location of each image and corresponding label.

The dataset dictionary is in json format, as follows, where [image] is a list of image locations and [annotations] is a list of corresponding labels.
 
 {<br/>
"images" : [image],<br/>
"annotations" : [annotation],<br/>
}

The dispatcher makes a HTTP request (requests.get(url)) for the dataset dictionary and extracts the dataset dictionary from the contents of the response (requests.get(url).content). The dispatcher extracts it as raw byte data (ByteIO) and loads the dataset dictionary into a json format (json.load).
 
The dispatcher calculates the number of batches based on the number of images and the batch size. The batches are then sequenced into groups (for i in range(0, batches, concurrent)), where the size of the group is the number of concurrent threads. For each group, the dispatcher creates a thread for each batch to load/store asynchronously (i.e., job). The dispatcher waits for all the concurrent jobs to complete. Based on the time to process the batch group, the dispatcher estimates the time to load/store all the remaining batches.

Note, you can set the size of the batches, the size to rescale the images to, and whether to convert to grayscale.

In [6]:
# Library for thread execution
import threading

def load_dispatcher(url, batch_type, batch_size=200, size=(300,300), grayscale=False, normalize=False, concurrent=5):
    """ Load the Data in Batches 
    url - location of data dictionary
    batch_type - training, validation or test
    batch_size - size of the batch
    size - size to rescale image to
    grayscale - flag to convert image to grayscale
    concurrent - the number of concurrent (parallel)) batch loads
    """
    
    # First retreive the dataset dictionary, which is in a JSON format. 
    # Dictionary is stored remote: We will make a HTTP request
    if url.startswith("http"):
        datadict = json.load( requests.get(url).content )
    # Dictionary is stored locally
    else:
        datadict = json.load( open( url ) )
   
    # The number of batches
    batches = int(len(datadict['images']) / batch_size)
    
    # Sequentially Load each batch group (i.e., concurrent)
    for i in range(0, batches, concurrent):
        # Start time for the batch group
        start_time = time.time()
        
        # Listof threads, corresponding to to the processing of each batch in the batch group
        threads = []
        # Create and Start a processing thread for each batch in the batch group
        for j in range(concurrent):
            t = threading.Thread(target=load_and_store_batch, args=(datadict, batch_type, i + j, batch_size, size, grayscale, normalize, ))
            # Keep track (remember) of the thread
            threads.append(t)
            # Start the thread
            t.start()
        # Join the threads into a single wait for all threads to complete
        for t in threads:
            t.join()
                  
        # Calculate elapsed time in seconds to load this batch group
        elapse = int(time.time() - start_time)
            
        # Estimate remaining time in minutes for loading remaining barches.
        remaining = int( ( ( batches - i ) / concurrent ) * elapse ) / 60
        
        print("Remaining time %d mins" % remaining)

## ** Load and Store the Data Batches **

We have a mass amount of images: 194,828 training, 6,400 validation, 12,800 test. If we tried to load it all, we would need 54GB of memory! 

Instead, we will load the data into smaller training (validation and test) batches and store them separately on disk. We use the loaded data dictionary and sequentially move through it (batch_size at a time) to build the batches and save them into a HD5 high performance file system.

In [7]:
def load_and_store_batch(datadict, batch_type, pos, batch_size, size, grayscale, normalize):
    """ Process loading (extration), handling (transformation) and storing (loading) as a batch 
    batch_type - training, validation or test
    pos - the batch slice position in the data (i.e., the first, the second, etc)
    batch_size - size of the batch
    size - size to rescale image to
    grayscale - flag to convert image to grayscale
    """
    start_time = time.time()
    
    start = pos * batch_size
    images, labels = load_batch(datadict, start, batch_size, size, grayscale, normalize )
        
    # Calculate elapsed time in seconds to load this batch
    elapse = int(time.time() - start_time)
        
    print("Batch Loaded %d: %d secs" % (pos, elapse))
        
    # Write the batch to disk as HD5 file
    with h5py.File('contents\\' + batch_type + '\\images' + str(pos) + '.h5', 'w') as hf:
        hf.create_dataset("images",  data=images)
    #with h5py.File('contents\\' + batch_type + '\\labels' + str(pos) +  '.h5', 'w') as hf:
        hf.create_dataset("labels",  data=labels)

## ** Extraction / Dataset Directory **

The load_batch() function uses the location of each image specified in the slice of the dataset dictionary to load the images into memory, do transformations and store the images on disk, as a batch. The slice is defined by a start position in the dataset dictionary and length, denoted by batch_size.

## **Transform / Load**
 
The images are a mix of grayscale, RGB and RGBA (alpha channel) images, and are of different pixel sizes. For the neural network, they all need to be the same size. We will rescale each of our images to be 300 by 300 pixels (default), but you can choose another scale with the size parameter. THe image data will then be packed into a high performance numpy 3D matrix. The row/column are the height and width (300,300) and the third dimension are the channels (3). 

If the parameter 'grayscale' is True, all the images are converted to grayscale (single channel); otherwise all the grayscale and RGBA images, are converted to RGB.

If the parameter 'normalize' is True, all the pixel values are converted from 0 .. 255 (int) to 0 .. 1 (float). *Warning* - non-normalized, the pixels are stored as 8bit integers. If normalized, they are stored as 64bit floating point, and the file will be 8 times as big.

The images will be stored in the list images[] and the corresponding labels in the list labels[].

The load_batch() function loads a batch of images from the training, validation or test data and does the transform function. The images and corresponding labels are then returned as a list.

In [8]:
timeout = 6   # timeout (seconds) for reading the image from the web
retries = 2   # Number of times to retry reading the image over the network

def load_batch(datadict, start, batch_size, size, grayscale, normalize):
    """ Load the training datas
    datadict - data image/label dictionary
    start - index to start reading batch of images
    batch_size - number of images to read (None = all images)
    grayscale - flag if image should be converted to grayscale
    """
    
    images = [] # List containing the images
    labels = [] # List containing the corresponding labels for the images
    
    # Number of images to load
    if batch_size == None:
        batch_size = len(datadict['images'])
      
    # Final shape of image Height, Width
    if grayscale == True:
        shape = size
    # Final shape of image Height, Width, Channels(3)
    else:
        shape = size + (3,)
        
    not_loaded = 0 # Number of images that failed to load in the batch
            
    # Load the batch of images/labels from the Data Dictionary
    end = start + batch_size
    for i in range(start, end): 
        image_url = datadict['images'][i]['url'][0]
        label_id  = datadict['annotations'][i]['label_id']

        # Keep trying to read the image over the network on failure upto retries number of times
        for retry in range(retries):
            # Download, resize and convert images to arrays
            try:
                # Make HTTP request fot the image data
                response = requests.get(image_url, timeout=10)

                # Use the PIL.Image libary to load the image data as au uncompressed RGB or Grayscale bitmap
                if grayscale == True:
                    pixels = Image.open(BytesIO(response.content)).convert('LA')
                else:
                    pixels = Image.open(BytesIO(response.content))

                # Resize the image to be all the same size
                pixels = pixels.resize(size, resample=Image.LANCZOS)

                # Load the image into a 3D numpy array
                image = np.asarray(pixels)

                # Discard image if it does not fit the final shape
                if image.shape != shape:
                    if grayscale == False:
                        # Was a gray scale image
                        if image.shape == size:
                            # Extend to three channels, replicating the single channel
                            pixels = pixels.convert('RGB')
                            image = np.asarray(pixels)
                            break
                        # Is RGBA image (4 channels)
                        if image.shape == size + (4,):
                            # Remove Alpha Channel from Image
                            pixels = pixels.convert('RGB')
                            image = np.asarray(pixels)
                            break
                            
                    # Unrecognized shape
                    not_loaded += 1
                    retry = retries
                    break
            except Exception as ex:
                if retry < retries-1:
                    continue
                #print("CAN'T FETCH IMAGE", image_url)
                retry = retries
            # Image was read or failed retries number of times
            break
                
        if retry == retries:
            not_loaded += 1
            continue

        # if bad image, skip
        if np.any(image == None):
            continue
            
        # Normalize the image (convert pixel values from int range 0 .. 255 to float range 0 .. 1)
        if normalize == True:
            image = image / 255
            
        # add image to images list
        images.append( image )
        # add corresponding label to labels list
        labels.append( label_id )
        
        if (i+1) % 50 == 0:
            print('%d Images added, %d not loaded' % ((i + 1), not_loaded))

    return images, labels
        

## Execute the Load

The URLs below are for more laptop. You will need to modify it to the location on your laptop.

Running as 5 concurrent processes in batches of 200, takes 1/2 day on my laptop with my local Internet service.

In [None]:
# Create Directories for the HD5 encoded batches
!mkdir contents
!mkdir contents\\train
!mkdir contents\\validation
!mkdir contents\\test

# Data dictionaries
train_url      = 'C:\\Users\\User\\.kaggle\\competitions\\imaterialist-challenge-furniture-2018\\train.json'
test_url       = 'C:\\Users\\User\\.kaggle\\competitions\\imaterialist-challenge-furniture-2018\\test.json'
validation_url = 'C:\\Users\\User\\.kaggle\\competitions\\imaterialist-challenge-furniture-2018\\validation.json'

# Load the Training Batches
load_dispatcher(train_url, "train")

# Load the Validation Batches
load_dispatcher(validation_url, "validation")

# Load the Test Batches
load_dispatcher(test_url, "test")

A subdirectory or file contents already exists.
A subdirectory or file contents\\train already exists.
A subdirectory or file contents\\validation already exists.
A subdirectory or file contents\\test already exists.


850 Images added, 1 not loaded
450 Images added, 1 not loaded
50 Images added, 1 not loaded
250 Images added, 2 not loaded
650 Images added, 2 not loaded
100 Images added, 1 not loaded
900 Images added, 3 not loaded
500 Images added, 2 not loaded
950 Images added, 3 not loaded
700 Images added, 4 not loaded
150 Images added, 1 not loaded
550 Images added, 2 not loaded
300 Images added, 4 not loaded
200 Images added, 1 not loaded
Batch Loaded 0: 228 secs
350 Images added, 5 not loaded
600 Images added, 4 not loaded
Batch Loaded 2: 272 secs
750 Images added, 6 not loaded
800 Images added, 9 not loaded
Batch Loaded 3: 444 secs
1000 Images added, 6 not loaded
Batch Loaded 4: 448 secs
400 Images added, 8 not loaded
Batch Loaded 1: 542 secs
Remaining time 1759 mins
1250 Images added, 0 not loaded
1450 Images added, 1 not loaded
1650 Images added, 3 not loaded
1050 Images added, 2 not loaded
1850 Images added, 1 not loaded
1300 Images added, 0 not loaded
1500 Images added, 1 not loaded
1100 I

  'to RGBA images')


3850 Images added, 4 not loaded
3600 Images added, 5 not loaded
Batch Loaded 17: 302 secs
3150 Images added, 2 not loaded
3300 Images added, 5 not loaded
3800 Images added, 3 not loaded
Batch Loaded 18: 364 secs
3900 Images added, 4 not loaded
3200 Images added, 3 not loaded
Batch Loaded 15: 383 secs
3350 Images added, 6 not loaded
3950 Images added, 4 not loaded
3400 Images added, 6 not loaded
Batch Loaded 16: 440 secs
4000 Images added, 4 not loaded
Batch Loaded 19: 472 secs
Remaining time 1508 mins
4650 Images added, 0 not loaded
4250 Images added, 0 not loaded
4450 Images added, 1 not loaded
4050 Images added, 0 not loaded
4850 Images added, 4 not loaded
4700 Images added, 1 not loaded
4100 Images added, 0 not loaded
4500 Images added, 2 not loaded
4750 Images added, 2 not loaded
4900 Images added, 9 not loaded
4150 Images added, 0 not loaded
4550 Images added, 5 not loaded
4300 Images added, 1 not loaded
4800 Images added, 4 not loaded
Batch Loaded 23: 314 secs
4600 Images added, 

  " Skipping tag %s" % (size, len(data), tag))


10750 Images added, 3 not loaded
10350 Images added, 5 not loaded
10200 Images added, 2 not loaded
Batch Loaded 50: 282 secs
10950 Images added, 5 not loaded
10550 Images added, 5 not loaded
10800 Images added, 4 not loaded
Batch Loaded 53: 322 secs
10400 Images added, 7 not loaded
Batch Loaded 51: 334 secs
11000 Images added, 7 not loaded
Batch Loaded 54: 380 secs
10600 Images added, 8 not loaded
Batch Loaded 52: 415 secs
Remaining time 1278 mins
11450 Images added, 2 not loaded
11050 Images added, 0 not loaded
11650 Images added, 1 not loaded
11850 Images added, 3 not loaded
11500 Images added, 3 not loaded
11250 Images added, 2 not loaded
11100 Images added, 2 not loaded
11700 Images added, 2 not loaded
11150 Images added, 3 not loaded
11300 Images added, 2 not loaded
11200 Images added, 4 not loaded
Batch Loaded 55: 328 secs
11750 Images added, 4 not loaded
11350 Images added, 5 not loaded
11900 Images added, 5 not loaded
11400 Images added, 5 not loaded
Batch Loaded 56: 442 secs
1