# **iMaterialist Challenge (Furniture) at FGVC5**
## **TFNW Kaggle Team**

### **Get Started**

To get started, you will:



1.   Join the Completion
2.   Create a Kaggle API key
3.   Install the Kaggle Python module
4.   Download the dataset for the competition

**Join The Completion**

Goto Kaggle.com and login.

Once logged in, goto the competition page and select the JOIN COMPETITION button

https://www.kaggle.com/c/imaterialist-challenge-furniture-2018

**Create Kaggle API Key**

To download this dataset you will use the Kaggle API for downloading datasets. To use the API, you must first create a Kaggle API Key. To create your API Key, do:



1.   Click on your Profile
2.   Select My Accont
3.   Under API, Select Create New API Token
4.   Download the API Key (kaggle.json)
5.   Copy the API key to:  ~/.kaggle/kaggle.json
6.   On Windows, that would be: \Users\<username>\.kaggle\kaggle.json

**Install the Kaggle Python Module**

C:> pip install kaggle




In [None]:
# Get the latest version of pip
!python -m pip install --upgrade pip

# Install the Kaggle module
!pip install kaggle



### Download the Dataset Dictionary

The dataset consists of a dataset dictionary and the data. The dataset dictionary is a JSON file that contains the URL location and label for each image in the dataset. We need to download this first using the Kaggle python module:

C:> kaggle competitions download -c imaterialist-challenge-furniture-2018

This will place the data under:

Linux/Mac: ~/.Kaggle/competitions/imaterialist-challenge-furniture-2018<br/>
Windows: \Users\<username>\.Kaggle\competitions\imaterialist-challenge-furniture-2018

In [None]:
!kaggle competitions download -c imaterialist-challenge-furniture-2018

Next, we will install and import some libraries.

In [None]:
# numpy for the high performance in-memory matrix/array storage and operations.
!pip install numpy   
# h5py for the HD5 filesystem high performance file storage of big data.
!pip install h5py   
# Python image manipulation library (replaces PIL)
!pip install pillow  
# requests for HTTP operations
!pip install requests

In [None]:
# Import numpy for the high performance in-memory matrix/array storage and operations.
import numpy as np

# Import h5py for the HD5 filesystem high performance file storage of big data.
import h5py

# Import PIL.Image for Python image manipulation library. 
from PIL import Image

# Import json and requests for HTTP operations
import json, requests

# Import the Byte and String IO library for extracing data returned (response) frome HTTP requests.
from io import BytesIO, StringIO

# Import time to record timing
import time


# **Loading the Dataset (ETL)**

** Extraction / Dataset Directory **

The load_batches() function takes the location of the dataset dictionary and loads it into memory. The function in conjunction with load_batch() uses the location of each image specified in the dictionary to load the images into memory in batches, do transformations and store the batches on disk.

The dataset dictionary is in json format, as follows, where [image] is a list of image locations and [annotations] is a list of corresponding labels.
 
 {<br/>
"images" : [image],<br/>
"annotations" : [annotation],<br/>
}
 
The load_batches() function makes a HTTP request (requests.get(url)) for the dataset dictionary and extracts the dataset dictionary from the contents of the response (requests.get(url).content). The function extracts it as raw byte data (ByteIO) and loads the dataset dictionary into a json format (json.load).

THe load_batches() function then splits the image and corresponding label extraction into batches, where the size of each batch is specified by the batch_size parameter.

**Transform / Load**
 
The images are all RGB images, but are of different pixel sizes. For the neural network, they all need to be the same size. We will rescale each of our images to be 300 by 300 pixels (default), but you can choose another scale with the size parameter. THe image data will then be packed into a high performance numpy 3D matrix. The row/column are the height and width (300,300) and the third dimension are the channels (3).

The images will be stored in the list images[] and the corresponding labels in the list labels[].

The load_batch() function loads a batch of images from the training, validation or test data and does the transform function. The images and corresponding labels are then returned as a list.

In [None]:
def load_batch(datadict, start, batch_size=200, size=(300,300), grayscale=False ):
    """ Load the training dataset 
    datadict - data image/label dictionary
    start - index to start reading batch of images
    batch_size - number of images to read (None = all images)
    grayscale - flag if image should be converted to grayscale
    """
    
    images = [] # List containing the images
    labels = [] # List containing the corresponding labels for the images
    
    # Number of images to load
    if batch_size == None:
        batch_size = len(datadict['images'])
      
    # Final shape of image Height, Width
    if grayscale == True:
        shape = size
    # Final shape of image Height, Width, Channels(3)
    else:
        shape = size + (3,)
            
    # Load the batch of images/labels from the Data Dictionary
    end = start + batch_size
    for i in range(start, end): 
        image_url = datadict['images'][i]['url'][0]
        label_id  = datadict['annotations'][i]['label_id']
        
        not_loaded = 0

        # Download, resize and convert images to arrays
        try:
            # Make HTTP request fot the image data
            response = requests.get(image_url, timeout=5)
            
            # Use the PIL.Image libary to load the image data as au uncompressed RGB or Grayscale bitmap
            if grayscale == True:
                image = Image.open(BytesIO(response.content)).convert('LA')
            else:
                image = Image.open(BytesIO(response.content))
            
            # Resize the image to be all the same size
            image = image.resize(size, resample=Image.LANCZOS)
            
            # Load the image into a 3D numpy array
            image = np.asarray(image)
            
            # Discard image if it does not fit the final shape
            assert image.shape == shape
        except Exception as ex:
            not_loaded += 1
            continue
        
        # if bad image, skip
        if np.any(image == None):
            continue
        # add image to images list
        images.append( image )
        # add corresponding label to labels list
        labels.append( label_id )
        
        if (i+1) % 50 == 0:
            print('%d Images added, %d not loaded' % ((i - start + 1), not_loaded))

    return images, labels
        

### Load and Store the Data Batches

We have a mass amount of images: 194,828 training, 6,400 validation, 12,800 test. If we tried to load it all, we would need 54GB of memory! 

Instead, we will load the data into smaller training batches and store them separately on disk. We use the loaded data dictionary and sequentially move through it (batch_size at a time) to build the batches and save them into a HD5 high performance file system.

Note, you can set the size of the batches, the size to rescale the images to, and whether to convert to grayscale.

In [None]:
def load_batches(url, batch_type, batch_size=200, size=(300,300), grayscale=False):
    """ Load the Data in Batches 
    url - location of data dictionary
    batch_type - training, validation or test
    batch_size - size of the batch
    size - size to rescale image to
    grayscale - flag to convert image to grayscale
    """
    
    # First retreive the dataset dictionary, which is in a JSON format. 
    # Dictionary is stored remote: We will make a HTTP request
    if url.startswith("http"):
        datadict = json.load( requests.get(url).content )
    # Dictionary is stored locally
    else:
        datadict = json.load( open( url ) )
   
    # The number of batches
    batches = int(len(datadict['images']) / batch_size)
    
    # Sequentially Load each batch
    for i in range(batches):
        start_time = time.time()
        
        start = i * batch_size
        images, labels = load_batch(datadict, start, batch_size, size, grayscale )
        
        # Calculate elapsed time in seconds to load this batch
        elapse = int(time.time() - start_time)
        # Estimate remaining time in minutes for loading remaining barches.
        remaining = int( (batches - i) * elapse ) / 60
        
        print("Batch Loaded %d: %d secs, remaining %d mins" % (i, elapse, remaining))
        
        # Write the batch to disk as HD5 file
        with h5py.File('contents\\' + batch_type + '\\images' + str(i) + '.h5', 'w') as hf:
            hf.create_dataset("images",  data=images)
        with h5py.File('contents\\' + batch_type + '\\labels' + str(i) +  '.h5', 'w') as hf:
            hf.create_dataset("labels",  data=labels)

The URL below is for more laptop. You will need to modify it to the location on your laptop.

In [None]:
# Create Directories for the HD5 encoded batches
!mkdir contents
!mkdir contents\\train
!mkdir contents\\validation
!mkdir contents\\test

# Data dictionaries
train_url      = 'C:\\Users\\User\\.kaggle\\competitions\\imaterialist-challenge-furniture-2018\\train.json'
test_url       = 'C:\\Users\\User\\.kaggle\\competitions\\imaterialist-challenge-furniture-2018\\test.json'
validation_url = 'C:\\Users\\User\\.kaggle\\competitions\\imaterialist-challenge-furniture-2018\\validation.json'

# Load the Training Batches
load_batches(train_url, "train")

# Load the Validation Batches
load_batches(validation_url, "validation")

# Load the Test Batches
load_batches(test_url, "test")