# Data Exploration

The data provided for this challenge is divided in the following manner:

 + Training set: 50,000 samples
 + Validation set: 50,000 samples
 + Test set: 100,000 samples

There are 1,000 possible labels, and each image can have up to 5 objects in it.

### Sample images

In [3]:
sample_images_path = 'sample-images/'
sample_images = ['n01440764_14034.JPEG', 'n01440764_7154.JPEG', 'n01443537_15790.JPEG', 'n01443537_3885.JPEG']

In [4]:
from IPython.display import display, HTML

html = ["<td><img src='{}' /></td>".format(sample_images_path + img) for img in sample_images]
display(HTML("<table><tr>{}</tr></table>".format(''.join(html))))

### Labels

The training an validation sets are labeled such that each sample image contains zero (none) or more (up to five) bounding boxes describing where the object is in the image, as well as what that object is. Each sample is labeled in the following format:

 + ImageId - the ID of the sample image
 + Width - width of the image (in pixels)
 + Height - height of the image (in pixels)
 + PredictionString - a set of zero of more groups of five values, representing:
     + synsetId - the synset ID of the object in the image
     + x_min - left coordinate of the top left corner (in pixels) of the bounding box where the object in the image
     + y_min - top coordinate of the top left corner (in pixels) of the bounding box where the object in the
     + x_max - right coordinate of the bottom right corner (in pixels) of the bounding box where the object in the image
     + y_max - bottom coordinate of the bottom right corner (in pixels) of the bounding box where the object in the

In [5]:
import pandas as pd

In [6]:
loc_train_solution = pd.read_csv('LOC_train_solution.csv')
loc_val_solution = pd.read_csv('LOC_val_solution.csv')
loc_synset_mapping_path = 'LOC_synset_mapping.txt'

In [10]:
loc_train_solution.head()

Unnamed: 0,ImageId,PredictionString
0,n02017213_7894,n02017213 115 49 448 294
1,n02017213_7261,n02017213 91 42 330 432
2,n02017213_5636,n02017213 230 104 414 224
3,n02017213_6132,n02017213 46 82 464 387
4,n02017213_7659,n02017213 103 66 331 335


In [11]:
loc_val_solution.head()

Unnamed: 0,ImageId,PredictionString
0,ILSVRC2012_val_00048981,n03995372 85 1 499 272
1,ILSVRC2012_val_00037956,n03481172 131 0 499 254
2,ILSVRC2012_val_00026161,n02108000 38 0 464 280
3,ILSVRC2012_val_00026171,n03109150 0 14 216 299
4,ILSVRC2012_val_00008726,n02119789 255 142 454 329 n02119789 44 21 322 ...


### Bounding boxes as percentages

Since we'll be using TensorFlow for this, we'll use [tf.image.draw_bounding_boxes](https://www.tensorflow.org/api_docs/python/tf/image/draw_bounding_boxes) to see the bounding boxes during training and at inference. Since that API requires the values for `[x_min, y_min, x_max, y_max]` to be floats in `[0.0, 1.0]` relative to the width and height of each image, we'll need to convert those values from absolute pixels to percentages.

### Synset mapping

The labels above are synset IDs, which we'll want to map to the their corresponding English values. Using the function `parse_synset_mapping()`, we'll have a dict that maps the synset ID to a list of values for that synset. Also, since we'll need to convert each synset ID to an integer, we'll need to generate another map to help with that.

In [13]:
def parse_synset_mapping(path):
    """Parse the synset mapping file into a dictionary mapping <synset_id>:[<synonyms in English>]
    This assumes an input file formatted as:
        <synset_id> <category>, <synonym...>
    Example:
        n01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
    """
    synset_map = {}
    with open(path, 'r') as fp:
        lines = fp.readlines()
        for line in lines:
            parts = line.split(' ')
            synset_map[parts[0]] = [label.strip() for label in ' '.join(parts[1:]).split(',')]
        return synset_map
    
def generate_synset_to_int_mapping(synset_mapping):
    synset_to_int_map = {}
    for index, (key, val) in enumerate(synset_mapping.items()):
        synset_to_int_map[key] = index
    return synset_to_int_map
    
def generate_int_to_synset_mapping(synset_mapping):
    int_to_synset_map = {}
    for index, (key, val) in enumerate(synset_mapping.items()):
        int_to_synset_map[index] = key
    return int_to_synset_map

In [14]:
synset_mapping = parse_synset_mapping(loc_synset_mapping_path)
synset_to_int = generate_synset_to_int_mapping(synset_mapping)
int_to_synset = generate_int_to_synset_mapping(synset_mapping)

In [83]:
len(synset_mapping)

1000

In [8]:
for index, (key, val) in enumerate(synset_mapping.items()):
    if (index > 5):
        break
    print('{} {}'.format(key, val))

n01440764 ['tench', 'Tinca tinca']
n01443537 ['goldfish', 'Carassius auratus']
n01484850 ['great white shark', 'white shark', 'man-eater', 'man-eating shark', 'Carcharodon carcharias']
n01491361 ['tiger shark', 'Galeocerdo cuvieri']
n01494475 ['hammerhead', 'hammerhead shark']
n01496331 ['electric ray', 'crampfish', 'numbfish', 'torpedo']


In [12]:
for index, (key, val) in enumerate(synset_to_int.items()):
    if (index > 5):
        break
    print('{} {}'.format(key, val))

n01440764 0
n01443537 1
n01484850 2
n01491361 3
n01494475 4
n01496331 5


In [15]:
for index, (key, val) in enumerate(int_to_synset.items()):
    if (index > 5):
        break
    print('{} {}'.format(key, val))

0 n01440764
1 n01443537
2 n01484850
3 n01491361
4 n01494475
5 n01496331
