# Data Exploration

The data provided for this challenge is divided in the following manner:

 + Training set: 50,000 samples
 + Validation set: 50,000 samples
 + Test set: 100,000 samples

There are 1,000 possible labels, and each image can have up to 5 objects in it.

### Sample images

In [3]:
sample_images_path = 'sample-images/'
sample_images = ['n01440764_14034.JPEG', 'n01440764_7154.JPEG', 'n01443537_15790.JPEG', 'n01443537_3885.JPEG']

In [4]:
from IPython.display import display, HTML

html = ["<td><img src='{}' /></td>".format(sample_images_path + img) for img in sample_images]
display(HTML("<table><tr>{}</tr></table>".format(''.join(html))))

### Labels

In [8]:
import pandas as pd

In [13]:
loc_train_solution = pd.read_csv('LOC_train_solution.csv')
loc_val_solution = pd.read_csv('LOC_val_solution.csv')
loc_synset_mapping_path = 'LOC_synset_mapping.txt'

In [10]:
loc_train_solution.head()

Unnamed: 0,ImageId,PredictionString
0,n02017213_7894,n02017213 115 49 448 294
1,n02017213_7261,n02017213 91 42 330 432
2,n02017213_5636,n02017213 230 104 414 224
3,n02017213_6132,n02017213 46 82 464 387
4,n02017213_7659,n02017213 103 66 331 335


In [11]:
loc_val_solution.head()

Unnamed: 0,ImageId,PredictionString
0,ILSVRC2012_val_00048981,n03995372 85 1 499 272
1,ILSVRC2012_val_00037956,n03481172 131 0 499 254
2,ILSVRC2012_val_00026161,n02108000 38 0 464 280
3,ILSVRC2012_val_00026171,n03109150 0 14 216 299
4,ILSVRC2012_val_00008726,n02119789 255 142 454 329 n02119789 44 21 322 ...


In [81]:
def parse_synset_mapping(path):
    """Parse the synset mapping file into a dictionary mapping <synset_id>:[<synonyms in English>]
    This assumes an input file formatted as:
        <synset_id> <category>, <synonym...>
    Example:
        n01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
    """
    synset_map = {}
    with open(path, 'r') as fp:
        lines = fp.readlines()
        for line in lines:
            parts = line.split(' ')
            synset_map[parts[0]] = [label.strip() for label in ' '.join(parts[1:]).split(',')]
        return synset_map

In [82]:
synset_mapping = parse_synset_mapping(loc_synset_mapping_path)

In [83]:
len(synset_mapping)

1000

In [84]:
for index, (key, val) in enumerate(synset_mapping.items()):
    if (index > 5):
        break
    print('{} {}'.format(key, val))

n01440764 ['tench', 'Tinca tinca']
n01443537 ['goldfish', 'Carassius auratus']
n01484850 ['great white shark', 'white shark', 'man-eater', 'man-eating shark', 'Carcharodon carcharias']
n01491361 ['tiger shark', 'Galeocerdo cuvieri']
n01494475 ['hammerhead', 'hammerhead shark']
n01496331 ['electric ray', 'crampfish', 'numbfish', 'torpedo']
