
![](https://theknclan.com/wp-content/uploads/2017/10/635980679147435890-488367249_FashionHeader.png)

# Extensive EDA of iMaterialist (Fashion) Dataset with Object Detection and Color Analysis

This notebook contains the exploration of iMaterialist Challenge (Fashion) at FGVC5 [dataset](https://www.kaggle.com/c/imaterialist-challenge-fashion-2018)

About the iMaterialist (Fashion) Competition - 

As shoppers move online, it would be a dream come true to have products in photos classified automatically. But, automatic product recognition is tough because for the same product, a picture can be taken in different lighting, angles, backgrounds, and levels of occlusion. Meanwhile different fine-grained categories may look very similar, for example, royal blue vs turquoise in color. Many of today’s general-purpose recognition machines simply cannot perceive such subtle differences between photos, yet these differences could be important for shopping decisions.

Tackling issues like this is why the Conference on Computer Vision and Pattern Recognition (CVPR) has put together a workshop specifically for data scientists focused on fine-grained visual categorization called the FGVC5 workshop. As part of this workshop, CVPR is partnering with Google, Wish, and Malong Technologies to challenge the data science community to help push the state of the art in automatic image classification.

In this competition, FGVC workshop organizers with Wish and Malong Technologies challenge you to develop algorithms that will help with an important step towards automatic product detection – to accurately assign attribute labels for fashion images. Individuals/Teams with top submissions will be invited to present their work live at the FGVC5 workshop.  




**Contents**

**1. Descriptive Statistics**   
&nbsp;&nbsp;&nbsp;&nbsp;  1.1 Counts of Images and Labels  
&nbsp;&nbsp;&nbsp;&nbsp;     1.2 Top Labels in the dataset  
&nbsp;&nbsp;&nbsp;&nbsp;     1.3 Most Common Co-occuring Labels  
&nbsp;&nbsp;&nbsp;&nbsp;     1.4 Images with maxium Labels  
&nbsp;&nbsp;&nbsp;&nbsp;     1.5 Images with single Label  
&nbsp;&nbsp;&nbsp;&nbsp;     1.6 Freq Dist of Images in different label count buckets  
**2. Colors Used in the Images**     
&nbsp;&nbsp;&nbsp;&nbsp;     2.1 Top Average Color of the images  
&nbsp;&nbsp;&nbsp;&nbsp;     2.2 Dominant Colors present in the images  
&nbsp;&nbsp;&nbsp;&nbsp;     2.3 Common Color Palletes    
**3. Object Detection**  
&nbsp;&nbsp;&nbsp;&nbsp;     3.1 Top Colors Detected in the images  
&nbsp;&nbsp;&nbsp;&nbsp;     3.2 Top Objects Detected in the images  

## Dataset Preparation 

In [3]:
from IPython.core.display import HTML
from IPython.display import Image
from collections import Counter
import pandas as pd 
import json
import random


from plotly.offline import init_notebook_mode, iplot
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from wordcloud import WordCloud
from plotly import tools
import seaborn as sns
from PIL import Image

import tensorflow as tf
import numpy as np

init_notebook_mode(connected=True)
%matplotlib inline 

In [4]:
## read the dataset 

train_path = 'data/train.json'
test_path  = 'data/test.json'
valid_path = 'data/validation.json'

train_inp = open(train_path).read()
test_inp  = open(test_path).read()
valid_inp = open(valid_path).read()

train_inp = json.loads(train_inp)
test_inp  = json.loads(test_inp)
valid_inp = json.loads(valid_inp)

## 1. Descriptive Statistics

## 1.1 How many Images and how many distinct labels are there in the dataset?

In [5]:
print('Keys in the training json', [*train_inp.keys()])
print('Keys in the test json', [*test_inp.keys()])
print('Keys in the validation json', [*valid_inp.keys()])

Keys in the training json ['info', 'images', 'annotations', 'license']
Keys in the test json ['images']
Keys in the validation json ['images', 'annotations']


In [4]:
from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

take(5, train_inp.items())

[('info',
  {'url': 'https://www.wish.com',
   'dateCreated': '2-27-2018',
   'version': '2',
   'description': 'Train Set for FGVC5 CVPR 2018 by https://www.wish.com',
   'year': '2018'}),
 ('images',
  [{'url': 'https://contestimg.wish.com/api/webimage/570f35feb2f4b95d223aa9b1-large',
    'imageId': '1'},
   {'url': 'https://contestimg.wish.com/api/webimage/5468f1c0d96b290ff8e5c805-large',
    'imageId': '2'},
   {'url': 'https://contestimg.wish.com/api/webimage/546410237d57f323e72ca414-large',
    'imageId': '3'},
   {'url': 'https://contestimg.wish.com/api/webimage/550b955fdd699c1a0351f84e-large',
    'imageId': '4'},
   {'url': 'https://contestimg.wish.com/api/webimage/54451f33355b4e0fd3028a30-large',
    'imageId': '5'},
   {'url': 'https://contestimg.wish.com/api/webimage/571e0b1cea3cc75d8a004f37-large',
    'imageId': '6'},
   {'url': 'https://contestimg.wish.com/api/webimage/52cbee3f34067e3d742181de-large',
    'imageId': '7'},
   {'url': 'https://contestimg.wish.com/api/webim

In [6]:

def get_stats(data):
    total_images = len(data['images'])

    all_annotations = []
    if 'annotations' in data:
        for each in data['annotations']:
            all_annotations.extend(each['labelId'])
    total_labels = len(set(all_annotations))
    return total_images, total_labels, all_annotations

total_images, total_labels, train_annotations = get_stats(train_inp)
print (train_path, "- Total Images:", total_images)
print (train_path, "- Total Labels:", total_labels)

total_images, total_labels, test_annotations = get_stats(test_inp)
print (test_path, " - Total Images:", total_images)
print (test_path, " - Total Labels:", total_labels)

total_images, total_labels, valid_annotations = get_stats(valid_inp)
print (valid_path, "- Total Images:", total_images)
print (valid_path, "- Total Labels:", total_labels)

data/train.json - Total Images: 1014544
data/train.json - Total Labels: 228
data/test.json  - Total Images: 39706
data/test.json  - Total Labels: 0
data/validation.json - Total Images: 9897
data/validation.json - Total Labels: 225


There are about 1 Million images provided in the train dataset and there are 228 distinct labels which are used to label these images. There are two other sources of data as well - test data and validation data but in thie notebook I have only used images from train dataset.

## 1.2 Which are the top used Labels in the dataset ?

In [7]:
train_imgs_df = pd.DataFrame.from_records(train_inp["images"])
train_imgs_df["url"] = train_imgs_df["url"]
train_labels_df = pd.DataFrame.from_records(train_inp["annotations"])
train_df = pd.merge(train_imgs_df,train_labels_df,on="imageId",how="outer")
train_df["imageId"] = train_df["imageId"].astype(np.int)
print(train_df.head(2))
print(train_df.dtypes)

valid_imgs_df = pd.DataFrame.from_records(valid_inp["images"])
valid_imgs_df["url"] = valid_imgs_df["url"]
valid_labels_df = pd.DataFrame.from_records(valid_inp["annotations"])
valid_df = pd.merge(valid_imgs_df,valid_labels_df,on="imageId",how="outer")
valid_df["imageId"] = valid_df["imageId"].astype(np.int)
print(valid_df.head(2))
print(valid_df.dtypes)

test_df = pd.DataFrame.from_records(test_inp["images"])
test_df["url"] = test_df["url"]
test_df["imageId"] = test_df["imageId"].astype(np.int)
print(test_df.head(2))
print(test_df.dtypes)

   imageId                                                url  \
0        1  https://contestimg.wish.com/api/webimage/570f3...   
1        2  https://contestimg.wish.com/api/webimage/5468f...   

                       labelId  
0        [95, 66, 137, 70, 20]  
1  [36, 66, 44, 214, 105, 133]  
imageId     int64
url        object
labelId    object
dtype: object
   imageId                                                url  \
0        1  https://contestimg.wish.com/api/webimage/568e1...   
1        2  https://contestimg.wish.com/api/webimage/5452f...   

                                     labelId  
0            [62, 17, 66, 214, 105, 137, 85]  
1  [95, 17, 66, 214, 164, 137, 20, 204, 184]  
imageId     int64
url        object
labelId    object
dtype: object
   imageId                                                url
0        1  https://contestimg.wish.com/api/webimage/568e1...
1        2  https://contestimg.wish.com/api/webimage/5452f...
imageId     int64
url        object
dtype: obj

In [8]:
print("## Training Data.")
print(train_df.isna().any(),"\n")

print("## Testing Data.")
print(test_df.isna().any(),"\n")

print("## Validation Data.")
print(valid_df.isna().any())


## Training Data.
imageId    False
url        False
labelId    False
dtype: bool 

## Testing Data.
imageId    False
url        False
dtype: bool 

## Validation Data.
imageId    False
url        False
labelId    False
dtype: bool


In [9]:
train_labels = Counter(train_annotations)

xvalues = list(train_labels.keys())
yvalues = list(train_labels.values())

colores = ["#%06x" % random.randint(0, 0xFFFFFF) for _ in range(len(xvalues)) ]

trace1 = go.Bar(x=xvalues, y=yvalues, opacity=0.8, name="year count", marker=dict(color=colores))
layout = dict(width=1000, title='Distribution of different labels in the train dataset', legend=dict(orientation="h"));

fig = go.Figure(data=[trace1], layout=layout);
iplot(fig);

In [10]:
valid_labels = Counter(valid_annotations)

xvalues = list(valid_labels.keys())
yvalues = list(valid_labels.values())
print(xvalues)

trace1 = go.Bar(x=xvalues, y=yvalues, opacity=0.8, name="year count", marker=dict(color='rgba(20, 20, 20, 1)'))
layout = dict(width=800, title='Distribution of different labels in the valid dataset', legend=dict(orientation="h"));

fig = go.Figure(data=[trace1], layout=layout);
iplot(fig);

['62', '17', '66', '214', '105', '137', '85', '95', '164', '20', '204', '184', '122', '19', '186', '180', '44', '154', '190', '222', '153', '226', '53', '171', '111', '70', '14', '98', '12', '175', '54', '138', '116', '176', '56', '210', '61', '106', '49', '15', '148', '115', '181', '36', '78', '193', '144', '103', '99', '178', '135', '47', '59', '18', '128', '87', '30', '108', '25', '102', '225', '48', '147', '209', '183', '194', '131', '203', '133', '212', '126', '77', '65', '73', '43', '32', '97', '130', '45', '201', '21', '2', '169', '88', '40', '79', '208', '159', '158', '63', '165', '192', '207', '182', '5', '92', '151', '136', '189', '10', '9', '187', '8', '38', '91', '100', '205', '41', '81', '142', '117', '120', '110', '211', '191', '155', '52', '218', '170', '55', '28', '114', '220', '168', '150', '113', '7', '216', '224', '119', '31', '141', '101', '217', '172', '80', '75', '69', '197', '124', '13', '132', '179', '74', '26', '143', '166', '71', '22', '94', '72', '51', '4', '

In [11]:
def get_images_for_labels(labellist, data):
    image_ids = []
    for each in data['annotations']:
        if all(x in each['labelId'] for x in labellist):
            image_ids.append(each['imageId'])
            if len(image_ids) == 2:
                break
    image_urls = []
    for each in data['images']:
        if each['imageId'] in image_ids:
            image_urls.append(each['url'])
    return image_urls

In [12]:
# most common labels 

temps = train_labels.most_common(10)
labels_tr = ["Label-"+str(x[0]) for x in temps]
values = [x[1] for x in temps]

colores = ["#%06x" % random.randint(0, 0xFFFFFF) for _ in range(10) ]

trace1 = go.Bar(x=labels_tr, y=values, opacity=0.7, name="year count", marker=dict(color=colores))
layout = dict(height=400, title='Top 10 Labels in the train dataset', legend=dict(orientation="h"));

fig = go.Figure(data=[trace1], layout=layout);
iplot(fig);

Label 66 is the most used label with almost 750K images tagged with this label in the training dataset

In [13]:
temps = valid_labels.most_common(10)
labels_vl = ["Label-"+str(x[0]) for x in temps]
values = [x[1] for x in temps]

trace1 = go.Bar(x=labels_vl, y=values, opacity=0.7, name="year count", marker=dict(color='rgba(120, 120, 120, 0.8)'))
layout = dict(height=400, title='Top 10 Labels in the valid dataset', legend=dict(orientation="h"));

fig = go.Figure(data=[trace1], layout=layout);
iplot(fig);

Again, in the validation dataset, Label 66 is the most used label but second most label used is label-17 not label-105 of training dataset

## 1.3 What are the most Common Co-Occuring Labels in the dataset

Since every image can be classified into multiple labels, it will be interesting to note which lables have co-occured together

In [14]:
# Most Commonly Occuring Labels 

def cartesian_reduct(alist):
    results = []
    for x in alist:
        for y in alist:
            if x == y:
                continue
            srtd = sorted([int(x),int(y)])
            srtd = " AND ".join([str(x) for x in srtd])
            results.append(srtd)
    return results 

co_occurance = []
for i, each in enumerate(train_inp['annotations']):
    prods = cartesian_reduct(each['labelId'])
    co_occurance.extend(prods)

In [15]:
coocur = Counter(co_occurance).most_common(10)
labels = list(reversed([str(x[0]) for x in coocur]))
values = list(reversed([x[1] for x in coocur]))

colores = ["#%06x" % random.randint(0, 0xFFFFFF) for _ in range(10) ]

trace1 = go.Bar(x=values, y=labels, opacity=0.7, orientation="h", name="year count", marker=dict(color=colores, colorscale='Rainbow'))
layout = dict(height=400, title='Most common co-occurring Labels in the dataset', legend=dict(orientation="h"));

fig = go.Figure(data=[trace1], layout=layout);
iplot(fig);

From the above graph, (label 66 and label 105) and (label 66 and label 171) have been used most number of times while labelling the images, with the total count of 460K and 445K respectively. Apart from the most frequently occuring label "66", label 105 and label 153 have been used repeatedly in the dataset.

## 1.4 Which Images are tagged with Maximum Labels

Some images are labelled with single label but some images can have labels as high as 20. Lets get the images having the largest numbers of labels in the dataset

In [16]:
def get_image_url(imgid, data):
    for each in data['images']:
        if each['imageId'] == imgid:
            return each['url']

srtedlist = sorted(train_inp['annotations'], key=lambda d: len(d['labelId']), reverse=True)

In [17]:
for img in srtedlist[:5]:
    iurl = get_image_url(img['imageId'], train_inp)  
    labelpair = ", ".join(img['labelId'])
    imghtml = """Labels: """+ str(labelpair) +""" &nbsp;&nbsp; <b>Total Labels: """+ str(len(img['labelId'])) + """</b><br>""" + "<img src="+iurl+" width=200px; style='float:left'>"
    display(HTML(imghtml))

## 1.5 Which Images have perfect label ie. a Single Label

Lets get some of the images which has only one label

In [18]:
# How many images are labelled with only 1 label 
for img in srtedlist[-5:]:
    iurl = get_image_url(img['imageId'], train_inp)  
    labelpair = ", ".join(img['labelId'])
    imghtml = """<b> Label: """+ str(labelpair) +"""</b><br>""" + "<img src="+iurl+" width=200px; height=200px; style='float:left'>"
    display(HTML(imghtml))

## 1.6 Frequency Distribution of Images with respective Labels Counts in the dataset

Lets visualize how many images are there in each label count bucket. 

In [19]:
lbldst = Counter([len(x['labelId']) for x in srtedlist])

labels = list(lbldst.keys())
values = list(lbldst.values())

trace1 = go.Bar(x=labels, y=values, opacity=0.7, name="year count")
layout = dict(height=400, title='Frequency distribution of images with respective labels counts ', legend=dict(orientation="h"));

fig = go.Figure(data=[trace1], layout=layout);
iplot(fig);

In [118]:
def display_label(label_id, label_mat, df, num_disp=8):
    data_col = train_image_mat.getcol(label_id)
    tar_col = np.random.choice(np.where(data_col.toarray() == 1.0)[0],size=num_disp).tolist()
    urls = df[df["imageId"].isin(tar_col)]["url"].tolist()
    img_style = "width: 110px; margin: 0px; float: left; border: 1px solid black;"
    images_list = ''.join([f"<img style='{img_style}' src='{u}' />" for u in urls])
    header_str = "<h2>Label {:d}</h2>".format(label_id)
    #display(HTML(header_str))
    #display(HTML(images_list))
    
    return header_str, images_list
    


In [91]:
from scipy.sparse import csr_matrix

train_image_arr = train_df[["imageId","labelId"]].apply(lambda x: [(x["imageId"],int(i)) for i in x["labelId"]], axis=1).tolist()
train_image_arr = [item for sublist in train_image_arr for item in sublist]
train_image_row = np.array([d[0] for d in train_image_arr]).astype(np.int)
train_image_col = np.array([d[1] for d in train_image_arr]).astype(np.int)
train_image_vals = np.ones(len(train_image_col))
train_image_mat = csr_matrix((train_image_vals, (train_image_row, train_image_col)))
print(train_image_mat.shape)

(1014545, 229)


In [120]:
row = ''
i = 1
display(HTML('<table style="width:100%; border: 1px solid black;"><tr>'))

for label in range(1,train_image_mat.shape[1]):
    header, images = display_label(label, train_image_mat, train_df, 1)
    if i == 8:
        row = '{}<td>{} <br />{}</td></tr>'.format(row, header, images)
        display(HTML(row))
        row = '<tr>'
        i = 0
    else:
        row = '{}<td>{} <br />{}</td>'.format(row, header, images)
     
    i += 1
    
    
display(HTML('</tr></table>'))

Most of the images in the dataset have 5 or 6 labels on an average. 

## 2. Colors Used in the Images 

In the e-commerce industry, colors play a very important role in the customer behaviours. Some people are more inclined towards soft colors while some prefer warm colors. In this section, lets visualize what type of colors are used in the images. 

## 2.1 Common Average Color of the Images 

In [27]:

import requests 
from io import BytesIO

def compute_average_image_color(img):
    width, height = img.size
    count, r_total, g_total, b_total = 0, 0, 0, 0
    for x in range(0, width):
        for y in range(0, height):
            r, g, b = img.getpixel((x,y))
            r_total += r
            g_total += g
            b_total += b
            count += 1
    return (r_total/count, g_total/count, b_total/count)

In [28]:
srtedlist = sorted(train_inp['annotations'], key=lambda d: len(d['labelId']))
average_colors = {}
for img in srtedlist[:10]:
    
    iurli = get_image_url(img['imageId'], train_inp)

    response = requests.get(iurli)
    img = Image.open(BytesIO(response.content))
           
    average_color = compute_average_image_color(img)
    if average_color not in average_colors:
        average_colors[average_color] = 0
    average_colors[average_color] += 1

In [29]:
for average_color in average_colors:
    average_color1 = (int(average_color[0]),int(average_color[1]),int(average_color[2]))
    image_url = "<span style='display:inline-block; min-width:200px; background-color:rgb"+str(average_color1)+";padding:10px 10px;'>"+str(average_color1)+"</span>"
#     print (image_url)
    display(HTML(image_url))

## 2.2 Most Dominant Colors Used in the Images 

In [51]:
## top used colors in images 
from colorthief import ColorThief
import urllib.request

from PIL import Image
import requests
from io import BytesIO
import urllib
import os


img_style = "width: 200px; margin: 0px; float: left; border: 1px solid black;"

def dominant_color_from_url(url,tmp_file='tmp.jpg'):
    '''Downloads ths image file and analyzes the dominant color'''
    urllib.request.urlretrieve(url, tmp_file)
    color_thief = ColorThief(tmp_file)
    dominant_color = color_thief.get_color(quality=1)
    palette = color_thief.get_palette(color_count=6)
    os.remove(tmp_file)
    return dominant_color, palette

pallets = []
for img in srtedlist[:6]:
    
    
    iurli = get_image_url(img['imageId'], train_inp)

    response = requests.get(iurli)
    img = Image.open(BytesIO(response.content))

    dominant_color, palette = dominant_color_from_url(iurli)
    image_url = "Dominant color: <span style='display:inline-block; min-width:200px; background-color:rgb"+str(dominant_color)+";padding:10px 10px;'>"+str(dominant_color)+"</span>"
    image     = "<img style=\'"+ img_style +"\' src=\'"+iurli+"\'/>"
    
    paleta = 'Pallet: '
    for pall in palette:
        paleta += "<span style='background-color:rgb"+str(pall)+";padding:20px 10px;'>"+str(pall)+"</span>"
        
    display(HTML(image_url))
    print("")
    display(HTML(paleta))
    print("")
    display(HTML(image))

    pallets.append(palette)






































## 2.3 Common Color Pallets of the Images

In [47]:
for pallet in pallets:
    img_url = ""
    for pall in pallet:
        img_url += "<span style='background-color:rgb"+str(pall)+";min-width:300px; padding:20px 10px;'>"+str(pall)+"</span>"
    img_url += "<br>"
    display(HTML(img_url))
    print 
    

- Reference: [TensorFlow Object Detection Notebook](https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb)  
- Pre-Trained Models Reference: [PreTrained Models](https://github.com/tensorflow/models/tree/676a4f70c20020ed41b533e0c331f115eeffe9a3/research/object_detection)  
- Link to download the Utils: https://github.com/tensorflow/models/tree/master/research/object_detection/utils