This notebook shows how we can find the images that contain labels with high frequency. Note that I only look at all the human labels because they have high confidence, and in total, they are much less than machine labels. 

In [359]:
import pandas as pd
import numpy as np


In [360]:
# Read all the human label

train_human_labels = pd.read_csv('/media/yshi/4TB/openImage/metadata/train_human_labels.csv', dtype = {'ImageID': str, "Source": str, 'LabelName':str, "Confidence": np.float64})

In [361]:
train_human_labels.shape

(8036466, 4)

So there are 8036466 labels in total. (Different from the total number of images because some images might contain more than one labels.) Next, we load the total trainable classes.

In [362]:
label_descriptions = pd.read_csv('/media/yshi/4TB/openImage/metadata/classes-trainable.csv')

In [363]:
label_descriptions.shape

(7178, 1)

In [365]:
# trainable_labels = list(label_descriptions.label_code)

Next, we load class descriptions, in order to make sure later in the results, all labels with high frequencies are sensible categories.

In [369]:
class_descriptions = pd.read_csv('/media/yshi/4TB/openImage/metadata/class-descriptions.csv', dtype=str)

In [370]:
lable_code_to_description = dict(zip(class_descriptions.label_code, class_descriptions.description))

In [371]:
class_descriptions.head()

Unnamed: 0,label_code,description
0,/m/0100nhbf,Sprenger's tulip
1,/m/0104x9kv,Vinegret
2,/m/0105jzwx,Dabu-dabu
3,/m/0105ld7g,Pistachio ice cream
4,/m/0105lxy5,Woku


Some descriptions corresponds with more than 1 unique label codes. That might be something we should consider in the future. 

In [375]:
class_description_counts = class_descriptions.groupby('description').count().rename(columns={'label_code':'count'})

In [376]:
repeated_descriptions = class_description_counts[(class_description_counts > 1).values].sort_values('count', ascending=False)

In [377]:
repeated_descriptions

Unnamed: 0_level_0,count
description,Unnamed: 1_level_1
Trunk,4
Wedge,3
Referee,3
Crop,3
Punch,3
Morning glory,3
Lilac,3
Eel,3
Mint,3
Windflower,3


https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas

In [381]:
# find all labels that have are in the trainable dataframe
trainable_human_all = train_human_labels.loc[train_human_labels['LabelName'].isin(label_descriptions['label_code'])]

In [311]:
# select all the training images that start with 0, 1, 2, or 3 (because they are all the data I've downloaded so far)
trainable_human_all = trainable_human_all[trainable_human_all['ImageID'].str.match('[0-3]')]

In [312]:
trainable_human_all

Unnamed: 0,ImageID,Source,LabelName,Confidence
0,000002b66c9c498e,crowdsource-verification,/m/01kcnl,1.0
1,000002b66c9c498e,verification,/m/012mj,1.0
2,000002b66c9c498e,verification,/m/012yh1,1.0
3,000002b66c9c498e,verification,/m/014sv8,1.0
4,000002b66c9c498e,verification,/m/016c68,1.0
5,000002b66c9c498e,verification,/m/016q19,1.0
6,000002b66c9c498e,verification,/m/019nj4,1.0
7,000002b66c9c498e,verification,/m/019_nn,1.0
8,000002b66c9c498e,verification,/m/019sc6,1.0
10,000002b66c9c498e,verification,/m/01bsxb,1.0


In [382]:
# next we count all the label names. Since we don't need the confidence column, we change it to count instead
trainable_label_counts = trainable_human_all.groupby('LabelName').count().rename(columns={'Confidence':'Counts'}).drop(columns=['ImageID', 'Source'])

In [383]:
trainable_label_counts = trainable_label_counts.reset_index()

In [385]:
# join the class descriptions
trainable_label_counts_sorted = trainable_label_counts.set_index('LabelName').join(class_descriptions.set_index('label_code')).sort_values('Counts', ascending=False)


In [386]:
trainable_label_counts_sorted.shape

(7172, 2)

7172 means that among 7178 trainable labels, we have 7172 trainable labels in the subset downloaded in my computer. Next, we select we labels whose counts are over 600. (of course we can change the number later)

In [388]:
trainable_label_counts_select = trainable_label_counts_sorted[trainable_label_counts_sorted['Counts'] > 600]

In [389]:
trainable_label_counts_select

Unnamed: 0_level_0,Counts,description
LabelName,Unnamed: 1_level_1,Unnamed: 2_level_1
/m/01g317,807090,Person
/m/09j2d,610840,Clothing
/m/0dzct,331942,Human face
/m/07j7r,315026,Tree
/m/05s2s,266978,Plant
/m/07yv9,228351,Vehicle
/m/0cgh4,183125,Building
/m/01prls,165525,Land vehicle
/m/09j5n,134458,Footwear
/m/0jbk,123529,Animal


So there are about 447 of the labels that appear more than 600 times in my subset. Next, we put the label name in the dataframe.

In [390]:
train_image_select = trainable_label_counts_select.join(trainable_human_all.set_index('LabelName'))

In [391]:
train_image_select

Unnamed: 0_level_0,Counts,description,ImageID,Source,Confidence
LabelName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
/m/011_f4,648,String instrument,0000201cd362f303,verification,1.0
/m/011_f4,648,String instrument,0000aa810854dc2e,verification,1.0
/m/011_f4,648,String instrument,0000b3e5921ab7ff,verification,1.0
/m/011_f4,648,String instrument,0000c33c6f4b8518,verification,1.0
/m/011_f4,648,String instrument,0000d59fa570d973,verification,1.0
/m/011_f4,648,String instrument,0002024f996741eb,verification,1.0
/m/011_f4,648,String instrument,00027ae34cb09507,verification,1.0
/m/011_f4,648,String instrument,0002a3c01c926a49,verification,1.0
/m/011_f4,648,String instrument,0002eed6350f1840,verification,1.0
/m/011_f4,648,String instrument,00036d5834d932bf,verification,1.0


So there are 6304532 labels in total. We need to find how many images there are. (Again, since there might be more than 1 labels for an image) 

In [392]:
import os

In [393]:
datapath = list()
for root, dirs, files in os.walk('/media/yshi/4TB/openImage/train/'):
    for filename in files:
        nm, ext = os.path.splitext(filename)
        fullpath = os.path.join(os.path.abspath(root), filename)
        datapath.append((filename[:-4], fullpath))
datapath = pd.DataFrame(datapath, columns=['filename', 'fullpath'])
# print(df1)

In [394]:
datapath.shape

(493994, 2)

493994 are the total number of images in my subset. But how many of them can be used to train the 447 highly frequent labels above?

In [395]:
data_select = datapath.loc[datapath['filename'].isin(train_image_select['ImageID'])]

In [396]:
data_select

Unnamed: 0,filename,fullpath
0,3cef2522b36f8c57,/media/yshi/4TB/openImage/train/3cef2522b36f8c...
1,3b4a15883505f0b6,/media/yshi/4TB/openImage/train/3b4a15883505f0...
2,1f7b221bfa9aacb9,/media/yshi/4TB/openImage/train/1f7b221bfa9aac...
3,0ae10d73844dea18,/media/yshi/4TB/openImage/train/0ae10d73844dea...
4,0b83dc18b027bc81,/media/yshi/4TB/openImage/train/0b83dc18b027bc...
5,3043cb9122c4d81c,/media/yshi/4TB/openImage/train/3043cb9122c4d8...
6,0cc625bbafb01cbe,/media/yshi/4TB/openImage/train/0cc625bbafb01c...
7,026f29e240a92aff,/media/yshi/4TB/openImage/train/026f29e240a92a...
8,2ca0b30070eb8e46,/media/yshi/4TB/openImage/train/2ca0b30070eb8e...
9,03cccddeded8eaa1,/media/yshi/4TB/openImage/train/03cccddeded8ea...


There are 487973 rows 

In [397]:
train_image_select = train_image_select.loc[train_image_select['ImageID'].isin(datapath['filename'])]

In [398]:
train_image_select

Unnamed: 0_level_0,Counts,description,ImageID,Source,Confidence
LabelName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
/m/011_f4,648,String instrument,0000201cd362f303,verification,1.0
/m/011_f4,648,String instrument,0000aa810854dc2e,verification,1.0
/m/011_f4,648,String instrument,0000b3e5921ab7ff,verification,1.0
/m/011_f4,648,String instrument,0000c33c6f4b8518,verification,1.0
/m/011_f4,648,String instrument,0000d59fa570d973,verification,1.0
/m/011_f4,648,String instrument,0002024f996741eb,verification,1.0
/m/011_f4,648,String instrument,00027ae34cb09507,verification,1.0
/m/011_f4,648,String instrument,0002a3c01c926a49,verification,1.0
/m/011_f4,648,String instrument,0002eed6350f1840,verification,1.0
/m/011_f4,648,String instrument,00036d5834d932bf,verification,1.0


So in these 487973 images there are 1868410 labels. Average 4 each.