# Example code for exploring `dataset/meta_data/full-candidate-subsets.pkl`

### Understanding `dataset/meta_data/full-candidate-subsets.pkl`
The metadata file `dataset/meta_data/full-candidate-subsets.pkl` is the most important piece of metadata of MetaShift, which provides the full subset information of MetaShift. To facilitate understanding, we have provided a notebook `dataset/understanding_full-candidate-subsets-pkl.ipynb` to show how to extract information from it. 

Basically, the pickle file stores a `collections.defaultdict(set)` object, which contains *17,938* keys. Each key is a string of the subset name like `dog(frisbee)`, and the corresponding value is a list of the IDs of the images that belong to this subset. The image IDs can be used to retrieve the image files from the Visual Genome dataset that you just downloaded. In our current version, *13,543* out of *17,938* subsets have more than 25 valid images. In addition, `dataset/meta_data/full-candidate-subsets.pkl` is drived from the [scene graph annotation](https://nlp.stanford.edu/data/gqa/sceneGraphs.zip), so check it out if your project need additional information about each image. 

In [1]:
import pickle
import numpy as np
import json, re, math
from collections import Counter, defaultdict
from itertools import repeat
import pprint
import os, errno
from PIL import Image
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import shutil # for copy files
import networkx as nx # graph vis
import pandas as pd
from sklearn.decomposition import TruncatedSVD


In [2]:
# Compare with Visaul Genome based MetaDataset
def load_candidate_subsets():
    pkl_save_path = "./meta_data/full-candidate-subsets.pkl" 
    with open(pkl_save_path, "rb") as pkl_f:
        load_data = pickle.load( pkl_f )
        print('pickle load', len(load_data), pkl_save_path)
    return load_data

VG_node_name_to_img_id = load_candidate_subsets()

pickle load 17938 ./meta_data/full-candidate-subsets.pkl


In [7]:
assert type(VG_node_name_to_img_id)==defaultdict
print('full-candidate-subsets.pkl is a ', type(VG_node_name_to_img_id) )

full-candidate-subsets.pkl is a  <class 'collections.defaultdict'>


In [6]:
print('full-candidate-subsets.pkl contains', len(VG_node_name_to_img_id), 'keys/subsets')

full-candidate-subsets.pkl contains 17938 keys


In [14]:
img_IDs = sorted(VG_node_name_to_img_id['dog(frisbee)'])
img_IDs[:10]

['1592975',
 '2315447',
 '2316506',
 '2316592',
 '2317615',
 '2317786',
 '2318046',
 '2318049',
 '2318692',
 '2319590']

In [11]:
IMGAGE_SUBSET_SIZE_THRESHOULD = 25 
counter_large_enough_subsets = 0
node_name_to_img_id = VG_node_name_to_img_id
for node_name in node_name_to_img_id.keys():
    ##################################
    # Apply a threshould: e.g., 100
    ##################################
    imageID_set = node_name_to_img_id[node_name]
    imageID_set = imageID_set
    node_name_to_img_id[node_name] = imageID_set
    if len(imageID_set) >= IMGAGE_SUBSET_SIZE_THRESHOULD:
        counter_large_enough_subsets += 1
        
print('Number of large enough subsets (i.e., with >={} images):'.format(IMGAGE_SUBSET_SIZE_THRESHOULD), counter_large_enough_subsets )

Number of large enough subsets (i.e., with >=25 images): 13543


In [15]:
VG_node_name_to_img_id.keys()

dict_keys(['rice(bowl)', 'rice(meat)', 'rice(spoon)', 'rice(plate)', 'spots(banana)', 'banana(plate)', 'banana(bowl)', 'meat(rice)', 'meat(spoon)', 'meat(bowl)', 'meat(onions)', 'meat(plate)', 'spoon(rice)', 'spoon(banana)', 'spoon(meat)', 'spoon(tablecloth)', 'spoon(bowl)', 'spoon(meal)', 'spoon(onions)', 'spoon(dish)', 'spoon(plate)', 'plantains(banana)', 'tablecloth(bowl)', 'tablecloth(plate)', 'tablecloth(spoon)', 'bowl(rice)', 'bowl(banana)', 'bowl(meat)', 'bowl(spoon)', 'bowl(tablecloth)', 'bowl(bananas)', 'bowl(onions)', 'bowl(dish)', 'bowl(plate)', 'meal(plate)', 'meal(spoon)', 'bananas(plate)', 'bananas(bowl)', 'onions(meat)', 'onions(bowl)', 'onions(plate)', 'dish(rice)', 'dish(meat)', 'dish(spoon)', 'dish(bowl)', 'dish(plate)', 'plate(rice)', 'plate(banana)', 'plate(meat)', 'plate(spoon)', 'plate(tablecloth)', 'plate(bowl)', 'plate(meal)', 'plate(bananas)', 'plate(onions)', 'ground(boy)', 'ground(leaves)', 'ground(snow)', 'bushes(tree)', 'road(boy)', 'leaves(ground)', 'leave