# Example code for exploring `dataset/attributes_MetaShift/attributes-candidate-subsets.pkl`

### Understanding `dataset/attributes_MetaShift/attributes-candidate-subsets.pkl`
`dataset/attributes_MetaShift/attributes-candidate-subsets.pkl` stores the metadata for MetaShift-Attributes, where each subset is defined by the attribute of the subject, e.g. `cat(orange)`, `cat(white)`, `dog(sitting)`, `dog(jumping)`. 

`attributes-candidate-subsets.pkl` has the same data format as `full-candidate-subsets.pkl`. To facilitate understanding, we have provided a notebook `dataset/attributes_MetaShift/understanding_attributes-candidate-subsets-pkl.ipynb` to show how to extract information from it. 

Basically, the pickle file stores a `collections.defaultdict(set)` object, which contains *4,962* keys. Each key is a string of the subset name like `cat(orange)`, and the corresponding value is a list of the IDs of the images that belong to this subset. The image IDs can be used to retrieve the image files from the Visual Genome dataset that you just downloaded. 

### Understanding `dataset/attributes_MetaShift/structured-attributes-candidate-subsets.pkl`
`dataset/attributes_MetaShift/structured-attributes-candidate-subsets.pkl` is very similar to `dataset/attributes_MetaShift/attributes-candidate-subsets.pkl`, but stores the metadata in a more structured way. The pickle file stores a 3-level nested dictionary, with the following structure:

```plain
.
├── key: 'color'
    ├── key: 'cat'              
        ├── key: 'orange'
            ├── value: a list of image IDs
├── key: 'activity'
    ├── key: 'dog'              
        ├── key: 'sitting'
            ├── value: a list of image IDs
        ├── ...
```

See the full attrribute ontology in `ATTRIBUTE_CONTEXT_ONTOLOGY` in `dataset/Constants.py`

## Part A: Understanding `attributes-candidate-subsets.pkl`

In [4]:
import pickle
import numpy as np
from collections import Counter, defaultdict
import pprint
from PIL import Image
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import shutil # for copy files
import networkx as nx # graph vis
import pandas as pd

In [5]:
# Visaul Genome based MetaShift
def load_candidate_subsets():
    pkl_save_path = "./attributes-candidate-subsets.pkl" 
    with open(pkl_save_path, "rb") as pkl_f:
        load_data = pickle.load( pkl_f )
        print('pickle load', len(load_data), pkl_save_path)
    return load_data

VG_node_name_to_img_id = load_candidate_subsets()

pickle load 4962 ./attributes-candidate-subsets.pkl


In [6]:
assert type(VG_node_name_to_img_id)==defaultdict
print('attributes-candidate-subsets.pkl is a ', type(VG_node_name_to_img_id) )

attributes-candidate-subsets.pkl is a  <class 'collections.defaultdict'>


In [7]:
print('attributes-candidate-subsets.pkl contains', len(VG_node_name_to_img_id), 'keys (or, subsets)')

attributes-candidate-subsets.pkl contains 4962 keys/subsets


In [8]:
img_IDs = sorted(VG_node_name_to_img_id['cat(orange)'])
print('Number of images in this subset:', len(img_IDs) )
img_IDs[:10]

Number of images in this subset: 107


['107962',
 '2315813',
 '2318038',
 '2318872',
 '2319323',
 '2320055',
 '2320210',
 '2320521',
 '2321421',
 '2324716']

In [11]:
# VG_node_name_to_img_id.keys()

## Part B: Understanding `structured-attributes-candidate-subsets.pkl`

In [15]:
# Visaul Genome based MetaShift
def load_structured_candidate_subsets():
    pkl_save_path = "./structured-attributes-candidate-subsets.pkl" 
    with open(pkl_save_path, "rb") as pkl_f:
        load_data = pickle.load( pkl_f )
        print('pickle load', len(load_data), pkl_save_path)
    return load_data

structured_VG_node_name_to_img_id = load_structured_candidate_subsets()

pickle load 23 ./structured-attributes-candidate-subsets.pkl


The following line is equivalent to 
```py
img_IDs = sorted(VG_node_name_to_img_id['cat(orange)'])
```
that we just saw from "Part A: Understanding `attributes-candidate-subsets.pkl`". 

In [20]:
img_IDs = sorted(structured_VG_node_name_to_img_id['color']['cat']['orange'])
print('Number of images in this subset:', len(img_IDs) )
img_IDs[:10]

Number of images in this subset: 107


['107962',
 '2315813',
 '2318038',
 '2318872',
 '2319323',
 '2320055',
 '2320210',
 '2320521',
 '2321421',
 '2324716']

In [21]:
structured_VG_node_name_to_img_id['color']['cat'].keys()

dict_keys(['gray', 'black', 'white', 'brown', 'orange', 'yellow', 'tan', 'dark', 'gold', 'light brown', 'pink', 'red', 'beige', 'green'])