# Flickr30k dataset exploration

* This notebook will help us understand what is needed in order to produce captions on the UnRel dataset by investigating how captions are generated on the Flickr30k dataset.

## Imports

In [1]:
import json
import os

In [2]:
CWD = os.getcwd()
CWD

'/home/inzouzouwetrust/MVA/Cours_S1/RECVIS/RECVIS_final_project'

In [3]:
dataset_filename = os.path.join(CWD, "NBT", "data", "flickr30k", "dataset_flickr30k.json")

In [12]:
with open(dataset_filename, "r") as f:
    dataset = json.load(f, encoding="utf-8")

In [25]:
splits = {}

for img in dataset["images"]:
    if splits.get(img["split"]):
        splits[img["split"]] += 1
    else:
        splits[img["split"]] = 1

In [26]:
splits

{u'test': 1000, u'train': 29000, u'val': 1014}

**Comment:**

* My remark about how only the "val" split was used in NBT might be invalid. At least there is a "test" split in the original splits by Karpathy.

## dataset_unrel.json

* My goal now is to apply some preprocessing to the UnRel captions that we produced in order to generate a file that looks like the original "dataset_flickr30k.json" 

In [4]:
import csv

In [5]:
root = {"dataset": "UnRel",
        "images": list()}

unrel_dataset_filename = os.path.join(CWD, "data", "unrelcropped.csv")
with open(unrel_dataset_filename, "r") as f:
    reader = csv.reader(f, delimiter=",", )
    for i, row in enumerate(reader):
        if i == 0: # Header information
            continue
            
        filename = row[0]
        sent1 = row[1]
        sent1id = int(row[2])
        sent2 = row[3]
        sent2id = int(row[4])
        sent3 = row[5]
        sent3id = int(row[6])
        imgid = int(filename.split(".")[0])
        
        #sent1_tokens = {}
        #for i, word in enumerate(sent1.split(" ")):
        #    sent1_tokens[i] = word.strip(".")
        
        sentences = list()
        for tup in [(sent1, sent1id), (sent2, sent2id), (sent3, sent3id)]:
            sent, sentid = tup
            tokens = {i: word.strip(".") for i, word in enumerate(sent.split(" "))}
            sentences.append({"tokens": tokens,
                              "raw": sent,
                              "imgid": imgid,
                              "sentid": sentid})
        
        # TODO: Build sub dictionary in images list
        tmp = {"sentids": [sent1id, sent2id, sent3id],
               "imgid": imgid,
               "split": "test",
               "filename": filename,
               "sentences": sentences}
        
        # TODO: Append dictionary in images list
        root["images"].append(tmp)
        
# Dump the json
with open(os.path.join(CWD, "data", "dataset_unrel.json"), "w") as f:
    json.dump(root, f)

## cap_unrel.json

* We do it the **COCO** way. The dictionary ``root`` is a list of a list of *captions* that are in the form of a sequence of *tokens*.

In [13]:
dataset_filename = os.path.join(CWD, "data", "dataset_unrel.json")
caption_filename = os.path.join(CWD, "data", "cap_unrel.json")

with open(dataset_filename, "r") as f:
    dataset = json.load(f, encoding="utf-8")

In [14]:
n_images = len(dataset["images"])
n_captions = 3
print("Annotated images in UnRel: ", n_images)
print("Number of captions per image: ", n_captions)

('Annotated images in UnRel: ', 115)
('Number of captions per image: ', 3)


In [21]:
root = list()

for i, image in enumerate(dataset["images"]):
    sentences = list()
    for j, sentence in enumerate(image["sentences"]):
        sentences.append(sentence["tokens"])
    root.append(sentences)
    
with open(caption_filename, "w") as f:
    json.dump(root, f)

## dic_unrel.json

* We do it the **COCO** way but using **Flickr30k** vocabulary since there is a greater overlap (see Categories_Exploration notebook).

* We only wish to modify the ``images`` field in ``dic_flickr30k.json`` to populate it with the **UnRel** images instead. We save the result as ``dic_unrel.json``.

In [22]:
dataset_filename = os.path.join(CWD, "data", "dataset_unrel.json")
flickr30k_dic_filename = os.path.join(CWD, "NBT", "data", "flickr30k", "dic_flickr30k.json")
unrel_dic_filename = os.path.join(CWD, "data", "dic_unrel.json")

with open(dataset_filename, "r") as f:
    dataset = json.load(f, encoding="utf-8")
    
with open(flickr30k_dic_filename, "r") as f:
    flickr30k_dic = json.load(f, encoding="utf-8")

In [27]:
dataset["images"][0].keys()

[u'sentids', u'imgid', u'sentences', u'split', u'filename']

In [28]:
flickr30k_dic["images"][0].keys()

[u'file_path', u'id', u'split']

In [32]:
# Copy dictionary
unrel_dic = dict(flickr30k_dic)

# Build a list of images to replace the current flickr30k_dic field
images = list()
for i, image in enumerate(dataset["images"]):
    images.append({"file_path": image["filename"],
                   "id": image["imgid"],
                   "split": image["split"]})
unrel_dic["images"] = images

# Save UnRel dictionary
with open(unrel_dic_filename, "w") as f:
    json.dump(unrel_dic, f)

## Now is time to modify the source code to introduce the UnRel dataset

* Here we keep track of the changes we have made so far:
  * ``demo.py`` => ``demo_unrel.py``
    * Remove ``bboxs`` and ``masks`` in demo code (not used anyway)
  * ``main.py`` => ``main_unrel.py``
  * ``dataloader_flickr30k.py`` => ``dataloader_unrel.py``
    * Add ``utils.RandomCrop``
    * Rewrite the proposals h5 file loading part
    * Get rid of useless ``gt_seq``, ``gt_bboxes``, ``mask``, ``input_seq``...
  * ``utils.RandomCropWithBbox`` => ``utils.RandomCrop``
    * Remove the ``bboxs`` component in the random crop
    
* Next step, try ``demo_unrel.py``

**Current progress:**

* The proposals I am using only contain the 

## Language evaluation .json file

* This section aims to create a .json file similar to ``caption_flickr30k.json`` found in ``tools/coco-caption/pycocotools/annotations`` so that the language evaluation can be conducted.

* The expected file format is as follow: ``root = {"images": [{"file_name": ..., "id": ...}, ... ] (duplicated images), "info": None, "licenses": None, "type": "captions", "annotations": [{"image_id", "id", "caption" (raw)}, ...]}``

In [1]:
import json
import os

CWD = os.getcwd()

In [6]:
with open(os.path.join(CWD, "data", "dataset_unrel.json"), "r") as f:
    dataset = json.load(f)

In [7]:
dataset.keys()

[u'images', u'dataset']

In [11]:
root = {"images": list(),
        "info": None,
        "licenses": None,
        "type": "captions",
        "annotations": list()}

sent_id = 0
img_id = 0
for i, image in enumerate(dataset["images"]):
    for j, sent in enumerate(image["sentences"]):
        root["images"].append({"file_name": image["filename"],
                               "id": image["imgid"]})
        root["annotations"].append({"image_id": image["imgid"],
                                    "id": sent["sentid"],
                                    "caption": sent["raw"]})
        
# Dump the json
with open(os.path.join(CWD, "NBT", "tools", "coco-caption", "annotations", "caption_unrel.json"), "w") as f:
    json.dump(root, f)