# Reading the  RefCOCOg dataset
The visual grounding task in this assignment will employ the RefCOCOg dataset, a variant of the Referring Expression Generation (REG) dataset.
The RefCOCOg dataset consists of approximately 25,799
images, each of which contains an average of 3.7 referring expressions. As with RefCOCO+, location words
are not allowed in the referring expressions, which contains only appearance-based descriptions that are
independent of viewer perspective. This makes the dataset well-suited for visual grounding tasks, as it
necessitates creating a mapping between the visual appearance of an object and its corresponding linguistic
label.

To access and download the dataset, you can use the [Google Drive link provided](https://drive.google.com/uc?id=1P8a1g76lDJ8cMIXjNDdboaRR5-HsVmUb). Please note that in the annotations folder, there are two available refer files in the pickle format (ref(google).p and ref(umd).p). For this exercise, we will use the second split.

Actually, there is no predefined torchvision dataset class appropriate for the visual grounding task. As as a result in this notebook we are going to create a custom dataset class to load and read the dataset correctly.

In [None]:
!mkdir dataset
#!gdown 1xijq32XfEm6FPhUb7RsZYWHc2UuwVkiq <-- questa era la risorsa condivisa dal prof
!gdown 1P8a1g76lDJ8cMIXjNDdboaRR5-HsVmUb
!mv refcocog.tar.gz ./dataset/
!ls dataset
!tar -xf dataset/refcocog.tar.gz -C dataset
!ls dataset

Inspect what's in our data directory using the in-built os.walk() to walk through each of the subdirectories and count the files present.

In [None]:
from pathlib import Path
import os
data_path = Path("dataset/refcocog/")
for dirpath, dirnames, filenames in os.walk(data_path):
  print(f"There are {len(dirnames)} directories and {len(filenames)} files in '{dirpath}'.")

# JSON Parsing
It is impossible to find something useful online about the RefCOCOg dataset. Moreover, the files are too larged to be inspected with a text editor. Hence, with the following code we are going to explore the structure of instances.json file.

In [None]:
import json


data_path = Path("dataset/refcocog/")

# Opening JSON file
f = open(data_path/"annotations/instances.json")
  
# returns JSON object as a dictionary
data = json.load(f)

## First level

In [None]:
# Iterating through the json list
print("FIRST LEVEL")
for i in data:
    print(i)
print("==============")

## INFO

In [None]:
print("INFO")
for i in data["info"]:
    print(i)
print("==============")

print("INFO - DESCRIPTION")
print(data["info"]["description"])
print("==============")

print("INFO - URL")
print(data["info"]["url"])
print("==============")

print("INFO - VERSION")
print(data["info"]["version"])
print("==============")

print("INFO - YEAR")
print(data["info"]["year"])
print("==============")

print("INFO - CONTRIBUTOR")
print(data["info"]["contributor"])
print("==============")

print("INFO - DATA_CREATED")
print(data["info"]["date_created"])
print("==============")

## IMAGES

In [None]:
import cv2 as cv
from google.colab.patches import cv2_imshow #this module is required otherwise cv2.imshow() is disabled in Colab, because it causes Jupyter sessions to crash
import numpy as np
import matplotlib.pyplot as plt

print("IMAGES")
print("data['images'].len : "+str(len(data["images"]))) #25799 (train 21899 + val 1300 + test 2600)

print("Sample") #get only three random examples since the whole dataset is too big

print(data["images"][0])
cv_image = cv.imread("dataset/refcocog/images/"+data["images"][0]["file_name"])
cv2_imshow(cv_image);

print(data["images"][1])
cv_image = cv.imread("dataset/refcocog/images/"+data["images"][1]["file_name"])
cv2_imshow(cv_image);

print(data["images"][2])
cv_image = cv.imread("dataset/refcocog/images/"+data["images"][2]["file_name"])
cv2_imshow(cv_image);

print("==============")

## LICENSES

In [None]:
print("LICENSES")
for i in data["licenses"]:
    print(i)
print("==============")

## ANNOTATIONS

In [None]:
import random

print("ANNOTATIONS")
print("data['annotations'].len : "+str(len(data["annotations"]))) #208960

print("Sample") #get only three random examples since the whole dataset is too big

#bbox: [xmin, ymin, width, height]

for j in range(3):
  random_number = random.randint(1, 20000)
  print("data['annotations']["+str(random_number)+"]")
  for i in data["annotations"][random_number]:
      print(i)
      print(data["annotations"][random_number][i])

  print("")
  print("")

 The segmentation format depends on whether the instance represents a single object (iscrowd=0 in which case polygons are used) or a collection of objects (iscrowd=1 in which case RLE is used). Note that a single object (iscrowd=0) may require multiple polygons, for example if occluded. Crowd annotations (iscrowd=1) are used to label large groups of objects (e.g. a crowd of people).

## CATEGORIES

In [None]:
print("CATEGORIES")
for i in data["categories"]:
    print(i)
print("==============")

## Inspect the annotations of a random image in the dataset

In [None]:
import cv2 as cv
from google.colab.patches import cv2_imshow #this module is required otherwise cv2.imshow() is disabled in Colab, because it causes Jupyter sessions to crash
import numpy as np
import matplotlib.pyplot as plt

random_number = random.randint(1, 20000)
image_id = data["images"][random_number]["id"]
image_filename = data["images"][random_number]["file_name"]
print("image number: "+str(random_number))
print("image id: "+str(image_id))
print(image_filename)

cv_image = cv.imread("dataset/refcocog/images/"+image_filename)
cv2_imshow(cv_image);

In [None]:
#Get annotations about this image
#and store the bounding boxes
bbox_list = list()
for ann in data["annotations"]:
    if ann['image_id']==image_id:
      print("ann_id: "+str(ann["id"]))
      #for i in ann:
        #print(i)
        #print(ann[i])
      bbox_list.append(ann["bbox"])

for i in range(len(bbox_list)):
  bbox = bbox_list[i] #bbox: [xmin, ymin, width, height]

  color = (random.randint(0,256), random.randint(0,256), random.randint(0,256)) #select a random bbox color

  cv.rectangle(cv_image, (int(bbox[0]), int(bbox[1])), (int(bbox[0]+bbox[2]), int(bbox[1]+bbox[3])), color, 2)

cv2_imshow(cv_image);

The function `cv::rectangle` draws a rectangle outline or a filled rectangle whose two opposite corners are pt1 and pt2.

**Parameters**
* img	Image.
* pt1	Vertex of the rectangle.
* pt2	Vertex of the rectangle opposite to pt1 .
* color	Rectangle color or brightness (grayscale image).
* thickness	Thickness of lines that make up the rectangle. Negative values, like FILLED, mean that the function has to draw a filled rectangle.
* lineType	Type of the line. See LineTypes
* shift	Number of fractional bits in the point coordinates.


## Close file

In [None]:
# Closing file
f.close()

# Read the image natural language descriptions
Natural language annotations of the images are stored in `/dataset/refcocog/annotations/refs(umd).p`

In [None]:
from pprint import pprint
import pickle
import random

annotationRoot = "dataset/refcocog/annotations/"
pickleFile = open(annotationRoot+"refs(umd).p", "rb") #open a file, where you stored the pickled data

# dump information to that file
data = pickle.load(pickleFile)

print(len(data))  #49822

print("Three random objects")
print("")
for i in range(3):
  random_number = random.randint(1, len(data))
  print("complete object: " + str(data[random_number]))
  for j in data[random_number]: #explore each field
    if j=="sentences":  #print sentences in a suitable way
      print("sentences["+str(len(data[random_number]["sentences"]))+"]:")
      for k in data[random_number]["sentences"]:
        for sentence_element in k:
          print(sentence_element+": "+str(k[sentence_element]))
        print(".")
    else:
      print(j+": "+str(data[random_number][j]))
  print("")
  print("---")
  print("")

# close the file
pickleFile.close()

Get description of the target image.

In [None]:
for re in data:
  if re["image_id"] == image_id:
    print(re["file_name"])
    for sentence in re["sentences"]:
      print(sentence["sent"])

# Dataset example: FashionMNIST

In [None]:
import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt


training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

In [None]:
labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}
figure = plt.figure(figsize=(8, 8))
cols, rows = 3, 3
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(training_data), size=(1,)).item()
    img, label = training_data[sample_idx]
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray")
plt.show()

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

In [None]:
# Display image and label.
train_features, train_labels = next(iter(train_dataloader))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")
img = train_features[0].squeeze()
label = train_labels[0]
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label}")

# Create custom dataset

In [None]:
#annotations_file: refcocog/annotations/instances.json
#img_dir: refcocog/images
#reference_exp_file: refcocog/annotations/refs(umd).p

img_dir = Path("dataset/refcocog/images/")
annotations_file = Path("dataset/refcocog/annotations/instances.json")
reference_exp_file = Path("dataset/refcocog/annotations/refs(umd).p")

f = open(annotations_file)
annotations_json = json.load(f)
f.close()

In [None]:
ann_id2index = {
    ann['id']: index
    for index, ann in enumerate(annotations_json["annotations"])
}

In [None]:
import os
import json
from pathlib import Path
import pandas as pd
import torch
from torchvision.io import read_image
from torch.utils.data import Dataset
from torchvision import transforms
from typing import Tuple, Dict, List
import pickle

class CocoDataset(Dataset):

    #split: train, test or val
    #img_transform: apply list of transformations on the processed images
    #exp_transform: apply list of transformations on the processed reference expressions
    #target_transform: apply list of transformations on the bounding box
    #limit: number of dataset elements which we want to consider
    def __init__(
        self,
        split,
        img_transform=None,
        exp_transform=None,
        target_transform=None,
        limit=None
    ):
        self.img_transform = img_transform
        self.exp_transform = exp_transform
        self.target_transform = target_transform

        # Internally the dataset is a list of couples (X,Y)
        # X: index of the reference expression object in refs(umd).p
        # Y: index of the annotation object in instances.json
        self.items = self.load_dataset_index(split, limit)

    def __len__(self):
        return len(self.items)

    #return ((image, sentences), bounding box)
    def __getitem__(self, idx):

        image = self.getImage(idx)
        bbox = self.getBoundingBox(idx)
        sentences = self.getSentences(idx)

        if self.img_transform:
            image = self.img_transform(image)

        if self.exp_transform:
            for s in sentences:
              s = self.exp_transform(s)

        if self.target_transform:
            bbox = self.target_transform(bbox)

        return (image, sentences), bbox

    #according to the split [train, test, val]
    #return a list of couples (X,Y) such that:
    # X: index of the reference expression object in refs(umd).p
    # Y: index of the annotation object in instances.json
    def load_dataset_index(self, split, limit):

      pickleFile = open(reference_exp_file, "rb")  # open the file, where you stored the pickled data
      ref_exp_data = pickle.load(pickleFile)       # dump information from that file
      pickleFile.close()                           # close the file

      #iterate over all the reference expressions related to the
      #the target split
      return [
          (ref_index, ann_id2index[ref['ann_id']])
          for ref_index, ref in enumerate(ref_exp_data)
          if ref['split'] == split
      ]


    def getImage(self, idx):
      pickleFile = open(reference_exp_file, "rb")  #open the file, where you stored the pickled data
      ref_exp_data = pickle.load(pickleFile)            #dump information from that file

      obj_index = self.items[idx][0]

      file_name = ref_exp_data[obj_index]["file_name"]
      # In refs(umd).p the image file name is stored with the following
      # format:
      #   - COCO_train2014_[image_id]_[annotation_id].jpg
      #   - Example: `COCO_train2014_000000130518_104426.jpg`
      # However in the folder images there is not a distinct image
      # file for each annotation. As a consequence the files in
      # images folder are characterized by the following structure:
      #   - COCO_train2014_[image_id].jpg
      # so we have to split the file_name to remove the last part
      # about the annotation id
      spl_filename = file_name.split("_")
      file_name = '_'.join(spl_filename[:-1])+".jpg"

      image = read_image(os.path.join(img_dir,file_name))

      pickleFile.close()

      return image

    def getBoundingBox(self, idx):
      f = open(annotations_file) #open JSON file
      instances_data = json.load(f) # returns JSON object as a dictionary

      obj_index = self.items[idx][1]
      bbox = instances_data["annotations"][obj_index]["bbox"]

      f.close()

      return bbox


    def getSentences(self, idx):
      pickleFile = open(reference_exp_file, "rb")  #open the file, where you stored the pickled data
      ref_exp_data = pickle.load(pickleFile)            #dump information to that file

      obj_index = self.items[idx][0]

      sentences = ref_exp_data[obj_index]["sentences"]

      pickleFile.close()

      return sentences
    

In the following we experiment some interactions with the custom dataset

In [None]:
from pathlib import Path
import torch
from torchvision import transforms

In [None]:
# Augment train data (https://www.learnpytorch.io/04_pytorch_custom_datasets/)
train_transforms = transforms.Compose([
    #transforms.Resize((64, 64)),
    #transforms.RandomHorizontalFlip(p=0.5),
    transforms.ToTensor()
])

# Don't augment test data, only reshape
test_transforms = transforms.Compose([
    #transforms.Resize((64, 64)),
    transforms.ToTensor()
])

In [None]:
train_data_custom = CocoDataset(
    split="train",
    img_transform=None,
    exp_transform=None,
    target_transform=None,
    limit=3
)

test_data_custom = CocoDataset(
    split="test",
    img_transform=None,
    exp_transform=None,
    target_transform=None,
    limit=3
)

Display 2 random dataset instances

In [None]:
import random
import numpy as np
import matplotlib.pyplot as plt
import cv2 as cv
from torchvision.utils import draw_bounding_boxes

def display_random_images(dataset: torch.utils.data.dataset.Dataset):
    
    # Get random sample indexes
    random_samples_idx = random.sample(range(len(dataset)), k=2)

    # Setup plot
    plt.figure(figsize=(50, 50))

    # Loop through samples and display random samples 
    for i, targ_sample in enumerate(random_samples_idx):
        input, output = dataset[targ_sample]

        img = input[0]
        sentences = input[1]
        bbox = output #[xmin, ymin, width, height]

        # add bounding box
        # Parameters
        # image: image of type Tensor of shape (C x H x W).
        # boxes: tensor of size [N,4] containing bounding boxes coordinates in (xmin, ymin, xmax, ymax) format. N is the number of bounding boxes
        # ...it also accepts more optional parameters such as labels, colors, fill, width, etc.
  
        color = (random.randint(0,256), random.randint(0,256), random.randint(0,256)) #select a random bbox color

        # convert bbox from (xmin, ymin, width, height) format
        # to (xmin, ymin, xmax, ymax) format
        bbox[2] += bbox[0]
        bbox[3] += bbox[1]

        #avoid deprecation warning
        for b in range(4):
          bbox[b] = int(bbox[b])

        bbox = torch.tensor(bbox, dtype=torch.int)
        #print(bbox)
        #print(bbox.size()) #[4]
        bbox = bbox.unsqueeze(0)
        #print(bbox.size()) #[1,4]

        # draw bounding box on the input image
        img=draw_bounding_boxes(img, bbox, width=3, colors=color)

        # Adjust image tensor shape for plotting: [color_channels, height, width] -> [color_channels, height, width]
        targ_image_adjust = img.permute(1, 2, 0)

        # Plot adjusted samples
        plt.subplot(1, 5, i+1)
        plt.imshow(targ_image_adjust)
        plt.axis("off")
        title = sentences[0]['raw']
        plt.title(title)

display_random_images(train_data_custom)

Implement dataloaders

In [None]:
from torch.utils.data import DataLoader
train_dataloader_custom = DataLoader(
    dataset=train_data_custom, # use custom created train Dataset
    batch_size=1, # how many samples per batch?
    num_workers=0, # how many subprocesses to use for data loading? (higher = more)
    shuffle=True, # shuffle the data?
) 

test_dataloader_custom = DataLoader(
    dataset=test_data_custom, # use custom created test Dataset
    batch_size=1,
    num_workers=0,
    shuffle=False, # usually there is no need to shuffle testing data
)

train_dataloader_custom, test_dataloader_custom

In [None]:
# Get image and label from custom DataLoader
input, output = next(iter(train_dataloader_custom))

# input is a list of two elements
# input[0] is the image torch.Size([1, 3, 64, 64]) -> [batch_size, color_channels, height, width]
# input[1] are the sentences

# output is a list of tensors
# [tensor([0.], dtype=torch.float64), tensor([45.9500], dtype=torch.float64), tensor([238.9200], dtype=torch.float64), tensor([408.6400], dtype=torch.float64)]
# [xmin, ymin, width, height]


# Setup plot
plt.figure(figsize=(15, 15))

img = input[0][0]
sentences = input[1]

#conver output in a list of int
bbox = list()
for b in range(4):
  bbox.append(int(output[b].item()))

# add bounding box
# Parameters
# image: image of type Tensor of shape (C x H x W).
# boxes: tensor of size [N,4] containing bounding boxes coordinates in (xmin, ymin, xmax, ymax) format. N is the number of bounding boxes
# ...it also accepts more optional parameters such as labels, colors, fill, width, etc.

color = (random.randint(0,256), random.randint(0,256), random.randint(0,256)) #select a random bbox color

# convert bbox from (xmin, ymin, width, height) format
# to (xmin, ymin, xmax, ymax) format
bbox[2] += bbox[0]
bbox[3] += bbox[1]

bbox = torch.tensor(bbox, dtype=torch.int)
bbox = bbox.unsqueeze(0)

# draw bounding box on the input image
img=draw_bounding_boxes(img, bbox, width=3, colors=color)

# Adjust image tensor shape for plotting: [color_channels, height, width] -> [color_channels, height, width]
targ_image_adjust = img.permute(1, 2, 0)


# Plot adjusted samples
plt.subplot(1, 1, 1)
plt.imshow(targ_image_adjust)
plt.axis("off")
title = sentences[0]['raw'][0]
plt.title(title)