# Prepare COCO
The COCO dataset contains separate train, test and validation sets. The images have unique IDs.

This notebook combines and renames the images in the different datasets to the format `<img_id>.jpg`.

Then, a reduced version of COCO is created containing only the images that are relevant for the task.

In [1]:
import os
import shutil
from tqdm import tqdm
import pandas as pd

The dataset folders `train2014`, `test2014` and `valid2014` must be [downloaded](https://cocodataset.org/#download), extracted from zipped file and placed in `res/mscoco/`.

In [2]:
RES_DIR = '../res'

%cd {RES_DIR}
os.listdir()



['2022-Hirota-CVPR.pdf',
 'IntersectionSAT-OSCAR-NICPL-NICEQ.csv',
 'reduced_coco_10780',
 'annotations',
 'mscoco',
 'merged_coco']

In [3]:
datasets = ['train2014', 'val2014', 'test2014']

In [4]:
idss = []
for ds in datasets:
    ids = []
    ds_path = os.path.join('mscoco', ds)
    if not os.path.exists(ds_path):
        continue
    for filename in os.listdir(ds_path):
        id_ = int(filename.rsplit('_', maxsplit=1)[1].split('.')[0])
        ids.append(id_)
    print(ds, len(os.listdir(ds_path)), len(set(ids)))
    idss.extend(ids)
print('total', len(idss), len(set(idss)))

val2014 1 1
total 1 1


## Merge datasets in a single folder
Files can be moved or copied.

In [5]:
dst_dir = 'merged_coco'
if not os.path.exists(dst_dir):
    os.makedirs(dst_dir)

In [6]:
for ds in datasets:
    ds_path = os.path.join('mscoco', ds)
    if not os.path.exists(ds_path):
        continue
    for src_file in tqdm(os.listdir(ds_path), desc=ds):
        dst_file = src_file.rsplit('_', maxsplit=1)[1]
        src = os.path.join(ds_path, src_file)
        dst = os.path.join(dst_dir, dst_file)
        
        shutil.copyfile(src, dst)  # copy if you want to keep the previous dataset
        # shutil.move(src, dst)  # move to save space

val2014: 100%|██████████████████████████████████| 1/1 [00:00<00:00, 5363.56it/s]


In [7]:
print(dst_dir, len(os.listdir(dst_dir)))

merged_coco 1


### Remove previous dataset folders

In [8]:
# for ds in datasets:
#     shutil.rmtree(ds)

## Reduce the COCO dataset for our task
We will just need the images for which we have annotations. We can create a reduced COCO dataset only keeping those images.

In [9]:
df = pd.read_csv('IntersectionSAT-OSCAR-NICPL-NICEQ.csv', index_col='img_id')
df

Unnamed: 0_level_0,caption_list,pred_oscar,pred_sat,pred_nicplus,pred_niceq
img_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
192,['A group of baseball players is crowded at th...,a baseball player holding a bat on top of a fi...,a batter catcher and umpire during a baseball ...,a baseball player holding a bat on top of a fi...,a baseball player holding a bat on a field.
241,['a man standing holding a game controller and...,a man standing in a living room holding a nint...,a couple of people that are playing a video game,a group of people playing a game with nintendo...,a group of people playing a video game.
294,['A man standing in front of a microwave next ...,a man standing in front of a bunch of pots and...,a woman is pouring wine into a wine glass,a woman standing in a kitchen preparing food.,a man standing in a kitchen holding a knife.
328,['Three men in military suits are sitting on a...,a group of three men sitting on top of a bench.,a group of people sitting on a bench,a black and white photo of a group of people s...,a black and white photo of a group of people s...
338,['Two people standing in a kitchen looking aro...,a couple of women standing in a kitchen next t...,a group of people standing in a kitchen,a woman standing in a kitchen next to a stove.,a woman standing in a kitchen next to a stove.
...,...,...,...,...,...
579902,['A person riding a motorcycle down a street.'...,a man riding a motorcycle down a street next t...,a man riding a motorcycle down a street,a man riding a motorcycle down a street.,a man riding a motorcycle down a street.
580197,['Two men in bow ties standing next to steel r...,a couple of men standing next to each other in...,a man in a suit and tie in a room,a man in a suit and tie standing next to a woman.,a man in a suit and tie standing next to anoth...
580294,['Person cooking an eggs on a black pot on a s...,a woman in a kitchen making pancakes on a stove.,a woman is preparing food in a kitchen,a woman standing in a kitchen preparing food.,a woman standing in a kitchen preparing food.
581317,"['A woman holding a small item in a field.', '...",a woman standing in a field looking at her cel...,a woman in a field with a cell phone,a woman standing in a field with a frisbee.,a woman is standing in the grass talking on a ...


In [10]:
src_dir = 'merged_coco'
dst_dir = f'reduced_coco_{len(df)}'

if not os.path.exists(dst_dir):
    os.makedirs(dst_dir)

In [11]:
df.index.values

array([   192,    241,    294, ..., 580294, 581317, 581357])

In [12]:
for filename in tqdm(os.listdir(src_dir)):
    id_ = int(filename.split('.')[0])

    if id_ in df.index.values:
        src = os.path.join(src_dir, filename)
        dst = os.path.join(dst_dir, filename)
        
        shutil.copyfile(src, dst)  # copy if you want to keep the previous dataset
        # shutil.move(src, dst)  # move to save space

100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 9300.01it/s]


In [13]:
print(dst_dir, len(os.listdir(dst_dir)))

reduced_coco_10780 0


### Remove temporal `merged_coco` folder

In [14]:
# shutil.rmtree(src_dir)