Prior to running this notebook, you have to do the following.
1. Ensure that all the images from the visual genome dataset are in a _single_ folder. The raw dataset download splits the images into two folders.
2. Download the _json_pretrain.zip_ file from the ALBEF repository. There are lots of differents ways to make pairs from Visual Genome. We use the same pairs as the ALBEF authors.

In [9]:
import pandas as pd
import json
from pathlib import Path
from tqdm.notebook import tqdm
import os

In [11]:
ALBEF_VG_PATH = Path('/net/acadia10a/data/zkhan/json_pretrain/vg.json')
VG_DOWNLOAD_ROOT = Path('/net/acadia10a/data/zkhan/visual-genome-sandbox/')
VG_IMAGES_PATH = VG_DOWNLOAD_ROOT / 'vg-images'

In [12]:
assert all([_.exists() for _ in (ALBEF_VG_PATH, VG_IMAGES_PATH)])

In [4]:
with open(ALBEF_VG_PATH, 'r') as f:
    albef_json = json.load(f)

What we're doing here is using the `image_id` key in ALBEF's pretraining JSON to get the path to the corresponding image in the downloaded version of the dataset. Then, we change the path from the path the ALBEF authors used to the path that works for our dataset.

In [18]:
for pair in tqdm(albef_json):
    vg_id = int(pair['image_id'].split('_')[-1])
    image_name = f'{vg_id}.jpg'
    absolute_path = VG_IMAGES_PATH / image_name
    assert absolute_path.exists()
    relative_path = os.path.join(VG_IMAGES_PATH.stem, image_name)
    pair['image']= str(relative_path)

  0%|          | 0/768536 [00:00<?, ?it/s]

In [20]:
vg_df = pd.DataFrame(albef_json)

In [23]:
vg_df.head()

Unnamed: 0,image,caption,image_id
0,vg-images/1.jpg,trees line the sidewalk,vg_1
1,vg-images/1.jpg,sidewalk is made of bricks,vg_1
2,vg-images/1.jpg,cars are parked along the edge of the street,vg_1
3,vg-images/1.jpg,Trees with sparse foilage,vg_1
4,vg-images/1.jpg,A tall brick building with many windows,vg_1


In [22]:
vg_df.to_csv(VG_DOWNLOAD_ROOT / 'visual-genome-pairs.tsv', sep='\t', index=False)

We wrote the dataframe to a CSV without absolute paths, because it makes regenerating the JSON _much_ easier if we change paths or need to run the code in a different setting. When we generate the actual JSON pairs that get fed into ALBEF, we use the absolute paths.

In [24]:
vg_df['image'] =  vg_df['image'].apply(lambda s: str(VG_DOWNLOAD_ROOT / s))

In [26]:
vg_df.to_json(VG_DOWNLOAD_ROOT / 'visual-genome-pairs.json', orient='records')