Data prepare #2

liuhl-source · 2021-05-13T14:12:25Z

When I try to train the model, there are some problems with the Dataloader. I get many errors such as
'Error while read file idx 433 in conceptual_caption_val_0 -> cannot identify image file <_io.BytesIO object at 0x7f36766d9bd0>'.
Many images can not be load. I don't know why. Do you have any suggestions? or Can you share the scripts for downloading the GCC, SBU dataset？ Thank you very much! :)

dandelin · 2021-05-13T14:28:15Z

I guess the image files were corrupted in the first place.
It also happened to me, thus I made the dataset to use another random index if such error occurs.
However, considering the size of the dataset (several million), I remember that the number of corrupted files was not that many.

FYI, I used aria2c to download images from the image URLs text directly.
$ aria2c -i uris.txt

fawazsammani · 2021-07-21T08:28:13Z

This is how i downloaded SBU (it took alot, days for the complete dataset of 1M images).

import requests
import shutil
import os
import json

if not os.path.isdir('images'):
    os.mkdir('images')
    
num_images_to_download = 10e5

urls = []
with open('sbu/SBU_captioned_photo_dataset_urls.txt') as f:
    for line in f:
        urls.append(line.strip())
        
captions = []
with open('sbu/SBU_captioned_photo_dataset_captions.txt') as f:
    for line in f:
        captions.append(line.strip())
        
img_id = 0
filtered_captions = []

for url,caption in zip(urls, captions):
    
    img_id += 1
    
    response = requests.get(url, stream = True)

    if response.status_code == 404:
        continue
    else:
        with open('images/' + str(img_id) + '.png', 'wb') as out_file:
            shutil.copyfileobj(response.raw, out_file)
        filtered_captions.append(caption)
        
    if img_id == num_images_to_download:
        break
         
f = open('filtered_sbu_caps.txt','w')
for c in filtered_captions:
    f.write(c +'\n')

f.close()

dandelin closed this as completed May 14, 2021

jkkishore1999 mentioned this issue May 27, 2021

python run.py with data_root="/arrows_flickr30k" num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="vilt_200k_mlm_itm.ckpt" #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data prepare #2

Data prepare #2

liuhl-source commented May 13, 2021

dandelin commented May 13, 2021

fawazsammani commented Jul 21, 2021 •

edited

Data prepare #2

Data prepare #2

Comments

liuhl-source commented May 13, 2021

dandelin commented May 13, 2021

fawazsammani commented Jul 21, 2021 • edited

fawazsammani commented Jul 21, 2021 •

edited