Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data prepare #2

Closed
liuhl-source opened this issue May 13, 2021 · 2 comments
Closed

Data prepare #2

liuhl-source opened this issue May 13, 2021 · 2 comments

Comments

@liuhl-source
Copy link

When I try to train the model, there are some problems with the Dataloader. I get many errors such as
'Error while read file idx 433 in conceptual_caption_val_0 -> cannot identify image file <_io.BytesIO object at 0x7f36766d9bd0>'.
Many images can not be load. I don't know why. Do you have any suggestions? or Can you share the scripts for downloading the GCC, SBU dataset? Thank you very much! :)

@dandelin
Copy link
Owner

I guess the image files were corrupted in the first place.
It also happened to me, thus I made the dataset to use another random index if such error occurs.
However, considering the size of the dataset (several million), I remember that the number of corrupted files was not that many.

FYI, I used aria2c to download images from the image URLs text directly.
$ aria2c -i uris.txt

@fawazsammani
Copy link

fawazsammani commented Jul 21, 2021

This is how i downloaded SBU (it took alot, days for the complete dataset of 1M images).

import requests
import shutil
import os
import json

if not os.path.isdir('images'):
    os.mkdir('images')
    
num_images_to_download = 10e5

urls = []
with open('sbu/SBU_captioned_photo_dataset_urls.txt') as f:
    for line in f:
        urls.append(line.strip())
        
captions = []
with open('sbu/SBU_captioned_photo_dataset_captions.txt') as f:
    for line in f:
        captions.append(line.strip())
        
img_id = 0
filtered_captions = []

for url,caption in zip(urls, captions):
    
    img_id += 1
    
    response = requests.get(url, stream = True)

    if response.status_code == 404:
        continue
    else:
        with open('images/' + str(img_id) + '.png', 'wb') as out_file:
            shutil.copyfileobj(response.raw, out_file)
        filtered_captions.append(caption)
        
    if img_id == num_images_to_download:
        break
         
f = open('filtered_sbu_caps.txt','w')
for c in filtered_captions:
    f.write(c +'\n')

f.close()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants