Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about GCC dataset download #45

Open
yr666666 opened this issue Dec 26, 2021 · 1 comment
Open

Question about GCC dataset download #45

yr666666 opened this issue Dec 26, 2021 · 1 comment

Comments

@yr666666
Copy link

root
├── images_train
│ ├── 0000 # First four letters of the image name
│ │ ├── 0000000 # Image Binary
│ │ ├── 0000001
│ │ └── ...
│ ├── 0001
│ │ ├── 0001000
│ │ ├── 0001001
│ │ └── ...

Hello, please forgive my stupid question. I don't know what you mean about "0000 # First four letters of image name" and "0000000 # Image Binary" in your DATA.md. Can you explain what are the "Image Binary" and "First four letters of image name"? Thanks

@dandelin
Copy link
Owner

Hi @yr666666

GCC (CC3M) provides the dataset in the form of image URLs and their related caption.
Since their original filenames are un-ordered and they have various formats, I renamed them to the ordered sequence without the extension (like .jpg, .png, ...) during the download.
So these renamed "image files (binaries)" have names such as 0000000, 0000001, ..., 2983222, etc.

If I put all files in a single directory, it slows down disk-related operations.
Thus I partitioned them into several directories named "first four letters of the image name" so that every directory has 1000 files at maximum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants