Question about GCC dataset download #45

yr666666 · 2021-12-26T08:14:14Z

root
├── images_train
│ ├── 0000 # First four letters of the image name
│ │ ├── 0000000 # Image Binary
│ │ ├── 0000001
│ │ └── ...
│ ├── 0001
│ │ ├── 0001000
│ │ ├── 0001001
│ │ └── ...

Hello， please forgive my stupid question. I don't know what you mean about "0000 # First four letters of image name" and "0000000 # Image Binary" in your DATA.md. Can you explain what are the "Image Binary" and "First four letters of image name"? Thanks

dandelin · 2021-12-30T15:32:28Z

Hi @yr666666

GCC (CC3M) provides the dataset in the form of image URLs and their related caption.
Since their original filenames are un-ordered and they have various formats, I renamed them to the ordered sequence without the extension (like .jpg, .png, ...) during the download.
So these renamed "image files (binaries)" have names such as 0000000, 0000001, ..., 2983222, etc.

If I put all files in a single directory, it slows down disk-related operations.
Thus I partitioned them into several directories named "first four letters of the image name" so that every directory has 1000 files at maximum.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about GCC dataset download #45

Question about GCC dataset download #45

yr666666 commented Dec 26, 2021

dandelin commented Dec 30, 2021

Question about GCC dataset download #45

Question about GCC dataset download #45

Comments

yr666666 commented Dec 26, 2021

dandelin commented Dec 30, 2021