This program is built for downloading, verifying and resizing the images and metadata of the Open Images dataset (https://github.com/openimages/dataset). It is designed to run as fast as possible by taking advantage of the available hardware and bandwidth by using asynchronous I/O and parallelism. Each image's size and md5 sum are validated against the values found in the dataset metadata. The download results are stored in CSV files with the same format as the original images.csv so that subsequent use for training, etc. can be done knowing that all the images are available. Many (over 2%) of the original images are no longer available or have changed, so these and any other failed downloads are stored in a separate results file. If you use a naive script or program such as curl or wget to download the images, you will end up with a lot of "unavailable" png images, xml files and some images that don't match the originals. This is why it's important to validate the size and md5 sum when downloading. In addition some of flicker's servers may randomly be down, so it is important to properly handle the resulting failures.
- 2017-11-20 Version 3.0 released. Add support for downloading Open Images Dataset V3 and V1 (in addition to V2).
- 2017-09-08 Version 1.0 released.
The application is written in scala and depends on the Java-8 jre to run. If you don't have a java jre installed, see https://www.java.com/en/download/help/download_options.xml for instructions on how to install it.
Open Images Downloader is distributed with a shell script and batch file generated by sbt-native-packager (http://www.scala-sbt.org/sbt-native-packager/index.html), so all you need to run it is to download and extract the distribution and then execute open_images_downloader (or the .bat on windows).
The resizing functionality depends on the ImageMagick program convert
, so if you want to do resizing, convert
must be on the PATH
. ImageMagick provides excellent quality resizing and very fast performance. It's easy to install on a linux distribution using a package manager (e.g. apt or yum), and not too hard on most other OSes. See https://www.imagemagick.org/script/binary-releases.php.
The application is a command line application. If you're running it on a server, I'd recommend using screen or tmux so that it continues running if the ssh connection is interrupted.
The code is written in a portable manner, but I haven't tested on any OS besides Ubuntu Linux, so if you use a different OS and run into issues let me know by opening an issue on github and I'll do my best to help you out.
This program is flexible and can be used for a number of use cases depending on how much storage you want to use and how you want to use the data. The original images can optionally be stored, and you can also choose whether to resize and store the resized images. Also the metadata download and extraction is optional. If the original images are found locally because you previously downloaded them, they will be used as the source for a resize. Also a resize is skipped if a file with size > 0 is found. This is so the program can be interrupted and restarted and you can resume it where it left off. Also if you have the original images stored locally you can resize all the images with different parameters without needing to re-download any images.
If you want to minimize the amount of space used, only store small images 224x224 compressed at jpeg quality 50, and use less bandwidth by downloading the 300K urls, use the following command line options:
$ open_images_downloader --nodownload-metadata --download-300k \
--resize-mode FillCrop --resize-compression-quality 50
If you want to save the images with a max side of 640 with original aspect ratio at original jpeg quality, and use less bandwidth by downloading the 300K urls, use the following command line options. Note that the 300K don't look as nice as the original images resized by ImageMagick. The 300k urls return images that are 640 pixels on the largest side, so the resize step only changes images that are larger than 640. Not all images have 300K urls, and in that case, the original url is used and these images are resized.
$ open_images_downloader --nodownload-metadata --download-300k \
--resize-box-size 640
If you want to download and save all the original images and metadata, and also resize them to 1024 max side, and save them in a subdirectory named images-resized-1024:
$ open_images_downloader --save-original-images --resize-box-size 1024 \
--resized-images-subdirectory images-resized-1024
There are also options for controlling how many concurrent http connections are made (don't worry about flickr, they can easily handle a few hundred connections from a single system and you downloading as fast as possible, and you won't be blocked for "abuse") which you may want to use to reduce the impact you have on your local network.
Here is the complete command line help:
open_images_downloader 3.0 by Dan Nuffer
Usage: open_images_downloader[.bat] [OPTION]...
Options:
--check-md5-if-exists If an image already exists locally
in <image dir> and is the same
size as the original, check the
md5 sum of the file to determine
whether to download it. Default is
on
--nocheck-md5-if-exists
--dataset-version <arg> The version of the dataset to
download. 1, 2, or 3. 3 was
released 2017-11-16, 2 was
released 2017-07-20, and 1 was
released 2016-09-28. Default is 3.
--download-300k Download the image from the url in
the Thumbnail300KURL field. This
disables verifying the size and
md5 hash and results in lower
quality images, but may be much
faster and use less bandwidth and
storage space. These are resized
to a max dim of 640, so if you use
--resize-mode=ShrinkToFit and
--resize-box-size=640 you can get
a full consistently sized set of
images. For the few images that
don't have a 300K url the original
is downloaded and needs to be
resized. Default is off
--nodownload-300k
--download-images Download and extract
images_2017_07.tar.gz and all
images. Default is on
--nodownload-images
--download-metadata Download and extract the metadata
files (annotations and classes).
Default is on
--nodownload-metadata
--http-pipelining-limit <arg> The maximum number of parallel
pipelined http requests per
connection. Default is 4
--log-file <arg> Write a log to <file>. Default is
to not write a log
--log-to-stdout Write the log to stdout. Default
is on
--nolog-to-stdout
--max-host-connections <arg> The maximum number of parallel
connections to a single host.
Default is 5
--max-retries <arg> Number of times to retry failed
downloads. Default is 15.
--max-total-connections <arg> The maximum number of parallel
connections to all hosts. Must be
a power of 2 and > 0. Default is
128
--original-images-subdirectory <arg> name of the subdirectory where the
original images are stored.
Default is images-original
--resize-box-size <arg> The number of pixels used by
resizing for the side of the
bounding box. Default is 224
--resize-compression-quality <arg> The compression quality. If
specified, it will be passed with
the -quality option to imagemagick
convert. See
https://www.imagemagick.org/script/command-line-options.php#quality
for the meaning of different
values and defaults for various
output formats. If unspecified,
-quality will not be passed and
imagemagick will use its default
--resize-images Resize images. Default is on
--noresize-images
--resize-mode <arg> One of ShrinkToFit, FillCrop, or
FillDistort. ShrinkToFit will
resize images larger than the
specified size of bounding box,
preserving aspect ratio. Smaller
images are unchanged. FillCrop
will fill the bounding box, by
first either shrinking or growing
the image and then doing a
center-crop on the larger
dimension. FillDistort will fill
the bounding box, by either
shrinking or growing the image,
modifying the aspect ratio as
necessary to fit. Default is
ShrinkToFit
--resize-output-format <arg> The format (and extension) to use
for the resized images. Valid
values are those supported by
ImageMagick. See
https://www.imagemagick.org/script/formats.php
and/or run identify -list format.
Default is jpg
--resized-images-subdirectory <arg> name of the subdirectory where the
resized images are stored. Default
is images-resized
--root-dir <arg> top-level directory for storing
the Open Images dataset. Default
is . (current working directory)
--save-original-images Save full-size original images.
This will use over 18 TB of space.
Default is off
--nosave-original-images
--save-tar-balls Save the downloaded .tar.gz and
.tar files. This uses more space
but can save time when resuming
from an interrupted execution.
Default is off
--nosave-tar-balls
--help Show help message
--version Show version of this program
Install sbt from http://www.scala-sbt.org/
Run sbt compile
to build.
Run sbt test
to run unit tests.
Run sbt universal:packageBin
to create the distribution .zip
Run sbt universal:packageZipTarball
to create the distribution .tgz. The output is stored in target/universal
.
- UI
- Show # of images/sec
- Show # of bytes/sec read from internet
- Show # of bytes/sec written to disk
- Show last 100(?) successful downloads
- Show last 100(?) failed downloads
- Show downloads in progress w/percentage complete
- Show # of resizes in progress
- Show overall number of downloads w/% of total completed
- ncurses based progress UI
- graphical progress UI
- display images as they are downloaded
- optionally convert from non-standard cmyk format
- optionally validate the image is a proper jpeg
- I made a script to do this before: https://github.com/dnuffer/detect_corrupt_jpeg. It uses magic mime-type detection, and PIL open(). I also tested out using jpeginfo and ImageMagick identify.
- There is a list of running my detect_corrupt_jpeg script at:
- partial with jpeginfo and identify: /storage/data/pub/ml_datasets/openimages/images_2016_08/train/detect_corrupt_jpeg.out
- on oi v1 with magic and PIL: /storage/data/pub/ml_datasets/openimages/images_2016_08/train/detect_corrupt_jpeg2.out
- Don't want to be too strict. If it's got some weird format, but can still be decoded into an image, that's fine.
- Don't want to be too lenient. If it's missing part of the image, that should fail.
- ImageMagick identify -verbose https://www.imagemagick.org/discourse-server/viewtopic.php?t=20045
- The end-goal is to check it can be decoded by tensorflow into a valid image. That uses libjpeg-turbo under the covers, so I could use the program djpeg from libjpeg-turbo-progs package. I'm not sure how good it is at detecting corruption however. I'll need to experiment with it.
- option to enable/disable 3-letter image subdirs
- option to do multiple image processes - resize to multiple sizes, multiple resize modes, convert to other formats.
- output to tfrecord
- optionally save metadata to a db (jpa?, shiny?)
- optionally save annotations to a db (jpa?, shiny?)
- distribute across machines using akka
- resume downloading partially downloaded files? Seems a bit pointless, none of the files are that big. Would be interesting to implement anyway! The file images_2017_07.tar.gz is the largest.
- check the license (scrape the website?)