Edits to README and download-dataset script with new instructions

cvdfoundation · May 22, 2019 · e4b7a4a · e4b7a4a
1 parent eeb15e4
commit e4b7a4a
Show file tree

Hide file tree

Showing 2 changed files with 61 additions and 16 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,8 @@ This is the second version of the Google Landmarks dataset, which contains
 images annotated with labels representing human-made and natural landmarks. The
 dataset can be used for landmark recognition and retrieval experiments. This
 version of the dataset contains approximately 5 million images, split into 3
-sets of images: `train`, `index` and `test`.
+sets of images: `train`, `index` and `test`. The dataset is presented in our
+[Google AI blog post](https://ai.googleblog.com/2019/05/announcing-google-landmarks-v2-improved.html).
 
 This dataset is associated to two Kaggle challenges, on
 [landmark recognition](https://kaggle.com/c/landmark-recognition-2019) and
@@ -18,15 +19,7 @@ For reference, the previous version of the Google Landmarks dataset is available
 
 ## Download `train` set
 
-### Using the provided script
-Running `download-dataset.sh` will automatically download, extract, and verify the images in the current directory.
-
-```bash
-chmod +x download-dataset.sh
-./download-dataset.sh
-```
-
-Note: This script downloads files in parallel. To adjust the number of parallel downloads, modify `NUM_PROC` in the script.
+There are 4,132,914 images in the `train` set.
 
 ### Download the labels and metadata
 
@@ -50,6 +43,19 @@ them, access the following link:
 
 And similarly for the other files.
 
+#### Using the provided script
+
+```bash
+mkdir train && cd train
+bash ../download-dataset.sh train 499
+```
+
+This will automatically download, verify and extract the images to the `train`
+directory.
+
+Note: This script downloads files in parallel. To adjust the number of parallel
+downloads, modify `NUM_PROC` in the script.
+
 ### `train` image licenses
 
 All images in the `train` set have CC-BY licenses without the NonDerivs (ND)
@@ -58,6 +64,8 @@ restriction. To verify the license for a particular image, please refer to
 
 ## Download `index` set
 
+There are 761,757 images in the `index` set.
+
 ### Download the list of images
 
 -   `index.csv`: single-column CSV with id field. `id` is a 16-character string.
@@ -75,12 +83,27 @@ them, access the following link:
 
 And similarly for the other files.
 
+#### Using the provided script
+
+```bash
+mkdir index && cd index
+bash ../download-dataset.sh index 99
+```
+
+This will automatically download, verify and extract the images to the `index`
+directory.
+
+Note: This script downloads files in parallel. To adjust the number of parallel
+downloads, modify `NUM_PROC` in the script.
+
 ### `index` image licenses
 
 All images in the `index` set have CC-0 or Public Domain licenses.
 
 ## Download `test` set
 
+There are 117,577 images in the `test` set.
+
 ### Download the list of images
 
 -   `test.csv`: single-column CSV with id field. `id` is a 16-character string.
@@ -98,6 +121,19 @@ them, access the following link:
 
 And similarly for the other files.
 
+#### Using the provided script
+
+```bash
+mkdir test && cd test
+bash ../download-dataset.sh test 19
+```
+
+This will automatically download, verify and extract the images to the `test`
+directory.
+
+Note: This script downloads files in parallel. To adjust the number of parallel
+downloads, modify `NUM_PROC` in the script.
+
 ### `test` image licenses
 
 All images in the `test` set have CC-0 or Public Domain licenses.
@@ -115,6 +151,9 @@ For example, the md5sum file corresponding to the `images_000.tar` file in the
 
 And similarly for the other files.
 
+If you use the provided `download-dataset.sh` script, the integrity of the files
+is already checked right after download.
+
 ## Extracting the data
 
 We recommend that the set of TAR files corresponding to each dataset split be

diff --git a/download-dataset.sh b/download-dataset.sh
@@ -16,17 +16,23 @@
 # Number of processes to run in parallel.
 NUM_PROC=6
 
-# Inclusive upper limit for file downloads.
-# Default of N=499 will download all files, i.e. images_000.tar...images_499.tar
-N=499
+# Dataset split to download.
+# Options: train, test, index.
+SPLIT=$1
+
+# Inclusive upper limit for file downloads. Should be set according to split:
+# train --> 499.
+# test --> 19.
+# index --> 99.
+N=$2
 
 download_check_and_extract() {
   local i=$1
   images_file_name=images_$1.tar
   images_md5_file_name=md5.images_$1.txt
-  images_tar_url=https://s3.amazonaws.com/google-landmark/train/$images_file_name
-  images_md5_url=https://s3.amazonaws.com/google-landmark/md5sum/train/$images_md5_file_name
-  echo "Downloading $images_file_name..."
+  images_tar_url=https://s3.amazonaws.com/google-landmark/$SPLIT/$images_file_name
+  images_md5_url=https://s3.amazonaws.com/google-landmark/md5sum/$SPLIT/$images_md5_file_name
+  echo "Downloading $images_file_name and its md5sum..."
   curl -Os $images_tar_url > /dev/null
   curl -Os $images_md5_url > /dev/null
   if [[ "$OSTYPE" == "linux-gnu" ]]; then