Skip to content
Permalink
Browse files

Edits to README and download-dataset script with new instructions

  • Loading branch information...
andrefaraujo committed May 22, 2019
1 parent eeb15e4 commit e4b7a4ab345018cb873cdc1d141e9141287ba832
Showing with 61 additions and 16 deletions.
  1. +49 −10 README.md
  2. +12 −6 download-dataset.sh
@@ -4,7 +4,8 @@ This is the second version of the Google Landmarks dataset, which contains
images annotated with labels representing human-made and natural landmarks. The
dataset can be used for landmark recognition and retrieval experiments. This
version of the dataset contains approximately 5 million images, split into 3
sets of images: `train`, `index` and `test`.
sets of images: `train`, `index` and `test`. The dataset is presented in our
[Google AI blog post](https://ai.googleblog.com/2019/05/announcing-google-landmarks-v2-improved.html).

This dataset is associated to two Kaggle challenges, on
[landmark recognition](https://kaggle.com/c/landmark-recognition-2019) and
@@ -18,15 +19,7 @@ For reference, the previous version of the Google Landmarks dataset is available

## Download `train` set

### Using the provided script
Running `download-dataset.sh` will automatically download, extract, and verify the images in the current directory.

```bash
chmod +x download-dataset.sh
./download-dataset.sh
```

Note: This script downloads files in parallel. To adjust the number of parallel downloads, modify `NUM_PROC` in the script.
There are 4,132,914 images in the `train` set.

### Download the labels and metadata

@@ -50,6 +43,19 @@ them, access the following link:

And similarly for the other files.

#### Using the provided script

```bash
mkdir train && cd train
bash ../download-dataset.sh train 499
```

This will automatically download, verify and extract the images to the `train`
directory.

Note: This script downloads files in parallel. To adjust the number of parallel
downloads, modify `NUM_PROC` in the script.

### `train` image licenses

All images in the `train` set have CC-BY licenses without the NonDerivs (ND)
@@ -58,6 +64,8 @@ restriction. To verify the license for a particular image, please refer to

## Download `index` set

There are 761,757 images in the `index` set.

### Download the list of images

- `index.csv`: single-column CSV with id field. `id` is a 16-character string.
@@ -75,12 +83,27 @@ them, access the following link:

And similarly for the other files.

#### Using the provided script

```bash
mkdir index && cd index
bash ../download-dataset.sh index 99
```

This will automatically download, verify and extract the images to the `index`
directory.

Note: This script downloads files in parallel. To adjust the number of parallel
downloads, modify `NUM_PROC` in the script.

### `index` image licenses

All images in the `index` set have CC-0 or Public Domain licenses.

## Download `test` set

There are 117,577 images in the `test` set.

### Download the list of images

- `test.csv`: single-column CSV with id field. `id` is a 16-character string.
@@ -98,6 +121,19 @@ them, access the following link:

And similarly for the other files.

#### Using the provided script

```bash
mkdir test && cd test
bash ../download-dataset.sh test 19
```

This will automatically download, verify and extract the images to the `test`
directory.

Note: This script downloads files in parallel. To adjust the number of parallel
downloads, modify `NUM_PROC` in the script.

### `test` image licenses

All images in the `test` set have CC-0 or Public Domain licenses.
@@ -115,6 +151,9 @@ For example, the md5sum file corresponding to the `images_000.tar` file in the

And similarly for the other files.

If you use the provided `download-dataset.sh` script, the integrity of the files
is already checked right after download.

## Extracting the data

We recommend that the set of TAR files corresponding to each dataset split be
@@ -16,17 +16,23 @@
# Number of processes to run in parallel.
NUM_PROC=6

# Inclusive upper limit for file downloads.
# Default of N=499 will download all files, i.e. images_000.tar...images_499.tar
N=499
# Dataset split to download.
# Options: train, test, index.
SPLIT=$1

# Inclusive upper limit for file downloads. Should be set according to split:
# train --> 499.
# test --> 19.
# index --> 99.
N=$2

download_check_and_extract() {
local i=$1
images_file_name=images_$1.tar
images_md5_file_name=md5.images_$1.txt
images_tar_url=https://s3.amazonaws.com/google-landmark/train/$images_file_name
images_md5_url=https://s3.amazonaws.com/google-landmark/md5sum/train/$images_md5_file_name
echo "Downloading $images_file_name..."
images_tar_url=https://s3.amazonaws.com/google-landmark/$SPLIT/$images_file_name
images_md5_url=https://s3.amazonaws.com/google-landmark/md5sum/$SPLIT/$images_md5_file_name
echo "Downloading $images_file_name and its md5sum..."
curl -Os $images_tar_url > /dev/null
curl -Os $images_md5_url > /dev/null
if [[ "$OSTYPE" == "linux-gnu" ]]; then

0 comments on commit e4b7a4a

Please sign in to comment.
You can’t perform that action at this time.