Skip to content

Commit

Permalink
Edits to README and download-dataset script with new instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
andrefaraujo committed May 22, 2019
1 parent eeb15e4 commit e4b7a4a
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 16 deletions.
59 changes: 49 additions & 10 deletions README.md
Expand Up @@ -4,7 +4,8 @@ This is the second version of the Google Landmarks dataset, which contains
images annotated with labels representing human-made and natural landmarks. The
dataset can be used for landmark recognition and retrieval experiments. This
version of the dataset contains approximately 5 million images, split into 3
sets of images: `train`, `index` and `test`.
sets of images: `train`, `index` and `test`. The dataset is presented in our
[Google AI blog post](https://ai.googleblog.com/2019/05/announcing-google-landmarks-v2-improved.html).

This dataset is associated to two Kaggle challenges, on
[landmark recognition](https://kaggle.com/c/landmark-recognition-2019) and
Expand All @@ -18,15 +19,7 @@ For reference, the previous version of the Google Landmarks dataset is available

## Download `train` set

### Using the provided script
Running `download-dataset.sh` will automatically download, extract, and verify the images in the current directory.

```bash
chmod +x download-dataset.sh
./download-dataset.sh
```

Note: This script downloads files in parallel. To adjust the number of parallel downloads, modify `NUM_PROC` in the script.
There are 4,132,914 images in the `train` set.

### Download the labels and metadata

Expand All @@ -50,6 +43,19 @@ them, access the following link:

And similarly for the other files.

#### Using the provided script

```bash
mkdir train && cd train
bash ../download-dataset.sh train 499
```

This will automatically download, verify and extract the images to the `train`
directory.

Note: This script downloads files in parallel. To adjust the number of parallel
downloads, modify `NUM_PROC` in the script.

### `train` image licenses

All images in the `train` set have CC-BY licenses without the NonDerivs (ND)
Expand All @@ -58,6 +64,8 @@ restriction. To verify the license for a particular image, please refer to

## Download `index` set

There are 761,757 images in the `index` set.

### Download the list of images

- `index.csv`: single-column CSV with id field. `id` is a 16-character string.
Expand All @@ -75,12 +83,27 @@ them, access the following link:

And similarly for the other files.

#### Using the provided script

```bash
mkdir index && cd index
bash ../download-dataset.sh index 99
```

This will automatically download, verify and extract the images to the `index`
directory.

Note: This script downloads files in parallel. To adjust the number of parallel
downloads, modify `NUM_PROC` in the script.

### `index` image licenses

All images in the `index` set have CC-0 or Public Domain licenses.

## Download `test` set

There are 117,577 images in the `test` set.

### Download the list of images

- `test.csv`: single-column CSV with id field. `id` is a 16-character string.
Expand All @@ -98,6 +121,19 @@ them, access the following link:

And similarly for the other files.

#### Using the provided script

```bash
mkdir test && cd test
bash ../download-dataset.sh test 19
```

This will automatically download, verify and extract the images to the `test`
directory.

Note: This script downloads files in parallel. To adjust the number of parallel
downloads, modify `NUM_PROC` in the script.

### `test` image licenses

All images in the `test` set have CC-0 or Public Domain licenses.
Expand All @@ -115,6 +151,9 @@ For example, the md5sum file corresponding to the `images_000.tar` file in the

And similarly for the other files.

If you use the provided `download-dataset.sh` script, the integrity of the files
is already checked right after download.

## Extracting the data

We recommend that the set of TAR files corresponding to each dataset split be
Expand Down
18 changes: 12 additions & 6 deletions download-dataset.sh
Expand Up @@ -16,17 +16,23 @@
# Number of processes to run in parallel.
NUM_PROC=6

# Inclusive upper limit for file downloads.
# Default of N=499 will download all files, i.e. images_000.tar...images_499.tar
N=499
# Dataset split to download.
# Options: train, test, index.
SPLIT=$1

# Inclusive upper limit for file downloads. Should be set according to split:
# train --> 499.
# test --> 19.
# index --> 99.
N=$2

download_check_and_extract() {
local i=$1
images_file_name=images_$1.tar
images_md5_file_name=md5.images_$1.txt
images_tar_url=https://s3.amazonaws.com/google-landmark/train/$images_file_name
images_md5_url=https://s3.amazonaws.com/google-landmark/md5sum/train/$images_md5_file_name
echo "Downloading $images_file_name..."
images_tar_url=https://s3.amazonaws.com/google-landmark/$SPLIT/$images_file_name
images_md5_url=https://s3.amazonaws.com/google-landmark/md5sum/$SPLIT/$images_md5_file_name
echo "Downloading $images_file_name and its md5sum..."
curl -Os $images_tar_url > /dev/null
curl -Os $images_md5_url > /dev/null
if [[ "$OSTYPE" == "linux-gnu" ]]; then
Expand Down

0 comments on commit e4b7a4a

Please sign in to comment.