Data Preparation

This directory contains the scripts and instructions to prepare the datasets used in the experiments.

Please set data_root to the directory where the datasets will be stored.

Overview

data_root
|-- ID_ImageNet1K
|   |-- test/ e.g. n01440764/           # 50,000 JPEG images
|   `-- val/ e.g. n01440764/            #  1,000 JPEG images
|-- ID_VOC
|   |-- test/ e.g. aeroplane/           #    906 jpg images
|   `-- val/ e.g. aeroplane/            #     94 jpg images
|-- OOD_COCO
|   `-- test/images                     #  1,000 jpeg images
|-- OOD_ImageNet22K
|   `-- test/ e.g. n01937909/           # 18,335 JPEG images
|-- OOD_Places
|   `-- test/images                     #  10,000 jpg images
|-- OOD_Sun
|   `-- test/images                     # 10,000 jpg images
|-- OOD_Texture
|   `-- test/images/ e.g. banded/       #  5,640 jpg images
`-- OOD_iNaturalist
    `-- test/images                     #  10,000 jpg images

In-Domain (ID) Datasets Preparation

1. ID_ImageNet1K

We use the ImageNet-1000 (ILSVRC2012) dataset for ID validation and testing. The original dataset contains 1.2 million training images and 50,000 validation images from 1000 classes, and is widely used for image classification. We follow MCM to construct the ImageNet1K ID test set from the validation set. Additionally, we curated an ImageNet1K ID validation set from the training set by randomly selecting one image for each label.

ID_ImageNet1K_val

We provide the curated ImageNet1K ID validation set here, please download and extract it to data_root/ID_ImageNet1K/val/.

We also provide the code to reproduce the ImageNet1K ID validation set if needed.

Construct ImageNet1K ID validation set

export $data_root=/path/to/data_root
cd $data_root/downloads

# download train(task1&2) images, 138GB
wget https://download_link_to_ILSVRC_2012/ILSVRC2012_img_train.tar
tar -xvf ILSVRC2012_img_train.tar -P ImageNet_train

# restore the ImageNet1K ID validation set from the training set
cd $SVD_OOD
python data/utils/restore_files.py \
--json_file data/ID_ImageNet1K/imagenet1k_val_data.json \
--source_folder $data_root/downloads/ImageNet_train \
--target_folder $data_root/ID_ImageNet1K/val

ID_ImageNet1K_test

Download the ImageNet-1K dataset (source).

export data_root=/path/to/data_root
export SVD_OOD=/path/to/SVD_OOD
cd $data_root
# download valid(all tasks) images, 6.3GB
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar -P downloads

mkdir ./ID_ImageNet1K/test
tar -xvf downloads/ILSVRC2012_img_val.tar -C $data_root/ID_ImageNet1K/test

Excute the following script (source) to restrucure the ImageNet1K ID test set from original val set.
```
cd $SVD_OOD
bash ./data/utils/restore_imagenet1k_test.sh $data_root/ID_ImageNet1K/test
```

2. ID_VOC

The Pascal VOC (Visual Object Classes) dataset is a benchmark dataset widely used in computer vision, featuring annotated images across multiple object categories. We use the Pascal-VOC subset collected by GL-MCM as ID validation and test set, each image has single-class ID objects and one or more OOD objects. The ID validation and test set are split by 1:9 for each class, resulting in 94 and 906 images, respectively.

Download the datasets.tar.gz to data_root/downloads from Google Drive. This file will be reused in OOD datasets preparation.

Unzip and extract ID_VOC_val and ID_VOC_test to data_root/ID_VOC.

cd $data_root/downloads
# download `datasets.tar.gz` from Google Drive
# unzip file
tar -xzvf datasets.tar.gz

# clean hidden files, e.g. ._2008_003846.jpg
find datasets -type f -name ".*" -delete

cd $SVD_OOD

# extract ID_VOC_val and ID_VOC_test
mkdir -p $data_root/ID_VOC
python data/utils/restore_files.py \
--json_file data/ID_VOC/voc_val_data.json \
--source_folder $data_root/downloads/datasets/ID_VOC_single \
--target_folder $data_root/ID_VOC/val

python data/utils/restore_files.py \
--json_file data/ID_VOC/voc_test_data.json \
--source_folder $data_root/downloads/datasets/ID_VOC_single \
--target_folder $data_root/ID_VOC/test

Out-of-Domain (OOD) Datasets Preparation

1. iNaturalist_test, Places_test, Sun_test, Texture_test

Excute the following script to download and extract the Sun and Texture OOD datasets.

cd $data_root/downloads

# download and unzip iNaturalist
wget http://pages.cs.wisc.edu/~huangrui/imagenet_ood_dataset/iNaturalist.tar.gz
tar -xvf iNaturalist.tar.gz
mkdir -p $data_root/OOD_iNaturalist/test
mv iNaturalist/images $data_root/OOD_iNaturalist/test

# download and unzip Places
wget http://pages.cs.wisc.edu/~huangrui/imagenet_ood_dataset/Places.tar.gz
tar -xvf Places.tar.gz
mkdir -p $data_root/OOD_Places/test
mv Places/images $data_root/OOD_Places/test


# download and unzip Sun
wget http://pages.cs.wisc.edu/~huangrui/imagenet_ood_dataset/SUN.tar.gz
tar -xvf SUN.tar.gz
mkdir -p $data_root/OOD_Sun/test
mv SUN/images $data_root/OOD_Sun/test

# download and unzip Texture
wget https://www.robots.ox.ac.uk/~vgg/data/dtd/download/dtd-r1.0.1.tar.gz
tar -xvf dtd-r1.0.1.tar.gz
mkdir -p $data_root/OOD_Texture/test
mv dtd/images $data_root/OOD_Texture/test
rm $data_root/OOD_Texture/test/images/waffled/.directory

2. COCO_test

MCM curated a Pascal-VOC OOD test set (VOC for short) with 4,000 images that are not overlapped with the MS-COCO ID classes, which we use as OOD testing data for MS-COCO ID test set.

Reuse the datasets.tar.gz downloaded before, and extract COCO_test and VOC_test to data_root/OOD_COCO_VOC.

cd $data_root

# extract COCO_test
mkdir -p OOD_COCO/test
mv $data_root/downloads/datasets/OOD_COCO/images OOD_COCO/test

3. ImageNet22K_test

The ImageNet-22K dataset, formerly known as ImageNet-21K, addresses the underestimation of its additional value compared to the standard ImageNet-1K pretraining, aiming to provide high-quality pretraining for a broader range of models. We use the filtered subset collected by multi-label-ood as the OOD test set for MC-COCO and Pascal-VOC ID test sets.

Download the ImagenetOOD_for_COCO_VOC.tar to data_root/downloads from Google Drive.

Extract ImageNet22K_test:

cd $data_root/downloads
# download `ImagenetOOD_for_COCO_VOC.tar` from Google Drive
tar -xvf ImagenetOOD_for_COCO_VOC.tar

cd ../
# extract ImageNet22K_test
mkdir -p $data_root/OOD_ImageNet22K
mv $data_root/downloads/ImageNet-22K OOD_ImageNet22K
mv OOD_ImageNet22K/ImageNet-22K OOD_ImageNet22K/test

Check the data structure

Excute the following script to check the data structure.

cd $SVD_OOD
python data/utils/check_data_structure.py --data_root $data_root

You should see the following output:

Comparing folder structure...

Comparing ID_ImageNet1K...
Split test: matched! Number of images:  50000
Split val: matched! Number of images:  1000

Comparing ID_VOC...
Split test: matched! Number of images:  906
Split val: matched! Number of images:  94

Comparing OOD_iNaturalist...
Split test: matched! Number of images:  10000

Comparing OOD_Sun...
Split test: matched! Number of images:  10000

Comparing OOD_Places...
Split test: matched! Number of images:  10000

Comparing OOD_Texture...
Split test: matched! Number of images:  5640

Comparing OOD_ImageNet22K...
Split test: matched! Number of images:  18335

Comparing OOD_COCO...
Split test: matched! Number of images:  1000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Preparation

Overview

In-Domain (ID) Datasets Preparation

1. ID_ImageNet1K

ID_ImageNet1K_val

ID_ImageNet1K_test

2. ID_VOC

Out-of-Domain (OOD) Datasets Preparation

1. iNaturalist_test, Places_test, Sun_test, Texture_test

2. COCO_test

3. ImageNet22K_test

Check the data structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Preparation

Overview

In-Domain (ID) Datasets Preparation

1. ID_ImageNet1K

ID_ImageNet1K_val

ID_ImageNet1K_test

2. ID_VOC

Out-of-Domain (OOD) Datasets Preparation

1. iNaturalist_test, Places_test, Sun_test, Texture_test

2. COCO_test

3. ImageNet22K_test

Check the data structure