Open source release of the evaluation benchmark suite described in "Realistic Evaluation of Deep Semi-Supervised Learning Algorithms"
Switch branches/tags
Nothing to show
Clone or download
avital Merge pull request #5 from DoctorKey/eval_bug
Don't accidentally skip every other batch during evaluation.
Latest commit 88df610 Nov 14, 2018

README.md

realistic-ssl-evaluation

This repository contains the code for Realistic Evaluation of Deep Semi-Supervised Learning Algorithms, by Avital Oliver*, Augustus Odena*, Colin Raffel*, Ekin D. Cubuk, and Ian J. Goodfellow, arXiv preprint arXiv:1804.09170.

If you use the code in this repository for a published research project, please cite this paper.

The code is designed to run on Python 3 using the dependencies listed in requirements.txt. You can install the dependencies by running pip3 install -r requirements.txt.

The latest version of this repository can be found here.

Prepare datasets

For SVHN and CIFAR-10, we provide scripts to automatically download and preprocess the data. Run those scripts as follows:

python3 build_tfrecords.py --dataset_name=cifar10
python3 build_tfrecords.py --dataset_name=svhn
python3 build_tfrecords.py --dataset_name=imagenet_32

Final accuracy numbers for semi-supervised learning can vary significantly based on which labels from the training set are retained. We specify which images should have their labels retained via "label maps". This codebase is distributed with the label maps we use in our paper in the data/ subdirectory. If desired, you can generate new label maps by using the build_label_map.py script, but note that in doing so you will not be able to reliably compare your results to others (such as our own) which are based on different label maps.

Preparing ImageNet 32x32 dataset for fine-tuning experiment

First you'll need to download the 32x32 version of the ImageNet dataset by following the instructions here. Unzip the resulting files and put them in a directory called 'data/imagenet_32'. You'll then need to convert those files (which are pickle files) into .npy files. You can do this by executing:

mkdir data/imagenet_32
unzip Imagenet32_train.zip -d data/imagenet_32
unzip Imagenet32_val.zip -d data/imagenet_32
python3 convert_imagenet.py

ImageNet32x32 is the only dataset which must be downloaded manually, due to licensing issues.

Running experiments

All of the experiments in our paper are accompanied by a .yml file in runs/.These .yml files are intended to be used with https://github.com/tmux-python/tmuxp, which is a session manager for tmux. They essentially provide a simple way to create a tmux session with all of the relevant tasks running (model training and evaluation). The .yml files are named according to their corresponding figure/table/section in the paper. For example, if you want to run an experiment evaluating VAT with 500 labels as shown in Figure 3, you could run

tmuxp load runs/figure-3-svhn-500-vat.yml

Of course, you can also run the code without using tmuxp. Each .yml file specifies the commands needed for running each experiment. For example, the file listed above runs/figure-3-svhn-500-vat.yml runs

CUDA_VISIBLE_DEVICES=0 python3 train_model.py --verbosity=0 --primary_dataset_name='svhn' --secondary_dataset_name='svhn' --root_dir=/mnt/experiment-logs/figure-3-svhn-500-vat --n_labeled=500 --consistency_model=vat --hparam_string=""  2>&1 | tee /mnt/experiment-logs/figure-3-svhn-500-vat_train.log
CUDA_VISIBLE_DEVICES=1 python3 evaluate_model.py --split=test --verbosity=0 --primary_dataset_name='svhn' --root_dir=/mnt/experiment-logs/figure-3-svhn-500-vat --consistency_model=vat --hparam_string=""  2>&1 | tee /mnt/experiment-logs/figure-3-svhn-500-vat_eval_test.log
CUDA_VISIBLE_DEVICES=2 python3 evaluate_model.py --split=valid --verbosity=0 --primary_dataset_name='svhn' --root_dir=/mnt/experiment-logs/figure-3-svhn-500-vat --consistency_model=vat --hparam_string=""  2>&1 | tee /mnt/experiment-logs/figure-3-svhn-500-vat_eval_valid.log
CUDA_VISIBLE_DEVICES=3 python3 evaluate_model.py --split=train --verbosity=0 --primary_dataset_name='svhn' --root_dir=/mnt/experiment-logs/figure-3-svhn-500-vat --consistency_model=vat --hparam_string=""  2>&1 | tee /mnt/experiment-logs/figure-3-svhn-500-vat_eval_train.log

Note that these commands are formulated to write out results to /mnt/experiment-logs. You will either need to create this directory or modify them to write to a different directory. Further, the .yml files are written to assume that this source tree lives in /root/realistic-ssl-evaluation.

A note on reproducibility

While the focus of our paper is reproducibility, ultimately exact comparison to the results in our paper will be conflated by subtle differences such as the version of TensorFlow used, random seeds, etc. In other words, simply copying the numbers stated in our paper may not provide a means for reliable comparison. As a result, if you'd like to use our implementation of baseline methods as a point of comparison for e.g. a new semi-supervised learning technique, we'd recommend re-running our experiments from scratch in the same environment as your new technique.

Simulating small validation sets

The following command runs evaluation on a set of checkpoints, with multiple resamples of small validation sets (as in figure 5 in the paper):

python3 evaluate_checkpoints.py --primary_dataset_name='cifar10' --checkpoints='/mnt/experiment-logs/section-4-3-cifar-fine-tuning/default/model.ckpt-1000000,/mnt/.../model.ckpt-...,...'

Results are printed to stdout for each evaluation run, and at the end a string representation of the entire list of validation accuracies for each resampled validation set and each checkpoint is printed:

{'/mnt/experiment-logs/table-1-svhn-1000-pi-model-run-5/default/model.ckpt-500001': [0.86, 0.93, 0.92, 0.91, 0.9, 0.94, 0.91, 0.88, 0.88, 0.89]}

Disclaimer

This is not an official Google product.