This page provides the instructions to run the two other distilled settings we explored in the paper, as well as lists details on how to customize our code, and perform testing.
We use MNIST -> USPS
as example. Commands for other adaptation cases are similar.
-
Train many networks on
MNIST
.# 200 networks for training python main.py --mode train --dataset MNIST --arch LeNet --n_nets 200 \ --epochs 40 --decay_epochs 20 --lr 2e-4 # 20 networks for testing python main.py --mode train --dataset MNIST --arch LeNet --n_nets 20 \ --epochs 40 --decay_epochs 20 --lr 2e-4 --phase test
-
Train distilled images on these 200 networks to work well on
USPS
. At each iteration, sample 4 of them to train.python main.py --mode distill_adapt --source_dataset MNIST --dataset USPS \ --arch LeNet --train_nets_type loaded --n_nets 200 --sample_n_nets 4 \ --test_nets_type loaded --test_n_nets 20
test_n_nets
here can be less than the number of testing networks trained in the previous step since these are only used to monitor results during training. In full evaluation time, all networks can be used. See the Testing section below for details.
See the Distributed Training section on using much more networks, e.g., 2000 networks for training in our experiments in paper.
Similar as above, but now we only need to train one network, and then train distilled images only on it. E.g.,
# train 1 network
python main.py --mode train --dataset MNIST --arch LeNet --n_nets 1 \
--epochs 40 --decay_epochs 20 --lr 2e-4
# train distilled images
python main.py --mode distill_adapt --source_dataset MNIST --dataset USPS \
--arch LeNet --train_nets_type loaded --n_nets 1 \
--test_nets_type same_as_train
Here we use the same mode as the basic setting, but use a special initialization method "imagenet_pretrained"
.
For example, the follow command adapts it to the PASCAL_VOC
dataset.
python main.py --mode distill_basic --dataset PASCAL_VOC --arch AlexNet \
--distill_steps 1 --distill_epochs 3 --init imagenet_pretrained \
--train_nets_type known_init --test_nets_type same_as_train \
--epochs 200 --decay_epochs 30 --distill_lr 0.001
Use --dataset CUB200
to adapt to the CUB200
dataset.
We use Cifar10
as example.
-
Similar to the adaptation setting, we start with training many networks.
# 200 networks for training python main.py --mode train --dataset Cifar10 --arch AlexCifarNet --n_nets 2000 \ --epochs 40 --decay_epochs 20 --lr 2e-4 # 200 networks for testing python main.py --mode train --dataset Cifar10 --arch AlexCifarNet --n_nets 200 \ --epochs 40 --decay_epochs 20 --lr 2e-4 --phase test
-
Distill for the malicious attack objective so that these well-optimized networks will misclassify a certain class (
attack_class
) to another (target_class
), after training on the distilled images.python main.py --mode distill_attack --dataset Cifar10 --arch AlexCifarNet \ --train_nets_type loaded --n_nets 2000 --sample_n_nets 4 \ --test_nets_type loaded --test_n_nets 20 \ --attack_class 0 --target_class 1 --lr 0.02
See the Distributed Training section on using much more networks, e.g., 2000 networks for training in our experiments in paper.
We often need to load multiple networks into GPU memory (e.g., for results presented in the paper, we use 2000 networks for training in adaptation and malicious attack settings). A single GPU can not hold all these networks. In such cases, we can use NCCL distributed training specifying a world_size
larger than 1
. Then, you need to start world_size
processes with identical arguments except device_id
, but each with a different environmental variable RANK
representing the process rank in [0, 1, ..., world_size - 1]
.
There are two ways to initialize a process group in PyTorch for distributed training:
-
TCP init. Specify environmental variables
MASTER_ADDR
andMASTER_PORT
, representing an accessible port from all ranks. -
File system init. Specify environmental variable
INIT_FILE
, representing a file handle accessible from all ranks.
For example,
-
These commands start 2 processes that each trains 1000 networks on a different GPU (2000 in total):
# rank 0: gpu 0, train [0, 1000) env RANK=0 INIT_FILE=/tmp/distill_init \ python main.py --mode train --dataset MNIST --arch LeNet --n_nets 2000 \ --epochs 40 --decay_epochs 20 --lr 2e-4 --world_size 2 --device_id 0 # rank 1: gpu 1, train [1000, 1000) env RANK=1 INIT_FILE=/tmp/distill_init \ python main.py --mode train --dataset MNIST --arch LeNet --n_nets 2000 \ --epochs 40 --decay_epochs 20 --lr 2e-4 --world_size 2 --device_id 1
-
These commands start 4 processes that collectively train distilled images for 2000 pre-trained networks. Each process loads 500 networks on a different GPU, and samples 1 network in each iteration:
# rank 0: gpu 0, load [0, 500) env RANK=0 MASTER_ADDR=XXXXX MASTER_ADDR=23456 \ python main.py --mode distill_adapt --source_dataset MNIST --dataset USPS \ --arch LeNet --train_nets_type loaded --n_nets 2000 --sample_n_nets 4 \ --test_nets_type loaded --test_n_nets 20 --world_size 4 --device_id 0 # rank 1: gpu 1, load [500, 1000) env RANK=1 MASTER_ADDR=XXXXX MASTER_ADDR=23456 \ python main.py --mode distill_adapt --source_dataset MNIST --dataset USPS \ --arch LeNet --train_nets_type loaded --n_nets 2000 --sample_n_nets 4 \ --test_nets_type loaded --test_n_nets 20 --world_size 4 --device_id 1 # rank 2: gpu 2, load [1000, 1500) env RANK=2 MASTER_ADDR=XXXXX MASTER_ADDR=23456 \ python main.py ---mode distill_adapt --source_dataset MNIST --dataset USPS \ --arch LeNet --train_nets_type loaded --n_nets 2000 --sample_n_nets 4 \ --test_nets_type loaded --test_n_nets 20 --world_size 4 --device_id 2 # rank 3: gpu 3, load [1500, 2000) env RANK=3 MASTER_ADDR=XXXXX MASTER_ADDR=23456 \ python main.py --mode distill_adapt --source_dataset MNIST --dataset USPS \ --arch LeNet --train_nets_type loaded --n_nets 2000 --sample_n_nets 4 \ --test_nets_type loaded --test_n_nets 20 --world_size 4 --device_id 3
Distributed training works not only for using multiple GPUs within a single node, but also for training using multiple nodes within a cluster.
Using --phase test
, we can evaluate our trained distilled images and various baselines by specifying the following options:
test_distilled_images
: Source of distilled images to be evaluated. This must be one of"loaded"
,"random_train"
,"average_train"
, and"kmeans_train"
, specifying whether to load the trained distilled images, or to compute images from training set. Default:"loaded"
.test_distilled_lrs
: Learning rates used to evaluate the distilled images. This must be one of"loaded"
,"fix [lr]"
, and"nearest_neighbor [k] [p]"
."fix"
will use constant lr for all steps."nearest_neighbor"
will instead use the distilled images fork
-nearest neighbor classification usingp
-norm. Default:"loaded"
.test_n_runs
: Number of times to run the entire evaluation process (i.e., constructing distilled images and evaluate). This is useful when using stochastic methods to construct distilled images, e.g.,"random_train"
and"kmeans_train"
. Default:1
.test_n_nets
: Number of test networks used in each run. Default:1
.test_distill_epochs
: Number of epochs to apply distilled images. IfNone
, this is set to equal todistill_epochs
used for training. DefaultNone
.test_optimize_n_runs
: For stochastic methods to construct distilled images, setting this to a non-None
value will optimize the obtained distilled images by evaluating them ontest_niter
batches of training images, and picking the besttest_n_runs
out oftest_optimize_n_runs
total sets of distilled images. Default:None
.test_optimize_n_nets
: Number of training networks used to optimize distilled images. Only meaningful iftest_optimize_n_runs
is notNone
. Default:20
.
For example,
-
To evaluate the trained distilled images applied over 10 epochs for
MNIST -> USPS
adaptation setting with unknown initialization on 200 networks:python main.py --mode distill_adapt --source_dataset MNIST --dataset USPS \ --arch LeNet --train_nets_type loaded --n_nets 200 --sample_n_nets 4 \ --phase test --test_nets_type loaded --test_n_nets 200 \ --test_distilled_images loaded --test_distilled_lrs loaded \ --test_distill_epochs 10
-
To evaluate using optimized random training as distilled images with fixed 0.3 learning rate for basic
MNIST
distillation setting on 200 networks:python main.py --mode distill_basic --dataset MNIST --arch LeNet \ --phase test --train_nets_type unknown_init --test_nets_type unknown_init \ --test_distilled_images random_train --test_distilled_lrs fix 0.3 \ --test_n_nets 200 --test_n_runs 20 \ --test_optimize_n_runs 50 --test_optimize_n_nets 20
To obtain the pre-trained weights (--mode train
), we recommend using
--epochs 40 --decay_epochs 20 --lr 2e-4
forMNIST
withLeNet
,--epochs 130 --decay_epochs 40 --lr 2e-4
forUSPS
withLeNet
,--epochs 65 --decay_epochs 20 --lr 2e-4
forSVHN
withLeNet
,--epochs 50 --decay_epochs 7 --lr 1e-3
forCifar10
withAlexCifarNet
.
Below we list some of the options you may want to tune:
distill_steps
: Number of gradient steps in applying distilled images. Each step is associated with a new batch of distilled images, so this also affects the total number of images. Default:10
.distill_epochs
: Number of passes to cycle over the gradient steps. E.g., withdistill_steps=10
anddistill_epochs=3
, the images of10
steps are iterated over3
times, leading to a total of30
gradient steps. This does not change the total number of distilled images. Default:3
.distilled_images_per_class_per_step
: Number of distilled images per class in each step. Default:1
.distill_lr
: Initial value of the trained learning rates for distillation. Default:0.001
.train_nets_type
: How the initial weights for training are obtained. This must be one of"unknown_init"
(randomly initialize weights at every iteration),"known_init"
(initialize weights once before training and keep fixed throughout training),"loaded"
(weights loaded from disk). Default:"unknown_init"
.n_nets
: Number of networks available to train the distilled images in each iteration. E.g., withtrain_nets_type="unknown_init"
andn_nets=4
, each training iteration samples 4 new sets of initial weights. Default:4
.sample_n_nets
: Number of networks subsampled fromn_nets
networks for training. This option is useful when training for pre-trained weights. E.g., withtrain_nets_type="loaded"
,n_nets=2000
andsample_n_nets=4
, in each iteration, 4 out of 2000 loaded networks will be randomly selected for training. Default: same asn_nets
.test_nets_type
: How the initial weights for testing are obtained. This must be one of"unknown_init"
,"same_as_train"
(same weights for training are used in testing),"loaded"
. Default:"unknown_init"
.init
: Initialization method to sample the initial weight. This must be one of"xavier"
,"xavier_unif"
,"kaiming"
,"kaiming_out"
,"orthogonal"
,"default"
, and"imagenet_pretrained"
."default"
uses the default initialization method in PyTorch."imagenet_pretrained"
only works withAlexNet
and loads a particular set of weights pre-trained on ImageNet. Others call corresponding initialization methods. Seeinit_weights
function innetworks/utils.py
for details. Default:"xavier"
.init_param
: Parameters used for the used initialization method, e.g.,gain
argument for"xavier"
. Seeinit_weights
function innetworks/utils.py
for details. Default:1
.device_id
: The device index used in this training process. If negative, CPU is used. Default:0
.
We include tests checking the correctness of our custom gradient computation. You may use python test_train_distilled_image.py
to run them. You can optionally append -v
to enable verbose mode, which prints the numerical gradient test details.