solver configuration that converges more reliably. thanks to @ducha-aiki

in #3 for the suggestion of linearly decreasing the learning rate through training. note that the provided model was trained with the old solver configuration. in our experiements, this new solver configuration leads to model accuracy that is greater than or equal to the old configuration.
forresti · Mar 26, 2016 · 0bc03d9 · 0bc03d9
1 parent 69c0afe
commit 0bc03d9
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ Helpful hints:
 1. **Getting the SqueezeNet model:** `git clone <this repo>`. 
 In this repository, we include Caffe-compatible files for the model architecture, the solver configuration, and the pretrained model (4.8MB uncompressed).
 
-2. **Batch size.** For the SqueezeNet model in our paper, we used a batch size of 1024. If implemented naively on a single GPU, this may result in running out of memory. An effective workaround is to use hierarchical batching (sometimes called "delayed batching"). Caffe supports hierarchical batching by doing `train_val.prototxt>batch_size` training samples concurrently in memory. After `solver.prototxt>iter_size` iterations, the gradients are summed and the model is updated. Mathematically, the batch size is `batch_size * iter_size`. In the included prototxt files, we have set `(batch_size=32, iter_size=32)`, but any combination of batch_size and iter_size that multiply to 1024 will produce eqivalent results. In fact, with the same random number generator seed, the model will be fully reproducable if trained multiple times. Finally, note that in Caffe `iter_size` is applied while training on the training set but not while testing on  the test set.
+2. **Batch size.** We have experimented with batch sizes ranging from 32 to 1024. In this repo, our default batch size is 512. If implemented naively on a single GPU, a batch size this large may result in running out of memory. An effective workaround is to use hierarchical batching (sometimes called "delayed batching"). Caffe supports hierarchical batching by doing `train_val.prototxt>batch_size` training samples concurrently in memory. After `solver.prototxt>iter_size` iterations, the gradients are summed and the model is updated. Mathematically, the batch size is `batch_size * iter_size`. In the included prototxt files, we have set `(batch_size=32, iter_size=16)`, but any combination of batch_size and iter_size that multiply to 512 will produce eqivalent results. In fact, with the same random number generator seed, the model will be fully reproducable if trained multiple times. Finally, note that in Caffe `iter_size` is applied while training on the training set but not while testing on  the test set.
 
 3. **Implementing Fire modules.** In the paper, we describe the `expand` portion of the Fire layer as a collection of 1x1 and 3x3 filters. Caffe does not natively support a convolution layer that has multiple filter sizes. To work around this, we implement `expand1x1` and `expand3x3` layers and concatenate the results together in the channel dimension. 
 
diff --git a/SqueezeNet_v1.0/solver.prototxt b/SqueezeNet_v1.0/solver.prototxt
@@ -8,12 +8,12 @@
 
 test_iter: 2000 #not subject to iter_size
 test_interval: 1000
-base_lr: 0.08
+base_lr: 0.04
 display: 40
-max_iter: 85000
-iter_size: 32 #global batch size = batch_size * iter_size
+max_iter: 170000
+iter_size: 16 #global batch size = batch_size * iter_size
 lr_policy: "poly"
-power: 0.5
+power: 1.0 #linearly decrease LR
 momentum: 0.9
 weight_decay: 0.0002
 snapshot: 1000