My replication code for the AlexNet paper.
tldr: go to Results
The data used is ImageNet 2012. I downloaded it from kaggle. The paper also experiments with ImageNet 2010 but I couldn't find this dataset. The authors also experiment with pretraining on ImageNet Fall 2011 which isn't even available anymore. The closest would be current ImageNet21K, but I don't have enough compute for that.
Download and extract ImangeNet
mkdir -p data/imagenet && cd data/imagenet
kaggle competitions download -c imagenet-object-localization-challenge
unzip imagenet-object-localization-challenge.zip
Instructions on how to configure kaggle
are here
Also, if you want to train on tiny-imagenet download the data as follows.
mkdir -p data && cd data
wget https://image-net.org/data/tiny-imagenet-200.zip
unzip tiny-imagenet-200.zip
cd ..
I coded a couple more tasks, but those are downloaded automatically.
Install the deps found in requirements.txt
. I used python 3.9 and pytorch
1.12. You should modify the cuda version according to your hardware.
These are the command line options
$ python -m alexnet --help
Usage: python -m alexnet [OPTIONS]
Options:
--task [mnist|fashion-mnist|cifar10|cifar100|tiny-imagenet|imagenet]
[required]
--batch-size INTEGER [default: 128]
--dropout FLOAT [default: 0.5]
--learn-rate FLOAT [default: 0.0001]
--seed INTEGER [default: 12331]
--extra-logging Whether to log histograms of parameters and
grads.
--fast-dev-run Run only a couple of steps, to check if
everything is working properly.
--help Show this message and exit.
The available tasks can be seen above. Default hparams where chosen according to the paper and my own experimentation. To run a setup pretty close to the one on the paper simply run
python -m alexnet --task imagenet
See the training curves in the tensorboard.dev experiment.
All experiments were done using an RTX3090. ImageNet training took ~3 days to reach 100 epochs (hardware has come a long way since 2012, compare with the 5 days it took on an RTX580).
Below are the results on different tasks.
imagenet (my experiment) |
imagenet (paper) |
mnist | fashion-mnist | cifar10 | cifar100 | tiny-imagenet | |
---|---|---|---|---|---|---|---|
error@1 | 0.48 | 0.40 | 0.01 | 0.09 | 0.13 | 0.41 | 0.57 |
error@5 | 0.25 | 0.18 | 0.00 | 0.00 | 0.01 | 0.16 | 0.32 |
Results aren't that close, but also not that far off. As we can see in the training curves, we might have been able to obtain a better results with more training time, since val error seemed to be going down still when training stoped. For this scope however it's good enough.
- for rect images, first rescale so that shorter side is of 256 pix, then crop longer side to 256 pix, always square images.
- substract "mean activity" from each pixel, apparently this is mean activity per channel
- ReLU non linearity everywhere
- training on multiple gpus (see section 3.2 and fig 2). I wil not do this,
instead one can leverage the
groups
option innn.Conv2d
so emulate the behaviour - local response norm, pytorch implementation divides alpha by n, so in order to replicate paper alpha should be multiplied by n
- net description taken from section 3.5 and figure 2
- augmentation: at train time, extract random 224 x 224 patches and horizontal reflection
- augmentation: at test time, extract 10 224 x 224 patches (corners + center) + h reflections, results is averaged prediction over the 10 patches
- augmentation: PCA color augmentation, see paper section 4.1, extra resources:
- dropout
- optim SGD, see exact update rule in paper section 5 (not used see, notes below)
- batch size 128
- dropout of 0.5
- momentum 0.9 (see notes about optimization below)
- weight decay 0.0005 (see notes about optimzation below)
- weight init is guassian with mean 0, std 0.01. Bias in 2, 4, 5 conv layers and also fc layers with constant 1, bias in other layers init with 0
- same lr for al layers, start at 0.01 (close, but see notes)
- decay learn rate by factor of 10 when val error stops improving. This results in 3 reductions during training.
- 90 epochs, which took 5/6 days.
- Optimizer changed, other
user
also saw poor performance using
optim.SGD
. I even implemented the actual optimizer step described in the paper, since it's a little bit different than pytorch's algortihm, but I saw no improvement. I kept Adam optim and used a learn rate of 1e-4, and also lowered it on plateau. - Apparently the net convergence is super sensitive to param initialization. Out of three seed values I tried only one made the net learn in the imagenet experiment (using the exact same hparams elsewhere). The current default seed is the one I found.