This is the PyTorch implementation for:
SaccadeCam: Adaptive Visual Attention for Monocular Depth Sensing
If you find our work useful in your research please consider citing our paper:
@InProceedings{Tilmon_2021_ICCV,
author = {Tilmon, Brevin and Koppal, Sanjeev J.},
title = {SaccadeCam: Adaptive Visual Attention for Monocular Depth Sensing},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {6009-6018}
}
Download the entire raw KITTI dataset by running:
wget -i splits/kitti_archives_to_download.txt -P kitti_data/
Then unzip with
cd kitti_data
unzip "*.zip"
cd ..
Following monodepth2, our default settings expect that you have converted the png images to jpeg with this command, which also deletes the raw KITTI .png
files:
find kitti_data/ -name '*.png' | parallel 'convert -quality 92 -sampling-factor 2x2,1x1,1x1 {.}.png {.}.jpg && rm {}'
You can skip this step and include the --png
flag, but the results will be slightly different and the loading times will be slower.
This code was developed on Ubuntu 16.04 with a NVIDIA GTX 1080 Ti.
- opencv-python==4.5.1.48
- protobuf==3.14.0
- tensorboardx==1.4
- torch==1.7.1
- torchvision==0.8.2
The models and tensorboard events are saved to ~/tmp/<model_name>
by default. This can be changed with the --log_dir
flag.
--defocused_scale
was set to 0.25, 0.2, 0.15, 0.0125.
--wac_scale
was always 0.75 in our results.
See camera.py
for how the following bandwidth flags affect image resolution. Conceptually, --defocused_scale
is the percentage of full resolution pixels used to form the target resolution. --wac_scale
is the percentage of target resolution used to form the periphery in a SaccadeCam rendered image, before full resolution fovea are added.
See experiments/train.sh
to train various models that should closely repeat results from our paper. An example from train.sh
for training our network is as follows:
CUDA_VISIBLE_DEVICES=0 python ../train_main.py \
--model_name attention/RR_depth_ \
--use_stereo --frame_ids 0 --split eigen_full --fovea 0 --num_epochs 20 \
--depth_lr 1e-4 --oracleC \
--defocused_scale 0.25 --wac_scale 0.75 \
--disable_automasking --weight_regions --fovea_weight 0.15
--oracleC
means we train the depth network and not the attention network. See experiments/train.sh
for more training configurations.
--fovea
= 0 for loading WAC resolutions, 1 for loading target resolutions, and 2 for full resolution.
--weight_regions
weights the fovea regions to encourage the network to focus on them more. --fovea_weight
determines how much more to weight the fovea regions. See trainer_main.py
for more.
test.sh
repeats the tables in our paper. You will need to change where the models are loaded from to test your own results. An example from test.sh
for testing our network is as follows:
python ../test.py \
--eval_stereo --fovea 0 --defocused_scale 0.25 --wac_scale 0.75 --frame_ids 0 --gpus 0 \
--deformable --weight_regions --epoch_to_load 6
As a guide, the below table explains which epochs of models were used for our results. We can consistently reproduce results with these epochs.
defocused/wac scales | no weighting epochs | weighting epochs |
---|---|---|
0.25/.75 | 17 | 7 |
0.2/0.75 | 13 | 16 |
0.15/0.75 | 11 | 14 |
0.0125/0.75 | 2 | 1 |
The network encoder and decoder architectures and reprojection layers were adapted from monodepth2. We reimplement layers.py
, resnet_encoder.py
, and depth_decoder.py
based on their paper to avoid infringing on their license. Our contributions aim for improving existing self supervised depth techniques, such as monodepth2, with adaptive resolution placement.