Skip to content
Switch branches/tags


Failed to load latest commit information.
Latest commit message
Commit time
Aug 1, 2020
Aug 1, 2020
May 4, 2021
Aug 1, 2020
May 4, 2021

Feature Pyramid Transformer

Implementation for paper: Feature Pyramid Transformer.


  1. Overview
  2. Requirements
  3. Data Preparation
  4. Pretrained Model
  5. Model Training
  6. Inference
  7. Citation
  8. Question


Feature interactions across space and scales underpin modern visual recognition systems because they introduce beneficial visual contexts. Conventionally, spatial contexts are passively hidden in the CNN's increasing receptive fields or actively encoded by non-local convolution. Yet, the non-local spatial interactions are not across scales, and thus they fail to capture the non-local contexts of objects (or parts) residing in different scales. To this end, we propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT). It transforms any feature pyramid into another feature pyramid of the same size but with richer contexts, by using three specially designed transformers in self-level, top-down, and bottom-up interaction fashion. FPT serves as a generic visual backbone with fair computational overhead. We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks, using various backbones and head networks, and observe consistent improvement over all the baselines and the state-of-the-art methods.

Overall structure of our proposed FPT. Different texture patterns indicate different feature transformers, and different color represents feature maps with different scales. "Conv" denotes a 3 × 3 convolution with the output dimension of 256. Without loss of generality, the top/bottom layer feature maps has no rendering/grounding transformer.


  • Packages
    • pytorch=0.4.0
    • torchvision>=0.2.0
    • cython
    • matplotlib
    • numpy
    • scipy
    • opencv
    • pyyaml
    • packaging
    • dropblock
    • pycocotools
    • tensorboardX — for logging the losses in Tensorboard
  • 8 GPUs and CUDA 8.0 or higher. Some operations only have gpu implementation.

Data Preparation

Create a data folder under the repo,

cd {repo_root}
mkdir data
  • COCO: Download COCO images and annotations from website.

    And make sure to put the files as the following structure:

    ├── annotations
    |   ├── instances_minival2014.json
    │   ├── instances_train2014.json
    │   ├── instances_train2017.json
    │   ├── instances_val2014.json
    │   ├── instances_val2017.json
    │   ├── instances_valminusminival2014.json
    │   ├── ...
    └── images
        ├── train2014
        ├── train2017
        ├── val2014
        ├── val2017
        ├── ...

    Feel free to put COCO at any place you want, and then soft link the dataset under the data/ folder:

    ln -s path/to/coco data/coco 

    Recommend to put the images on a SSD for possible better training performance

Pretrained Model

ImageNet Pretrained Model from Caffe

Download them and put them into the {repo_root}/data/pretrained_model.

If you want to use pytorch pre-trained models, please remember to transpose images from BGR to RGB, and also use the same data preprocessing as used in Pytorch pretrained model.

ImageNet Pretrained Model from Detectron

NOTE: Caffe pretrained weights have slightly better performance than the Pytorch pretrained weights.

Model Training

Train from scratch

Take mask-rcnn with resnet-50 backbone for example.

python tools/ --dataset coco2017 --cfg configs/e2e_fptnet_R-50_mask.yaml --use_tfboard --bs {batch_size} --nw {num_workers}

Use --bs to overwrite the default batch size to a proper value that fits into your GPUs. Simliar for --nw, number of data loader threads defaults to 4 in

Specify —-use_tfboard to log the losses on Tensorboard.

Finetune from a checkpoint

python tools/ ... --load_ckpt {path/to/the/checkpoint}

or using Detectron's checkpoint file

python tools/ ... --load_detectron {path/to/the/checkpoint}

Resume training with the same dataset and batch size

python tools/ ... --load_ckpt {path/to/the/checkpoint} --resume

When resume the training, step count and optimizer state will also be restored from the checkpoint. For SGD optimizer, optimizer state contains the momentum for each trainable parameter.

NOTE: --resume is not yet supported for --load_detectron

Set config options in command line

  python tools/ ... --no_save --set {config.name1} {value1} {config.name2} {value2} ...
  • For Example, run for debugging.
    python tools/ ... --no_save --set DEBUG True
    Load less annotations to accelarate training progress. Add --no_save to avoid saving any checkpoint or logging.

Show command line help messages

python --help


Evaluate the training results

For example, on coco2017 val set

python tools/ --dataset coco2017 --cfg configs/e2e_fptnet_R-50_mask.yaml --load_ckpt {path/to/your/checkpoint}

Results visualization

python tools/ --dataset coco --cfg configs/e2e_fptnet_R-50_mask.yaml --load_ckpt {path/to/your/checkpoint} --image_dir {dir/of/input/images}  --output_dir {dir/to/save/visualizations}

My nn.DataParallel

  • Keep certain keyword inputs on cpu Official DataParallel will broadcast all the input Variables to GPUs. However, many rpn related computations are done in CPU, and it's unnecessary to put those related inputs on GPUs.
  • Allow Different blob size for different GPU To save gpu memory, images are padded seperately for each gpu.
  • Work with returned value of dictionary type


If our work is useful for your research, please consider citing:

  author = {Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua and Qianru Sun},
  title = {Feature Pyramid Transformer},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year = {2020}


Please contact ''


Implementation for paper Feature Pyramid Transformer



No releases published


No packages published