[AAAI20] Deep Object Co-segmentation via Spatial-Semantic Network Modulation(Oral paper)
- PDF: arXiv
Object co-segmentation is to segment the shared objects in multiple relevant images, which has numerous applications in computer vision. This paper presents a spatial and semantic modulated deep network framework for object co-segmentation. A backbone network is adopted to extract multi-resolution image features. With the multi-resolution features of the relevant images as input, we design a spatial modulator to learn a mask for each image. The spatial modulator captures the correlations of image feature descriptors via unsupervised learning. The learned mask can roughly localize the shared foreground object while suppressing the background. For the semantic modulator, we model it as a supervised image classification task. We propose a hierarchical second-order pooling module to transform the image features for classification use. The outputs of the two modulators manipulate the multi-resolution features by a shift-and-scale operation so that the features focus on segmenting co-object regions. The proposed model is trained end-to-end without any intricate post-processing. Extensive experiments on four image co-segmentation benchmark datasets demonstrate the superior accuracy of the proposed method compared to state-of-the-art methods.
Overview of our method
We propose a spatial-semantic modulated deep network for object co-segmentation. Image features extracted by a backbone network are used to learn a spatial modulator and a semantic modulator. The outputs of the modulators guide the image features up-sampling to generate the co-segmentation results. The network parameter learning is formulated into a multi-task learning task, and the whole network is trained in an end-to-end manner.
For the spatial modulation branch, an unsupervised learning method is proposed to learn a mask for each image. With the fused multi-resolution image features as input, we formulate the mask learning as an integer programming problem. Its continuous relaxation has a closed-form solution. The learned parameter indicates whether the corresponding image pixel corresponds to foreground or background.
In the semantic modulation branch, we design a hierarchical second-order pooling (HSP) operator to transform the convolutional features for object classification. Spatial pooling (SP) is shown to be able to capture the high-order feature statistical dependency. The proposed HSP module has a stack of two SP layers. They are dedicated to capturing the long-range channel-wise dependency of the holistic feature representation. The output of the HSP layer is fed into a fully-connected layer for object classification and used as the semantic modulator.
In order to compare the deep learning methods in recent years fairly, we conduct extensive evaluations on four widely-used benchmark datasets including sub-set of MSRC, Internet, sub-set of iCoseg, and PASCAL-VOC. Among them:
- The sub-set of MSRC includes 7 classes: bird, car, cat, cow, dog, plane, sheep, and each class contains 10 images.
- The Internet has 3 categories of airplane, car and horse. Each class has 100 images including some images with noisy labels.
- The sub-set of iCoseg contains 8 categories, and each has a different number of images.
- The PASCAL-VOC is the most challenging dataset with 1037 images of 20 categories selected from the PASCAL-VOC 2010 dataset.
- Create github repo (2019.11.18)
- Release arXiv pdf (2019.12.2)
- All results (soon)
- Test and Train code (soon)