Link to full thesis: filipskogh.com/thesis.pdf
Video object segmentation is a fundamental problem in computer vision used in a variety of application
across many fields. Over the past few years video object segmentation has witnessed rapid progress catalyzed
by increasingly large datasets. These datasets consisting of pixel-accurate masks with object association
between frames are especially labor-intensive and costly, prohibiting truly large-scale datasets. We
propose a video object segmentation model capable of being trained exclusively with bounding boxes, a
cheaper type of annotation. To achieve this, our method employs loss functions tailored for box-annotations
that leverages self-supervision through color similarity and spatio-temporal coherence.
We validate our approach against traditional fully-supervised methods and various other settings on YouTube-VOS
and DAVIS, achieving over 90% relative performance on
@inproceedings{cheng2022xmem,
title={{XMem}: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model},
author={Cheng, Ho Kei and Alexander G. Schwing},
booktitle={ECCV},
year={2022}
}