Skip to content

asmekal/iccv2019-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 

Repository files navigation

iccv2019-notes

Disclaimer: this is a personal perspective, I could miss a lot of things and could not promise that there are no mistakes.

The marks for papers [?/10] again are totally subjective and based on their usability for me. Approximate translation from these marks is >=9 - must read; >=6 - must read if you are working in the specified area. If the paper title is crossed these means I decided not to go into this (despite I was interested during main conference, the reason may be that the area is very far from my current interests), if the paper goes without comments it means that I will probably add them later

Intro

Over 7,500 participants, 4 days of main conference + 60 workshops and 12 tutorials. 1075 accepted papers (10% orals) on the main conference alone.

All of the papers can be found at CVF open access. The official videos from oral session will be available at CVF youtube channel

It was absolutely infeasible to track everything so I almost completely skipped the following topics

  • Autonomous driving
  • 3D, Point clouds
  • Video analysis
  • Computer Vision in medical images
  • Captioning, Visual Grounding, Visual Question Answering

I spend a few time to:

  • Domain adaptation, Zero-shot, few-shot, unsupervised, self-supervised, semi-supervised. TL;DR motivation - to learn as fast as humans with less/no data by few examples. In practice it still works poor enough and can not compare to supervised methods. At the current stage we can not do this successfully enough, but when we will it will be a giant step forward.
  • Knowledge distillation, federated learning. TL;DR many papers have controversal results - sometimes it works, sometimes don't, sometimes very useful, sometimes useless. You can try but do not expect much
  • Deepfakes in images and videos. TL;DR you can not completely trust any digital image/video anymore. There are huge movement in the area and already several datasets present. The problem is - when you know the "deepfake attack" method and trained on on the data which was produced with this method you can take ~70-95% accuracy (which is itself not much), but when you don't know the method your deepfake detector may be close to random (50%)

I took a closer look on:

  • Semantic and instance segmentation, object detection
  • New architectures, modules, losses, augmentations, optimizetion methods
  • Neural architecture search
  • Interpretibility
  • Text detection and recognition
  • Network compression
  • GANs, style transfer

TL;DR - detected trends and advancements

  • Avoid imagenet pretraining for transfer learning. Self-supervised techniques seem to work better. It seems (and is shown in several papers) that it is probably much better to train from scratch on your data, pretrained weights initialization may decrease convergence speed but do not guarantee that final metrics will be better. Instead of training from scratch you can also try to combine your training with self-supervised methods.

  • Efficient layers instead of convolutions. Several layers proposed to be used as drop-in replacements of vanilla convolutions. The most notable example is probably OctConv. The problem with such new layers is that vanilla convolution has highly optimized implementations, which is not true for this new proposed layers even if they require less computation in theory.

  • Efficient loss functions. Several new loss functions were proposed instead of CE and it seems that they should be a default choice now as they are easy to implement and outperform Center Loss or OHEM strategies and sometimes provide more clear class separation in the embedding space and even work well for imbalanced classification. See Losses section for details.

  • Going from anchor-based object detection to dense predictions. Several papers propose to go from anchor-based detectors to dense predictions, see Instance Segmentation and Object Detection sections below for more details

  • Generative models from single image. This includes better deepfakes, neural talking heads and GANs on single image (SinGAN and InGAN).

  • Revival of auxiliary intermediate classifiers. I especially liked this paper where authors apply distillation between final classifier and intermediate classifiers and these improved results a lot.

  • Attemps for fashion generation and try-on. A lot of works in this field but they do not really work for now, however it is just a matter of time.

  • Preregistration Workshop. The current scheme for ML Conferences is not how science normally works: you have a hypothesys, than provide experiments to prove/disprove it, but in ML Papers the pipeline is the opposite: having the results you hypothesyze to explain them which is known as HARKing (Hypothesyzing after the results are known). The consequence of this is high positive bias, SOTA-hacking (the paper is much more likely accepted if it claims to beat SOTA) and more importantly poor generalization of the results (say, you conducted experiments on 10 classification datasets and recieve SOTA on 3 of them; you publish a paper with these 3 and complitely omit 7 which is absolutely terrible to the science). The idea of preregistration is for fighting this problem and separate hypothesys generation and hypothesys validation. Looks very promising direction despite having lots of problems. This is probably the most important idea I've heard of in the entire conference. Preregistration workshop link. Here is the link to the video which clearly explains the problem (I will post it when it will be available).

Papers by topic

Augmentation

Modules

Conv-replacements

  • [10/10][OctConv is both faster and more accurate; drop-in replacement for vanilla conv] Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution. The trick is that the final architecture have to be optimized (in terms of framework matrix operations) otherwise it will be actially slower. Gives up to 30% speed up with better accuracy. The idea of the paper is to explicitly decompose features into high-frequent (H, W, C_h) and low-frequent (H // octave, W // octave, C_l) and process them separately than exchange the information.

  • [9/10][Suspicious layer which surprisingly improves both accuracy and speed and is a drop-in replacement for vanilla convolution] Dynamic Multi-scale Filters for Semantic Segmentation Replace vanilla conv with the following 2-branch structure: the first branch computes KxK kernel via adaptive_pool(KxK) -> conv1x1; the second branch applies 1x1 conv to features; then 2 branches merge via depthwise conv with kernel computed from the first branch and after that additional 1x1 conv. Ablation study shows that it can give +7% mIoU compared to vanilla conv. Now why it's suspicious? The computed kernel's top left element is essentially taken from image top left part, and the same for bottom right kernel element (it's essentially image bottom right part). And this very different elements are applied to very similar local features. My intuition fails to explain why it make sense, maybe we have to add additional global pooling for the kernel and firstly convolve kernel with it.

  • [6/10][Essentially local self-attention with inefficient (not optimized) computation] Local Relation Networks for Image Recognition

Pooling variations

Multi-scale Feature aggregation

Attention

Semantic segmentation

  • [10/10][Non-uniform downsampling of high-resolution images] Efficient segmentation: learning downsampling near semantic boundaries. Network for creation of non-uniform downsampling grid aimed to increase space for semantic boundaries. The results are reasonably better than uniform downsampling. 3-steps: 1)non-uniform downsampling (image is downsampled to very small resulution (32x32 or 64x64 for example), on this resolution downsampling network is trained, ground truth are derived from a reasonable optimization problem on ground truth segmentation map) == very fast stage 2)main segmentation network runs on non-uniform downsampled image 3)the result is upsampled (which can be done as we know downsampling strategy)

Adversarial training

  • [9/10][Instead of using global discriminator between ground truth and predicted segmentation map they use "gambler" which predicts from (image, predicted segmap) to CE-weights to maximize sum(weights * CELoss)] I Bet You Are Wrong: Gambling Adversarial Networks for Structured Semantic Segmentation Instead of using global discriminator between ground truth and predicted segmentation map they use "gambler" which predicts from (image, predicted segmap) to CE-weights to maximize sum(weights * CELoss). Seem to improve perfornamce a lot compared to previous adversarial training approaches. Additional benefit is that gambler does not see GT so it is less sensitive to errors in GT

Context aggregation

  • [9/10][SOTA on cityscapes-val, proposed global-local context module to aggregate multidimensional features] Adaptive Context Network for Scene Parsing

  • [6/10][Self-attention on ASPP features flattened + concated] Asymmetric Non-local Neural Networks for Semantic Segmentation Instead of global self-attention (which is very costly) they 1)use ASPP 2)flatten all ASPP maps 3)concat the resulted 1x1 maps 4)use attention on this concated features (which means that you can select 0.1 * global (1x1) pool + 0.3 * 2x2pool[0,0] + 0.01 * 2x2pool[0,1] + ...). The module improves final metric obviously.

Make use of class prior

Make use of boundaries

  • [8/10][Separate (chip) shape stream from image gradients and dual-task learning (shape + segmentation)] Gated-SCNN: Gated Shape CNN for Semantic Segmentation. 1)very cheap 3-layer shape stream which accepts image gradients + 1st layer CNN features and exchanges information with the main backbone via gating mechanism 2)dual loss (edge detection + semantic segmentation) + consensus regularization penalty (checks that semantic segmentation output is consistant with predicted edges)

  • [8/10][Another approach for using boundary: first, learn boundary as N+1' class then introduce UAGs and some crazy staff] Boundary-Aware Feature Propagation for Scene Segmentation

Salient object detection (note that everyone exploits edges & boundaries in some way)

  • EGNet: Edge Guidance Network for Salient Object Detection

  • Selectivity or Invariance: Boundary aware Salient Object Detection

  • Stacked Cross Refined Network for Edge-aware Salient Object Detection

Other

Instance segmentation

  • [9/10][Learn prototypes and coefficients to combine them; can be 3-10x faster than MaskRCNN and have comparable accuracy] YOLACT Real-time Instance Segmentation Each anchor predicts bbox + classes + prototypes weights. The separate branch predicts prototypes.

  • [9/10][Backbone + point proposal -> mask of the object with point] AdaptIS: Adaptive Instance Selection Network Proposed network is capable of generating instance mask by specifying point on that instance. Backbone extracts features. Features + point -> small net with AdaIN (where norms are computed from point info) -> instance mask. To get all objects on the image authors trained separate "point proposal" branch which is trained after everything else is frozen and predicts binary label "will point be good for object mask prediction?". From this branch top k% points are sampled and used for predicting objects.

Object detection

Text detection and recognition

  • SNICER: Single noisy image denoicing and rectification for improving licence plate recognition

  • State-of-the-art in action: unconstrained text detection

  • Convolutional character networks

  • Large-scale Tag-based Font retrieval with Generative Feature Learning

  • Chinese Street View Text: Large-scale Chinese Reading and partially supervised learning

  • TextDragon: An End-to-End Framework for Arbitraty Shaped Text Spotting

  • Efficient and accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

  • What's wrong with Scene Text Recognition Model Comparations? Dataset and Model analysys

  • Towards unconstrained text spotting

  • Controllable artistic text style transfer via shape-matching GAN

Content generation, generative models, GANs, style transfer

GAN Training Stability improvements

Video synthesys

  • [9/10] Few-Shot Adversarial Learning of Realistic Neural Talking Head Models video

  • Markov decision process for video generation

source person -> pose; target person + source pose -> synthesys

  • Dance Dance Generation: Motion Transfer for Internet Videos

  • Everybody Dance Now (University of California) video

Image extension

(also SinGAN and InGAN)

  • Boundless: Generative Adversarial Network for Image Extension

  • Very Long Natural Scenary Image Prediction by Outpaining

Style transfer

  • [examples really looks like simple color transform] Photorealistic style transfer via Wavelet Transforms

  • A closed-form solution to universal style transfer

  • Understanding whitening and coloring transform for universal style transfer

  • [5/10] [Style transfer on entire image + semantic segmentation masks = style transfer for selected object classes] Class-based styling: real-time localized style transfer with semantic segmentation

Fashion, clothes try-on

In general all these methods still work quite poorly, but at least somehow

  • FW-GAN (Flow navigated warping gan for video virtual try-on)

  • Personalized Fashion Design (Cong Yu et al)

Neural Architecture Search

Compression

My knowledge of compression techniques is quite limeted so do not really trust these quality marks

Anomaly detection

  • Real time aerial suspicious analysis (asana): system for identification and re-identification of suspicious individuals in crowds using the bayesian scatter-net hybrid network

  • Detecting the unexpected by image resynthesis

  • memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection

Other small topics

Sounds

Imbalanced classes

  • [9/10][Loss for classification+clustering specifically for imbalanced clustering; good separation of embeddings in the vector space] Gaussian margin for max-margin class imbalanced learning

  • [7/10][3-player adversarial game between a convex generator, a multi-class classifier network, and a real/fake discriminator to perform oversampling in deep learning systems. The convex generator generates new samples from the minority classes as convex combinations of existing instances, aiming to fool both the discriminator as well as the classifier into misclassifying the generated samples] generative adversarial minority oversampling

Knowledge distillation

Self-supervised

Losses

Interpretability

  • Explaining Neural Networks Semantically and Qualitatively

  • fooling network interpretation in image classification

  • Seeing what a GAN cannot generate

Clustering

  • Subspace structure-aware spectral clustering for robust subspace clustering

  • Invariant information clustering for unsupervised image classification and segmentation

  • GAN-Tree: An incrimentally Learned Hierarchical Generative Framework for Multi-Modal Data Distributions

  • Deep Comprehensive Maining for Image Clustering

Human unsertainty for training

Motion in the dark

  • Seeing Motion in the dark
  • Learning to see moving objects in the dark

Other (random)

  • [8/10][Imagenet pretraining may improve convergence speed but do not necessary leads to better results; training from scratch is better] Rethinging Imagenet Pre-training

  • [9/10][Learn multiple prototypes per class to detect noisy labels; train on both noisy and pseudo labels; no need to clean data; no specific noise distribution assumptions; SOTA for noisy classificationDeep Self-learning From Noisy Labels

  • [9/10][Fast second order optimizer (cost of backward ~ 2-3 * costs of forward)] Small steps and giant leaps: Minimal Newton solvers for Deep Learning Converges better than Adam, SGD & co

  • Selective Sparse Sampling for Fine-Grained Image Recognition

  • Dynamic anchor feature selection for single shot object detection

  • VideoBERT: A Joint model for Video and Language Representation Learning

  • PR Product: A substitute for inner product in neural networks

  • Deep Meta Metric Learning

  • [9/10][Slow net on 1/N frames, fast net on (N-1)/N frames] Slow-Fast Networks for Video Recognition

  • [6/10][One of many works on domain adaptation]Self-training with progressive augmentation for Unsupervised Person Re-identification

  • Learning to paint with model-based deep reinforcement learning

  • Joint Demosaicing and Denoising by Fine-tuning of Bursts of Raw Images

  • Improving CNN Classifiers by Estimating Test-time Priors

  • Joint Acne Image Grading and Counting via Label Distribution Learning

  • [ransac-like to fit arbitraty shapes or arbitrary counts] Progressive-X: Efficient, Anytime, Multi-Model Fitting Algorithm

  • Noise flow: noise modeling with conditional normalizing flows

  • [comic colorization] Tag2Pix: Line Art Colorization Using Text Tag With SECat and Changing Loss

  • Learning lightweighted LANE Detection CNNs by self-attention distilation

  • Transductive Learning for Zero-shot Object Detection

Other notes:

  • Book "Explainable AI: Interpreting, explaining and visualizing deep learning"

Releases

No releases published

Packages

No packages published