# Unsupervised Learning of Video Representations using LSTMs

* 바벨피쉬 / 딥엘라스틱 : 파트 2 - 딥NLP [1]
* 김무성

# Contents
* Abstract
* 1.Introduction
    - 1.1. Why Unsupervised Learning?
    - 1.2. Our Approach
    - 1.3. Related Work
* 2.Model Description
    - 2.1. Long Short Term Memory
    - 2.2. LSTM Autoencoder Model
    - 2.3. LSTM Future Predictor Model
    - 2.4. Conditional Decoder
    - 2.5. A Composite Model
* 3.Experiments
    - 3.1. Datasets
    - 3.2. Visualization and Qualitative Analysis
    - 3.3. Action Recognition on UCF-101/HMDB-51
    - 3.4. Comparison of Different Model Variants
    - 3.5. Comparison with Other Action Recognition Benchmarks

#### 참고
[2] Unsupervised Learning of Video Representations using LSTMs slide - https://docs.google.com/presentation/d/1aF-HdZwR3jfHkyS_BL2jRYMM2dTFvT4zUBs4alN-GPs/edit#slide=id.p

# Abstract

<font color="red">We use multilayer Long Short Term Memory (LSTM) networks to learn representations of video sequences.</font>

# 1. Introduction
* 1.1. Why Unsupervised Learning?
* 1.2. Our Approach
* 1.3. Related Work

## 1.1. Why Unsupervised Learning?

* The costly work of collecting more labelled data and the tedious work of doing more clever engineering can go a long way in solving particular problems, but this is ultimately unsatisfying as a machine learning solution. 
* This highlights the need for using unsupervised learning to find and represent structure in videos. 

## 1.2. Our Approach

* Our model works as follows. 
    - The Encoder LSTM runs through a sequence of frames to come up with a representation. 
    - This representation is then decoded through another LSTM to produce a target sequence. 
    - We consider different choices of the target sequence.
        - <font color="red">One choice is to predict the same sequence as the input</font>. 
            - The motivation is similar to that of autoencoders 
                - – we wish to capture all that is needed to reproduce the input but at the same time go through the inductive biases imposed by the model. 
        - <font color="red">Another option is to predict the future frames</font>. 
            - Here the motivation is to learn a representation that extracts all that is needed to extrapolate the motion and appearance beyond what has been observed. 
        - These two natural choices can also be combined. 

<img src="figures/cap5.png" width=600 />

## 1.3. Related Work

# 2. Model Description
* 2.1. Long Short Term Memory
* 2.2. LSTM Autoencoder Model
* 2.3. LSTM Future Predictor Model
* 2.4. Conditional Decoder
* 2.5. A Composite Model

## 2.1. Long Short Term Memory

#### 참고
* [7] 엘에스티엠 네트워크 이해하기 - http://roboticist.tistory.com/m/post/571

<img src="figures/cap1.png" width=600 />

<img src="figures/eq1.png" width=600 />

## 2.2. LSTM Autoencoder Model
* Why should this learn good features?

#### 참고
* [9] CS231n: Convolutional Neural Networks for Visual Recognition - Lecture 11 - Training ConvNets in practice : Data augmentation, transfer learning, Distributed training, CPU/GPU bottlenecks, Efficient convolutions - http://cs231n.stanford.edu/slides/winter1516_lecture11.pdf
* [6] CS231n: Convolutional Neural Networks for Visual Recognition - Lecture 14 - ConvNets for videos Unsupervised learning - http://cs231n.stanford.edu/slides/winter1516_lecture14.pdf
* [8] Deep Learning 이론과 실습 - https://wikidocs.net/3413

<img src="figures/cap2.png" width=600 />

## Why should this learn good features?

* The state of the encoder LSTM after the last input has been read is the representation of the input video. 
* The decoder LSTM is being asked to reconstruct back the input sequence from this representation. 
* In order to do so, the representation must retain information about the appearance of the objects and the background as well as the motion contained in the video. 
* However, an important question for any autoencoder-style model is what prevents it from learning an identity mapping and effectively copying the input to the output.
* In that case all the information about the in- put would still be present but the representation will be no better than the input.
* <font color="red">There are two factors that control this behaviour</font>. 
    - First, the fact that there are only a fixed number of hidden units makes it unlikely that the model can learn trivial mappings for arbitrary length input sequences. 
    - Second, the same LSTM operation is used to decode the representation recursively. This means that the same dynamics must be applied on the representation at any stage of decoding. 

## 2.3. LSTM Future Predictor Model
* Why should this learn good features?

Another natural unsupervised learning task for sequences is <font color="blue">predicting the future</font>. 
* This is the approach used in language models for modeling sequences of words. 
* The design of the <font color="red">Future Predictor Model</font> is same as that of the Autoencoder Model, except that the decoder LSTM in this case predicts frames of the video that come after the input sequence (Fig. 3).

<img src="figures/cap3.png" width=600 />

####  long sequence (future)
* Ranzato et al. (2014) use a similar model but predict only the next frame at each time step.
* This model, on the other hand, predicts a long sequence into the future.
* Here again we can consider <font color="red">two variants of the decoder </font>
    - conditional and 
    - unconditioned.

## Why should this learn good features?

* In order to predict the next few frames correctly, the model needs information about which objects and background are present and how they are moving so that the motion can be extrapolated. 
* <font color="red">The hidden state coming out from the encoder will try to capture this information</font>. 
* <font color="blue">Therefore, this state can be seen as a representation of the input sequence.</font>

## 2.4. Conditional Decoder

* For each of these two models, we can consider two possibilities 
    - one in which the decoder LSTM is <font color="red">conditioned on the last generated frame</font> and 
    - the other in which it is not.

## 2.5. A Composite Model

<img src="figures/cap4.png" width=600 />

This composite model tries to <font color="red">overcome the shortcomings</font> that each model suffers on its own. 

#### memorization
* A high-capacity autoencoder would suffer from the tendency to learn trivial representations that just memorize the inputs. 
* However, this memorization is not useful at all for predicting the future. Therefore, the composite model cannot just memorization. 

#### Forgetting 
* On the other hand, the future predictor suffers form the tendency to store information only about the last few frames since those are most important for predicting the future, 
    - i.e., in order to predict $v_t$, the frames {$v_{t−1}, . . . , v_{t−k}$} are much more important than $v_0$,for some small value of k.
* Therefore the representation at the end of the encoder will have forgotten about a large part of the input. 
* But if we ask the model to also predict all of the input sequence, then it cannot just pay attention to the last few frames.

# 3. Experiments
* 3.1. Datasets
* 3.2. Visualization and Qualitative Analysis
* 3.3. Action Recognition on UCF-101/HMDB-51
* 3.4. Comparison of Different Model Variants
* 3.5. Comparison with Other Action Recognition Benchmarks

## 3.1. Datasets

## 3.2. Visualization and Qualitative Analysis
* Experiments on MNIST
* Experiments on Natural Image Patches
* Generalization over time scales
* Out-of-domain Inputs
* Visualizing Features

## Experiments on MNIST

<img src="figures/cap5.png" width=600 />

## Experiments on Natural Image Patches

<img src="figures/cap6.png" width=600 />

## Generalization over time scales

<img src="figures/cap7.png" width=600 />

## Out-of-domain Inputs

<img src="figures/cap9.png" width=600 />

## Visualizing Features

<img src="figures/cap10.png" width=600 />

<img src="figures/cap11.png" width=600 />

## 3.3. Action Recognition on UCF-101/HMDB-51

<img src="figures/cap8.png" width=400 />

<img src="figures/cap14.png" width=600 />

<img src="figures/cap12.png" width=600 />

## 3.4. Comparison of Different Model Variants

<img src="figures/cap13.png" width=600 />

<img src="figures/cap5.png" width=600 />

<img src="figures/cap15.png" width=600 />

## 3.5. Comparison with Other Action Recognition Benchmarks

<img src="figures/cap16.png" width=600 />

# 참고자료

* [1] Unsupervised Learning of Video Representations using LSTMs /  ICML 2015 / arXiv:1502.04681  / Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov - http://arxiv.org/abs/1502.04681
* [2] Unsupervised Learning of Video Representations using LSTMs
 slide - https://docs.google.com/presentation/d/1aF-HdZwR3jfHkyS_BL2jRYMM2dTFvT4zUBs4alN-GPs/edit#slide=id.p
* [3] Topics in Computer Vision (CSC2523): Deep Learning in Computer Vision Winter 2016 - http://www.cs.utoronto.ca/~fidler/teaching/2015/CSC2523.html
* [4] code(orignal) - http://www.cs.toronto.edu/~nitish/unsupervised_video/
* [5] code (emansim/unsupervised-videos) - https://github.com/emansim/unsupervised-videos
* [6] CS231n: Convolutional Neural Networks for Visual Recognition - Lecture 14 - ConvNets for videos Unsupervised learning - http://cs231n.stanford.edu/slides/winter1516_lecture14.pdf
* [7] 엘에스티엠 네트워크 이해하기 - http://roboticist.tistory.com/m/post/571
* [8] Deep Learning 이론과 실습 - https://wikidocs.net/3413
* [9] CS231n: Convolutional Neural Networks for Visual Recognition - Lecture 11 - Training ConvNets in practice : Data augmentation, transfer learning, Distributed training, CPU/GPU bottlenecks, Efficient convolutions - http://cs231n.stanford.edu/slides/winter1516_lecture11.pdf