# Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints [1]

## ECCV 2020
<center><img src="NRSfM_figs/title.png" width=1000px alt="default"/></center>

[1] Sidhu, Vikramjit, et al. "Neural dense non-rigid structure from motion with latent space constraints." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer International Publishing, 2020.

## Non-rigid Structure from Motion 

Rigid structure-from-motion (SfM): assumes the object maintains a consistent shape (rigid body)

Non-rigid Structure from Motion (NRSfM) deals with objects that can deform or change shape over time, such as human faces, animals, and other flexible objects.

<center><img src="NRSfM_figs/example_nrsfm.png" width=1000px alt="default"/></center>


## Workflow: SfM

<!-- $\color{#EF5645}{\text{Given:}}$ Image points $x_{ij}$ representing $X_j$ in camera $i$ with projection matrix $M_i$: 

After data centering: $\hat x_{ij} = A_i \hat X_j$, where we do not know $A_i$ and $\hat X_j$.

Goal: Factor out $A_i$ and $\hat X_j$, for all $i$ and $j$.

$$D = \begin{bmatrix}
A_1 \hat X_1 & A_1 \hat X_2 & ... & A_1 \hat X_n\\
A_2 \hat X_1 & A_2 \hat X_2 & ... & A_2 \hat X_n\\
... &  ... & ... & ...\\
A_m \hat X_1 & A_m \hat X_2 & ... & A_m \hat X_n\\
\end{bmatrix}
$$

which we write: -->

Consider a set of P points densely tracked across T frames
- Measurement: 2D position of p-th point in image t: $\textbf{w}_t^p=\hat{\textbf{w}}_t^p = \textbf{t}_t$, centered by camera translation $\textbf{t}_t$
- Reconstruction: 3D coordinates of p-th point in image t $\mathbf{s}_t^p=[x_t^p, y_t^p, z_t^p]^T$

We build a linear system to map the 3D-to-2D point coordinates as:

<center><img src="NRSfM_figs/WRS.png" width=500px alt="default"/></center>

Note: in previous slides, we wrote it as: $D = MS$

$D$->$W$: Measurement matrix, $M$->$R$: motion matrix $S$: structure (shape) matrix

## Workflow: Non-rigid SfM:

<center><img src="NRSfM_figs/workflow.png" width=500px alt="default"/></center>

**Shape Representation for Non-rigid objects**:

The 3D shape $(\mathbf{S}_t)$ at time $(t)$ is often represented as:
$$
[
\mathbf{S}_t = \mathbf{\bar{S}} + \sum_{i=1}^k c_{it} \mathbf{B}_i
]
$$
where $(\mathbf{\bar{S}})$ is the mean shape, $(\mathbf{B}_i)$ are the basis shapes, $(c_{it})$ are the deformation coefficients at time $(t)$, and $(k)$ is the number of basis shapes.



## Workflow: Non-rigid SfM:

<center><img src="NRSfM_figs/workflow.png" width=500px alt="default"/></center>

   - **Mean Shape and Deformations**: 
      - The 3D shape of a non-rigid object at any given time can be thought of as a deviation from a mean shape. This is often represented using a linear combination of basis shapes.
   - **Basis Shapes**: 
      - A set of shapes that, when combined, can approximate the deformations of the object. 
      - Typically learned from training data or through optimization techniques.

The core component of NRSfM is a carefully designed deformation model with handcrafted priors

## Contributions

<center><img src="NRSfM_figs/overview.png" width=1000px alt="default"/></center>

This work embraces a diﬀerent formulation that can regress the deformation modes in a unsupervised manner during a neural network training

- Core contribution: a new neural deformation model component based on auto-encoder
    - Fully diﬀerentiable dense neural NRSfM approach.
    - Encode priors in energy function to enhance the reconstruction accuracy
- Extensive experimental evaluation & Several applications of the deformation model including 
    - shape compression, interpolation and completion
    - fast direct non-rigid 3D reconstruction from monocular image sequences

## Method Overview
<center><img src="NRSfM_figs/method_overview_1.png" width=500px alt="default"/></center>

<center><img src="NRSfM_figs/method_overview.png" width=500px alt="default"/></center>


### Modelling Leanable Deformation with Neural Networks

<center><img src="NRSfM_figs/method_overview.png" width=500px alt="default"/></center>

<!-- Non-rigid shape representation:
$$
[
\mathbf{S}_t = \mathbf{\bar{S}} + \sum_{i=1}^k c_{it} \mathbf{B}_i
]
$$ -->

In this work:

- Deformation autodecoder $f_{\mathbf{\theta}}$: 
    - $f_{\mathbf{\theta}}$ is a function of the latent space $\mathbf{z}_t$, with nine fully-connected layers
    <!-- - the weight matrix of the ﬁnal fully-connected layer of $f_{\mathbf{\theta}}$ can be interpreted as a low-rank linear subspace where every vector denotes a 3D displacement from the mean shape. -->
    
- Relate to shape space: $\mathbf{S}_t=\overline{\mathbf{S}}+f_{\boldsymbol{\theta}}\left(\mathbf{z}_t\right)$
    - Both $\overline{\mathbf{S}}$ and $\mathbf{R}_t$ are initialized by rigid factorization from $\mathbf{W}$ (Tomasi-Kanade factorization method)
    - the output can be interpreted as an analogue to the number of basis shapes in linear subspace models. $f_{\boldsymbol{\theta}}\left(\mathbf{z}_t\right) -> \sum_{i=1}^k c_{it} \mathbf{B}_i$

### Differentiable Energy (Loss) Function

<center><img src="NRSfM_figs/energy_function.png" width=600px alt="default"/></center>

- $\mathbf{E}_{data}$: penalize the image re-projection errors

    - $\mathbf{E}_{\text {data }}(\boldsymbol{\theta}, \mathbf{z}, \mathbf{R})=\left\|\mathbf{W}-\mathbf{R}\left(\left(\mathbf{1}_T \otimes \overline{\mathbf{S}}\right)+f_{\boldsymbol{\theta}}(\mathbf{z})\right)\right\|_\epsilon$

- $\mathbf{E}_{temp}$, $\mathbf{E}_{spat}$, $\mathbf{E}_{traj}$, $\mathbf{E}_{latent}$: encode priors by soft (diffentiable) regularizations.



### Temporal & Trajectory Regularization

<center><img src="NRSfM_figs/temporal_regularization.png" width=500px alt="default"/></center>

$\mathbf{E}_{temp}$: enforces temporal-preserving regularisation of the 3D shape via its latent space

$\mathbf{E}_{traj}$: imposes a subspace constraint on point trajectories throughout the whole sequence


### Spatial Regularization

<center><img src="NRSfM_figs/spatial_regularization&apos;.png" width=500px alt="default"/></center>

$\mathbf{E}_{spat}$: spatial-preserving regularisation for a neighbourhood. 

For dense observations, where most of the points in a local neighbourhood can follow a similar motion pattern

### Latent Space Constraints

$\mathbf{E}_{latent}(\mathbf{z})=\| \mathcal{F}(\mathbf{z}) \|_1$: 
$\mathcal{F}()$ denotes Fourier transform operator

- Imposes sparsity constraints over the latent vector by encouraging the sparsity of the Fourier series and removing less relevant frequency components.
- Period Detection: The period of the sequence can be recovered by extracting the dominant frequency
- Sequence Segmentation: The latent space is temporally segmented. Similar values are decoded into similar shapes.


## Experiments

### Kinect paper
<center><img src="NRSfM_figs/exp_paper.png" width=500px alt="default"/></center>

### Synthetic Face Sequences

<center><img src="NRSfM_figs/exp_face.png" width=500px alt="default"/></center>

<center><img src="NRSfM_figs/exp_face_2.png" width=500px alt="default"/></center>

### Kinect paper
<center><img src="NRSfM_figs/exp_paper_2.png" width=500px alt="default"/></center>

### Effect of spatial constraints

<center><img src="NRSfM_figs/exp_spatial.png" width=500px alt="default"/></center>

<center><img src="NRSfM_figs/exp_spatial_2.png" width=500px alt="default"/></center>

### Effect of latent constaints

<center><img src="NRSfM_figs/exp_latent_training.png" width=500px alt="default"/></center>

<center><img src="NRSfM_figs/exp_latent.png" width=500px alt="default"/></center>

## Applications

### Shape Compression, Interpolation, Denoising and Completion.

<center><img src="NRSfM_figs/app_denoising.png" width=500px alt="default"/></center>

<center><img src="NRSfM_figs/app_completion.png" width=500px alt="default"/></center>


### Direct Monocular Non-rigid 3D Reconstruction with Occlusion Handling.

<center><img src="NRSfM_figs/app_reconstruction.png" width=500px alt="default"/></center>

<center><img src="NRSfM_figs/app_reconstruction_2.png" width=500px alt="default"/></center>

## Remarks:

- This paper introduced a neural-based non-rigid structure-of-motion
    - Neural network (implicit):
        - Deep features, implicit but powerful & robust
        - Nonlinearity, with learnability
    - Image Processing: 
        - SfM workflow stemming from factorization methods
        - encode priors into regularization terms in energy (loss) function

The good combination of neural-based learning methods (e.g., extract deep features, model implicit relations) and "traditional" image processing skills/priors contribute to further improvements.
