<a href="https://colab.research.google.com/github/djbradyAtOpticalSciencesArizona/multiframeEstimation/blob/main/better_image_representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Better Image Representation - Learn from NeRF<sup>1</sup>  and SRN<sup>2</sup> 

Thoughts from discussion with Qian   
Minghao   

Mar 31, 2021

djb update 4 april 2021

(SNR is the followup to DeepVoxels<sup>3</sup>)   
1: [NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis](https://arxiv.org/abs/2003.08934)   
2: [Scene representation networks: Continuous 3D-structure-aware neural scene representations](https://papers.nips.cc/paper/2019/hash/b5dc4e5d9b495d0196f61d45b26ef33e-Abstract.html)   
3: [DeepVoxels: Learning Persistent 3D Feature Embeddings](https://vsitzmann.github.io/deepvoxels/)

## NeRF and SRN: implicit 3D representation.

These two neural networks has similar input, output, and pipeline. The pipeline can be written as:
\begin{align*}
    \mathbf{D} &= \mathcal{F}(\{\mathbf{X}\}) \\
    \mathbf{Y} &= \mathcal{G}(\mathbf{D}, \mathbf{V})
\end{align*}
Where 
- $\{\mathbf{X}\}$: Input, 100/many 2D images   
- $\mathbf{D}$: Representation, one data chunk (array or weights) representing this single/class 3D object   
- $\{\mathbf{Y}\}$: Shown output, 2D images of the object from given view points $\{\mathbf{V}\}$   
- $\mathcal{F}$: can be regarded as fusion and compression
- $\mathcal{G}$: can be regarded as extraction and projection.

Like shown in this DeepVoxels' demostration image. $\{\mathbf{X}\}$ are the images of the vase taken from different viewpoints, shown in top-left. $\mathbf{D}$ is drawn as a colorful cube. With given viewpoints $\{\mathbf{V}\}$, output $\{\mathbf{Y}\}$ are rendered from $\mathbf{D}$, as shown in bottom-right.
![DeepVoxels](https://github.com/djbradyAtOpticalSciencesArizona/multiframeEstimation/blob/main/imgs/mh_image_representation/DeepVoxels_sketch.png?raw=1)

SRN and NeRF has $\mathbf{D}$ in different format, thus different $\mathcal{F}$ and $\mathcal{G}$.    
NeRF's $\mathbf{D}$ is some weights for the network structure $\mathcal{G}$, and its $\mathcal{F}$ is a training process.   
SRN's $\mathbf{D}$ is more complex, as it aims to work for a class of objects, where a latent vector $z$ chooses which object to render.

The input $\{\mathbf{X}\}$ has much redundancy. A human can grab a decent understanding of a 3D object if we show him/her just 3 or 4 images from different view points. The process $\{\mathbf{X}\} \rightarrow \mathbf{D}$ is more or less an overposed problem. As a result, the representation $\mathbf{D}$ takes even smaller disk space than input $\{\mathbf{X}\}$.

The shown output $\{\mathbf{Y}\}$ can be regarded as projections of the 3D object, and it lies in the same space of input $\{\mathbf{X}\}$. This makes it easier to train the network, as we may split a set of $\{\mathbf{X}\}$ to a input and groundtruth data.

By the way, I notice that $\{\mathbf{X}\}$ and $\{\mathbf{Y}\}$ are all 2D images. Why? My opinion is that this is the restriction of our sensor and display. We have 2D sensors, we hav 2D displays. I bet that the shown output won't be images, if we have commercial available 3D displays.

## Straight forward 2D analogy

However, it's not that easy to find a proper analogy for 2D images. 

As in SRN, it manages to find a function $\Phi: (x,y,z) \rightarrow \mathbf{v}$, where $(x,y,z)$ is a 3D coordinate, and $\mathbf{v}$ is a representation vector containing color and radiance information. Then it uses $\Phi$ to derive $\mathcal{G}$.   
NeRF works similarly, but trace values along rays, not points.  

Of course we can try the most straight forward analogy, finding a function $\Phi: (x,y) \rightarrow \mathbf{v}$. Similar method has been raised by CPPN<sup>1</sup>, which uses neural network to perform a function $\Phi: (x,y) \rightarrow v$. That paper use it as a pattern generator, instead of a representation or compression. I'm still scanning papers cited CPPN.  
Anyway, we can train such a network, and it should be a reasonable inpainting algorithm, as we can train on existing pixels, and predict missing pixels.    
However, though we may make arbitrary dense $(x,y)$ grid, I don't think it would easily handle super-resolution tasks. As mentioned in SRN paper, they tried to make denser $(x,y,z)$ grid, but the resulting voxel array didn't contain much high frequency information beyond input images.

A main difference between the 3D algorithms and this straight forward 2D analogy, in my opinion, is the relation between representation $\mathbf{D}$ and shown output $\{\mathbf{Y}\}$.   
In the 3D algorithms, with a specific viewpoint $\mathbf{V}$, we can project $\mathbf{D}$ to $\mathbf{Y}$. The reverse projection is very hard. What's more, though different $\mathbf{Y}$s are closely correlated, its hard to infer one from another.   
In the 2D analogy, the output a 2D pixel map is itself a fuction $\mathbf{I}: (x,y) \rightarrow v$ with band-limited output,  which is almost identical to the proposed representation $\Phi: (x,y) \rightarrow v$. What's more, the output can be fully represented with just one image, not a set of images. Thus, there's not much superiority in using a function to represent an image. 

1: [Compositional Pattern Producing Networks: A Novel Abstraction of Development](https://link.springer.com/content/pdf/10.1007/s10710-007-9028-8.pdf)

## Expand a 2D image

A possible way out is to "expand" a 2D image. 

For example, we can add color channels. We can estimate a implicit representation $\Phi: (x,y, \lambda) \rightarrow \mathbf{v}$ from multiframe input or CASSI-style input.    
In this way, the images with different $\lambda$ are closely correlated, but not easily transfered. The implicit representation $\Phi$ can easily project to a image with $\lambda$, and the reverse is hard. That should be a better analogy.   
I don't know if there's already paper about it. I'm searching them.

Of course we can think of more ways to expand an image. A existing example estimating BTF (Bidirectional Texture Function) with neural network<sup>1</sup>. I'm still thinking for more options.

1: [Neural BTF Compression and Interpolation](http://rgl.epfl.ch/publications/Rainer2019Neural)

# Neural Image Representation

A imaging system consists of three components: 
 

1.   the analog digital layer that converts the physical field distribution into discrete numbers. This may be, for example, a focal lens combined with a color filter array. At the end of this layer, we measure g= Hf, where f is the object radiance distribution. various nonlinearities could generalize this mapping to g=H(f)
2.   a coding layer that translates the measured data into features. For example, jpeg or mpeg coding is such a layer. The D layer referenced above is this layer
3. a decoding layer that returns the estimate of the unknown field . 

Conventionally, the coding layer is considered only for compression. But deep voxels uses the coding layer to help with image estimation. We cannot estimate the image directly from measured data, better results are obtained if we find features before training for inversion. 



# Thoughts and plans
#### after discussion with Minghao about Dr. Brady's new notes 
#### Qian
#### 4/6/21


## thoughts

Feature representation is useful when all desired pixel maps has redundancy but cannot be mutually transformed with ease, e.g., viewpoints in view systhesis tasks $(x, y, v)$, channels in hyper-spectral tasks $(x, y, \lambda)$, frames in video tasks $(x, y, t)$. Those are all high-dimensional (HD) data that span more than spatial dimension.

Like the input of view sysnthesis task, input data of a Multiframe system is also HD as it involves time dimension. In view of advantanges of "deep voxel" representation, we should be able to improve component 2 & 3 by finding a better representation of a multiframe scene than pixel maps. 

A problem in multiframe estimation tasks like demosaicing and super-resolution is that only one good-looking version of final result is necessary, not all possible estimations. For example, a system that produces all demosaiced estimates from a same image isn't more interesting than a system that only produces one of the best estimations. These lessen the necessity of a demosaicing feature or super-resolution feature that involves third "demosaicing dimension" or "resolution dimension". 

I think a practical way to make meaningful outputs is to still combine those tasks with time/channel/viewpoint. For example, as "multiframe" naturally involves time, we can design a demosaicing system that takes sensor data of timestamps 1-T and a target timestamp t, and outputs colored pixel map at time t. Or a demosaicing system that takes sensor data of timestamps 1-T, and outputs all colored pixel map at time 1-T. The intermediate feature should be a compact representation that can be used to estimate all potential outputs, which are colored but have redundancy in contents.

Given the new definition of the coding layer, estimation may get involved in component 2, 3 or both. If we leave estimation to component 2 solely, the feature should be able to produce all pixel maps that we want nearly losslessly. If we only estimate in component 3, the feature should prepare the input data in a best position to complish the estimation task. The latter one may require task-specific and physically based features (e.g., deep voxel/multiplane image (MPI) for view synthesis and [deep voxel flow](https://openaccess.thecvf.com/content_iccv_2017/html/Liu_Video_Frame_Synthesis_ICCV_2017_paper.html) for video interpolation) In tasks like demosaicing, super-resolution, etc., physically based feature representation are more ambiguous. I'll investigate its feasibility.

## plans

I plan to start with a basic demosaicing system following the thought mentioned above. Have to admit that although the general components are the same (coding layer & decoding), it has many variants when implementing. First try let coding layer do compression & estimation and decoding to do reconstruction. Train a decoding layer first, freeze it, and train a coding layer on top of that.

data size: *n_frames x n_channels x H x W*

* input: multiframe bayer data X, *16x64x64*
* output: colored estimation compact representation Y_comp, size undefined.

1. Generate 16-frame video libraries to get ground-truth frames: multiframe colored estimation Y, *16x3x512x512*, using blender
2. using forward model to generate input bayer data
3. prepare Y_comp. Design and train a decoder inside an autoencoder. Inputs and outputs are Y and the loss can be MSE.
4. Design and train estimation network, mapping from X -> Y_comp.

Till now 1 and 2 are finished.


<font size=1> Note about Component 1: check operations that are allowed or feasible in an optical system.</font>