# Spatial Transformer

Introduced in: [Spatial Transformer Networks](https://arxiv.org/abs/1506.02025)

**Convolutional Neural Networks are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner.**

This module allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process.

**Use of spatial transformers allows modules to learn invariance to translation, scale, rotation and more generic warping.**

![](../../assets/spatial-transformer-networks-1.png)

### Localisation Network

It takes the input feature map and outputs the parameters of the transformation to be applied to the feature map.

**NOTE**: Parameters of the transformation can also be used in trailing layers!

### Parameterised Sampling Grid

The sampling grid consists in defining how the values from the input feature map ($s$ in $U$) are gathered in the output feature map ($t$ in $V$).

![](../../assets/spatial-transformer-networks-2.png)

If the transformation $\mathcal{T}_{\theta}$ is a 2D affine transformation, it is defined as:

$$
\begin{pmatrix}
x_i^s \\
y_i^s
\end{pmatrix} = \mathcal{T}_{\theta}(G_i) = 
\begin{bmatrix}
\theta_{11} & \theta_{12} & \theta_{13} \\
\theta_{21} & \theta_{22} & \theta_{23}
\end{bmatrix}
\begin{pmatrix}
x_i^t \\
y_i^t \\
1
\end{pmatrix}
$$

with $\left(x_i^s, y_i^s\right) \in [-1,1]$ and $\left(x_i^t, y_i^t\right) \in [-1, 1]$.

Another class of transformation can be the attention:

$$
\begin{bmatrix}
\theta_{11} & \theta_{12} & \theta_{13} \\
\theta_{21} & \theta_{22} & \theta_{23}
\end{bmatrix} =
\begin{bmatrix}
s & 0 & t_x \\
0 & s & t_y
\end{bmatrix}
$$

allowing cropping, translation and isotropic scaling by varying $s$, $t_x$ and $t_y$.

Other types of transformation functions are also possible:
- Projective transformation
- 16-point thin plate split transformation ([REF](https://khanhha.github.io/posts/Thin-Plate-Splines-Warping/))
  - **Well adapted to reduce error for elastical deformations**

### Differentiable Image Sampling

To perform a spatial transformation of the input feature map, a sampler must take the set of sampling points $\mathcal{T}_\theta(G_i)$, along with the input feature map $U$ and produce the sampled output feature map $V$:

$$
V_i^c = \sum_n^H \sum_m^W U_{nm}^c k\left(x_i^s-m; \Phi_x\right)k\left(y_i^s-n; \Phi_n\right) \quad \forall i \in [1 \dots H'W'] \quad \forall c \in [1 \dots C]
$$

where:
- $\Phi_x$ and $\Phi_y$ are the parameters of a generic sampling kernel $k()$ which defined the input feature map (i.e. image) interpolation
- $U_{nm}^c$ is the value at location $(n, m)$ in channel $c$
- $V_i^c$ is the output value for pixel $i$ at location $\left(x_i^t, y_i^t\right)$ in channel $c$

**NOTE**: The sampling is similar for each channel.

### Integer sampling

This sampling kernel equates to just copying the value at the nearest pixel in the input feature map.

$$
V_i^c = \sum_n^H \sum_m^W U_{nm}^c \delta \left( \left\lfloor x_i^s + 0.5 \right\rfloor - m \right) \delta \left( \left\lfloor y_i^s + 0.5 \right\rfloor - m \right)
$$

where:
- $\left\lfloor x_i^s + 0.5 \right\rfloor$ rounds $x$ to the nearest integer
- $\delta()$ is the Kronecker delta function (output 1 if $x == 0$ 0 otherwise)

### Bilinear sampling

This sampling kernel get the weighted sum of values at surrounding pixels in the input feature map.

$$
V_i^c = \sum_n^H \sum_m^W U_{nm}^c \max(0, 1 - \left \vert x_i^s - m \right \vert ) \max(0, 1 - \left \vert y_i^s - m \right \vert )
$$

## Code

None.