I implement the "ResNet-LDDMM: Advancing the LDDMM Framework using Deep Residual Networks" paper found here: https://arxiv.org/abs/2102.07951. This paper uses specific deep residual networks along with kinetic energy minimisation principles to learn diffeomorphic registrations to a performance they claim is better than other non-deep learning approaches. I implement this paper for reasons of interest in using deep nets as ODE solvers due to the paper's proof of equivalence of resnets to forward-euler schemes with a straightforward way to implement kinetic energy minimisation. Due to the flexibility of neural networks, the full shape morphism can be reconstructed with a learnt net. Below are my notes having read this paper, personal tests, and included in this repo is a full tensorflow code implementation to do your own image registrations with subject to having .obj data and using Chamfer's distance.
A taster of what can be generated with an L2 loss:
Image registration is the process of finding some set of coordinates describing all possible deformations of a shape given the shape's properties. It's heavily used in statistical medical analysis and in medical diagnoses. For example, if different medical equipment from different medical facilities observe a single patient's lungs, registering the patients lungs allows for a far more seamless integration of varying medical data, especially if data is observed at varying times. The registration will allow for minding the most likely trajectory the shape took through time to get from its source shape to its target shape. The picture at the start of the blog illustrates the register of a fencer through a few seconds of time.
LDDMM methods have been a recently popular and highly successful method for estimating such flows. By comparing the group structure of the manifold of diffeomorphisms of
The image at the start of this post is that of what's capable using RESNET-LDDMM. Due to having a wee little laptop, I instead use a different dataset to show the power of LDDMM's, however there is no non-computational reason limiting the production of the above image. MIT's MoSculp uses neural networks to find points on moving objects that it then tracks and forms a 4D (space and time) model out of. Effectively, it registers the observed flow of the shape, as is the topic of this post. It requires a video as input, and as output produces the time-sculpture. This paper allows to do exactly what's been done with MoSculp, however just requires the first and last video frame. RESNET-LDDMM fills in all the rest!
Let
for all
The dynamics of the flow have some self-imposed conditions due to the nature of the problem. Generally speaking, if a medical scan of some body part (like an arm)is a shape that wouldn't (topologically) change the number of holes it has, or form angular kinks - instead as the shape moves, it would maintain the properties of smooth curviture and topology (like flexing the arm, doesn't create holes, and precise angles don't form at a small enough scale). These restrictions result in the shape having a diffeomorphic flow. This imposes a restriction on
LDDMM computes diffeomorphic transformations by integrating the velocity field
The second term represents the integral of the velocity field generated by
Minimising these two terms by considering all the admissable dynamics defined by
LDDMM exploits the fact that the space of integrable (over
LDDMM seeks to, under some assumptions, (1) minimise the sum of the data and registration terms, and (2) implicitely solve the ODE describing the shape motion.
As proposed by Sylvain Arguillere, Boulbaba Ben Amor, and Ling Shao, this can be very neatly formulated into the language of neural nets:
(1) The data + regularisation terms can be minimised via gradient descent by considering them as the loss function.
(2) A special class of neural net: residual nets, propagate information similar to that of the iterative forward-euler method for approximating solutions to ODEs.
Hence, residual nets can be used to solve ODEs, specifically finding the dynamics which describe suitable shape registrations.
Let
The output of a residual block given signal
Therefore, by stacking residual blocks ontop of one another, one can construct a discretised ODE solver that learns the most appropriate
The prior section shows how residual blocks can be composed to solve an ODE, with the loss function specifying solutions that are low in energy loss and try to get a final match to the target shape. To finish the equivalence with LDDMM, one requires that the search space
Considering
And so the block is lipschitz continuous with respect to
Each block is further invertible, and this is determined by the Picard-Lindelof theorem. Due to
There is one final note on why the term diffeomorphism has been met with some caveats. The RELU activation function isn't smooth, and so the layer ends up being invertibly lipschitz continuous. To preserve true diffeomorphisms, other
The neural network ends up having a fairly nice shape. Given
We've shown that the residual network's topology offers the ability to solve ODEs via equivalent to the forward-euler numerical method. The network's itself has constraints before being able to be used in other topologies for other numerical method schemes, however the proven constraints of being invertibly lipschitz continuous are perfect for image registration purposes when having the network blocks be of the same input and output size.
The loss function is chosen to be composed of the data term and regularisation term, first of which offers the diffeomorphism to eventually match the required target shape, and secondly to choose a diffeomorphic path which minimises loss of energy. The first term wasn't properly defined here, however due to issues in implementation, I omit these metrics and use a simpler one instead.
Hating to also be a tease, I have a slightly weaker computer than that of the paper's authors! My diffeomorphisms are slighly less complex and far less exciting, I do however wish to show that the net does what it says it does! All unspecified net parameteres are of those used in the paper.
Here is a diffeomorphism between a smile and a neutral facial expression.
Tada! Nice and diffeomorphic.
In the above, I end up using the simple L2 norm. The main issue with this is that input and output coordinates for an object must be determined prior to learning the net. This is practically unfeasible since no way would one know, given two (say, scanned) point clouds of an object, where the start and end is without manual labelling or already some other algorithm used to compute these. To avoid this issue, a different loss function is used - one which is invariant to moving the point clouds around arbitrarily. In this case, the loss function is far more versatile in that target point cloud locations don't need specification with respect to input point clouds.
Specifically, with Chamfer's distance,
Therefore it's a generalised L2 norm, however running through point clouds X, Y nearest neighbours. Some examples of 3D registrations using this loss:
Doing more complicated image registrations does come with limitations. In my tests, it's very fiddly to have to find the optimal number of neural network layers and find the optimal weighing parameters. In the prior image, even having tried many alternate parameters and network lengths, the leg of the human isn't registered correctly. For some reason the kinetic energy is minimised by having two legs tie at the knee and switch position. This lack of clarity and precision is quite unfortunate. Due to the diffeomorphic limitations of the net, one further suggestions might be to limit the gradients of each layer manually so that tangling doesn't occur.
- Neural net topology itself can be modified to fit arbitrary ODE numerical methods for solving things that require invertibility and diffeomorphic movements. Forward-euler schemes aren't the only type!
- Overfitting doesn't exist in this problem, instead, doing the most number of training epochs is the best!
- Learning the register of two images allows one to have far more data for few samples, improving net generability as a way of data augmentation. Alternatively the loss function could include something like Chamfer's distance between the samples and further their register.
- Width of res blocks determines the ability to finely granulate the underlying space of approximated derivates.