# The L1 vs L2 Norm

We have formulated our point cloud registration algorithm as an optimization problem that minimizes the following objective function:

$$
f_1(x, y, z, \theta, \phi, \xi) =
\sum_B g(\left|\mathbf{b}\right|)
\;
(\left|\mathbf{b} - \mathbf{a}_{\textrm{S},\textrm{min}}\right|/\rho(|\mathbf{b}|) - 1)
\;
u(\rho(|\mathbf{b}|) - \left|\mathbf{b} - \mathbf{a}_{\textrm{S},\textrm{min}}\right|)
$$

The objective function $f_1$ uses a weighted sum of the distances between pairs of matching points.  We will refer to this approach as "using the L1 norm".

Alternatively, we could have optimized over the square-root of the sum of the distances-squared between pairs of matching points.  This is often called the RMS error, or the L2 norm.  If we had, we would be minimizing an objective function with a form similar to this:

$$
f_2(x, y, z, \theta, \phi, \xi) =
\left[ \sum_B g(\left|\mathbf{b}\right|)
\;
\left(\mathbf{b} - \mathbf{a}_{\textrm{S},\textrm{min}}\right)^2
\;
u(\rho(|\mathbf{b}|) - \left|\mathbf{b} - \mathbf{a}_{\textrm{S},\textrm{min}}\right|)
\right]^{1/2}
$$

For some problems it may be appropriate to minimize the L1 norm, and for other problems it may be appropriate to minimize the L2 norm.  Deciding which is more correct completely depends on the context and the problem at hand.

## Background Reading

Here is the abstract of a paper about this:

> L2-norm,  also  known  as  the  least  squares  method  was  widely  used  in  the  adjustment  calculus.  The weaknesses of the least squares method were the effect of gross measurements error on the solution and the disturbance and absorbance of gross error on the solution. L1–norm, also known as the least absolute values method, was affected by almost none or very little from gross error. Therefore L1-norm method,  used  for parameters estimation in some  special  case, has  been  successfully used for outlier measurements  detection.  In  this  study,  the  L1  and  L2-norm  adjustment  methods  have  been  taken relatively  to  each  other’s  advantages  and  disadvantages  and  the  numerical  application  of  the  two-dimensional  similarity  coordinate  transformation  were  made  and  the  results  of  both  methods  are discussed.

[Full article here](https://www.researchgate.net/publication/228416411_The_comparison_of_L1_and_L2-norm_minimization_methods)---it is not particularly well written FYI.

[This blog post](http://www.chioka.in/differences-between-the-l1-norm-and-the-l2-norm-least-absolute-deviations-and-least-squares/) also has some interesting discussion.

## Robust Results

In our note book titled "The Ideal Form of G", it was seen that our registration algorithm found the precisely correct registration vector, regardless of how many points in the data set we considered.  This was surprising because one would expect that all of the randomness in the peripheral points should affect the registration vector if they were included.

After further investigation, we believe that these results, although surprising, are correct.  We do not believe they are due to an error in our formulation of the problem or in our implementation of the solution.

We believe our registration algorithm was able to find the precise registration vector because our objective function, $f$, uses the L1 norm.

In particular, there were enough points without any distortion in the center of the volume such that the minimum of $f_1$ was still at the exact registration, despite the distorted points.  If the distortions in the ouside points had been larger, we believe that we would see the registration result change.

This explains why our third example, which had slight amounts of distortion even in the central points, varied slightly as the $g$-cutoff increased.

## Example of L1 Norm's "Robustness"


In order to further verify our result, we reproduced it in a smaller, easier to visualize dataset.

Here is an example 2D case where we can see a similar phenonmena.  Note that the objective function we are optimizing is slightly different, however we believe the results to be analgous.

In [None]:
%matplotlib inline

from math import sqrt

import scipy.optimize
import matplotlib.pylab as plt
import numpy as np


def build_f1(points):
    def f(inputs):
        m, b = inputs
        return sum([abs(m*x + b - y) for x, y in points])
    return f


def build_f2(points):
    def f(inputs):
        m, b = inputs
        return sqrt(sum([(m*x + b - y)**2 for x, y in points]))
    return f

def plot_f1_and_f2(points):
    x = np.array([x for x, y in points])
    y = np.array([y for x, y in points])
    
    plt.scatter(x, y, color='r', label='Raw Data')
    
    result = scipy.optimize.minimize(build_f1(points), x0=[0, 1])
    m1, b1 = result.x
    plt.plot(x, m1*x + b1, 'b-', label='L1 Norm')
    
    result = scipy.optimize.minimize(build_f2(points), x0=[0, 1])
    m2, b2 = result.x
    plt.plot(x, m2*x + b2, 'g-', label='L2 Norm')
    plt.legend(loc='upper left')

In [None]:
plot_f1_and_f2([(-4, -3), (-3, -5), (-2, -2), (-1, -1), (0, 0), (1, 1), (2, 2), (3, 6), (4, 3)]) 

Notice how the line derived from optimizing the L2 norm is tilted because of the "distorted" points in the periphery, while the line derived from the L1 norm is not.

## L2 vs L1 for Our Registration Problem

We believe that the comment in the abstract of *Potter 2000*:

> Registration techniques based on matching of landmark points located far from the magnet isocenter are especially prone to MR distortions.

is made based on conclusions drawn from various groups that used the L2 norm when performing their registration.

> The evaluation of the registration accuracy with land- mark techniques is a topic of some controversy. A commonly used definition of the registration error in landmark registra- tion techniques is the root-mean-square (RMS) distance between corresponding landmarks. The error is then estimat- ed only at the landmark locations, and is assumed uniform throughout the registered volume.

As far as we can tell, all of the papers they reference use the L2 norm as well.

**We believe that for this registration problem, the L1 norm is much more appropriate than the L2 because we know that there is a certain bounded amount of distortion in points that are far from the origin.**

The L2 norm is innappropriate for this problem, because the L2 norm weights outliers much more heavily than the L2 norm.  We know that the external points have significant distortion, we certainly do not want to weight them more than the central points.




## Why Even Use the Outside Points?

Really there are two questions:

1. Should we use the L1 or L2 norm?
2. How many points should we use?

As mentioned, we believe we should use the L1 norm.  We also believe that we should weight the points in the middle higher by using a $g$ that tapers off as we leave the origin.  We also beleive that we should ignore points that are too far away from their "expected position" by using a $\rho$ that grows larger as we move further away from the origin.

By combining the effects of $g$, $\rho$, and the L1-norm, we believe will be able to robustly find the proper registration, despite there being distortion in the points that are further from the center.

However, one may ask, why even bother taking the risk--why not just use the L1-norm on just the central points?

We are less certain of this conclusion, however we suspect that by taking into account the full set of points, our algorithm will be

- more robust to errors and noise in the point locations near the isocenter
- more robust to false positives near the isocenter
- more robust to missing points (false negatives) near the isocenter
- more robust to larger displacements of the phantom--i.e. they could be off by a full 1.5 cm and our algorithm should still be able to find the proper central location.

All of that being said, we suspect that none of these items will be a problem in practice, and as a result, it will likely be more or less equivalent to use just the central points or the entire set of points.