# Least squares

Let's consider a set of measurements $d_{i}$, $i = 1, \dots, N$, of a given physical quantity. This data set is called observed data. Let's also consider that each observed data $d_{i}$ can be properly approximated by a function $y_{i}$, $i = 1, \dots, N$, given by:

$$
y_{i} = a_{i1} \, x_{1} + a_{i2} \, x_{2} + \cdots + a_{iM} \, x_{M} \: ,
$$

where $a_{ij}$ are known variables and $x_{j}$ are unknown variables, $i = 1, \dots, N$, $j = 1, \dots, M$.

Then, by considering the $N$ measurements, we obtain

$$
\begin{split}
d_{1} &\approx \; &a_{11} \, x_{1} + &a_{12} \, x_{2} + \cdots + &a_{1M} \, x_{M} \\
d_{2} &\approx &a_{21} \, x_{1} + &a_{22} \, x_{2} + \cdots + &a_{2M} \, x_{M} \\
\vdots & &\vdots & \vdots &\vdots\\
d_{N} &\approx &a_{N1} \, x_{1} + &a_{N2} \, x_{2} + \cdots + &a_{NM} \, x_{M}
\end{split}
$$

or, equivalently,

$$
\mathbf{d} \approx \mathbf{y} = \mathbf{A} \, \mathbf{x} \: ,
$$

where

$$
\mathbf{x} = 
\left[ \begin{array}{c}
x_{1} \\
x_{2} \\
\vdots \\
x_{M}
\end{array} \right] \: ,
$$

$$
\mathbf{y} = 
\left[ \begin{array}{c}
y_{1} \\
y_{2} \\
\vdots \\
y_{N}
\end{array} \right] \: ,
$$

$$
\mathbf{d} = 
\left[ \begin{array}{c}
d_{1} \\
d_{2} \\
\vdots \\
d_{N}
\end{array} \right]
$$

and

$$
\mathbf{A} = 
\left[ \begin{array}{cccc}
a_{11} & a_{12} & \cdots & a_{1M} \\
a_{21} & a_{22} & \cdots & a_{2M} \\
\vdots & \vdots &        & \vdots \\
a_{N1} & a_{N2} & \cdots & a_{NM}
\end{array} \right] \: .
$$

Now, consider the problem of determining the vector $\mathbf{x}$ from the measurements $\mathbf{d}$ and the $N \times M$ matrix $\mathbf{A}$. Mathematically, this problem consists in determining a vector $\mathbf{x} = \mathbf{x}^{\ast}$ producing a $\mathbf{y}$ *"as close as possible"* to $\mathbf{d}$. To solve this problem, we need to define what *"as close as possible"* means. The notion of "*closeness*" is intrinsically related to the notion of "*distance*" and, consequently, to the notion of <a href="https://en.wikipedia.org/wiki/Norm_(mathematics)">norm</a>.

For example, let's consider a vector $\mathbf{r} = \mathbf{d} - \mathbf{y}$, which is defined as the difference between the measurements $\mathbf{d}$ and the model $\mathbf{y}$. The length of this vector can be determined by the following scalar function:

$$
\begin{split}
\| \mathbf{r} \|_{2} &= \sqrt{\mathbf{r}^{\top}\mathbf{r}} \\
&= \sqrt{\sum \limits_{i = 1}^{N} r_{i}^{2}}
\end{split} \: ,
$$

where $r_{i} = d_{i} - y_{i}$. This function is a norm that quantifies the "*distance*" between the vectors $\mathbf{d}$ and $\mathbf{y}$. It is called <a href="https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm">Euclidean norm</a>. Notice that this function equals to zero if $\mathbf{d} = \mathbf{y}$ and it is greater than zero if $\mathbf{d} \ne \mathbf{y}$.

So, the problem of determining a vector $\mathbf{x} = \mathbf{x}^{\ast}$ producing a $\mathbf{y}$ *"as close as possible"* to $\mathbf{d}$ can be thought of the problem of determining a vector $\mathbf{x} = \mathbf{x}^{\ast}$ producing the minimum Euclidean norm of the difference between the measurements $\mathbf{d}$ and the model $\mathbf{y}$.

In practical situations, instead of determining the vector $\mathbf{x}^{\ast}$ minimizing the Euclidean norm of $\mathbf{r}$, we determine the vector $\mathbf{x}^{\ast}$ minimizing the Squared Euclidean norm of $\mathbf{r}$, which is given by:

$$
\begin{split}
\| \mathbf{r} \|_{2}^{2} &= \mathbf{r}^{\top}\mathbf{r} \\
&= \sum \limits_{i = 1}^{N} r_{i}^{2}
\end{split} \: .
$$

Notice that the Squared Euclidean norm of $\mathbf{r}$ is a scalar function depending on the unknows $\mathbf{x}$ and can be written as follows: 

$$
\Phi(\mathbf{x}) = \left[ \mathbf{d} - \mathbf{A}\mathbf{x} \right]^{\top}\left[ \mathbf{d} - \mathbf{A}\mathbf{x} \right] \: .
$$

By considering that there is only one vector $\mathbf{x}^{\ast}$ producing the minimum $\Phi(\mathbf{x})$, we can state that

$$
\Phi(\mathbf{x}^{\ast} + \Delta \mathbf{x})
\begin{cases}
\gt \Phi(\mathbf{x}^{\ast}) \, , \: \text{if} \:\: \| \Delta \mathbf{x}\|_{2} \ne 0 \\
= \Phi(\mathbf{x}^{\ast}) \, , \: \text{if} \:\: \| \Delta \mathbf{x}\|_{2} = 0 \\
\end{cases} \: .
$$

Besides that, consider that the vector $\mathbf{x}^{\ast}$ satisfies the following equation:

$$
\nabla \Phi(\mathbf{x}^{\ast}) = 
\left[ \begin{array}{c}
0 \\
\vdots \\
0
\end{array} \right] \: ,
$$

where

$$
\nabla \Phi(\mathbf{x}) = 
\left[ \begin{array}{c}
\dfrac{\partial \, \Phi(\mathbf{x})}{\partial \, x_{1}} \\
\vdots \\
\dfrac{\partial \, \Phi(\mathbf{x})}{\partial \, x_{M}} 
\end{array} \right]
$$

is the gradient of $\Phi(\mathbf{x})$.

It is reasonable to think that, for calculating the vector $\mathbf{x}^{\ast}$, we need first to determine the gradient $\nabla \Phi(\mathbf{x})$.