<a href="https://colab.research.google.com/github/aecins/tutorials/blob/main/least_squares/weighted_least_squares.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

# Create a weigted least squares problem ([wiki](https://en.wikipedia.org/wiki/Weighted_least_squares))
We setup a simple linear least squares problem by creating values for:
- true parameter values $x$ (sampled from uniform distribution)
- independent variables $A$ (sampled from uniform distribution)
- dependent variables $b$ (constructed as $Ax + \mathcal{N}(0, \Sigma)$ where $\Sigma$ is a diagonal matrix i.e. residuals are uncorrelated but have different variances)
$$
\Sigma =
\begin{bmatrix}
\sigma_0^2 & \cdots & 0 \\
\vdots & \ddots & \vdots \\
0 & \cdots & \sigma_n^2 \\
\end{bmatrix}
$$

This satisfies the assumptions on the measurement errors of ordinary least squares:
- Measurement errors are uncorrelated
- Measurement errors have different variance ([heteroscedasticity](https://en.wikipedia.org/wiki/Heteroscedasticity))
- Measurement errors are zero mean normally distributed

These assumptions can be summarized as a single assumption:
- measurement error vector is drawn from a multivariate Gaussian distribution with zero mean and a diagonal covariance matrix of $\Sigma$.

In [None]:
# Function that generates random values drawn from uniform distribution with a given range.
def generate_random_uniform(y_size, x_size, value_min, value_max):
    assert(value_max > value_min)

    values = np.random.rand(y_size, x_size)
    values = values * (value_max - value_min)
    values = values + value_min

    return values

In [None]:
# Create true parameters of a linear least squares problem by sampling them uniformly
# in the range [-X_TRUE_MAGNITUDE; X_TRUE_MAGNITUDE]
NUM_PARAMETERS = 4
X_TRUE_MAGNITUDE = 10

x_true = generate_random_uniform(NUM_PARAMETERS, 1, -X_TRUE_MAGNITUDE, X_TRUE_MAGNITUDE)
print(x_true)

[[-2.941]
 [-3.764]
 [ 1.385]
 [-2.768]]


In [None]:
# Function that generates random independent variables (A) and corresponding noisy dependent variables (b)
# for a linear least squares problem.
def generate_Ab(x_true, num_measurements, A_magnitude, b_noise_sigmas):
    # First generate independent variables aka A matrix.
    # These are generated from a uniform distribution in the range [-A_magnitude; A_magnitude]
    A = generate_random_uniform(NUM_MEASUREMENTS, NUM_PARAMETERS, -A_magnitude, A_magnitude)

    # Next generate dependent variables aka vector b.
    # These are generated as values predicted by the true values of the model + measurement noise
    measurement_noise = (np.random.normal(0, b_noise_sigmas, [num_measurements, 1]))
    b = A.dot(x_true) + measurement_noise

    return [A, b]

# Create measurement sigmas.
NUM_MEASUREMENTS = 10000
B_NOISE_SIGMA_MAGNITUDE = 1000000
b_sigmas = generate_random_uniform(NUM_MEASUREMENTS, 1, 0, B_NOISE_SIGMA_MAGNITUDE)

# Geneare measurements.
A_MAGNITUDE = 10
[A, b] = generate_Ab(x_true, NUM_MEASUREMENTS, A_MAGNITUDE, b_sigmas)

# Validate residuals
The noise added to each measurement is sampled from a zero mean Gaussian distribution with different sigmas. The set of all noise measurements can be considered to be a sample from a [Gaussian mixture model (GMM)](https://en.wikipedia.org/wiki/Mixture_model) where all components have zero mean and are equally probable. The mean and standard deviation of such Gaussian mixture model are:
$$
\mu_{gmm} = 0 \\
\sigma_{gmm} = \sqrt{\dfrac{1}{N}\sum_N \sigma_i^2}
$$
(see this [link](https://stats.stackexchange.com/questions/16608/what-is-the-variance-of-the-weighted-mixture-of-two-gaussians) for proof).

We can check that the system was constructed correctly by checking that residuals evaluated at ground truth solution are distributed the same way as the measurement noise.

In [None]:
# Check that the residuals evaluated at true parameter values have the same statistics as measurement noise.
# NOTE: accuracy of the mean and standard deviation estimates depends on NUM_PARAMETERS.
residuals_true = A.dot(x_true) - b
residuals_expected_mean = 0
residuals_expected_sigma = np.sqrt(np.mean(np.square(b_sigmas)))
print("Statistics of residuals evaluated at true parameter values")
print("mean  : estimated {:f}, expected {:f}".format(np.mean(residuals_true), residuals_expected_mean))
print("sigma : estimated {:f}, expected {:f}".format(np.std(residuals_true), residuals_expected_sigma))

Statistics of residuals evaluated at true parameter values
mean  : estimated -4342.408709, expected 0.000000
sigma : estimated 576235.420587, expected 572802.760831


# Solve using ordinary least squares
We can attempt to solve the problem using ordinary least squares.

In [None]:
# Calculate RMSE of parameters.
def parameter_rmse(x_estimated, x_true):
    assert(len(x_estimated) == len(x_true))
    return np.linalg.norm(x_estimated - x_true) / len(x_estimated)

In [None]:
# Calculate covariance of solution using bootstrap.
NUM_BOOTSTRAP_ITERATIONS = 10000
def bootstrap_covariance(solver_function):
    x_estimates = np.zeros((NUM_PARAMETERS, NUM_BOOTSTRAP_ITERATIONS))
    for i in range(NUM_BOOTSTRAP_ITERATIONS):
        # Generate A and b.
        [A, b] = generate_Ab(x_true, NUM_MEASUREMENTS, A_MAGNITUDE, b_sigmas)

        # Solve.
        x_estimated = solver_function(A, b, b_sigmas)

        # Append to solutions
        x_estimates[:, i] = x_estimated[:, 0]

    return np.cov(x_estimates)

In [None]:
# Solve linear least squares usig numpy solver.
def solve_least_squares_numpy(A, b, b_noise_sigmas):
    # NOTE: b_noise_sigmas are ignored
    return np.linalg.lstsq(A, b, rcond=None)[0]

x_estimated = solve_least_squares_numpy(A, b, b_sigmas)
residuals_estimated = A.dot(x_estimated) - b
rmse = parameter_rmse(x_estimated, x_true)
print("Root mean Squared Error of paramters estimated using OLS solution:\n {:f}".format(rmse))
print("Average residual:\n {:f}".format(np.mean(residuals_estimated)))

# Calculate bootstrap covariance of the solution
x_covariance = bootstrap_covariance(solve_least_squares_numpy)
print("Solution covariance estimated using bootstrap:")
print(x_covariance)

Root mean Squared Error of paramters estimated using OLS solution:
 637.022794
Average residual:
 -4401.073490
Solution covariance estimated using bootstrap:
[[ 972482.050 -19447.076 -13566.337  17088.933]
 [-19447.076  970392.787 -2460.736 -917.164]
 [-13566.337 -2460.736  1004987.336 -15665.740]
 [ 17088.933 -917.164 -15665.740  992996.862]]


We see that using ordinary least squares method to solve this problem leads to:
- high RMS error on the parameters
- high solution covariance

Intuitively this can be explained by the fact that optimization problem equally penalises residuals corresponding to constraints with high amount of noise and residuals with low noise. As a result the optimizer may choose a parameter estimate that allows a slightly increased residual for a low noise constraint in order to have a slightly lower residual on the high noise constraint. The true solution will have a very low residual on low noise contraint and a relatively high residual on the high noise contraint.

# Solving weighted least squares
## Residuals
It was shown that the best estimate of the parameters for weighted least squares problem can be obtained by solving an ordinary least squares problem where the residuals are scaled by the inverse of the standard deviation of the corresponding measurement noise [[link](https://en.wikipedia.org/wiki/Weighted_least_squares#:~:text=Aitken%20showed%20that,of%20the%20measurement)]. The residual becomes:
$$
r_i = \dfrac{1}{\sigma_i}(Ax_i - b)
$$
Such residuals are called a **whitened residuals**. Whitened residuals evaluated at true parameter values are normally distributed with unit variance (assuming all assuptions of weighted least squares hold).

If we construct the weight matrix $W$ to be the matrix inverse of the measurement covariance matrix $\Sigma$ then we can express the residual vector in the matrix form:
$$
W = \Sigma^{-1}  =
\begin{bmatrix}
\frac{1}{\sigma_0^2} & \cdots & 0 \\
\vdots & \ddots & \vdots \\
0 & \cdots & \frac{1}{\sigma_n^2} \\
\end{bmatrix} \\
r = W^{\frac{1}{2}}(Ax - b)
$$

## Whitening transform
Consider a column vector containing all of the residuals to be a random variable $\mathbf{r}$. This random variable has a covariance matrix $\Sigma$. Transforming this variable by $W^{\frac{1}{2}} = \Sigma^{\frac{1}{2}}$ creates a new variable $W^{\frac{1}{2}}\mathbf{r}$ that has an identity covariance matrix. A linear transformation that transforms a random variable to have unit covariance is known as a *[whitening transform](https://github.com/aecins/Least-Squares-Notebooks/blob/main/whitening_transform.ipynb)*.

## Jacobian
Similarly to how we can derive whitened residuals in a weighted least squares problem - we can also derive the whitened Jacobian:
$$
\begin{align*}
r(x) & = W^{\frac{1}{2}}(Ax - b) \\
J & = \begin{bmatrix}\dfrac{dr_i}{dx_j}\end{bmatrix} = W^{\frac{1}{2}}A
\end{align*}
$$
Note that $W^{\frac{1}{2}}A$ is equivalent to dividing each row of A by standard deviation of the corresponding measurement noise.

In [None]:
# Validate Jacobian of the residual function.
NUM_JACOBIAN_TESTS = 100
max_value_difference = 0

# Calculate Jacobian.
J = A / b_sigmas

for i in range(NUM_JACOBIAN_TESTS):
    # Create two random parameter values.
    x = generate_random_uniform(NUM_PARAMETERS, 1, -X_TRUE_MAGNITUDE, X_TRUE_MAGNITUDE)
    y = generate_random_uniform(NUM_PARAMETERS, 1, -X_TRUE_MAGNITUDE, X_TRUE_MAGNITUDE)

    # Calculate residuals
    r_x = (A.dot(x) - b) / b_sigmas
    r_y = (A.dot(y) - b) / b_sigmas
#     print(x - y)
#     print(r_x - r_y)

    # Calculate predicted residual
    delta = x - y
    r_x_predicted = r_y + (J.dot(delta))
    max_value_difference = max(max_value_difference, np.linalg.norm(r_x - r_x_predicted))

print("Maximum norm of the difference between true and predicted values of the residual: {:.3f}".format(max_value_difference))

Maximum norm of the difference between true and predicted values of the residual: 0.000


## Normal equations
To solve weighted least squares problem we want to find a set of parameters $x^*$ that minimize the sum of squared whitened residuals:
$$
\begin{align*}
\DeclareMathOperator*{\argmin}{arg\,min}
x^* & = \argmin_x ||W^{\frac{1}{2}}(Ax - b)||^2 \\
    & = \argmin_x ||W^{\frac{1}{2}}Ax - W^{\frac{1}{2}}b)||^2
\end{align*}
$$
This is equivalent to solving an ordinary least squares problem with the following substitution:
$$
\begin{align*}
A' & = W^{\frac{1}{2}}A \\
b' & = W^{\frac{1}{2}}b
\end{align*}
$$

The normal equations for weighted least squares are can be obtained by plugging in expressions for $A'$ and $b'$ into the ordinary least squares normal equations:
$$
\begin{align*}
A'^T A' x^* & = A'^T b' \\
A^T W A x^* & = A^T Wb
\end{align*}
$$
The optimal solution is obtained as:
$$
x^* = (A^T W A)^{-1}A^T Wb \\
$$

### Normal equations derivation
$$
\begin{align*}
A'^TA' x^* & = A'^T b' \\
(W^{\frac{1}{2}}A)^TW^{\frac{1}{2}}A x^* & = (W^{\frac{1}{2}}A)^T W^{\frac{1}{2}}b \\
(W^{\frac{1}{2}}A)^TW^{\frac{1}{2}}A x^* & = (W^{\frac{1}{2}}A)^T W^{\frac{1}{2}}b \\
\end{align*}
$$
Since $W$ is symmetric we have:
$$
(W^{\frac{1}{2}}A)^T = A^T (W^{\frac{1}{2}})^T  = A^T W^{\frac{1}{2}}
$$
Plugging back we get the Normal equations for weighted least squares:
$$
\begin{align*}
A^T W^{\frac{1}{2}} W^{\frac{1}{2}}A x^* & = A^T W^{\frac{1}{2}} W^{\frac{1}{2}}b \\
A^T W A x^* & = A^T Wb \\
\end{align*}
$$

In [None]:
# Solve weighted linear least squares problem.
def solve_weighted_least_squares(A, b, b_noise_sigmas):
    # First whitened measurement matrices.
    A_whitened = A / b_sigmas
    b_whitened = b / b_sigmas

    # Next solve an ordinary least squares problem with whitened measurements.
    info = A_whitened.transpose().dot(A_whitened)
    info_inv = np.linalg.inv(info)
    Atb_whitened = (A_whitened.transpose()).dot(b_whitened)
    return np.matmul(info_inv, Atb_whitened)

x_estimated = solve_weighted_least_squares(A, b, b_sigmas)
residuals_estimated = A.dot(x_estimated) - b
rmse = parameter_rmse(x_estimated, x_true)
print("Root mean Squared Error of paramters estimated using weighted least squares solution:\n {:f}".format(rmse))
print("Average residual:\n {:f}".format(np.mean(residuals_estimated)))

# Calculate bootstrap covariance of the solution
x_covariance = bootstrap_covariance(solve_weighted_least_squares)
print("Solution covariance estimated using bootstrap:")
print(x_covariance)

Root mean Squared Error of paramters estimated using weighted least squares solution:
 2.058716
Average residual:
 -4342.449140
Solution covariance estimated using bootstrap:
[[ 179.534  1.314  1.237  1.347]
 [ 1.314  183.273 -0.230  0.326]
 [ 1.237 -0.230  181.422 -2.836]
 [ 1.347  0.326 -2.836  184.400]]


## Covariance
Given a weighted least squares problem the covariance matrix of the solution can be estimated as:
$$cov(x^*) = (A^T \Sigma^{-1} A)^{-1}$$
where $\Sigma$ is the standard deviation of the residuals [[link](https://en.wikipedia.org/wiki/Weighted_least_squares#:~:text=When%20W%20%3D%20M%E2%88%921%2C%20this%20simplifies%20to)][[proof](https://github.com/aecins/Least-Squares-Notebooks/blob/main/least_squares_covariance_derivation.ipynb)].

In [None]:
# Calculate covariance of the weighted least squares problem.
A_prime = A / b_sigmas
cov_inverse = A_prime.transpose().dot(A_prime)
cov = np.linalg.inv(cov_inverse)
print(cov)

[[ 219.170  46.953 -112.475 -79.246]
 [ 46.953  165.369 -42.285  1.161]
 [-112.475 -42.285  268.194  124.747]
 [-79.246  1.161  124.747  119.841]]
