# WEIGHTED LEAST SQUARES

<br>

## Introduction

<br>
A special case of GLS called <b>Weighted Least Squares</b> (<b>WLS</b>) occurs when the disturbance terms are uncorrelated but still exhibit heteroscedasticity.

<br>
Given the following covariance matrix, where the off-diagonal entries of $\mathbf{\Sigma}$ are null but the remaining entries may still have different variances, we can compute a simpler version of the transformation matrix $\mathbf{G}$ : 

$
    \quad
    \mathrm{V}(\boldsymbol{\varepsilon}) 
    \quad = \quad
    \mathbf{\Sigma}
    \quad = \quad
    \begin{bmatrix}
        {\sigma_1}^2  &  0             &  \dots   &  0             \\
        0             &  {\sigma_2}^2  &  \dots   &  0             \\ 
        \vdots        &  \vdots        &  \vdots  &  \vdots        \\
        \vdots        &  \vdots        &  \ddots  &  \vdots        \\ 
        0             &  0             &  \dots   &  {\sigma_m}^2 
    \end{bmatrix} _\textit{ m x m }
    \quad = \quad
    \boldsymbol{\mathbf{W}^{-1}}
    \qquad \Rightarrow \qquad
    \mathbf{G} 
    \quad = \quad \mathbf{\Sigma}^{\ -1/2} 
    \quad = \quad 
        \begin{bmatrix}
            {\sigma_1}^{-1}  &  0                &  \dots   &  0                \\
            0                &  {\sigma_2}^{-1}  &  \dots   &  0                \\ 
            \vdots           &  \vdots           &  \vdots  &  \vdots           \\
            \vdots           &  \vdots           &  \ddots  &  \vdots           \\ 
            0                &  0                &  \dots   &  {\sigma_m}^{-1}
        \end{bmatrix} _\textit{ m x m }   
$

<br>
Please note that, in the context of WLS, the inverse of the matrix $\mathbf{\Sigma}$ will be refferred to as $\mathbf{W}$.

## Minimization Objective

<br>
The WLS minimization objective can be seen as a particular version of its GLS analogous, with a simpler structure : 

<br>
$
    \quad
    \begin{align}
    S_{WLS}(\hat{\boldsymbol{\beta}}, \mathbf{W}) 
    &=
    \newline
    &=
        \boldsymbol{c^2}
        \ (\mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} )^{T} 
        \ \mathbf{\Sigma}^{-1} 
        \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) 
    \newline 
    &=
        \boldsymbol{c^2}
        \ (\mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} )^{T} 
        \ \mathbf{W} 
        \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) 
    \newline \newline
    &= \sum _{i=1}^{m} \boldsymbol{\mathbf{W}_{ii}}
        \big[ \boldsymbol{\mathbf{Y}_i} - (\boldsymbol{\mathbf{X}_i})^{\top} \boldsymbol{\hat{\beta}} \big] ^\boldsymbol{2}
    \quad = \quad \sum_{i=1}^{m} \boldsymbol{\mathbf{W}_i} \ {\boldsymbol{\mathbf{e}_i}}^\boldsymbol{2}
    \end{align}
$

## WLS Estimators

<br>
And the same can be told of the WLS estimators :

$
    \quad
    \begin{align}
        \boldsymbol{\hat{\beta}_{WLS-1}}
        &= 
        \newline
        &=  \big[ \mathbf{X}^{\top} \ \mathbf{\Sigma}^{-1} \ \mathbf{X} \big] ^{-1} 
            \mathbf{X}^{\top} \ \mathbf{\Sigma}^{-1} \ \mathbf{Y}   
        \newline
        &=  \big[ \mathbf{X}^{\top} \ \mathbf{W} \ \mathbf{X} \big] ^{-1} 
            \mathbf{X}^{\top} \ \mathbf{W} \ \mathbf{Y}   
    \end{align}    
$

$
    \quad
    \begin{align}
        \mathrm{V}_{WLS}(\boldsymbol{\varepsilon^*}) 
        &= 
        \newline
        &= \mathrm{V}_{GLS}(\boldsymbol{\varepsilon^*}) 
        \newline
        &= \boldsymbol{c^2} \boldsymbol{\textit{I}} _\textit{ m x m }
    \end{align}    
$

$
    \quad
    \begin{align}
        \mathrm{V}(\boldsymbol{\hat{\beta}_{WLS-1}}) 
        &=
        \newline
        &= \big[ \mathbf{X}^{\top} \ \mathbf{\Sigma}^{-1} \ \mathbf{X} \big] ^{-1} 
        \newline
        &= \big[ \mathbf{X}^{\top} \ \mathbf{W} \ \mathbf{X} \big] ^{-1} 
    \end{align}    
$

$
    \quad
    \begin{align}
        \mathrm{V}_{WLS}(\mathbf{Y^*}) 
        &= 
        \newline
        &= \mathrm{V}_{GLS}(\mathbf{Y^*})  
        \newline
        & = \boldsymbol{c^2} \ \boldsymbol{\textit{I}} _\textit{ m x m }       
    \end{align}    
$

$
    \quad
    \begin{align}
        \boldsymbol{{s^2}_{WLS}} 
        &=
        \newline
        &=
            \dfrac{1}{m - p} 
            \ \boldsymbol{c^2}
            \ (\mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} )^{T} 
            \ \mathbf{\Sigma}^{-1} 
            \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) 
        \newline
        &=
            \dfrac{1}{m - p} 
            \ \boldsymbol{c^2}
            \ (\mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} )^{T} 
            \ \mathbf{W} 
            \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) 
    \end{align}    
$

<br>
We can think of OLS estimation as a special case of WLS (and consequently of GLS) which not only has uncorrelated error terms, but also presents homoscedasticity ($\boldsymbol{\mathbf{W}_i} = 1 $).


## Weights Estimation TODO

<br>

### Direct Estimation TODO

<br>
If there are replicates in the data, the most obvious way to estimate the weights is to set the weight for each data point equal to the reciprocal of the sample variance obtained from the set of replicate measurements to which the data point belongs. Mathematically :

<br>
$
    \quad
    \boldsymbol{\mathbf{W}_{ij}} 
    = 
        \dfrac {1}{ \boldsymbol{{\sigma^2}_i} } = 
        \big[ 
            \dfrac
            { \sum_{j=1}^{m_i}(\boldsymbol{\mathbf{Y}_{ij}} - \boldsymbol{\overline{\mathbf{Y}}_i})^2 } 
            {m_i - 1} 
        \big]^{-1}
$

<br>
where :

<br>
<ul style="list-style-type:square">
    <li>
        $\boldsymbol{\mathbf{W}_{ij}}$ are the weights indexed by their regressor variable levels and replicate measurements
    </li>
    <br>
    <li>
        $\boldsymbol{i}$ indexes the unique combinations of regressor variable values, and $\boldsymbol{j}$ indexes the
        replicates within each combination of regressor variable values
    </li>
    <br>
    <li>
        $\boldsymbol{{\sigma^2}_i}$ is the sample standard deviation of the response variable at the $i^\text{th}$ combination
        of regressor variable values
    </li>
    <br>
    <li>
        $m_i$ is the number of replicate observations at the $i^\text{th}$ combination of regressor variable values
    </li>
    <br>
    <li>
        $\boldsymbol{\mathbf{Y}_{ij}}$ are the individual data points indexed by their regressor variable levels and replicate
        measurements
    </li>
    <br>
    <li>
        $\boldsymbol{\overline{\mathbf{Y}}_i}$ is the mean of the responses at the $i^\text{th}$ combination of regressor
        variable levels
    </li>
</ul>

<br>
Unfortunately, although this method is attractive, it rarely works well. This is because when the weights are estimated this way, they are usually extremely variable. As a result, the estimated weights do not correctly control how much each data point should influence the parameter estimates. 

<br>
This method can work, but it requires a very large number of replicates at each combination of predictor variables. In fact, if this method is used with too few replicate measurements, the parameter estimates can actually be more variable than they would have been if the unequal variation were ignored.


### A better strategy TODO

<br>
A better strategy for estimating the weights is to find a function that relates the standard deviation of the response at each combination of regressor variable values to the regressor variables themselves :

<br>
$
    \quad
    \boldsymbol{ {\hat{\sigma}^2}_i } \quad \approx \quad \mathbf{g} (\boldsymbol{\mathbf{X}_i}, \mathbf{\gamma})
$

<br>
This approach to estimating the weights usually provides more precise estimates than direct estimation because fewer quantities have to be estimated and there is more data to estimate each one.

## Considerations

<br>
It is fundamental to understand that, unlike linear and non-linear least squares regression, <b>WLS</b> regression is <b>not associated with a particular specification</b> of the functional form (the function used to describe the relationship between the dependent and independent variables). 

<br>
Instead, WLS reflects the behavior of the disturbance terms, and it can be used with functions that are either linear or non-linear in the parameters. It works by incorporating extra non-negative constants (weights), into the fitting criterion. 

<br>
Optimizing the weighted fitting criterion to find the parameter estimates allows the weights to determine the contribution of each observation to the final parameter estimates.


<br>
Weighted Least Squares estimation is particularly relevant when our aim is :

<br>
<ul style="list-style-type:square">
    <li>
        <b>focusing accuracy</b> <br>
        [<b>C1</b>]
    </li>
    <br>
    <li>
        <b>discounting imprecision</b> <br>
        [<b>C2</b>]
    </li>
    <br>
    <li>
        <b>solving other optimization problems</b> <br>
        [<b>C3</b>]
    </li>
</ul>

### [C1] Focusing accuracy

<br>
Depending on the context of the regression, we may want to dedicate more attention to the predicted response for specific values of the regressors : values we expect to see often again, associated to mistakes that are especially costly than others. 

<br>
When assigning a larger weight to the $i^\text{th}$ residual (or to a certain set of residuals), we are modifying the objective function so that the regression hyperplane will be pulled towards that specific observation of the response variable. In other words, the regression is trying harder to fit a certain set of observation, while "neglecting" others with smaller weights.

<br>
In this first context, we choose the weights to reflect our priorities. 


### [C2] Discounting imprecision

<br>
As we already know, OLS corresponds to Maximum Likelihood Estimation when the disturbance term $\boldsymbol{\varepsilon}$ is IID Gaussian white noise. This means that, with the disturbance term having constant variance, we are measuring the regression curve with the same precision everywhere. 

<br>
However, the magnitude of the noise is often not constant, and the data are said to exhibit heteroscedasticity. In presence of heteroskedasticity, even if each and every noise term is still Gaussian, OLS is no longer efficient, and in fact no longer match MLE.

<br>
We also know that not all is lost; if the covariance matrix of the disturbance term is known (or can be estimated through OLS regression residuals), we can use WLS to recover the efficiency of parameter estimation. This is done by "modulating" the objective function in such a way that each data point has the proper amount of influence over the parameter estimation process. 

<br>
There is no way we can estimate the regression function as accurately where the noise is large as we can where the noise is small; <b>a procedure that treats all of the observations equally would give to high-variance (or less precisely measured) points more influence than they should have, and too little influence to small-variance points</b>.

<br>
Trying to give equal attention to all parts of the input space is a waste of resources; our aim should be fitting well where the noise is small (observations measured with high precision), and expect to fit poorly where the noise is large.


### [C3] Solving other optimization problems

<br>
There are a number of other optimization problems which can be transformed into (or approximated by) WLS. The most important of these arises from Generalized Linear Models, where the mean response is some non-linear function of a linear predictor (logistic regression is an example.)

<br>
In this third case, the weights come from the optimization problem we are actually in the process of solving.


## Advantages

<br>
Like all of the least squares methods discussed so far, weighted least squares <b>is an efficient method that makes good use of (it extracts more information from) small data sets</b>. It also shares the ability to provide different types of easily interpretable statistical intervals for estimation, prediction, calibration and optimization. 

<br>
In addition, as discussed above, the main advantage that weighted least squares enjoys over other methods is the <b>ability to handle regression situations in which the data points are of varying quality</b>. If the standard deviation of the random errors in the data is not constant across all levels of the explanatory variables, using weighted least squares with proper weights yields the most precise parameter estimates possible.


## Disadvantages

<br>
The biggest disadvantage of weighted least squares is the fact that the theory behind <b>this method is based on the assumption that the weights are known exactly</b>. This is almost never the case in real applications, of course, so estimated weights must be used instead. 

<br>
The effect of using estimated weights is difficult to assess. Experience indicates that small variations in the the weights due to estimation do not often affect a regression analysis or its interpretation. If the weights can be estimated with high enough precision, their use can significantly improve the parameter estimates compared to the results that would be obtained if all of the data points were equally weighted. However, when the weights are estimated from small numbers of replicated observations, the results of an analysis can be very badly and unpredictably affected. 

<br>
This is especially likely to be the case when the weights for extreme values of the regressors are estimated using only a few observations. It is important to remain aware of this potential problem, and to only use weighted least squares when the weights can be estimated precisely relative to one another.

<br>
Weighted Least Squares regression, like the other least squares methods, is also sensitive to the <b>effects of outliers and high-leverage points</b>. If potential outliers are not investigated and dealt with appropriately, they will likely have a negative impact on the parameter estimation and other aspects of a weighted least squares analysis. If a weighted least squares regression actually increases the influence of an outlier, the results of the analysis may be far inferior to an unweighted least squares analysis.


## Alternatives TODO

<br>
It is wise to use WLS only when the weights can be estimated with high precision. When dealing with heteroscedasticity, it is common to run OLS instead, and use a different variance estimator. 

<br>
While White’s consistent estimator doesn’t require heteroscedasticity, it isn’t a very efficient strategy. However, if you don’t know the weights for your data, it may be your best choice. For a full explanation of how to implement White’s consistent estimator, you can read White’s original 1908 paper for free here.

## References

<br>
<ul style="list-style-type:square">    
    <li>
        National Taiwan University - Chung-Ming Kuan - 
        <a href="https://bit.ly/2IJ3hmG">
        Generalized Least Squares Theory</a>
    </li>
    <br>
    <li>
        Carleton University - Ba Chu - 
        <a href="https://bit.ly/2J9PpB4">
        Generalized Least Squares Theory</a>
    </li>
    <br>
    <li>
        National Institute of Standards and Technology - 
        <a href="https://bit.ly/2GJ8tRD">
        Accounting for Non-Constant Variation Across the Data TODO</a>
    </li>    
    <li>
        National Institute of Standards and Technology - 
        <a href="https://bit.ly/2J75EyX">
        Weighted Least Squares Regression</a>
    </li>
    <br>
    <li>
        Carnegie Mellon University - Cosma Shalizi - 
        <a href="https://bit.ly/2IICB5y">
        Extending Linear Regression: Weighted Least Squares, Heteroskedasticity, Local Polynomial Regression</a>
    </li>
</ul>
