# GENERALIZED LEAST SQUARES

<br>

## Introduction

<br>
Until now we have assumed the error terms to be spherical ($ \ \mathbf{E} [ \varepsilon \varepsilon^{\top} \mid X ] = \sigma^2 \textit{I} \ $). Although the assumptions of homoscedasticity and indipendence of the error terms have no effect on the OLS method per se, we will see that they do affect the statistical properties of the OLS estimators and resulting test statistics. 

<b>Generalized least squares</b> (<b>GLS</b>) is an extension of the OLS method that allows unbiased and efficient estimation of population parameters $\boldsymbol{\beta}$ when the error terms are affected by either <b>heteroscedasticity or correlations (or both)</b>, as long as the form of heteroscedasticity and correlation is known independently of the data. 

<br>
GLS can be used to perform linear regression when there is a certain degree of correlation between the residuals in a regression model. In these cases, OLS and weighted least squares can be statistically inefficient, or even give misleading inferences.


## Problems with OLS

<br>
A quick look at the previous notebooks will remind us that :

<br>
<blockquote>
$
    \begin{align}
        &
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
            \text{by OLS estimation}
        \newline
        \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
        &= (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}\mathbf{Y}         
        \newline
        &= {\boldsymbol{\beta}} + (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}\boldsymbol{\varepsilon}
    \end{align}
    \\
    \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0} 
    = \overline{\mathbf{Y}} - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}}
$
</blockquote>


<blockquote>
$
    \begin{align}
        &
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            & \text{by strinct exogeneity (} \textbf{A2} \text{)}  
        \newline
        & \mathbf{E} \big[ \boldsymbol{\hat{\beta}_{OLS-1}} \big] = \boldsymbol{\beta_1}
        \newline
        & \mathbf{E} \big[ \boldsymbol{\hat{\beta}_{OLS-0}} \big] = \boldsymbol{\beta_0}
    \end{align}
$
</blockquote>


<blockquote>
$
    \begin{align}
        &
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            & \text{by strinct exogeneity (} \textbf{A2} \text{)}  
        \newline
        & \mathrm{Var}(\boldsymbol{\mathbf{Y}_i}) = \mathrm{Var} ( \boldsymbol{\varepsilon_i} )
        \newline
        & \mathrm{V}(\mathbf{Y}) = \mathrm{V}(\boldsymbol{\varepsilon})
        \newline
        & \mathrm{V}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}) 
        = 
            (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}
            \ \mathrm{V}(\boldsymbol{\varepsilon}) \         
            \mathbf{X} (\mathbf{X}^{\top}\mathbf{X})^{-1}  
    \end{align}
$
</blockquote>


<br>
Suppose now that instead of $\quad \mathrm{V}(\boldsymbol{\varepsilon}) = \boldsymbol{\sigma^2} \boldsymbol{\textit{I}} \quad$, the covariance matrix of the disturbance term is $ \quad \mathrm{V}(\boldsymbol{\varepsilon}) = \boldsymbol{\Sigma} = \boldsymbol{\sigma^2} \ \mathbf{\Omega} \quad $, where the matrix $\mathbf{\Omega}$ is a positive definite matrix accounting for both heteroscedasticity and autocorrelation.

<br>
Since the only assumption required for unbiasedness is strict exogeneity, if we were to estimate our model using OLS, <b>our estimators will still be unbiased</b>.

<br>
We also know that $\mathrm{V}(\boldsymbol{\varepsilon})$ is no longer a scalar covariance matrix, and hence there is no guarantee that the OLS estimator is the most efficient within the class of linear unbiased estimators (<b>loss of efficiency</b>).

<br>
Apart from efficiency, a more serious consequence is that <b>hypothesis testing based on the standard OLS estimation becomes invalid</b> : as the <b><i>t</i></b> and <b><i>F</i></b> statistics depend on the elements of the estimated covariance matrix 
$ \boldsymbol{s^2} \ (\mathbf{X}^{\top}\mathbf{X})^{-1} $, they no longer have the desired t and F distributions under the null hypothesis. Consequently, the inferences based on these tests become invalid.

<br>
In practice, we hardly know the true properties of $\mathbf{Y}$; therefore it is important to consider an estimation method that is valid when $\mathrm{Var}(\mathbf{Y})$ has a more general form.


## Derivation

<br>
The intuition behind GLS is to find a transformation $\mathbf{G}$ (of the population regression equation) such that it delivers a new error term which actually meets the Gauss-Markov assumptions, while retaining the others made so far. If the new specification complies with the Gauss-Markov assumptions, then it is proven that the OLS estimator will be BLUE. 

<br>
Let $\mathbf{G}$ be a $_\textit{ N x N }$ non-stochastic matrix, now consider the "transformed" specification

<br>
$
    \quad
    \begin{align}
        &
            \mathbf{G} \ \mathbf{Y} = \mathbf{G} \ \mathbf{X} \ \boldsymbol{\beta} + \mathbf{G} \ \boldsymbol{\varepsilon} 
        \newline
        \text{or} \quad
        &
            \boldsymbol{\mathbf{Y}^*} = \boldsymbol{\mathbf{X}^*} \ \boldsymbol{\beta} + \boldsymbol{\varepsilon^*} 
    \end{align}
$

<br>
where $\boldsymbol{\mathbf{Y}^*} (=\mathbf{G} \ \mathbf{Y}) $ denotes the transformed dependent variable and 
$ \boldsymbol{\mathbf{X}^*} (= \mathbf{G} \ \mathbf{X})$ is the matrix of transformed explanatory variables.

<br>
It can be seen that $\boldsymbol{\mathbf{X}^*}$ has full column rank $\textit{p}$, provided that $\mathbf{G}$ is nonsingular; the identification requirement thus carries over under nonsingular transformations. It follows that population parameters can still be estimated by OLS using these transformed variables.

<br>
It is also easy to see that the specification is still linear with respect to the parameters 

<br>
Before going through the OLS estimation of the parameters, let's ask ourselves what are the properties brought by this transformation $\mathbf{G}$ ? Without further assumptions or knowledge, we can say that :

<br>
<ul style="list-style-type:square">
    <li>
        $            
            \mathbf{E}[\boldsymbol{\varepsilon^*}] 
            \ = \ \mathbf{E}[\mathbf{G} \boldsymbol{\varepsilon}] 
            \ = \ \mathbf{G} \ \mathbf{E}[\boldsymbol{\varepsilon}] 
            \ = \ 0                        
        $    
    </li>
    <br>
    <li>
        $
            \mathbf{E}[\boldsymbol{\varepsilon^*}\boldsymbol{\varepsilon^{*\top}}]
            \ = \ \mathbf{E}[ \ (\mathbf{G} \ \boldsymbol{\varepsilon}) \ (\mathbf{G} \ \boldsymbol{\varepsilon})^{\top} \ ] 
            \ = \ \mathbf{E}[ \ \mathbf{G} \ \boldsymbol{\varepsilon} \ \boldsymbol{\varepsilon}^{\top} \ \mathbf{G}^{\top} \ ]
            \ = \ \mathbf{G} \ \mathbf{E}[ \ \boldsymbol{\varepsilon} \ \boldsymbol{\varepsilon}^{\top} \ ] \ \mathbf{G}^{\top} 
            \ = \ \mathbf{G} \ \mathbf{\Sigma} \ \mathbf{G}^{\top} 
            \ = \ \boldsymbol{\sigma^2} \ \mathbf{G} \ \mathbf{\Omega} \ \mathbf{G}^{\top}    
        $    
    </li>
    <br>
    <li>
        $
            \mathbf{E}[\mathbf{Y^*}] 
            \ = \ \mathbf{E}[\boldsymbol{\mathbf{X}^*} \ \boldsymbol{\beta}] + \mathbf{E}[\boldsymbol{\varepsilon^*}]
            \ = \ \mathbf{E}[\mathbf{G} \ \mathbf{X} \ \boldsymbol{\beta}]
            \ = \ \mathbf{G} \ \mathbf{E}[\mathbf{X} \ \boldsymbol{\beta}]
            \ = \ \mathbf{G} \ \mathbf{E}[\mathbf{Y}]
        $    
    </li>
</ul>    

<br>
The resulting OLS estimators are : 


$
    \quad
    \begin{align}
        \boldsymbol{\hat{\beta}_{GLS-1}}
        &= 
        \newline
        &= (\mathbf{X^{*\top}} \mathbf{X^*})^{-1} \mathbf{X^{*\top}} \mathbf{Y^*}  
        \newline
        &=  
            \big[ (\mathbf{G} \ \mathbf{X})^{\top} (\mathbf{G} \ \mathbf{X}) \big] ^{-1} 
            (\mathbf{G} \ \mathbf{X})^{\top} (\mathbf{G} \ \mathbf{Y})
        \newline
        &=  
            \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big] ^{-1} 
            \mathbf{X}^{\top} \ \mathbf{G}^{\top} \mathbf{G} \ \mathbf{Y}   
    \end{align}    
$

$
    \quad
    \begin{align}
        \hat{\boldsymbol{\beta}}_\boldsymbol{GLS-0} 
        &= 
        \newline
        &= \overline{\mathbf{Y}}^{\ *} - \boldsymbol{\hat{\beta}_{GLS-1}} \overline{\mathbf{X}}^{\ *}
        \newline
        &= \mathbf{G} \ \overline{\mathbf{Y}} - \boldsymbol{\hat{\beta}_{GLS-1}} \mathbf{G} \ \overline{\mathbf{X}}  
    \end{align}    
$

$
    \quad
    \begin{align}
        \mathrm{V}(\boldsymbol{\varepsilon^*}) 
        &= 
        \newline
        &= \mathrm{V}(\mathbf{G} \ \boldsymbol{\varepsilon}) 
        \newline
        &= 
            \mathbf{G} \ \mathrm{V}(\boldsymbol{\varepsilon}) \ \mathbf{G}^{\top}
            = \mathbf{G} \ \boldsymbol{\Sigma} \ \mathbf{G}^{\top}
    \end{align}    
$

$
    \quad
    \begin{align}
        \mathrm{V}(\boldsymbol{\hat{\beta}_{GLS-1}}) 
        &=
        \newline
        &= 
            (\mathbf{X^{* \top}} \mathbf{X^*})^{-1} \mathbf{X^{* \top}}
            \ \mathrm{V}(\boldsymbol{\varepsilon^*}) \         
            \mathbf{X^*} (\mathbf{X^{* \top}}\mathbf{X^*})^{-1}  
        \newline
        &= 
            \big[ (\mathbf{G} \ \mathbf{X})^{\top} (\mathbf{G} \ \mathbf{X}) \big] ^{-1} 
            \ (\mathbf{G} \ \mathbf{X})^{\top}
            \ (\mathbf{G} \ \boldsymbol{\Sigma} \ \mathbf{G}^{\top})
            \ (\mathbf{G} \ \mathbf{X}) 
            \ \big[ (\mathbf{G} \ \mathbf{X})^{\top} (\mathbf{G} \ \mathbf{X}) \big] ^{-1} 
    \end{align}    
$

$
    \quad
    \begin{align}
        \mathrm{V}(\mathbf{Y^*}) 
        &=
        \newline
        &= \mathrm{V}(\boldsymbol{\varepsilon^*})  
        = \mathbf{G} \ \boldsymbol{\Sigma} \ \mathbf{G}^{\top}
    \end{align}    
$

which we know is unbiased for any non-stochastic and non-singular matrix $\mathbf{G}$. The next question is : can we
find a transformation matrix that yields the most efficient estimator among all linear unbiased estimators? In other words, if there is any matrix $\mathbf{G}$ such that $ \mathbf{G} \ \mathbf{\Omega} \ \mathbf{G}^{\top} = \boldsymbol{\sigma^2} \boldsymbol{\textit{I}} $ for some finite positive number $\boldsymbol{\sigma^2}$.

### Diagonalizable Matrices

<br>
A square $ \ \textit{ m x m } \ $ matrix $\mathbf{A}$ is said to be diagonalizable if it can be written in the form 
$ \quad \mathbf{A} = \mathbf{Q} \ \mathbf{D} \ \mathbf{Q}^{-1} \quad $ where 

<ul style="list-style-type:square">
    <li>
        $\mathbf{D}$ is a diagonal $ \ \textit{ m x m } \ $ matrix with the eigenvalues of $\mathbf{A}$ as its entries
    </li>
    <br>
    <li>
        $\mathbf{Q}$ is a nonsingular $ \ \textit{ m x m } \ $ matrix consisting of the eigenvectors corresponding to the
        eigenvalues in $\mathbf{D}$
    </li>
</ul>

<br>
<b>Property</b> :
A square $ \ \textit{ m x m } \ $ matrix $\mathbf{A}$ is diagonalizable <b>if and only if</b> it has $ \ \textit{m} \ $ linearly independent eigenvectors, i.e. if the rank of the matrix formed by the eigenvectors is $ \ \textit{m} \ $. 


### Orthogonal Matrices

<br>
A square $ \ \textit{ m x m } \ $ matrix $\mathbf{Q}$ is said to be orthogonal if it has real entries and its columns and rows are orthogonal unit vectors (orthonormal vectors), i.e. if its transpose is equal to its inverse : 

$
    \quad
    \mathbf{Q}^{\top} \mathbf{Q} = \mathbf{Q} \ \mathbf{Q}^{\top} = \boldsymbol{\textit{I}}
$

### Orthogonally Diagonalizable Matrices

<br>
A square $ \ \textit{ m x m } \ $ matrix $\mathbf{A}$ is said to be orthogonally diagonalizable if there exists an orthogonal matrix $\mathbf{Q}$ such that $\mathbf{A} = \mathbf{Q} \ \mathbf{D} \ \mathbf{Q}^{\top}$, where the entries of the diagonal matrix $\mathbf{D}$ are the eigenvalues of $\mathbf{A}$, and the columns of $\mathbf{Q}$ are the corresponding eigenvectors.

<br>
In other words, a square $ \ \textit{ m x m } \ $ matrix $\mathbf{A}$ is said to be orthogonally diagonalizable if it is diagonalizable by means of a orthogonal matrix $\mathbf{Q}$ :

<br>
$
    \quad
    \begin{align}
        \mathbf{A} = \mathbf{Q} \ \mathbf{D} \ \mathbf{Q}^{-1}
        &\qquad \Leftrightarrow \qquad 
        \mathbf{D} = \mathbf{Q}^{-1} \mathbf{A} \ \mathbf{Q}      
        \newline
        \mathbf{A} = \mathbf{Q} \ \mathbf{D} \ \mathbf{Q}^{\top}
        &\qquad \Leftrightarrow \qquad  
        \mathbf{D} = \mathbf{Q}^{\top} \ \mathbf{A} \ \mathbf{Q}
    \end{align}
$


<br>
<b>Theorem</b> : Every orthogonally diagonalizable matrix is symmetric. 
<br>
<b>Proof</b> : 

<br>
$
    \quad
    \begin{align}
        \mathbf{A} &= \mathbf{Q} \ \mathbf{D} \ \mathbf{Q}^{-1} = \mathbf{Q} \ \mathbf{D} \ \mathbf{Q}^{\top}
        \newline \newline
        \mathbf{A}^{\top}            
        &=         
        \newline
        &= (\mathbf{Q} \ \mathbf{D} \ \mathbf{Q}^{\top})^{\top}
        = (\mathbf{Q}^{\top})^{\top} \ (\mathbf{Q} \ \mathbf{D})^{\top}
        = \mathbf{Q} \ \mathbf{D}^{\top} \ \mathbf{Q}^{\top} 
        \newline
        &= \mathbf{Q} \ \mathbf{D} \ \mathbf{Q}^{\top} 
        \newline
        &= \mathbf{A}
    \end{align}
$

<br>
<b>Theorem</b> : Every symmetric matrix is orthogonally diagonalizable.
<br>
<b>Proof</b> : Every (real) $ \ \textit{ m x m } \ $ symmetric matrix has $\textit{m}$ real eigenvalues (counted by their multiplicities); for each eigenvalue, we can find a real eigenvector associated with it.

<br>
We are now ready to prove main theorem regarding orthogonally diagonalizable matrices.

<br>
<b>The Spectral Theorem</b> : A (real) matrix is orthogonally diagonalizable <b>if and only if</b> is symmetric.

<br>
Earlier, we made the easy observation that if $\mathbf{A}$ is orthogonally diagonalizable, then it is
<b>necessary</b> that $\mathbf{A}$ be symmetric. The Spectral Theorem says that the symmetry of $\mathbf{A}$is also 
<b>sufficient</b> : a real symmetric matrix must be orthogonally diagonalizable. This is the part of the
theorem that is hard and that seems surprising because i'is not easy to see whether a matrix is diagonalizable at all.

### Orthogonal Diagonalization at work

<br>
It is now easy to understand that in order to address the loss of efficiency due to heteroscedasticity and/or correlation,  we should choose a non-stochastic and non-singular matrix $\mathbf{G}$ such that $ \mathbf{G} \ \mathbf{\Sigma} \ \mathbf{G}^{\top} = \boldsymbol{c^2} \boldsymbol{\textit{I}} $ for some finite positive number $\boldsymbol{c^2}$.

<br>
To find the desired transformation matrix $\mathbf{G}$, note that $\mathbf{\Sigma}$ is a symmetric and positive definite matrix so that it can be orthogonally diagonalized as $ \mathbf{C} \ \mathbf{\Sigma} \ \mathbf{C}^{\top} = \mathbf{\Lambda} \quad$ , where $\mathbf{C}$ is the matrix of eigenvectors corresponding to the matrix of eigenvalues $\mathbf{\Lambda}$.

<br>
For $\mathbf{G} = \mathbf{\Sigma}^{\ -1/2} = \mathbf{C} \ \mathbf{\Lambda}^{\ -1/2} \ \mathbf{C}^{\top}$ (or $\mathbf{\Sigma}^{\ -1/2} = \mathbf{\Lambda}^{\ -1/2} \ \mathbf{C}^{\top}$), we have

<br>
$
    \quad
    \begin{align}
        \mathbf{G} \ \mathbf{\Sigma} \ \mathbf{G}^{\top}
        &= 
        \newline
        &= \mathbf{\Sigma}^{\ -1/2} \ \mathbf{\Sigma} \ (\mathbf{\Sigma}^{\ -1/2})^{\top}
        \newline
        &= \mathbf{\Sigma}^{\ -1/2} \ \mathbf{\Sigma} \ \mathbf{\Sigma}^{\ -1/2}
        \newline
        &= \boldsymbol{c^2} \ \boldsymbol{\textit{I}} _\textit{ m x m }
    \end{align}
$


<br>
<b>Property</b> : The inverse of the orthogonally diagonalizable matrix $\mathbf{\Sigma}$

$
    \quad
    \begin{align}
        & 
            \mathbf{G} \ \mathbf{\Sigma} \ \mathbf{G}^{\top}
            = \boldsymbol{c^2} \ \boldsymbol{\textit{I}}
        \newline
        \Rightarrow \quad
        & \mathbf{G} \ \mathbf{\Sigma} \ \mathbf{G}^{-1} = \boldsymbol{c^2} \ \boldsymbol{\textit{I}}
    \end{align}
$

$
    \quad
    \begin{align}
        &
            \mathbf{G}^{-1} \ \big[ \mathbf{G} \ \mathbf{\Sigma} \ \mathbf{G}^{-1} \big] \ \mathbf{G}
            = \mathbf{G}^{-1} \ (\boldsymbol{c^2} \ \boldsymbol{\textit{I}}) \ \mathbf{G}
        \newline
        \Rightarrow \quad
        &
            \mathbf{\Sigma} = \boldsymbol{c^2} \ \mathbf{G}^{-1} \ \mathbf{G}
    \end{align}
$

$
    \quad
    \begin{align} 
        \mathbf{\Sigma}^{-1} 
        &=
        \newline
        &= 
            (\boldsymbol{c^2} \ \mathbf{G}^{-1} \ \mathbf{G})^{-1}
            = (\boldsymbol{c^2})^{-1} \ (\mathbf{G}^{-1} \ \mathbf{G})^{-1}
        \newline
        &= (\boldsymbol{c^2})^{-1} \ \mathbf{G}^{-1} \ \mathbf{G}
        \newline
        &= (\boldsymbol{c^2})^{-1} \ \mathbf{G}^{\top} \ \mathbf{G}                        
    \end{align}
$

$
    \quad
    \mathbf{G}^{\top} \ \mathbf{G} = \boldsymbol{c^2} \ \mathbf{\Sigma}^{-1} _\textit{ m x m }
$

$
    \quad
    \begin{align}
        \mathbf{G} \ \mathbf{\Sigma}^{-1} \ \mathbf{G}^{\top} 
        &=
        \newline
        &= \mathbf{G} \ \big[ (\boldsymbol{c^2})^{-1} \ \mathbf{G}^{\top} \ \mathbf{G} \big] \ \mathbf{G}^{\top}
        \newline
        &= (\boldsymbol{c^2})^{-1} \ \mathbf{G} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{G}^{\top}
        \newline
        &= (\boldsymbol{c^2})^{-1} \ \mathbf{G} \ \mathbf{G^{-1}} \ \mathbf{G} \ \mathbf{G^{-1}}
        \newline
        &= (\boldsymbol{c^2})^{-1} \ \boldsymbol{\textit{I}} _\textit{ m x m }
    \end{align}
$

## BLUE

<br>
The OLS estimators of this specific transformation is (by construction) BLUE :

$
    \quad
    \begin{align}
        \boldsymbol{\hat{\beta}_{GLS-1}}
        &= 
        \newline
        &=  \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big] ^{-1} 
            \mathbf{X}^{\top} \ \mathbf{G}^{\top} \mathbf{G} \ \mathbf{Y}   
        \newline
        &=  \big[ \mathbf{X}^{\top} \ \boldsymbol{c^2} \ \mathbf{\Sigma}^{-1} \ \mathbf{X} \big] ^{-1} 
            \mathbf{X}^{\top} \ \boldsymbol{c^2} \ \mathbf{\Sigma}^{-1} \ \mathbf{Y}   
        \newline
        &=  \big[ \mathbf{X}^{\top} \ \mathbf{\Sigma}^{-1} \ \mathbf{X} \big] ^{-1} 
            \mathbf{X}^{\top} \ \mathbf{\Sigma}^{-1} \ \mathbf{Y}   
    \end{align}    
$

$
    \quad
    \begin{align}
        \mathrm{V}(\boldsymbol{\varepsilon^*}) 
        &= 
        \newline
        &= \mathbf{G} \ \mathbf{\Sigma} \ \mathbf{G}^{\top}
        \newline
        &= \boldsymbol{c^2} \boldsymbol{\textit{I}} _\textit{ m x m }
    \end{align}    
$

$
    \quad
    \begin{align}
        \mathrm{V}(\boldsymbol{\hat{\beta}_{GLS-1}}) 
        &=
        \newline
        &= 
            \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big] ^{-1} 
            \ \mathbf{X}^{\top} \ \mathbf{G}^{\top}  
            \ (\mathbf{G} \ \mathbf{\Sigma} \ \mathbf{G}^{\top})
            \ (\mathbf{G} \ \mathbf{X}) 
            \ \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big] ^{-1} 
        \newline
        &= 
            \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big] ^{-1} 
            \ \mathbf{X}^{\top} \ \mathbf{G}^{\top}  
            \ (\boldsymbol{c^2} \boldsymbol{\textit{I}})
            \ (\mathbf{G} \ \mathbf{X}) 
            \ \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big] ^{-1} 
        \newline
        &= 
            \boldsymbol{c^2}
            \ \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big] ^{-1} 
            \ \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big]
            \ \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big] ^{-1} 
        \newline
        &= \boldsymbol{c^2} \ \big[ \mathbf{X}^{\top} \ \mathbf{G}^{\top} \ \mathbf{G} \ \mathbf{X} \big] ^{-1} 
        \newline
        &= \boldsymbol{c^2} \ \big[ \mathbf{X}^{\top} \ \boldsymbol{c^2} \ \mathbf{\Sigma}^{-1} \ \mathbf{X} \big] ^{-1} 
        \newline
        &= \big[ \mathbf{X}^{\top} \ \mathbf{\Sigma}^{-1} \ \mathbf{X} \big] ^{-1} 
        \newline
    \end{align}    
$

$
    \quad
    \begin{align}
        \mathrm{V}(\mathbf{Y^*}) 
        &= 
        \newline
        &= \mathrm{V}(\boldsymbol{\varepsilon^*}) 
        \newline
        & = \boldsymbol{c^2} \ \boldsymbol{\textit{I}} _\textit{ m x m }       
    \end{align}    
$

<br>
As the GLS estimator does not depend on $\boldsymbol{c}$, it is without loss of generality that we can set 
$ \quad \mathbf{G} = \mathbf{\Sigma}^{\ -1/2} $ .

### Estimation of the variance

<br>
As usual, since the variance of the error terms $\boldsymbol{\sigma^2}$ is unobservable (being the error terms unobservable themselves), we will actually compute an estimate $\boldsymbol{s^2}$ of it, based on the regression residuals :

<br>
$
    \quad
    \boldsymbol{{s^2}_{GLS}} 
    \quad = \quad
        \dfrac{1}{m - p} 
        \ (\boldsymbol{\mathbf{Y}^{*}} - \boldsymbol{\mathbf{X}^{*}} \hat{\boldsymbol{\beta}})^{T} 
        \ (\boldsymbol{\mathbf{Y}^{*}} - \boldsymbol{\mathbf{X}^{*}} \hat{\boldsymbol{\beta}})      
    \quad = \quad 
        \dfrac{1}{m - p} 
        \ \boldsymbol{c^2}
        \ (\mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} )^{T} 
        \ \mathbf{\Sigma}^{-1} 
        \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) 
$


## Minimization Objective

<br>
The GLS estimator is the parameter $\boldsymbol{\hat{\beta}_{GLS-1}}$ which minimizes the following criterion function :

<br>
$
    \quad
    \begin{align}
    S_{GLS}(\hat{\boldsymbol{\beta}}, \mathbf{\Sigma}) 
    &=
    \newline
    &= 
        \sum_{i=1}^{m} (\boldsymbol{ {\mathbf{e}_i}^{*} })^\boldsymbol{2}
        =  \sum _{i=1}^{m} 
        \big[ \boldsymbol{ {\mathbf{Y}_i}^{*} } - (\boldsymbol{ {\mathbf{X}_i}^{*} })^{\top} \hat{\boldsymbol{\beta}} \big] ^{2}
    \newline \newline
    &= 
        (\boldsymbol{\mathbf{Y}^{*}} - \boldsymbol{\mathbf{X}^{*}} \hat{\boldsymbol{\beta}})^{T} \ 
        (\boldsymbol{\mathbf{Y}^{*}} - \boldsymbol{\mathbf{X}^{*}} \hat{\boldsymbol{\beta}})  
    \newline \newline
    &= 
        \big[ \mathbf{G} \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) \big] ^{T} \ 
        \big[ \mathbf{G} \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) \big]
        \quad = \quad
        \big[ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} )^{T} \ \mathbf{G}^{T} \big] \
        \big[ \mathbf{G} \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) \big] 
    \newline
    &=
        ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} )^{T} 
        \ ( \mathbf{G}^{T} \ \mathbf{G} )
        \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) 
    \newline \newline
    &=
        \boldsymbol{c^2}
        \ (\mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} )^{T} 
        \ \mathbf{\Sigma}^{-1} 
        \ ( \mathbf{Y} - \mathbf{X} \ \hat{\boldsymbol{\beta}} ) 
    \end{align}
$

$
    \quad
    \begin{align}
    S_{OLS}(\hat{\boldsymbol{\beta}}) 
    &=
    \newline
    &= 
        \sum_{i=1}^{m} \boldsymbol{{\mathbf{e}_i}^2}
        = \sum _{i=1}^{m}(\boldsymbol{\mathbf{Y}_i} - \hat{\mathbf{Y}}_\boldsymbol{i})^{2} 
    \newline
    &= (\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}})^{T} \ (\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}})  
    \end{align}
$

<br>
When compared to its OLS analogous, we will see that this criterion function is a weighted sum of squared errors and hence a <b>generalized version of the standard OLS criterion function</b>.

## Feasible Generalized Least Squares

<br>
The biggest disadvantage of generalized least squares is that this method is based on the assumption that the covariance matrix of the disturbance terms is known a-priori. This is almost never the case in real applications, of course, being the error terms unobservable. It is therefore necessary to compute an estimate of this covariance matrix. 

<br>
In the next notebooks we will see a special case of GLS called Weighted Least Squares, and how to estimate the covariance matrix of the original disturbance terms in order to actually implement this two methods.


## References

<br>
<ul style="list-style-type:square">
    <li>
        National Taiwan University - Chung-Ming Kuan - 
        <a href="https://bit.ly/2IJ3hmG">
        Generalized Least Squares Theory</a>
    </li>
    <br>
    <li>
        Carleton University - Ba Chu - 
        <a href="https://bit.ly/2J9PpB4">
        Generalized Least Squares Theory</a>
    </li>
    <br>
    <li>
        Washington University in St. Louis - Ron Freiwald - 
        <a href="https://bit.ly/2s9oD1g">
        Orthogonally Diagonalizable Matrices</a>
    </li>
    <br>
    <li>
        The University of Manchester - MATH10212 - Peter J. Eccles - 
        <a href="https://bit.ly/2IK3mCi">
        Linear Algebra : Brief Lecture Notes, Note 10</a>
    </li>
    <br>
    <li>
        Wolfram MathWorld - 
        <a href="https://bit.ly/2Lscu09">
        Diagonalizable Matrix</a>
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2GK4LXL">
        Diagonalizable Matrix</a>
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2vrr0wU">
        Orthogonal Matrix</a>
    </li>
</ul>