# LINEAR REGRESSION
<br>


## Introduction

<br>
In statistics, linear regression is a linear approach for modeling the relationship between a dependent variable $\mathbf{Y}$ and one or more independent variables $\mathbf{X}$ . The case of one explanatory variable is called simple linear regression, whereas for more than one explanatory variable the process is called multiple linear regression.

<br>
In this series of notebooks about linear regression we will use the following notation and terminology : <br>

<ul style="list-style-type:square">
    <li>
        $\mathbf{Y}$ is a $ \ \textit{m x 1} \ $ <i>m x 1</i> vector of $m$ observations on the dependent variable, equivalently called endogenous
        variable, regressand, response variable, explained variable etc.
    </li>
    <br>
    <li>
        $\mathbf{X}$ is a $ \ \textit{m x p} \ $ matrix of $m$ observations on $p$ independent variables, also known as exogenous
        variables, regressors, input variables, explanatory variables, covariates, predictor variables etc. Since our model will
        usually contain a constant term, the first column in $\mathbf{X}$ will contain only ones; this column should be treated
        exactly the same as any other column in the matrix
    </li> 
    <br>
    <li>
        $\boldsymbol{\beta}$ is a $ \ \textit{p x 1} \ $ vector of unknown population parameters (or effects) that we want to
        estimate; these estimates are often called regression coefficients
    </li>    
    <br>
    <li>
        $\boldsymbol{\varepsilon}$ is a $ \ \textit{m x 1} \ $ vector of disturbances, equivalently called error terms or noise
    </li>
</ul> 


<br>
The linear regression algorithm assumes that the relationship between the dependent variable <b>$\mathbf{Y}$</b> and independent variables <b>$\mathbf{X}$</b> can be described in terms of a linear combination (of the population parameters) : 

<br>
$
    \quad
    \begin{bmatrix}
        Y_1 \\
        Y_2 \\
        \vdots \\
        \vdots \\
        Y_m
    \end{bmatrix}_\textit{ m x 1}
    \quad = \qquad
    \begin{bmatrix}
        1      & X_{11} & X_{12} & \dots & X_{1p} \\
        1      & X_{21} & X_{22} & \dots & X_{2p} \\
        \vdots & \vdots & \vdots & \dots & \vdots \\
        \vdots & \vdots & \vdots & \ddots & \vdots \\
        1      & X_{m1} & X_{m2} & \dots & X_{mp} \\
    \end{bmatrix}_\textit{ m x p}  
    \quad 
    \begin{bmatrix}
        \beta_1 \\
        \beta_2 \\
        \vdots \\
        \vdots \\
        \beta_p
    \end{bmatrix}_\textit{ p x 1}
    \quad + \quad
    \begin{bmatrix}
        \varepsilon_1 \\
        \varepsilon_2 \\
        \vdots \\
        \vdots \\
        \varepsilon_m
    \end{bmatrix}_\textit{ m x 1}
$


<br>
$ 
    \quad
    \mathbf{Y}  
        \ = \ f(\mathbf{X}) + \boldsymbol{\varepsilon}
        \ = \ \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} 
    \qquad \text{or} \qquad
    \boldsymbol{\mathbf{Y}_i}
        \ = \ f(\boldsymbol{\mathbf{X}_i}) + \boldsymbol{\varepsilon_i}
        \ = \ \sum_{j=1}^{p} \boldsymbol{\beta_j \mathbf{X}_{ij}} + \boldsymbol{\varepsilon_i}
        \ = \ \boldsymbol{\mathbf{X}_i}^{\top} \boldsymbol{\beta} + \boldsymbol{\varepsilon_i}
        \ = \ \mathbf{E} [\boldsymbol{\mathbf{Y}_i} \mid \boldsymbol{\mathbf{X}_i}] + \boldsymbol{\varepsilon_i}
$ 

<br>
The two equations are equivalently called Population Regression Equation (PRE), which consists of two parts : <br>

<ul style="list-style-type:square">
    <li>
        the Population Regression Function (<b>PRF</b>, also known as the systematic component) 
        $ \ f(\mathbf{X}) \ $,
        which describes the relationship between the dependent and the independent variables as a linear (with respect to the
        unknown population parameters) function of the independent variables $\mathbf{X}$        
    </li>
    <br>
    <li>
        the disturbance term $\boldsymbol{\varepsilon_i}$ (also known as error term or stochastic component), which accounts for
        the deviation of the observed value $\boldsymbol{\mathbf{Y}_i}$ from its expected value (the value of the PRF for the
        corresponding $\boldsymbol{\mathbf{X}_i}$)
    </li>
</ul> 

<br>
Not including disturbance is equivalent to saying that <b>$\mathbf{Y}$</b> follows this linear combination without the smallest deviation from the expected value; in other words, it would mean that our data already show a deterministic behavior. In this series of notebooks we will not examine deterministic relationships, for which an equation exists that exactly describes the data. Instead, we are interested in statistical relationships,for which the relationship between the variables is not perfect. <br>

<br>
The sample regression function (<b>SRF</b>) is the sample counterpart of the population regression function (PRF); since the SRF is obtained for a given sample, each new sample will generate different estimates. The model thus takes the form 
$ \ \mathbf{\hat{Y}} = \mathbf{X} \boldsymbol{\hat{\beta}} $ .

<br>
Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of $\mathbf{Y}$ given $\mathbf{X}$, rather than on the joint probability distribution of the two, which is the domain of multivariate analysis.

## Interpretation

<br>
A fitted linear regression model can be used to identify the relationship between a single predictor variable $\boldsymbol{\mathbf{X}_j}$ and the response variable $\mathbf{Y}$ when all the other predictor variables in the model are "held fixed". 

<br>
Specifically, the interpretation of $\boldsymbol{\beta_j}$ is the expected change in the response variable for a one-unit change in $\boldsymbol{\mathbf{X}_j}$ when the other covariates are held fixed; in other words, the expected value of the partial derivative of $\mathbf{Y}$ with respect to $\boldsymbol{\mathbf{X}_j}$. This is sometimes called the unique effect of $\boldsymbol{\mathbf{X}_j}$ on $\mathbf{Y}$. In contrast, the marginal effect of $\boldsymbol{\mathbf{X}_j}$ on $\mathbf{Y}$ can be assessed using a correlation coefficient or simple linear regression model relating only $\boldsymbol{\mathbf{X}_j}$ to $\mathbf{Y}$; this effect is the total derivative of $\mathbf{Y}$ with respect to $\boldsymbol{\mathbf{X}_j}$.

<br>
It is possible that the unique effect can be nearly zero even when the marginal effect is large. This may imply that some other covariate "captures" all the information in $\boldsymbol{\mathbf{X}_j}$, so that once that variable is in the model, there is no contribution of $\boldsymbol{\mathbf{X}_j}$ to the variation in the response variable. 

<br>
Conversely, the unique effect of $\boldsymbol{\mathbf{X}_j}$ can be large while its marginal effect is nearly zero. This would happen if the other covariates explained a great deal of the variation of $\mathbf{Y}$, but they mainly explain variation in a way that is complementary to what is captured by $\boldsymbol{\mathbf{X}_j}$ . In this case, including the other variables in the model reduces the part of the variability of $\mathbf{Y}$ that is unrelated to $\boldsymbol{\mathbf{X}_j}$, thereby strengthening the apparent relationship with the latter.

<br>
Care must be taken when interpreting regression results. The notion of a "unique effect" is appealing when studying a complex system where multiple interrelated components influence the response variable. In some cases, it can literally be interpreted as the causal effect of an intervention that is linked to the value of a predictor variable. However, it has been argued that in many cases multiple regression analysis fails to clarify the relationships between the predictor variables and the response variable when the predictors are correlated with each other and are not assigned following a study design.


## Errors vs Residuals

<br>
In regression analysis, the distinction between errors and residuals is subtle and important.<br>

<ul style="list-style-type:square">
    <li>
        <b>$\boldsymbol{\varepsilon}$</b> : 
        the error (or disturbance) term is the difference between the observed value and the true (unobserved) value of
        <b>$\mathbf{Y}$</b>, it represents factors other than $\mathbf{X}$ that affect <b>$\mathbf{Y}$</b>, 
        it pertains to the true data generating process (DGP) and therefore is not observable.
    </li>
    <br>
    <li>
      <b>e</b> : 
      the residual is the difference between the observed value and the estimated (fitted) value of <b>$\mathbf{Y}$</b> ($\boldsymbol{ \mathbf{e} = \mathbf{Y} - \mathbf{\hat{Y}}})$
    </li>
</ul>

<br>
Assumptions like normality, homoscedasticity, and independence all apply to the error terms of the DGP, not to the model residuals,<br>
and it is important to remember that $\boldsymbol{\varepsilon \neq \mathbf{e}}$. <br>

<br>
We end up using the residuals to choose the models (do they look uncorrelated, do they have a constant variance, etc.), but all along, we must remember that the residuals are just constructs of the data and the estimates of the parameters we put in front of those variables. 

<br>
Unobserved exogenous variables are sometimes called "disturbances" or "errors", they represent factors omitted from the
model but judged to be relevant for explaining the behavior of variables in the model. Background factors in structural equations differ fundamentally from residual terms in regression equations. 

<br>
The former are part of the physical reality and are responsible for variations observed in the data. They are treated as any other variable, though we often cannot measure their values precisely and must resign to merely acknowledging their existence and assessing qualitatively how they relate to other variables in the system.

<br>
The latters are artifacts of analysis which, by definition, are uncorrelated with the regressors. 


## References

<br>
<ul style="list-style-type:square">
    <li>
         University of Valencia - Ezequiel Uriel - 
         <a href="https://bit.ly/2x9cSh6">
         The simple regression model : estimation and properties</a>        
    </li>
    <br>
    <li>
        New York University - 
        <a href="https://stanford.io/2KMHCGL">
        OLS in Matrix Form</a>        
    </li>
    <br>
    <li>
        Pennsylvania State University - STAT 501 - 
        <a href="https://bit.ly/2J28hlM">
        What is Simple Linear Regression?</a>
    </li>
    <br>
    <li>
        Research Gate - 
        <a href="https://bit.ly/2s67hDa">
        What is the difference between error terms and residuals in econometrics (or in regression models)?</a>
    </li>
    <br>
    <li>
        Rick H. Hoyle - <a href="https://bit.ly/2KPkLtZ">
        Handbook of Structural Equation Modeling</a>
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/1LmnkPf">
        Linear regression</a>
    </li>    
</ul>
