<a href="https://colab.research.google.com/github/edoardochiarotti/class_datascience/blob/main/2024/06_Linear-Regression-Model/06_Linear_regression_model.ipynb"
   target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Linear Regression Model

<img src="https://i.imgflip.com/83mjm6.jpg" width="500">

In [None]:
# PACKAGES
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import random as rd
import statistics as st
import pandas as pd
import os
import statsmodels.api as sm
import re
!pip install stargazer
from stargazer.stargazer import Stargazer

# FUNCTIONS FROM PACKAGES
from numpy.linalg import inv
from sklearn.linear_model import LinearRegression

# SEABORN THEME
scale = 0.4
W = 16*scale
H = 9*scale
sns.set(rc = {'figure.figsize':(W,H)})
sns.set_style("white")

- In this class we'll cover the Linear Regression Model (LRM), and we'll estimate its parameters with the Ordinary Least Squares (OLS) estimator. We'll go over the theory first, and then we'll apply the theory to data in Python.
- This class is based on several **sources**. The main ones are:
    - Class notes by Nicolas Berman and Daniele Rinaldo at The Geneva Graduate Institute.
    - Textbook [Econometric Analysis](https://www.amazon.com/Econometric-Analysis-William-H-Greene/dp/0135132452) by William H. Greene (sixth edition, Pearson).
    - Fiona Burlig's class [ARE 212: Multiple equation estimation](https://www.fionaburlig.com/teaching/are212).
    - Matteo Courthoud's class [Machine Learning for Economics](https://matteocourthoud.github.io/course/ml-econ/).

## Outline
- [The Mean Model (Recap)](#The-Mean-Model-(Recap))
- [The Linear Regression Model](#The-Linear-Regression-Model)
- [The Linear Regression Model in Matrix Notation](#The-Linear-Regression-Model-in-Matrix-Notation)
- [The Ordinary Least Squares (OLS) Estimator](#The-Ordinary-Least-Squares-(OLS)-Estimator)
- [Goodness of Fit with the OLS Estimator](#Goodness-of-Fit-with-the-OLS-Estimator)
- [Hypothesis Testing with the OLS Estimator](#Hypothesis-Testing-with-the-OLS-Estimator)
- [The OLS Estimator in Python](#The-OLS-Estimator-in-Python)
    - [Data](#Data)
    - [OLS Canned Routine](#OLS-Canned-Routine)
    - [Visualizing Estimated Fitted Values and Estimated Residuals](#Visualizing-Estimated-Fitted-Values-and-Estimated-Residuals)
    - [Interpretation of Regression Results](#Interpretation-of-Regression-Results)

## The Mean Model (Recap) <a name="The-Mean-Model-(Recap)"></a>
- The population of a variable of interest can be represented by a probability distribution. In the mean model of the population, we assume that this probability distribution is normal with mean $\beta$ and variance $\sigma^2$. For example, the distribution of CO2 emissions per capita of all countries can be represented by a normal distribution with mean, say, 5 tonnes per capita (or more generally $\beta$) and variance, say, 1 tonne per capita (or more generally $\sigma^2$). In concise notation, the **mean model of the population**:
<br><br>
$$
y \sim \mathcal{N}(\beta,\,\sigma^2)
$$ 
<br>
- We can re-express the population model in regression form with the population regression equation (exact same model). In a nutshell, we consider $y$ as having a deterministic component, or mean component, given by $\beta$, and a random component, which we can refer to as $\epsilon$:
<br><br>
$$
y = \beta + \epsilon \qquad \text{with } \epsilon \sim \mathcal{N}(0,\,\sigma^2)
$$
<br>
- Let's consider now our random sample $(y_1, ..., y_N)$. We assume that each observation of the random sample is generated by an underlying process (Data Generating Process, DGP) described by the distribution of the population, i.e. a normal distribution with mean $\beta$ and variance $\sigma^2$. For example, let's think that we are about to draw our sample of countries for which we have data on CO2 emissions per capita. We assume that there is a 95% chance that CO2 emissions per capita of the first country in our sample will be between 3 tonnes ($5-1\times2$) and 7 tonnes ($5+1\times2$). In other words, we assume that the DGP for the first country follows the distribution of the population, i.e. a normal distribution with mean 5 (or more generally $\beta$) and variance 1 (or more generally $\sigma^2$). 
- Good, now let's generalize this and make this assumption(s) for all countries that we will observe in our sample. In concise notation, the **mean model of the DGP** (in linear-regression form):
<br><br>
$$
y_i = \beta + \epsilon_i \qquad \text{with } \epsilon_i \sim \mathcal{N}(0,\,\sigma^2), \, \, \text{for } i=1,...,N
$$
<br>
- In addition, the exact same model of the DGP can be re-expressed by explicitly listing all the assumptions we make on the process that generates the data (less concise notation). For $i=1,...,N$:
    1. **Linearity**: $ y_i = \beta + \epsilon_i $
    2. **Zero mean**: $ \mathrm{E}(\epsilon_i)=0$
    3. **Homoscedasticity and nonautocorrelation**: each disturbance $\epsilon_i$ has the same finite variance $\sigma^2$ and is uncorrelated with every other disturbance $\epsilon_j$ (i.e. the random components of the random sample are independent and identically distributed, i.i.d.)
    4. **Normal distribution**: the disturbances are normally distributed.
- We can express the mean model of the data generating process in matrix notation by defining $\boldsymbol{x}=(1, ..., 1)'$: $\boldsymbol{y} = \boldsymbol{x}\beta + \boldsymbol{\epsilon}$ with
<br>
$\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_N,\,\sigma^2\boldsymbol{I}_N)$.
- When we do not know $\beta$ and $\sigma^2$, we need to estimate them using estimators. The generic estimators for these coefficients can be denoted as $\hat{\beta}$ and $\hat{\sigma}^2$, with generic predictions $\hat{\boldsymbol{y}}=\boldsymbol{x}\hat{\beta}$ and generic residuals $\hat{\boldsymbol{\epsilon}}=\boldsymbol{y} - \boldsymbol{x}\hat{\beta}$. Specifically, always defining $\boldsymbol{x}=(1, ..., 1)'$, we have seen that we can estimate $\beta$ with the sample-mean estimator $\hat{\beta}_{SM} = (\boldsymbol{x}'\boldsymbol{x})^{-1}(\boldsymbol{x}'\boldsymbol{y}) \sim \mathcal{N}(\beta,\sigma^2(\boldsymbol{x}'\boldsymbol{x})^{-1})$, which gives sample-mean predictions $\hat{\boldsymbol{y}}_{SM}=\boldsymbol{x}\hat{\beta}_{SM}$ and sample-mean residuals $\hat{\boldsymbol{\epsilon}}_{SM}=\boldsymbol{y} - \boldsymbol{x}\hat{\beta}_{SM}$. Furthermore, $\sigma^2$ can be estimated with the sample-variance estimator $\hat{\sigma}^2_{SV} = \frac{\hat{\boldsymbol{\epsilon}}_{SM}\,'\hat{\boldsymbol{\epsilon}}_{SM}}{N-1}$ with $\frac{N-1}{\sigma^2}\hat{\sigma}^2_{SV} \sim \mathcal{\chi}^2_{N-1}$. We can then use the central limit theorem and the theory of statistical tests to test whether the estimates from $\hat{\beta}_{SM}$ and $\hat{\sigma}^2_{SV}$ are statistically different from zero.

<img src="https://i.imgflip.com/83mkck.jpg" width="300">

## The Linear Regression Model <a name="The-Linear-Regression-Model"></a>
- When we expressed the mean model of the population in regression form, what we did was splitting the variable of interest (say CO2 emissions per capita) between a deterministic component, given by its mean (say, 5 tonnes per capita, or more generally $\beta$), and a random component, given by a disturbance with mean 0 and variance, say, 2 (or more generally $\sigma^2$).
- In the linear regression model, what we do is saying that the deterministic component, or mean component, of $y$ is no longer simply equal to $\beta$, but it's a linear function of another variable, which we call $x$. For example, we can say that, on average - as we are talking about the mean component -, CO2 emissions per capita equals a linear function of GDP per capita. The (univariate) **linear regression model of the population**, with the split between this new deterministic component and the random component, can be written as follows:
<br><br>
$$
y = \beta_0 + x\beta_1 + \epsilon \qquad \text{with } \epsilon|x \sim \mathcal{N}(0,\,\sigma^2)
$$
<br>
- OK let's spot the differences with the mean model of the population. First, the mean component is no longer $\beta$, but it's $\beta_0 + x\beta_1$. Second, now have a new variable $x$, which is a random variable. Third, given that now we are saying that $y$ depends on $x$, the distribution of $y$ is now a conditional distribution on $x$. Or, in the same way, the distribution of the random part of $y$, namely $\epsilon$, is conditional on $x$. 
- Quick note: remember that the same thing of what we just wrote can be written without the regression form, like this: $y|x \sim \mathcal{N}(\beta_0+x\beta_1,\sigma^2)$. Right? Goood.
- OK now let's think that we'll have to work with samples and data (with the final goal of estimating our population parameters). As usual, let's make some assumptions on how our sample of data will be generated. The underling process that generates our sample of data can be described by the following **linear regression model of the DGP**:
<br><br>
$$
y_i = \beta_0 + x_i\beta_1 + \epsilon_i \qquad \text{with } \epsilon_i|x_{i} \sim \mathcal{N}(0,\,\sigma^2), \, \, \text{for } i=1,...,N
$$
<br>
- As before, the model of the data generating process can be re-expressed by explicitly listing all the assumptions we make on the data generating process (less concise notation). For $i=1,...,N$:
    1. **Linearity**: $ y_i = \beta_0 + x_{i}\beta_1 + \epsilon_i $
    2. **Zero mean, or exogeneity of the independent variable**: $ \mathrm{E}(\epsilon_i|x_i)=0$.  This means that the independent variable will not carry useful information for prediction of the disturbance.
    3. **Homoscedasticity and nonautocorrelation**: each disturbance $\epsilon_i$ has the same finite conditional variance $\sigma^2$ and is uncorrelated with every other disturbance $\epsilon_j$ (i.e. the random components of the random sample are independent and identically distributed, i.i.d.)
    4. **Data generation**: $x_i$ is a random variable.
    5. **Normal distribution**: the disturbances are normally distributed conditional on $x_i$, i.e. $\epsilon_i|x_i=\mathcal{N}(0,\sigma^2)$.
- The following graph conceptualizes what we have just written for a random sample of 3 observations $(y_1,y_2,y_3)$:

<img src="https://i.ibb.co/7VjDL5D/Screen-Shot-2023-10-24-at-15-37-31.png" width="800">

## The Linear Regression Model in Matrix Notation <a name="The-Linear-Regression-Model-in-Matrix-Notation"></a>
- As we did for the mean model, we can express the linear regression model of the data generating process (not of the population) in matrix notation:
<br><br>
$$
\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \qquad \text{with } \boldsymbol{\epsilon}|\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{0}_N,\sigma^2\boldsymbol{I}_N)
$$
<br>
- Let's take a moment to think about what we are writing. $\boldsymbol{y}$ is a N-dimensional random vector of the about-to-be-observed data of the variable of interest (for example, CO2 emissions per capita). $\boldsymbol{X}$ is a $N\times2$ matrix, with the first column being a N-dimensional constant vector of 1s, i.e. $(1,...,1)'$, and the second column being the N-dimensional random vector of the about-to-be-observed data of our predictor, i.e. $(x_1,...,x_N)'$ (for example GDP per capita). $\boldsymbol{\beta}$ is a 2-dimensional  vector of coefficients $(\beta_0, \beta_1)'$. $\boldsymbol{X\beta}$ represents the deterministic component of $\boldsymbol{y}$. $\boldsymbol{\epsilon}$ is a N-dimensional random vector of disturbances $(\epsilon_1,...,\epsilon_N)'$. You should write this down and expand the matrices, as we did for the sample-mean model.
- Let's then re-write the linear regression model by explicitly mentioning all the assumptions, this time using matrix notation:
    1. **Linearity**: $\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$
    2. **Full rank**: $\boldsymbol{X}$ is full rank, i.e. there is no exact linear relationship among any of the independent variables in the model.
    3. **Zero mean, or exogeneity of the independent variables**: $ \mathrm{E}(\boldsymbol{\epsilon}|\boldsymbol{X})=\boldsymbol{0}_N$. This implies $\mathrm{E}(\boldsymbol{y}|\boldsymbol{X}) = \boldsymbol{X}\boldsymbol{\beta}$.
    4. **Homoscedasticity and nonautocorrelation**: $\mathrm{V} (\boldsymbol{\epsilon}|\boldsymbol{X})=\sigma^2\boldsymbol{I}_N$. If we consider assumption 3, this assumption can be re-written as $\mathrm{E}(\boldsymbol{\epsilon}'\boldsymbol{\epsilon}|\boldsymbol{X}) = \sigma^2\boldsymbol{I}_N$.
    5. **Data generation**: $\boldsymbol{X}$ may be fixed or random.
    6. **Normal distribution**: $\boldsymbol{\epsilon}|\boldsymbol{X}$ is normally distributed.
- Note that assumptions 3, 4 and 6 together can be written using the following concise notation: $\boldsymbol{\epsilon}|\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{0}_N,\sigma^2\boldsymbol{I}_N)$.

## The Ordinary Least Squares (OLS) Estimator <a name="The-Ordinary-Least-Squares-(OLS)-Estimator"></a>
- $\boldsymbol{\beta}$ and $\sigma$ are unknown parameters that we want to estimate. As per usual, we'll estimate them using some estimators $\hat{\boldsymbol{\beta}}$ and $\hat{\sigma}$.
- For generic estimators $\hat{\boldsymbol{\beta}}$ and $\hat{\sigma}$ (any estimator), we can write the **generic predicted values** for $\boldsymbol{y}$, namely $\hat{\boldsymbol{y}}$, and the **generic residuals** $\hat{\boldsymbol{\epsilon}}$:
<br><br>
$$
\hat{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}}
$$
<br>
$$
\hat{\boldsymbol{\epsilon}}=\boldsymbol{y} - \hat{\boldsymbol{y}}=\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{\beta}}
$$
<br>
- This should look familiar. It's very similar to the generic predicted values and generic residuals of the sample-mean model, with the difference that now instead of having $\boldsymbol{x}=(1, ..., 1)'$, we have $\boldsymbol{X}$, which is a $N\times2$ matrix, with the first column being a N-dimensional constant vector of 1s, i.e. $(1, ..., 1)'$, and the second column being the N-dimensional random vector of the about-to-be-observed data of our predictor, i.e. $\boldsymbol{x}=(x_1, ..., x_N)'$.
- The Ordinary Least Squares (OLS) estimator finds $\beta$ to minimize the sum of squares of the residuals. Minimizing a function of the residuals means minimizing the distance between the regression line and the observations, or maximizing the fraction of the variance of y explained by our model. Why do we minimize the squares? Well, minimizing the simple sum of the residuals does not make sense, as negative and positive values could cancel out. Another  possibility would be to minimize the absolute value of the residuals, but this estimator does not usually have good properties, and poses technical difficulties.
- Let's go back to linear notation for a second. The OLS estimator is defined as the solution of the following **minimization problem**:
<br><br>
$$
\min_{\beta_0,\beta_1} f(\hat{\beta}_0,\hat{\beta}_1) = \sum_{i=1}^{N}(\hat{\epsilon}_i)^2 = \sum_{i=1}^{N}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{N}(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)^2
$$
<br>
- As this funciton is always nonnegative, continuous and strictly concave, the minimization gives the following set of first order conditions:
<br><br>
$$
\begin{pmatrix}
\frac{\partial f(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_0} \\
\frac{\partial f(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_1}
\end{pmatrix} = 0 \Longleftrightarrow
\begin{pmatrix}
-2\sum_{i=1}^{N}(y_i - \hat{\beta}_{0,OLS} - \hat{\beta}_{1,OLS}x_i) \\
-2\sum_{i=1}^{N}x_i(y_i - \hat{\beta}_{0,OLS} - \hat{\beta}_{1,OLS}x_i)
\end{pmatrix} = 0
$$
<br>
- This is a system with 2 unkowns ($\hat{\beta}_{0,OLS}$ and $\hat{\beta}_{1,OLS}$) and 2 equations. We can therefore solve for $\hat{\beta}_{0,OLS}$ and $\hat{\beta}_{1,OLS}$. Let's write down the **solutions** for $\hat{\beta}_{1,OLS}$ and $\hat{\beta}_{0,OLS}$ (try to do the steps yourself):
<br><br>
$$
\hat{\beta}_{1,OLS} = \frac{\sum_{i=1}^{N}(x_i-\overline{x})(y_i-\overline{y})}{(x_i-\overline{x})^2}
$$
<br>
<br>
$$
\hat{\beta}_{0,OLS} = \overline{y}-\hat{\beta}_{OLS,1}\overline{x}
$$
<br>
- Going back to **matrix notation**, the minimization problem can be written as follows:
<br><br>
$$
\min_{\hat{\boldsymbol{\beta}}} f(\hat{\boldsymbol{\beta}}) = \hat{\boldsymbol{\epsilon}}'\hat{\boldsymbol{\epsilon}} = (\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{\beta}})'(\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{\beta}}) = \boldsymbol{y}'\boldsymbol{y} - \hat{\boldsymbol{\beta}}'\boldsymbol{X}'\boldsymbol{y} - \boldsymbol{y}'\boldsymbol{X}\boldsymbol{\hat{\beta}} + \hat{\boldsymbol{\beta}}'\boldsymbol{X}'\boldsymbol{X}\hat{\boldsymbol{\beta}}
$$
<br>
- And in matrix notation, the minimization gives the following set of first order conditions:
<br><br>
$$
\frac{\partial \hat{\boldsymbol{\epsilon}}'\hat{\boldsymbol{\epsilon}}}{\partial \hat{\boldsymbol{\beta}}} = \boldsymbol{0} \Longleftrightarrow \boldsymbol{0} - \boldsymbol{X}'\boldsymbol{y} - \boldsymbol{X}'\boldsymbol{y} + \boldsymbol{X}'\boldsymbol{X} \hat{\boldsymbol{\beta}}_{OLS} + (\hat{\boldsymbol{\beta}}_{OLS}'\boldsymbol{X}'\boldsymbol{X})' = \boldsymbol{0}
$$
<br>
- We can solve for $\hat{\boldsymbol{\beta}}_{OLS}$:
<br><br>
$$
\hat{\boldsymbol{\beta}}_{OLS} = (\boldsymbol{X}'\boldsymbol{X})^{-1}(\boldsymbol{X}'\boldsymbol{y})
$$
<br>
- Also this expression should look very familiar. It's **very similar to the expression of the sample-mean estimator** in matrix notation, with the difference that now instead of having $\boldsymbol{x}=(1, ..., 1)'$, we have $\boldsymbol{X}$, which is a $N\times2$ matrix, with the first column being a N-dimensional constant vector of 1s, i.e. $(1, ..., 1)'$, and the second column being the N-dimensional random vector of the about-to-be-observed data of our predictor, i.e. $\boldsymbol{x}=(x_1, ..., x_N)'$. So basically we can say that the  OLS estimator is just a generalization of the sample-mean estimator as we move from a simple constant random vectors of 1s, i.e. $\boldsymbol{x}=(1, ..., 1)'$, to a matrix in which the first column is this random vector of 1s and the second column is the random vector $\boldsymbol{x}$. Or, if you want, we can say that the sample-mean estimator is just a specific case of the OLS estimator with $\boldsymbol{X}=\boldsymbol{x}=(1, ..., 1)'$.

<img src="https://i.imgflip.com/83mlt2.jpg" width="300">

- So basically everything that we have learned from the sample-mean estimator applies to the OLS estimator, with some slight differences as the OLS is more general (but basically same thing). One slight difference is that, since now $\boldsymbol{X}$ is a mix of constants (1s) and a random variable ($\boldsymbol{x}$), the distribution of the estimator will no longer be a simple marginal distribution, but a **conditional distribution** on $\boldsymbol{X}$. So in theory we should write:
<br><br>
$$
\hat{\boldsymbol{\beta}}_{OLS}|\boldsymbol{X} = (\boldsymbol{X}'\boldsymbol{X})^{-1}(\boldsymbol{X}'\boldsymbol{y})
$$
<br>
- We, as everyone does, will mostly always write it by omitting the conditional sign, unless we are studying its moments and distribution. But let's not forget that the distribution of $\hat{\boldsymbol{\beta}}_{OLS}$ is a conditional distribution (on $\boldsymbol{X}$).
- So as we did for the sample-mean estimator in the mean model, let's say something on the distribution of $\hat{\boldsymbol{\beta}}_{OLS}$. It can be shown that, given the assumptions 1-6 of the linear regression model for the data generating process mentioned above, the OLS estimator follows a conditional normal distribution (conditional on $\boldsymbol{X}$), with conditional **mean of the OLS estimator** $\boldsymbol{\beta}$ and conditional **variance-covariance matrix of the OLS estimator** $\sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1}$. In the usual concise notation:
<br><br>
$$
\hat{\boldsymbol{\beta}}_{OLS}|\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{\beta},\sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1})
$$
<br>
- Again, it should look very familiar. OK we have an estimator for $\boldsymbol{\beta}$, let's look at the estimator for $\sigma^2$. Let's define the $n\times1$ vectors of **OLS fitted (or predicted) values** and **OLS residuals**:
<br><br>
$$
\hat{\boldsymbol{y}}_{OLS}=\boldsymbol{X}\hat{\boldsymbol{\beta}}_{OLS}
$$
<br>
$$
\hat{\boldsymbol{\epsilon}}_{OLS}=\boldsymbol{y} - \hat{\boldsymbol{y}}_{OLS}= \boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{\beta}}_{OLS}
$$
<br>
- You will not be surprised to know that an unbiased estimator for $\sigma^2$ is very similar to the **sample-variance estimator**, i.e. the sum of the least squares residuals divided by the number of degrees of freedom. In our case:
<br><br>
$$
\hat{\sigma}^2_{OLS}|\boldsymbol{X}=\frac{\hat{\boldsymbol{\epsilon}}_{OLS}\,'\hat{\boldsymbol{\epsilon}}_{OLS}}{N-2}
$$
<br>
- This estimator distributes conditionally on $\boldsymbol{X}$. The estimator $\hat{\sigma}^2_{OLS}|\boldsymbol{X}$ has conditional mean $\mathrm{E}(\hat{\sigma}^2|\boldsymbol{X})={\sigma}^2$ and conditional variance $\mathrm{V}(\hat{\sigma}^2|\boldsymbol{X})=\frac{2\sigma^4}{N-2}$, and its transformation $\frac{N-1}{\sigma^2}\hat{\sigma}^2_{OLS}|\boldsymbol{X}$ follows a conditional Chi-square distribution with $N-2$ degrees of freedom, i.e. $\frac{N-1}{\sigma^2}\hat{\sigma}^2_{OLS}|\boldsymbol{X} \sim \mathcal{\chi^2}_{(N-2)}$. In most statistical software the estimator $\hat{\sigma}^2_{OLS}|\boldsymbol{X}$ (or less precisely $\hat{\sigma}^2_{OLS}$) is called the Root Mean Squared Error (RMSE).
- As written above, the variance-covariance matrix of the OLS estimator is $\sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1}$. The diagonal elements are the variances of the respective coefficient estimators. Let $S^{kk}$ be the k-th diagonal element of $(\boldsymbol{X}'\boldsymbol{X})^{-1}$. In our case, the first diagonal element, i.e. $\sigma^2S^{11}$, is the variance of the estimator $\hat{\beta}_{0,OLS}$, and the second diagonal element, i.e. $\sigma^2S^{22}$, is the variance of the estimator $\hat{\beta}_{1,OLS}$. The square root of the first diagonal element, i.e. $\sqrt{\sigma^2S^{11}}$, is called conditional **standard error of the OLS estimator** $\hat{\beta}_0$, and the square root of the second diagonal element, i.e. $\sqrt{\sigma^2S^{22}}$, is called conditional standard error of the OLS estimator $\hat{\beta}_1$. If we do not know $\sigma^2$, we need to estimate it with $\hat{\sigma}^2_{OLS}$, and therefore we'll say that the estimator for the standard error of the estimator $\hat{\beta}_{0,OLS}$ is $\sqrt{\hat{\sigma}^2_{OLS}S^{11}}$, and that the estimator for the standard error of the estimator $\hat{\beta}_{1,OLS}$ is $\sqrt{\hat{\sigma}^2_{OLS}S^{22}}$.


## Goodness of Fit with the OLS Estimator <a name="Goodness-of-Fit-with-the-OLS-Estimator"></a>
- The OLS estimator, by definition, maximizes the goodness-of-fit of the regression. The goodness-of-fit is the share of the variance of the dependent variable explained by the model (how well your model fit the data).
- Let's define the following expressions:
    - **Total Sum of Squares**: TSS = $\sum_{i=1}^{N}(y_i-\overline{y})^2$. This is the total sample variation of the dependent variable.
    - **Explained Sum of Squares**: ESS = $\sum_{i=1}^{N}(\hat{y}_{i,OLS}-\overline{y})^2$. This is the total sample variation of the OLS predicted values of the dependent variable.
    - **Sum of Squared Residuals**: SSR = $\sum_{i=1}^{N}(\hat{\epsilon}_{i,OLS})^2=\hat{\boldsymbol{\epsilon}}_{OLS}\,'\hat{\boldsymbol{\epsilon}}_{OLS}$. This is the total sample variation of the OLS residuals, i.e. of the unexplained part of the dependent variable.
- The OLS estimator minimizes the SSR. We can show that TSS = ESS + SSR. 
- The usual measure of goodness-of-fit is the the **R-Squared**, defined as the fraction of the variance of y explained by the model:
<br>
$R^2=\frac{ESS}{TSS}=\frac{\sum_{i=1}^{N}(\hat{y}_{i,OLS}-\overline{y})^2}{\sum_{i=1}^{N}(y_{i,OLS}-\overline{y})^2}=1-\frac{SSR}{TSS}=1-\frac{\hat{\boldsymbol{\epsilon}}_{OLS}\,'\hat{\boldsymbol{\epsilon}}_{OLS}}{TSS}$
<br><br>
- The fact that the $R^2$ is maximized by the OLS estimation is a desirable property. However, one shouldn’t pay attention only to this statistic to determine whether a model is useful or not. We are in general more interested in the partial effect of a given variable.

## Hypothesis Testing with the OLS Estimator <a name="Hypothesis-Testing-with-the-OLS-Estimator"></a>
- Again, this is going to be very similar to the sample-mean estimator. Let's test whether the coefficient $\beta_1$ is statistically different from $\beta_{1,0}$ (could be any value) at the 5% significance level. Obviously, what follows also holds for the other coefficient $\beta_0$, or to any coefficient (instead of $\beta_1$ we would write $\beta_i$).
- Under the null hypothesis $H_0: \beta_1 = \beta_{1,0}$ with known $\sigma^2$, we know that $\hat{\beta}_{1,OLS}|\boldsymbol{X} \sim \mathcal{N}(\beta_{1,0}, \sigma^2S^{11})$. When we normalize the test statistic we obtain the formula for the z-score test statistic $\frac{\hat{\beta}_{1,OLS}-\beta_{1,0}}{\sqrt{\sigma^2S^{11}}} |\boldsymbol{X} \sim \mathcal{N}(0, 1)$. When we do not know $\sigma^2$, the test statistic is called t-statistic, as it follows a t-distribution with N-2 degrees of freedom, i.e. $\frac{\hat{\beta}_{1,OLS}-\beta_{1,0}}{\sqrt{\hat{\sigma}^2_{OLS}S^{11}}} |\boldsymbol{X} \sim t_{N-2}$. It follows a t-statistic because it's the ratio of a standard normal and a Chi-square distribution (see Green for the proof). Finally, thanks to the central limit theorem we know that in large-enough samples the distribution of the t-statistic converges to a normal distribution, so we can use the usual critical value of 1.96 for the 5% significance level.
- Note that the test you'll be doing most is testing whether the population coefficient is statistically different from zero or not. In this case, we'll have that under the null hypothesis $H_0: \beta_1 = 0$, the t-statistic is $\frac{\hat{\beta}_{1,OLS}}{\sqrt{\hat{\sigma}^2_{OLS}S^{11}}} |\boldsymbol{X}$.
- This should look very familiar, the concepts and test statistics are all very similar to the sample-mean estimator ...

<img src="https://i.imgflip.com/83mmux.jpg" width="500">

## The OLS Estimator in Python <a name="The-OLS-Estimator-in-Python"></a>
- OK now we'll apply all these concepts and equations on real data using Python. When we did the sample-mean model, we considered CO2 emissions per capita from the QoG Environmental Indicators dataset (QoG-EI) database (exercises notebook).  We found that, on average, one person emits 5.016 of **CO2 emissions** per year. In addition, looking at the p-value of 0.00000000000038846703634227, we can say that this result is statistically significant at the 1% level (the p-value is smaller than 0.01). This means that, based on our sample and assumptions, we can be pretty sure that a person emits more than 0 tons of CO2 per year. The probability that we are wrong on this claim is 0.00000000000038846703634227 %.
- Great. But let's say now we want more. Instead of just studying the average value of CO2 emissions per capita, now we want to understand what drives CO2 emissions per capita (on average). What could it be? For the sake of this exercise, let's pick something "naive". Could it be the case that the amount of **money we have** influences how much we emit? So, more in general, the research question would be: is wealth driving climate change? The mechanism would be the following: the richer we are, the more goods we consume, and the production of those goods emit CO2. So, intuitively, people with high income consume lots of carbon-intensive goods and services, i.e. they fly more, and therefore emit more.
- How do we translate this into our model? Simple. Instead of assuming that the mean component of CO2 emissions per capita is simply $\beta$, we'll assume that its mean component is $\beta_0+\beta_1x$, where $x$ is GDP per capita. In other words, we assume that, on average, CO2 emissions per capita linearly depend on the value of the GDP per capita. Or similarly, that we can use GDP per capita to predict CO2 emissions per capita.

### Data <a name="Data"></a>
- Let's get **QoG**:

In [None]:
# get data
link = "https://www.qogdata.pol.gu.se/data/qog_ei_sept21.xlsx"
df_qog = pd.read_excel(link)

- In QoG-EI there is no GDP per capita. But we have CO2 emissions over GDP in billions of US dollars (`edgar_co2gdp`), CO2 emissions per capita (`edgar_co2pc`) and total CO2 emissions (`edgar_co2t`). So we can infer GDP, population and GDP per capita as follows:

In [None]:
# get variables
indexes = ["ccodealp","year"]
variabs_co2 = ["edgar_co2gdp","edgar_co2t","edgar_co2pc"]
variabs_control = ["oecd_cctr_gdp"]
variabs = variabs_co2 + variabs_control
df = df_qog.loc[:,np.append(indexes,variabs)]

# make gdp per capita
df["gdp"] = (df["edgar_co2gdp"]/df["edgar_co2t"])**(-1) # billions US dollars
df["pop"] = (df["edgar_co2pc"]/df["edgar_co2t"])**(-1) # millions
df["gdp_pc"] = df["gdp"]/df["pop"] # thousands of US dollars
variabs = np.append(variabs, ["gdp","pop","gdp_pc"])

- As you noticed we've got also `oecd_cctr_gdp`, which is Climate change-related tax revenue as a percentage of gross domestic product (GDP). Includes taxes, fees and charges, tradable permits, deposit-refund systems, subsidies, and voluntary approaches related to the domain of climate change. We'll use it later on.
- This is a panel. Let's make it simple and work on a cross section for now:

In [None]:
# make cross section
df = df.groupby("ccodealp")[variabs].mean().reset_index().dropna()

# size
df.describe()

- OK so our dependent variable is CO2 emissions per capita (`edgar_co2pc`) that has an average of 4.83 tonnes per year. The independent variable of interest is gdp_pc (`gdp_pc`), with mean of 19 thousand dollars per year. Let's plot them:

In [None]:
sns.scatterplot(x='gdp_pc', y='edgar_co2pc', data=df, color = "r", s = 20)

- OK so it looks like one country is far away from the rest of the group with GDP per capita around 100 thousand dollars per year and emissions per capita above 25 tonnes. Let's do a "quick and dirty" cleaning and drop this outlier, no further questions asked:

In [None]:
# drop outliers quick and dirty
df = df.loc[df["gdp_pc"] < 80,:]
sns.scatterplot(x='gdp_pc', y='edgar_co2pc', data=df, color = "r", s = 20)
df.shape

- OK looks a bit better. Now there seems to be lots of grouping for lower values of both variables, while values are more spread as both variables increase. This is a classic situation in which log transforming the variables might provide a more homogeneous grouping. Let's try it:

In [None]:
# maybe logs?
df["ln_gdp_pc"] = np.log(df["gdp_pc"])
df["ln_edgar_co2pc"] = np.log(df["edgar_co2pc"])
sns.scatterplot(x='ln_gdp_pc', y='ln_edgar_co2pc', data=df, color = "r", s = 20)

### OLS Canned Routine <a name="OLS-Canned-Routine"></a>
- OK good, let's work with this data.
- We are assuming that there is a linear relationship between the log of CO2 Emissions and the log of GDP per capita (plus some randomness), i.e. the regression model of the population is $ln\_edgar\_co2pc = \beta_0 + \beta_1 ln\_gdp\_pc + \epsilon$. We can use this sample to estimate the population parameters $\beta_0$ and $\beta_1$. As we have seen above, a good estimator to estimate these parameters is the OLS estimators, which minimizes the sum of the squares of the residuals. Let's use the canned method `sm.OLS.from_formula` to get the OLS estimates for $\beta_0$ and $\beta_1$:

In [None]:
# canned ols
ols_canned_results = sm.OLS.from_formula('ln_edgar_co2pc ~ ln_gdp_pc', df).fit()
ols_canned_results_table = ols_canned_results.summary().tables[1]
ols_canned_results_table

### Visualizing Estimated Fitted Values and Estimated Residuals <a name="Visualizing-Estimated-Fitted-Values-and-Estimated-Residuals"></a>
- The canned routine gives us the coefficient estimates in the first column. The other columns report respectively the estimate of the standard errors of the OLS estimators, the t statistics for the estimates, the p values for the estimates, and the 95% confidence intervals around the estimates. Let's focus in the coefficient estimates for now.
- The estimate for $\beta_0$ is $-2.4732$, and the estimate for $\beta_1$ is $1.3136$. Visually, these numbers can be interpreted as the intercept and the slope of the line of the Least Squares Fit (the code is borrowed from the really amazing [class](https://matteocourthoud.github.io/course/ml-econ/01_regression/) of Matteo Curthoud on machine learning):

In [None]:
# graph ols
def make_OLS_scatterplot():
    
    # Init figure
    fig, ax = plt.subplots(1,1)
    ax.set_title('OLS Scatter Plot');

    # Plot scatter and best fit line
    sns.regplot(x=df.ln_gdp_pc, y=df.ln_edgar_co2pc, ax=ax, order=1, ci=None, scatter_kws={'color':'r', 's':20})
    ax.legend(['Data','Least Squares Fit']);
    
make_OLS_scatterplot()

- As the estimates of $\beta_0$ and $\beta_1$ have been obtained by minimizing the sum of the squares of the residuals, we can say that this line is the one with lowest average distance from the observations. If we worked with a different estimator than the OLS, the fit line would be in a different position.
- We can visualize the residuals as follows:

In [None]:
# get stuff
xdata = df.ln_gdp_pc.values.reshape(-1,1)
ydata = df.ln_edgar_co2pc.values
yhat_OLS = ols_canned_results.fittedvalues

# Figure 3.1
def make_OLS_scatterplot_withres():

    # Init figure
    fig, ax = plt.subplots(1,1)
    ax.set_title('OLS Scatter Plot with OLS Residuals');

    # Add residuals
    sns.regplot(x=df.ln_gdp_pc, y=df.ln_edgar_co2pc, ax=ax, order=1, ci=None, scatter_kws={'color':'r', 's':20})
    ax.vlines(xdata, np.minimum(ydata,yhat_OLS), np.maximum(ydata,yhat_OLS), linestyle='--', color='k', alpha=0.5, linewidth=1)
    plt.legend(['Data','Least Squares Fit','Residuals']);
    
make_OLS_scatterplot_withres()

### Interpretation of Regression Results <a name="Interpretation-of-Regression-Results"></a>
- OK now that we have an understanding of what the numbers in our table are, let's interpret them. Remember that both our variables are in logs (i.e. we have a **log-log model**), so the interpretation will be in percentage points. For a summary of how to interpret coefficients with different combinations of level and logged variables check [here](https://towardsdatascience.com/how-to-interpret-linear-regression-coefficients-8f4450e38001) or [here](https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/).
- The coefficient estimate for `ln_gdp_pc` of 1.3136 means that, on average, if GDP per capita increases by one percent, we'd expect CO2 emissions per capita to increase by 1.3157 percent (i.e. $(1.01^{1.3136}-1)\times100$). Or, if we want to multiply by 10 to make it more meaningful, a 10-percent increase to GDP per capita leads CO2 emissions per capita to increase by 13.33 % (i.e. $(1.10^{1.3136}-1)\times100$).
- In addition, as the p-value is lower than 5%, we can infer that, based on our sample and estimate, we can be pretty sure that the population parameter's $\beta_1$ is different than zero (the probability that we are making a mistake when making this argument is 0.000 ... etc). We can get to the same conclusion by noticing that the 95%-confidence interval of the coefficient estimate does not contain 0.
- What if we wanted to study the effect of an increase in 1000 US dollars of GDP per capita on the tonnes of CO2 per capita, rather than a 1% or a 10% increase? Well, we'd have to estimate a **level-level model**:

In [None]:
# level level
ols_canned_results_levlev = sm.OLS.from_formula('edgar_co2pc ~ gdp_pc', df).fit()
ols_canned_results_table_levlev = ols_canned_results_levlev.summary().tables[1]
ols_canned_results_table_levlev

- So if GDP per capita increases by 1000 dollars, on average we'd expect CO2 emissions per capita to increase by approximately a quarter of a tonne (0.2385). Or in the same way, people in countries where GDP per capita is 1000 dollars higher than in other countries emit 0.2385 tonnes more than other countries.
- Finally, we did not interpret the **intercept** as its interpretation does not have much economic sense. For example, 0.1874 means that the average CO2 emissions per capita when GDP per capita is zero is around 0.18 tonnes. However GDP per capita is never zero, and that's why the estimate of $\beta_0$ does not have much economic sense. However, it is always good practice to include the constant in a regression model, as (i) it usually improves the estimates, (ii) otherwise we should come up with a reason of why it should not be there (which is usually hard to do), (iii) now we are fond of it as it allowed us to learn the mean model (even though now has a different meaning). Long story short, make sure you always have it.

- Finally, you must have noticed that these tables we worked with are not the same of **standard regression tables** in academic papers. There is a way to get those too, using the Stargazer package:

In [None]:
stargazer = Stargazer([ols_canned_results])
stargazer

<img src="https://i.imgflip.com/83mpks.jpg" width="500">

- OK thank you Chris Martin, though all you are adding to our former table is just a nice sparkling way to see if results are statistically significant at the 1-percent level (3 stars 1 percent, 2 starts 5 percent, 1 star 10 percent). Why not.
- Finally also note that this table shows R-squared and Adjusted R-squared. We did not do the adjusted yet (and rightfully so as they are the same in our case). But we can interpret the R-squared. As seen above:
<br>
$R^2=\frac{ESS}{TSS}=\frac{\sum_{i=1}^{N}(\hat{y}_{i,OLS}-\overline{y})^2}{\sum_{i=1}^{N}(y_{i,OLS}-\overline{y})^2}=1-\frac{SSR}{TSS}=1-\frac{\hat{\boldsymbol{\epsilon}}_{OLS}\,'\hat{\boldsymbol{\epsilon}}_{OLS}}{TSS}$
<br><br>
- So an **R-Squared** of 0.88 means that the fit line (or prediction) explain around 88% of the variation in our data. Indeed, you saw in the scatterplot how our fit line was nicely fitting the cloud of data points. Not bad. There isn't a right or wrong level of R-Squared, so we should always interpret it with caution. For sure, if it's like 0.00001, then that's a red flag that our OLS fit line is not doing a very good job in fitting the data.