<a href="https://colab.research.google.com/github/edoardochiarotti/class_datascience/blob/main/2024/06_Linear-Regression-Model/06_Linear_regression_model_exercises.ipynb"
   target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Linear Regression Model - Exercises

<img src="https://i.imgflip.com/83mjm6.jpg" width="500">

In [1]:
# PACKAGES
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import random as rd
import statistics as st
import pandas as pd
import os
import statsmodels.api as sm
import re

# FUNCTIONS FROM PACKAGES
from numpy.linalg import inv

# SEABORN THEME
scale = 0.4
W = 16*scale
H = 9*scale
sns.set(rc = {'figure.figsize':(W,H)})
sns.set_style("white")

## Content
- [Exercise 1: Function for OLS coefficients](#Exercise-1:-Function-for-OLS-coefficients)
- [Exercise 2: Make a nice output table](#Exercise-2:-Make-a-nice-output-table)
- [Exercise 3: Add statistics to the output table](#Exercise-3:-Add-statistics-to-the-output-table)

- As done in class, let's consider the relationship between CO2 emissions per capita and income per capita.
- So, instead of assuming that the mean component of CO2 emissions per capita is simply $\beta$, we'll assume that its mean component is $\beta_0+\beta_1x$, where $x$ is GDP per capita. In other words, we assume that, on average, CO2 emissions per capita linearly depend on the value of the GDP per capita. Or similarly, that we can use GDP per capita to predict CO2 emissions per capita.
- Let's get **QoG** and add the variables as done in class:

In [None]:
# get data
link = "https://www.qogdata.pol.gu.se/data/qog_ei_sept21.xlsx"
df_qog = pd.read_excel(link)

In [None]:
# get variables
indexes = ["ccodealp","year"]
variabs_co2 = ["edgar_co2gdp","edgar_co2t","edgar_co2pc"]
variabs_control = ["oecd_cctr_gdp"]
variabs = variabs_co2 + variabs_control
df = df_qog.loc[:,np.append(indexes,variabs)]

# make gdp per capita
df["gdp"] = (df["edgar_co2gdp"]/df["edgar_co2t"])**(-1) # billions US dollars
df["pop"] = (df["edgar_co2pc"]/df["edgar_co2t"])**(-1) # millions
df["gdp_pc"] = df["gdp"]/df["pop"] # thousands of US dollars
variabs = np.append(variabs, ["gdp","pop","gdp_pc"])

# make cross section
df = df.groupby("ccodealp")[variabs].mean().reset_index().dropna()

# put ones into data
df["ones"] = 1

# drop outliers quick and dirty
df = df.loc[df["gdp_pc"] < 80,:]

# maybe logs?
df["ln_gdp_pc"] = np.log(df["gdp_pc"])
df["ln_edgar_co2pc"] = np.log(df["edgar_co2pc"])

## Exercise 1: Function for OLS coefficients <a name="Exercise-1:-Function-for-OLS-coefficients"></a>

- First, use the function `sm.OLS.from_formula` to regress `ln_edgar_co2pc` on `ln_gdp_pc`, save the results in an object called `ols_canned_results`, save the table with the regression results in an object called `ols_canned_results_table`, and display the table.

In [None]:
# your code here ...


- OK so the canned routine gives us the OLS estiamates for $\beta_0$ and $\beta_1$, plus a bunch of other related statistics. That's convenient, though we'd like to understand what is behind all these estimates and numbers, wouldn't we? Of couuuurse. So let's use our knowledge of Python, our knowledge of the Python application of the sample-mean estimator and related statistics seen in the last class, and our new knowledge of the OLS equations to figure it out.
- For the exercises of last class, we have built some functions for the **sample-mean estimator** using matrix notation. The key formula is $\hat{\beta}_{SM} = (\boldsymbol{x}'\boldsymbol{x})^{-1}(\boldsymbol{x}'\boldsymbol{y})$, which in Python is written as `betahat_SM = (inv(xdata.T @ xdata)) @ (xdata.T @ ydata)`.
- Here is what we used:

In [None]:
# function to transform panda series into vectors / matrices
def data_to_matrix(data, variab_name):
    
    """ My Data to Matrix Function """
    
    # store in matrixes
    matrix = data.loc[:,variab_name].to_numpy()
    
    # make column vectors for arrays with less than 2 dimensions
    if len(matrix.shape) == 1:
        matrix = np.atleast_2d(matrix).T
        
    # return result
    return matrix

# define sample mean function
def sample_mean_estimator(data, y, x):
    
    """ My Sample Mean Function """
    
    # store in matrixes
    ydata = data_to_matrix(data, variab_name = y)
    xdata = data_to_matrix(data, variab_name = x)

    # get sample mean
    beta_hat_SM = (inv(xdata.T @ xdata)) @ (xdata.T @ ydata)

    # return
    return float(beta_hat_SM)

- Test these functions to compute the sample mean of `edgar_co2pc` and `gdp_pc`, and print the results:

In [1]:
# your code here ...

- Why is the sample-mean estimate for CO2 emissions per capita different from 5.02 (the one we computed during last class with the same QoG variable)?

- Your answer here ...


- Let's now write a function for the **OLS estimator**. As we have seen, the OLS is simply a generalization of the sample-mean estimator, as we move from a data vector $x$ with only ones to a data matrix $X$ with ones and the realizations of a random variable (GDP per capita). 
- Since we have been super good and we wrote the function for the sample-mean estimator already in matrix form and super generalized, we don't have to change much for the one of the OLS estimator. 
- Starting from the function `sample_mean_estimator`, write a function called `OLS_estimator_simple` to estimate the OLS coefficients. Tips:
    - Instead of the argument `x`, lets follow the notation for matrixes and put `X`
    - Instead of naming the output `beta_hat_SM`, name it `beta_hat_OLS`

In [2]:
# your code here ...


- Test the function by estimating the coefficients of the regression of `ln_edgar_co2pc` on `ln_gdp_pc`. Tips:
    - The arguments of your function should take up these 2 variable names
    - Remember that you need to estimate both $\beta_0$ and $\beta_1$ (remember the inputs you gave to the function for the sample mean)

In [3]:
# Your code here ...

- Are these coefficients the same ones you have obtained with the canned routine?

In [4]:
# your code here ...

## Exercise 2: Make a nice output table <a name="Exercise-2:-Make-a-nice-output-table"></a>

- OK but your results don't look nearly as cool as the ones of the canned method, which uses a `SimpleTable` to store them and display them. We don't know SimpleTables, but we do know panda dataframes! We could try to get a similar output by storing our results in a panda dataframe. To do that, we'll add a little chunk of code at the end of our `OLS_estimator_simple` function. 
- Write a new function called `OLS_estimator` by adding 2 chunks of code to the function `OLS_estimator_simple`:
    1. Chunk that stores (i) the number of observations, (ii) the number of parameters, (iii) the degrees of freedom. Tip: this chunk will be between the chunks `store in matrixes` and `get OLS estimate`.
    2. Chunk that stores the results in a dataframe and gives it back to us, in the form of the picture below (note that coefficient estimates are rounded to 4 decimals). Tip: this chunk will be after the chunk `get OLS estimate`.

<img src="https://i.ibb.co/wyt7tpz/Screen-Shot-2023-10-24-at-15-34-15.png" width="200">

In [5]:
# Your code here ...


- Test the function by regressing `ln_edgar_co2pc` on a constant and `ln_gdp_pc` and display the regression coefficients. In addition, re-plot the canned results to make sure your numbers match with the numbers in the first column of the canned routine.

In [6]:
# your code here ...


In [7]:
# your code here


## Exercise 3: Add statistics to the output table <a name="Exercise-3:-Add-statistics-to-the-output-table"></a>

- Now, you must have noticed that your table is a little smaller than the canned routine's, as you are missing all the nice statistics for statistical inference. Let's add them shall we? We can start with the estimates for the standard errors of the OLS coefficient estimators. 
- As we have seen above, the variance-covariance matrix of the OLS estimator is $\sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1}$, and its estimator is $\hat{\sigma}^2(\boldsymbol{X}'\boldsymbol{X})^{-1}$. The estimators for the standard errors are the square roots of the diagonal elements of the estimator for variance-covariance matrix, i.e. for $\hat{\beta}_0$ is $\sqrt{\hat{\sigma}^2_{OLS}S^{11}}$ and for $\hat{\beta}_1$ is $\sqrt{\hat{\sigma}^2_{OLS}S^{22}}$.
- Add a chunk of code to your function `OLS_estimator` to compute the estimates for the standard errors of the OLS estimates and store them in an added column of the output table titled `std err`. Tips:
    - The chunk should be between `get OLS estimate` and `get table`, and the chunk `get table` should also be updated.
    - In Python you can write $\sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1}$ as `beta_hat_OLS_vcov = sigma2_hat_OLS * inv(xdata.T @ xdata)` (obtained with the OLS residuals and the estimator of the model's variance), and you can create the vector of estimates of the standard errors as `se_hat_OLS = np.atleast_2d(np.sqrt(betahat_OLS_vcov.diagonal())).T`.

In [8]:
# your code here ...


- Test the new version of the function and compare the results with the canned routine (all numbers should match):

In [9]:
# your code here ...


In [10]:
# your code here ...


- Finally, update your function `OLS_estimator` to also compute test statistics, p-values and confidence intervals for your OLS estimates and add them to your output table in 4 new columns. The function should give an output table that is very similar to the one of the canned method. Tips:
    - You can leverage your knowledge of functions for (i) test statistics, (ii) p-values and (iii) confidence intervals for the sample-mean estimates and apply it for OLS. The only difference is that now you need to obtain them for 2 coefficients, rather than only one. So you'll have to create a loop. 
    - Also, in the exercises of last class on the sample mean, we did a large-sample version of the t-statistic, in which we assumed it distributed like a normal. As the canned routine does not make this assumption and uses the t-student distribution, with related degrees-of-freedom correction, let's also do it for our function. In the exercises of 2 classes ago, we have seen this test for the sample mean estimator (it was called one-sample t-test for the sample mean). Here you need to do the same thing, just for the OLS estimator. Remember that, as we are using a t-student distribution, the Python function that gets you the area underneath the cumulative density function for a given t-statistic value is `stats.t.cdf`. Also, the Python function that gets you a critical value from a given critical percentage (the inverse of `stats.t.cdf`) is `stats.t.ppf`.

In [11]:
# your code here ...

- Test the updated version of the function `OLS_estimator` and compare the results with the canned routine (all numbers should match):

In [12]:
# your code here ...

In [13]:
# your code here ...