# Linear Models for Regression

In [None]:
import datetime
from tqdm import tqdm

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from scipy import stats

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.stattools import durbin_watson

from sklearn.linear_model import ElasticNet, Lasso, Ridge, LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures


from IPython.display import display, HTML

display(HTML("<style>.container { width:90% !important; }</style>"))

In [None]:
data = pd.read_csv("data/Rv_daily_lec4.csv", index_col=0)
df = data.copy()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.isnull().sum()

In [None]:
var = "RV"

In [None]:
# Scatter plot the data.
df.plot.scatter(x="Return_close", y=var, title="SP500 ret vs {}".format(var));

In [None]:
# Scatter plot the data.
fig, ax = plt.subplots()
df.plot.scatter(x="Return_close", y=var, title="SP500 ret vs {}".format(var), ax=ax)
ax.set_xlim(-10, 10);

In [None]:
df.columns

In [None]:
# Look at the Pearson correlation coefficient.
df[["Return_close", 'RV']].corr(method="pearson")

IMPORTANT: With a linear correlation coefficient of -0.15, we aim to suggest the presence of a negative linear relationship (though the possibility of a nonlinear relationship cannot be ruled out). However, it is important to emphasize that conclusions should not be drawn solely based on the initial interpretation of statistical values without further analysis or validation.

In [None]:
df['RV'].plot()

In [None]:
df['Return_close'].plot()

In [None]:
(r, P_value) = pearsonr(df["Return_close"], df['RV'])
print("correlation = %f" % r)
print("P_value = %f" % P_value)

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 200 or so.

In [None]:
df.columns

# Univariate Linear Regression

The first - and arguably the most straightforward - statistical model that we will face is the univariate linear regression model. We have one continuous predictor - also known as the independent variable - and one continuous the dependent variable. The task changes depending on the values that those variables assume. If the dependent variable is assumed to be unbounded, i.e., taking values across the whole domain of real numbers, we are solving a **regression** problem. 

<div style="text-align:center; font-size:24px">
    <span style="color:red">What does it happen if the independent variable is categorical?</span>
</div>


Linear regression does not imply any causality; it is up to the model user to impose causal assumptions, i.e., which variable takes the role of the criterion and which variable is assigned as a predictor. It is unnecessary to set any such assumptions to obtain a valid linear regression model. However, it is very customary to have some hypothesized direction of causality to discuss prediction meaningfully.

Like any other statistical model, linear regression rests upon some assumptions. We will discuss the following more thoroughly and learn how to assess their validity during this session:

* **Linearity**: The relationship between the independent and dependent variables is linear. This also means that the effects of the changes in the independent variable(s) on the dependent variable are constant.;
* **Normal distribution of residuals (model errors)**: The errors (residuals of the model) follow a Normal distribution.;
* **Constant variance: homoscedasticity**:The variance of the errors is constant across all levels of the independent variables. This means that the 'spread' of the residuals should remain constant and not form a funnel-like shape.;
* **Independence of errors == no autocorrelation of residuals**;
* **No significant outliers or influential cases**;


Let's come up with our initial linear regression model:


1. One variable, denoted `x`, is regarded as the predictor, explanatory, or independent variable.
2. The other variable, denoted `y`, is regarded as the response, outcome, or dependent variable.

In this case, `x` could be any of the explanatory variables like `TBill1Y`, `TBill3M`, `Oil`, `RV`, `Gold`, `SP_volume`, `weekday` and 'y' could be `Return_close`, if you're trying to predict returns based on these factors.

The simple linear regression model provides a coefficient estimate that quantifies the direction and strength of the relationship between the predictor variable and the response. The estimated regression function (black line) has the equation:

$$y_t = \alpha + \beta x_t + \epsilon_t$$

$\alpha$ and $\beta$ are two unknown parameters that represent the intercept and slope terms in the linear model.

We will use the `statsmodels` package to conduct simple linear regression. We choose the realized volatility (RV) as the response variable variable and the S&P500 returns as the predictor:

In [None]:
X = df["Return_close"]
y = df["RV"]

Before we perform the regression, we need to add a constant to our X variable:

In [None]:
X = sm.add_constant(X)

In [None]:
X

In [None]:
y

In [None]:
model = sm.OLS(y, X)
results = model.fit()

In [None]:
results.summary()

$$\hat{y} = 1.1432 - 0.0664 x$$

Look at the residuals

In [None]:
results.resid  # from stats object

In [None]:
residuals = results.resid

**Test for Error Normality**

One of the main assumption for the inferential part of the regression (OLS - ordinary least
squares) is the assumption that the errors follow a normal distribution. A first important
verification is to check the compatibility of the residuals (the errors observed on the sample)
with this assumption.

In [None]:
sns.displot(residuals)

In [None]:
mu = residuals.mean()
std = residuals.std()
x = residuals.sort_values()
plt.hist(residuals, bins=100, density=1)
plt.plot(x, stats.norm.pdf(x, mu, std), "red")
plt.xlabel("residuals")
plt.show()

In [None]:
# Residuals sklearn
print("skewness -> %f" % residuals.skew())
print("excess kurtosis -> %f" % residuals.kurt())

**Jarque-Bera normality test  (uses only skewness and kurtosis)**

If the data comes from a normal distribution, the JB statistic **asymptotically** has a **Chi-Squared** distribution with 2 degrees of freedom
$$JB = \frac{n-k}{6}(\xi^2+\frac 1 4(\chi -3)^2) $$
where $n$ is the number of observations and $k$ is the number of regressors when examining residuals to an equation.

In [None]:
JB, JBpv, skw, kurt = sm.stats.stattools.jarque_bera(residuals)
print(JB, JBpv, skw, kurt)

Here we reject the null hypotesis that the errors follow a normal distribution.

**Q-Q plot**


In [None]:
sm.qqplot(residuals, line="s")

**K-S test**

The Kolmogorov–Smirnov statistic for a given cumulative distribution function $F(x)$ is:
$$D_n = \sup_x |F_n(x)- F(x)|$$
Asymptotically $\sqrt {n}D_{n}$ converges to the Kolmogorov distribution, which does not depend on F.

In [None]:
x

In [None]:
(KS, p_V) = stats.kstest(residuals, "norm")
print("KS -> %f" % (KS))
print("p_V -> %f" % (p_V))

**Homoskedasticity**

In [None]:
plt.scatter(residuals,X.iloc[:, 1], )
plt.xlabel("Residuals")
plt.ylabel("Returns")

**Durbin-Watson test**

The null hypothesis of the test is that there is no serial correlation. The Durbin-Watson test statistics is defined as:

$$DB = \frac{\sum_t (e_t-e_{t-1})^2}{\sum_t e_t^2}$$
The test statistic is approximately equal to $2*(1-r)$ where $r$ is the sample autocorrelation of the residuals. Thus, for $r = 0$, indicating no serial correlation, the test statistic equals $2$. This statistic will always be between 0 and 4. The closer to 0 the statistic, the more evidence for positive serial correlation. The closer to 4, the more evidence for negative serial correlation.

In [None]:
X["res"] = results.resid
E = X.sort_values(by="Return_close").res
print("DB_e -> %f" % durbin_watson(E))
print("DB_e^2 -> %f" % durbin_watson(E**2))

If the assumptions of linearity, independence, homoscedasticity, and normality are violated, you might need to consider data transformations, adding interaction terms or applying a more suitable modeling technique.

Please note that real-world data often violate the assumptions to some degree but still result in useful models. It's the degree of violation that determines whether we can overlook the violation or need to address it.

# Multivariate Linear Regression

Multivariate linear regression is an extension of univariate linear regression used to predict an outcome variable (Y) based on multivariate distinct predictor variables (X). With three or more variables involved, the data is modelled as a hyperplane in multidimensional space.

The multivariate linear regression equation is as follows:

$$y_{t} = \alpha + \beta_1 x_{t1} + \beta_2 x_{t2} + ... + \beta_n x_{tn} + \epsilon_t $$


Where:
- $y_t$ is the dependent variable.
- $\alpha$, $\beta_1$, ..., $\beta_n$ are the regression coefficients. They represent the change in the dependent variable for every one unit change in an independent variable, assuming all other variables are held constant.
- $x_{t1}$, $x_{t2}$, ..., $x_{tn}$ are independent variables.
- $\epsilon$ is the error term (residuals).

Just as with univariate linear regression, the assumptions for multivariate linear regression are linearity, independence, homoscedasticity, and normality of residuals.

In a more compact form one can write

$$\mathbf{Y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}$$


Remember, when using multivariate linear regression, multicollinearity can be a problem. Multicollinearity is when predictor variables are correlated with each other. This can be checked by examining the correlation matrix of the variables. If multicollinearity is found, you might need to remove one of the correlated variables or perform dimensionality reduction.


<div style="text-align:center; font-size:24px">
    <span style="color:red">Can I have both continuous and categorical predictors in a multivariate linear regression?</span>
</div>



In [None]:
df

In [None]:
df['TBill3M_ret'] = df['TBill3M'].pct_change()

In [None]:
df['TBill3M_ret'].plot()

In [None]:
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(10, 7))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
    corr,
    mask=mask,
    cmap=cmap,
    vmax=0.3,
    center=0,
    square=True,
    annot=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.5},
)

<div style="text-align:center; font-size:24px">
    <span style="color:red">Why does Tbill3M_ret has a very low correlation with the rest of the dataset?</span>
</div>

In [None]:
col_to_transform = ["TBill3M", "TBill1Y", "Oil", "Gold", "SP_volume"]
for c in col_to_transform:
    df["{}_ret".format(c)] = df[c].pct_change(1) * 100

df = df.dropna().copy()

In [None]:
# Compute the correlation matrix
corr = df.corr()

# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(15, 10))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
    corr,
    cmap=cmap,
    vmax=0.3,
    center=0,
    square=True,
    annot=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.5},
)

In [None]:
# Define predictor variables and the response variable
X = df[["RV", "TBill1Y_ret"]]
y = df["Return_close"]

# Add a constant to the predictor variables
X = sm.add_constant(X)

# Build the model
model = sm.OLS(y, X)

# Fit the model
results = model.fit()

# Print a summary of the results
print(results.summary())

## Autoregressive Components and Predictive Modeling

<div style="text-align:center; font-size:24px">
    <span style="color:red">What about including an autoregressive component to the model?</span>
</div>


In [None]:
# shift down, create lags
df['Return_close_t-1'] = df['Return_close'].shift(1)

In [None]:
df['RV']

In [None]:
# shift up, create future targets
df['RV_t+1'] = df['RV'].shift(-1)

In [None]:
df.dropna()

In [None]:
df.isnull().sum()

In [None]:
df[['RV','RV_t+1']]

In [None]:
def add_lags(df, columns, n_lags=1):
    """
    Add lags to specific columns in a DataFrame.

    Parameters:
    - df (DataFrame): Original DataFrame.
    - columns (list): List of column names for which to create lags.
    - n_lags (int): Number of lags to create for each column.

    Returns:
    - DataFrame: Updated DataFrame with lag columns.
    """
    df_copy = df.copy()
    for column in columns:
        for lag in range(1, n_lags + 1):
            df_copy.loc[:,f"{column}_lag{lag}"] = df_copy.loc[:,column].shift(lag)
    return df_copy

In [None]:
df = add_lags(df, ["Return_close"], n_lags=3)

In [None]:
df.head()

In [None]:
df = df.dropna()

In [None]:
df = df.replace([np.inf, -np.inf], 0)

Using `RV` and `TBill1Y_ret` as regressors will not make the estimated model a predictive one. 

<div style="text-align:center; font-size:24px">
    <span style="color:red">Why will not be the model predictive?</span>
</div>

In [None]:
X_cols = [
    # "RV",
    # "TBill1Y_ret",
    "Return_close_lag1",
    "Return_close_lag2",
    "Return_close_lag3",
]
X = df[X_cols]
y = df["Return_close"]

X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()

print(results.summary())

In [None]:
X_cols = ["Return_close_lag1", "Return_close_lag2"]
X = df[X_cols]
y = df["Return_close"]

X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()

print(results.summary())

## Performance Evaluation

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
X_cols = ["Return_close_lag1", "Return_close_lag2"]
X = df[X_cols]
y = df["Return_close"]

In [None]:
# Calculate the index for the 80/20 train-test split
train_size = int(0.8 * len(df))

In [None]:
# Split the data into training and testing sets
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

In [None]:
# Fit the model
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

model = sm.OLS(y_train, X_train)
results = model.fit()

In [None]:
y_train_pred = results.predict(X_train)
y_test_pred = results.predict(X_test)

In [None]:
# Calculate MSE on training and testing data
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

print(f"Train MSE: {mse_train}")
print(f"Test MSE: {mse_test}")
print(results.summary())

In [None]:
rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)

print(f"Train RMSE: {rmse_train}")
print(f"Test RMSE: {rmse_test}")

In [None]:
y_test.std()

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(y_test.index, y_test, label='Actual')
plt.plot(y_test.index, y_test_pred, label='Predicted')
plt.xlabel('Time')
plt.ylabel('Return_close')
plt.title('Actual vs. Predicted Returns on Test Set')
plt.legend()
plt.xticks(y_test.index[::200]);

<div style="text-align:center; font-size:24px">
    <span style="color:red">How can this model be improved?</span>
</div>

# Polynomial Regression

Polynomial Regression is a form of linear regression in which the relationship between $X$ and $Y$ is modeled as an $n$-th degree polynomial. Polynomial regression fits a nonlinear relationship between the independent variable and the depedent variables.

Even though it models a nonlinear relationship, Polynomial Regression is still considered a linear model because the regression function is linear in terms of the coefficients:

$$ y = \alpha + \beta_1 x_1 + \beta_2 x_1^2 + \dots + \beta_n x_1^n $$

Here:
- $\alpha, \beta_1, \dots, \beta_n$ are the coefficients.
- $x_1, \ldots, x_1^n$ are the independent variables.

As the degree of the polynomial increases, the model can fit a wider range of curvatures, making it more flexible. However, high-degree polynomials might lead to overfitting, where the model performs poorly on new, unseen data.

Polynomial Regression captures these relationships to an nth-degree polynomial. This allows for a more complex interplay between variables.

In [None]:
# Create polynomial features
poly = PolynomialFeatures(degree=2)  
X_poly = poly.fit_transform(X)


# Fit a Polynomial Linear Regression model 
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)

# Get the model parameters
intercept = poly_reg.intercept_
coefficients = poly_reg.coef_

print("Intercept: \n", intercept)
print("Coefficients: \n", coefficients)

# Evaluate the model
y_pred = poly_reg.predict(X_poly)

# Calculate the mean squared error of the predictions
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error: \n", mse)

In [None]:
coefficients.shape

# Regularized Linear Models

## Ridge Regression

Ridge Regression is a technique used when the data suffers from multicollinearity (independent variables are highly correlated). By adding a degree of bias to the regression estimates, Ridge Regression reduces the standard errors. 

This technique works by adding a "squared magnitude" of coefficient as penalty term to the loss function. 

**Ridge Regression** aims to minimize the following objective function:

$$L(\beta) = ||Y - X\beta||^2 + \lambda||\beta||^{2}_{2} $$

where:
- $Y$ is the response variable.
- $X$ is the design matrix.
- $\beta$  is the vector of coefficients.
- $\lambda$ is the Ridge regularization parameter.

The term $\lambda||\beta||^2$ is the L2 penalty term that "penalizes" the size of the coefficients. While $\lambda$ can take any value between 0 and $\infty$, note that:
- When $\lambda = 0$, Ridge Regression will produce the same coefficients as a simple linear regression.
- When $\lambda = \infty$, all coefficients will be zero because of infinite penalty.
- When $0 < \lambda < \infty$, the magnitude of $\lambda$ will decide the value of coefficients.

**Differences in Optimization Compared to OLS**:

Ordinary Least Squares (OLS) aims to minimize just the residual sum of squares:

$$L_{OLS}(\beta) = ||Y - X\beta||^2$$

As you can see, OLS does not have the regularization term that Ridge regression does. The L2 penalty in Ridge Regression shrinks the coefficients, especially when the regularization parameter $\lambda$ is large, which can help prevent overfitting especially in scenarios where multicollinearity is present. One of the significant advantages of ridge regression is coefficient shrinkage and reducing model complexity.


Here, $\lambda$ is a tuning parameter (also known as regularization parameter) that decides how much we want to penalize the flexibility of our model.



Generally, we use `sklearn` for Ridge estimation. The `statsmodels` library does not have a specific Ridge Regression function similar to `sklearn`.

In [None]:
X_cols = [
    "RV",
    "TBill3M_ret",
    "TBill1Y_ret",
    "Oil_ret",
    "Gold_ret",
    "SP_volume_ret",
    "Return_close_lag1",
    "Return_close_lag2",
    "Return_close_lag3",
]
X = df[X_cols]
y = df["Return_close"]

# Fit a Ridge regression model
ridge_reg = Ridge(alpha=0.5)  
ridge_reg.fit(X, y)

In [None]:
# Get the model parameters
intercept = ridge_reg.intercept_
coefficients = ridge_reg.coef_

print("Intercept: \n", intercept)
print("Coefficients: \n", coefficients)

In [None]:
# Evaluate the model
y_pred = ridge_reg.predict(X)

# Calculate the mean squared error of the predictions
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error: \n", mse)

## Lasso Regression

**Lasso (Least Absolute Shrinkage and Selection Operator) Regression** is another regularization technique. It's useful when dealing with feature selection in a model where we have a large number of features.

Like Ridge Regression, Lasso also adds a penalty for non-zero coefficients, but unlike Ridge regression which penalizes sum of squared coefficients (the L2 penalty), lasso penalizes the sum of their absolute values (the L1 penalty). As a result, for high values of \( \lambda \), many coefficients are exactly zeroed under Lasso, which is never the case in Ridge.

**Objective Function for Lasso Regression**:

$$L(\beta) = ||Y - X\beta||^2 +  \lambda  || \beta ||_{1} $$

Where:
- $ \lambda $ is the Lasso regularization parameter.
- $ \beta_j $ are the model coefficients.

While $ \lambda $ can take any value between 0 and $\infty$, note that:
- When $ \lambda = 0 $, Lasso produces the same coefficients as a simple linear regression.
- When $ \lambda = \infty $, all coefficients are zero because of infinite penalty.
- When $ 0 < \lambda < \infty $, the magnitude of $ \lambda $ will decide how the model balances fit with complexity.

The key difference from Ridge Regression is the L1 penalty can lead to zero coefficients i.e. some of the features are completely eliminated, hence providing a feature selection. This is a useful property for machine learning applications where feature selection is important.

In [None]:
X

In [None]:
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)

# Get the model parameters
intercept = lasso_reg.intercept_
coefficients = lasso_reg.coef_

print("Intercept: \n", intercept)
print("Coefficients: \n", coefficients)

# Evaluate the model
y_pred = lasso_reg.predict(X)

# Calculate the mean squared error of the predictions
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error: \n", mse)

## Elastic Net Regression


Elastic Net is a middle ground between Ridge Regression and Lasso. It incorporates penalties from both Lasso and Ridge to get the best of both worlds. Elastic Net aims at minimizing the following loss function:

$$L(\beta) = ||Y - X\beta||^2 +  \lambda_1 || \beta ||_{1} + \lambda_2  ||\beta||^{2}_{2}  $$ 

Where:
- $\lambda_1$ is the coefficient of L1 penalty, similar to the one used in Lasso.
- $\lambda_2$ is the coefficient of L2 penalty, similar to the one used in Ridge.

In other words, Elastic Net is a hybrid of Ridge Regression and Lasso. It works by penalizing the model using both the L2-norm (Ridge) and the L1-norm (Lasso). 

The key takeaway is that Elastic Net is useful when there are multivariate features which are correlated. Lasso might randomly pick one of these, but elastic-net will take both of them into account. However, it does have a computational cost as it adds an extra hyperparameter to tune.


In [None]:
# Fit an Elastic Net model
elastic_reg = ElasticNet(
    alpha=0.1, l1_ratio=0.5
)
elastic_reg.fit(X, y)

# Get the model parameters
intercept = elastic_reg.intercept_
coefficients = elastic_reg.coef_

print("Intercept: \n", intercept)
print("Coefficients: \n", coefficients)

# Evaluate the model
y_pred = elastic_reg.predict(X)

# Calculate the mean squared error of the predictions
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error: \n", mse)

# Appendix: Background Theory

Models from econometrics, which is the intersection of Economics and Statistics.An econometric model is an association between $y_{i}$ and $x_{i}$ E.g.:
- personal income $y_{i}$ and personal QI $x_{i}$
- stock return $y_{i}$ and market return $x_{i}$
- current return $y_{t}$ and past returns $y_{t-h}$

The econometric model provides an "approximate," i.e., a probabilistic description of the association. The relation will be stochastic and not deterministic. Econometrics provides estimation methods for the parametric model.

**Ordinary least squares (OLS): a first linear model**

Linear model
$$
\begin{aligned}
y_{i} &=f\left(x_{i 1}, x_{i 2}, \ldots, x_{i k-1}\right)+\varepsilon_{i} \\
&=\beta_{0}+\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\cdots+\beta_{K} x_{i k-1}+\varepsilon_{i} \quad i=1, \ldots, n
\end{aligned}
$$
where
- $y_{i}:$ dependent or explained variable (observed)
- $x_{i}$ : regressors or covariates or explanatory variables (observed)
- $\varepsilon_{i}:$ error term or random disturbance (unobserved)
- $\beta_{i}:$ unknown parameters or regression coefficient (unobserved)

$$
y_{i}=\beta_{0}+\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\cdots+\beta_{K} x_{i k-1}+\varepsilon_{i}
$$
can be written in vector notation
$$
y_{i}=\underbrace{x_{i}^{\prime}}_{1 \times k} \underbrace{\beta}_{k \times 1}+\varepsilon_{i}
$$
and in the even more compact matrix notation
$$
\underbrace{Y}_{n \times 1}=\underbrace{X}_{n \times k} \underbrace{\beta}_{k \times 1}+\varepsilon
$$
with

**OLS assumptions**

Standard OLS Assumptions:
- H.1 Strict exogeneity of regressors: $\mathbb{E}[\varepsilon \mid X]=0$
Note: $\varepsilon_{i}$ does not depend on any $x_{j}$, neither past nor future $x$ s
$$
\begin{aligned}
&\mathbb{E}[\varepsilon \mid X]=0 \Rightarrow \mathbb{E}[\varepsilon]=0 \text { (by Law of Total Exp } \mathbb{E}[\mathbb{E}[\varepsilon \mid x]]=\mathbb{E}[\varepsilon]) \\
&\mathbb{E}[\varepsilon \mid X]=0 \Rightarrow \mathbb{E}(X \varepsilon)=\underbrace{\mathbb{E}[\mathbb{E}(X \varepsilon \mid X)]}_{\text {Law of Total Exp }}=\underbrace{\mathbb{E}[X \mathbb{E}(\varepsilon \mid X)]}_{\text {is measurable }}=\underbrace{0}_{\mathbb{E}(\varepsilon \mid X)=0}
\end{aligned}
$$
$\mathbb{E}[\varepsilon \mid X]=0 \Rightarrow \mathbb{E}[y \mid X]=X \beta \quad$ i.e. $X \beta$ is the conditional mean of $y \mid X$.
- H.2 Identification condition: $X$ is $n \times k$ with rank $k$ with probability 1
- H.3 Spherical errors $\operatorname{Var}[\varepsilon \mid X]=\sigma^{2} / n$
$\Rightarrow$ homoscedastic: $\operatorname{Var}\left[\varepsilon_{i} \mid X\right]=\sigma^{2}, \quad \forall i=1, \ldots, n$ and
uncorrelated errors: $\operatorname{Cov}\left[\varepsilon_{i} \varepsilon_{j} \mid X\right]=0 \quad \forall i \neq j$


**OLS estimation**

Goal: statistical inference on $\beta$, e.g. estimate $\beta$
Least Square finds $\beta$ that minimizes the sum of squared residuals in $Y=X \beta+\varepsilon$ :
$$
\begin{aligned}
S S &=\sum_{i=1}^{n} \varepsilon_{i}^{2}=\varepsilon^{\prime} \varepsilon \\
=&(Y-X \beta)^{\prime}(Y-X \beta) \\
=& Y^{\prime} Y-2 X^{\prime} Y \beta+\beta^{\prime} X^{\prime} X \beta \\
\text { F.O.C. } \quad &: \quad-2 X^{\prime} Y+2 X^{\prime} X \beta=0 \\
& \Rightarrow \quad X^{\prime}(Y-X \beta)=0 \\
& \Rightarrow \quad X^{\prime} X \beta=X^{\prime} Y
\end{aligned}
$$
$\Rightarrow$ OLS estimator:
$$
\begin{aligned}
\hat{\beta} &=\left(X^{\prime} X\right)^{-1} X^{\prime} Y \\
&=\left(\sum_{i=1}^{n} x_{i} x_{i}^{\prime}\right)^{-1}\left(\sum_{i=1}^{n} x_{i} y_{i}\right)
\end{aligned}
$$

Unbiasedness: $\mathbb{E}[\hat{\beta} \mid X]=\beta$
$$
\hat{\beta}=\left(X^{\prime} X\right)^{-1} X^{\prime}(X \beta+\varepsilon)=\beta+\left(X^{\prime} X\right)^{-1} X^{\prime} \varepsilon
$$
Then
$$
\mathbb{E}[\hat{\beta} \mid X]=\beta+\left(X^{\prime} X\right)^{-1} X^{\prime} \underbrace{\mathbb{E}[\varepsilon \mid X]}_{=0(H .1)}=\beta
$$
- Variance: $\operatorname{Var}(\hat{\beta} \mid X)=\sigma^{2}\left(X^{\prime} X\right)^{-1}$
$$
\operatorname{Var}[\hat{\beta} \mid X]=\left(X^{\prime} X\right)^{-1} X^{\prime} \underbrace{\operatorname{Var}[\varepsilon \mid X]}_{\sigma^{2} I_{n}(H .3)} X\left(X^{\prime} X\right)^{-1}=\sigma^{2}\left(X^{\prime} X\right)^{-1}
$$
- Efficiency (Gauss-Markov Theorem): $\hat{\beta}$ is $B L U E$, i.e. $\operatorname{Var}(\hat{\beta} \mid X) \leq \operatorname{Var}(\tilde{\beta} \mid X), \forall \tilde{\beta}$ linear unbiased estimator (prove it)

**Goodness of fit**


being $\hat{Y} \perp e$ then
$$
\begin{aligned}
\operatorname{Var}(Y) &=\operatorname{Var}(\hat{Y})+\operatorname{Var}(e) \\
\frac{T S S}{n} &=\frac{E S S}{n}+\frac{R S S}{n} \\
\text { Total Var } &=\text { Explained Var + Residual Var }
\end{aligned}
$$
A common measure of goodness of fit is the coefficient of determination $R^{2}$ :
$$
R^{2}=\frac{\text { Explained Var }}{\text { Total Var }}=1-\frac{\text { Residual Var }}{\text { Total Var }}=1-\frac{R S S}{T S S}
$$
since $R^{2}$ always increases when a regressor is added (even if uncorrelated)
$$
\text { Adjusted } R^{2}=1-\frac{\text { Residual Var } /(n-k)}{\text { Total Var } /(n-1)}
$$

