# Topics in Regression Statistics



## Properties of a good estimate
* **Unbiased** - $E[\hat{X}] = X$, on average the estimated coefficients will be equal to the true value
* **Efficient** - $\hat{X}$ is the minimum variance estimate for X
* **Consistent** - $lim_{T \to \infty} Pr[|\hat{\beta} - \beta| > \gamma] = 0,  \forall \gamma > 0$ As the sample size tends to infinity, the distance between the estimate and the true value can be bound ever closer to $\gamma$
* NOTE: 
  * MLE is a BLUE (best linear unbiased estimate)
    * MLE is based on a distribution assumption
  * OLS is a mathematical estimate that does NOT assume a distribution
    * what links OLS to statistics/distribution is the Gauss-Markov Theorem
  * When you do a regression you have to check all six GM assumptions. If the six assumptions are satified, then the OLS estimate = MLE estimate



## Regression

### General steps when doing regression
* graph the data 
* try to assess relationships between the data
  * example (cobb-douglass variant) $Q=\beta_0 N^{\beta_1}K^{\beta_2}\epsilon$
  * betas are elasticities, N is the number of laborers, K is the amount of capital
* Rule of thum is to never go above a cubic model. Also, begin with the cubic, assess significance of coefficients and if not significant go to a quadratic model.
* For forecasting the % change in Q, forget/drop the intercept and do the following;
  * %$\Delta_Q = -1.5(.10) + .8(.04) = -.15 + .032 = -.11.8$ = 11.8% decrease
* Confidence interval
  * $P[\bar{x} - Z_{\alpha/2}*SE \le \mu \le \bar{x} + Z_{\alpha/2}*SE] = (1-\alpha)$


### Regression results
$TSS = RSS + ESS$, total sum squared = regression sum squared + error sum squared
<br>$1 = \frac{\beta_1^2\sum{x^2}}{\sum{y^2}} + \frac{\sum{\epsilon^2}}{\sum{y^2}} = \text{explained variation + error sum squared}$
<br>
* $R^2 = .85$ means 85% of variance in y is explained by x
<br>

*Standard Error*
<br>$SE = \sqrt{\frac{\sum{y^2}-\beta_1^2\sum{x^2}}{n-2}}$
* High $R^2$ implies LOW SE

Say you have your forecast<br>
$y^F = \hat{\beta_0} + \hat{\beta-1}X_0$
<br>$SE^F = SE \sqrt{1 + \frac{1}{n} + \frac{(X_0 - \bar{X})^2}{\sum{x^2}}}$
<br> 
* if you want a low $SE^F$, you want a low SE so you want a high $R^2$
* you want a high value of n
* you want a low value for $X_0 - \hat{X}$
* you want a high range of values of x. Imagine a small range of x = [1,2]. This would make your model only applicable in that area. You want a large range x = [0, 1e6]

* Adjusted $R^2$, goodness of fit
    * $R^2 = 1-(1-R^2)\frac{n-1}{n-k}$, where k is the # of estimated parameters
      * any time that you add an independent variable to a multi-regression model your $R^2$ will go up even if the variable is non-sensical.
* $ Coefficient of Variation = \frac{\sigma_x}{\bar{x}}$
* $ Sharpe Ratio = \frac{\bar{R_a} - r_f}{\sigma_a}$
* $ Treynor Ratio = \frac{\bar{R_A} - \bar{r_f}}{\beta_A}$
      

### Event Studies
Ex-post or in-sample
* regress using most data (n-10) and forecast using the last 10
* Calculate abnormal returns, $R_{actual} - R_{estimated}$ 
* t-statistic = AR/SE (you could use the individual SE or the regression SE)

Ex-ante or out-of sample (regress using entire n samples, forecast on future)


### Dummy Trap
for n=12 cases you cannot have 12 dummy variables, you use n-1 = 11


### Measures of forecasting accuracy
* MAD or MAE (mean absolute deviation, mean absolute error)
  * $=\frac{\sum{|\hat{e}|}}{n}$
  * cannot be used by itself, can be used to compare models
* MAPE
  * $=\frac{\sum{|\frac{\hat{e}}{y}|}}{n}$
  * can be used on its own since it gives you a percentage
  * want < 10%
  * sensitive to y values close to 0, those blow up your estimate
* RMSE
  * $\sqrt{\frac{\sum{\hat{e}^2}}{n}}$
  * Used mostly since the formula deals with errors best (square and then sqrt)
* MASE mean absolute scaled error 
  * $< 1$, this is what you want 
  * $> 1$ you can do a better job with a random walk than with your model


### Lagging/Differencing
$diff(P) = P_t - P_{t-1}$ is a one period lag, diff(P,2) = $P_t - P_{t-2}$ is a two period lag

### De-trending
Once you pick your model you will arrive at
* $\hat{Y} = \hat{\beta_0} + \hat{\beta_1}t +\epsilon$
* The $\epsilon$ denotes the remaining factors, i.e. $S*C*A*R$ from above
  * (S)easonality, (C)yclicality due to industry cycle, (A)utocorrelation, (R)andomness 
* As you account for by adding factors to the regression you are 'detrending' the variables


### Example - CAPM
$R_a - R_f = \alpha + \beta(R_{sp} - R_f)$
<br> $\beta$ is the systemic risk, undiversifiable risk, $R_a - R_f$ is the 'asset risk premium'




### Hypothesis Testing

**1-Tailed**
<br>$H_0 = \mu \le 75K$ (equality is part of the null hypothesis)
<br>$H_a = \mu \gt 75K$

There are four rules for 1-tailed hypothesis testing:
* equality is part of the null hypothesis NEVER of $H_a$
* always bound the null, so $H_0 = \mu \le 4.3$% unemployment is correct instead of $H_0 = \mu \ge 4.3$% unemployment
* rejection area is defined by $H_a$ (this one is obvious)
* 95% or whatever your confidence level is to the other side of the rejection area
  * This changes your Z value. Whereas in 2-tailed it was 1.96 at the 95% c.i., now your 95% c.i. Z value is at 1.68


**2-Tailed**
Average salary is 75K, coefficient $\beta = 0$
<br>$H_0 = \mu = 75K$
<br>$H_a = \mu \ne 75K$ 
$$t or z = \frac{\bar{x}-c}{S_{\bar{x}}}, S_{\bar{x}}=\frac{S_x}{\sqrt{n}}$$
**If you want to test equality of means and...**
If sample population variances are **unequal**, i.e. $\sigma_1^2 \ne \sigma_2^2$.
<br>$H_0: \mu_1 = \mu_2$ 
<br>$H_a: \mu_1 \ne \mu_2$ 
<br>
$$t_d (or\space z_d) = \frac{\bar{X_1}-\bar{X_2}}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}}$$
* where $n_1 + n_2 - 2$ are your degrees of freedom and should be greater than 35
* if $t_d \gt t_{critical}$ then you reject the null
* to decide whether the varainces are unequal or not you may have to do an f-test to determine this
<br>

If sample population variances are **equal** then 
$$t_d (or\space z_d) = \frac{\bar{X_1}-\bar{X_2}}{\sqrt{S_p^2[\frac{1}{n_1}+\frac{1}{n_2}]}}$$
<br>

where $S_p^2 = \frac{(n_1 - 1)S_1^2 + (n_2 - 1)S_2^2}{n_1 + n_2 - 2}$ 

** In general, an F-statistic is a ratio of two quantities that are expected to be roughly equal under the null hypothesis**
![F-distribution](https://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.psu.edu.stat414/files/lesson23/plot_02.gif)

**F-test (to test if sample variances are the same)**
<br>$H_0: \sigma_1^2 = \sigma_2^2$, $H_a: \sigma_1^2 \ne \sigma_2^2$ 
<br>$F=\frac{S_1^2}{S_2^2}$

**F-test aka Joint Hypothesis Test (to test if sample means are the same)**
<br>$H_0: \mu_0 = \mu_1 = \mu_2$, $H_a:$ at least one $\mu$ is different

Say that $\bar{\bar{x}}$ is the average for $x_1, x_2, x_3$<br>
* Sum Square Treatment (SST) = $(\bar{x_1}-\bar{\bar{x}})^2 + (\bar{x_2}-\bar{\bar{x}})^2 + (\bar{x_3}-\bar{\bar{x}})^2$
* Sum Square Error (SSE) = $\sum{(x_{1_i}-\bar{x_1})^2} + \sum{(x_{2_i}-\bar{x_2})^2} + \sum{(x_{3_i}-\bar{x_3})^2}$
* F = $\frac{MST}{MSE} = \frac{SST/(k-1)}{SSE/(n-k)}$
  * if F is large you reject the null (means are too far away from the mean of the means $\bar{\bar{x}}$), or the means are not the same
* to find out which mean is not the same as the others use the **TukeyHSD** test

**F-statistic (in multiple regresison to test if all of the coefficients are jointly = 0)**
<br>$H_0: \mu_0 = \mu_1 = \mu_2 = \mu_3 = 0$ 
* The rule of thumb is that if F>4, then the coefficients are NOT jointly = 0, and so the regression seems significant and valid. 
* The F-statistic DOES NOT comment on the intercept, so if you fail to reject null, then all that is left is the intercept


**A quick word on the number of samples**
* Use Normal distribution - for large sample size ($n > 35$)
* Use t distribution(student distribution) - for small sample size ($n \le 15$)
* in between n [15,35] if variance is known then use normal, if variance is unknown use t distribution
   


### Linear vs. Non-linear models


As an example consider a demand model and note that $\epsilon$ < 1 is an inelastic (necessary good) and if > 1 it is elastic (luxury, unnecessary good)

* Linear $Q=\beta_0 + \beta_1P + \beta_2Y + \beta_3A + \epsilon$
  * Price elasticity = $ \epsilon_p = \frac{\text{percentage change in quantity demanded}}{\text{percentage change in price}} = \frac{\%\Delta_Q}{\%\Delta_P} = \frac{\Delta{Q}/Q}{\Delta{P}/P} = \frac{\Delta{Q}}{Q}\frac{P}{\Delta{P}} = \frac{\Delta{Q}}{\Delta{P}}\frac{P}{Q}$
  * $\epsilon_p = \hat{\beta_1}\frac{\bar{P}}{\bar{Q}}$ (price elasticity)
    * $\frac{\delta_Q}{\delta_P}\frac{P}{Q} = \frac{\delta}{\delta_P}[\beta_0 + \beta_1P + \beta_2Y + \beta_3A + \epsilon]\frac{P}{Q} = \hat{\beta_1}\frac{\bar{P}}{\bar{Q}}$
  * $\epsilon_y = \hat{\beta_2}\frac{\bar{Y}}{\bar{Q}}$ (income elasticity)
  * $\epsilon_a = \hat{\beta_3}\frac{\bar{A}}{\bar{Q}}$ (advertising elasticity)
* Non-Linear $Q=\beta_0P^{\beta_1}Y^{\beta_2}A^{\beta_3}\epsilon$ (Cobb-Douglass demand model), the Betas are the elasticities
  * $\epsilon_p = \hat{\beta_1}$ (price elasticity)
    * $\frac{\delta_Q}{\delta_P}\frac{P}{Q} = \beta_0Y^{\beta_2}A^{\beta_3}\epsilon \frac{\delta}{\delta_P}[P^{\beta_1}] \frac{P}{Q}$
    * $= \frac{\beta_0Y^{\beta_2}A^{\beta_3}\epsilon}{\beta_0P^{\beta_1}Y^{\beta_2}A^{\beta_3}\epsilon} \beta_1P^{\beta_1 - 1}P$ 
    * $= \frac{\beta_1P^{\beta_1}}{P^{\beta_1}} = \beta_1$ 
  * $\epsilon_y = \hat{\beta_2}$ (income elasticity)
  * $\epsilon_a = \hat{\beta_3}$ (advertising elasticity)

Cannot linearize (via common math): $Y = \beta_0 + \beta_1X^{\beta_2}+\epsilon$
* Use a Taylor Series Expansion to arrive at an 
  * Example used is the Kensyan consumption model $C=\beta_0+\beta_1Y\epsilon$ where $\beta_1$ is the MPC (marginal propensity of consumption) = $\frac{dC}{dY}$
    * The is causality from Y to C and from C to Y
    * wealthy countries or people have low MPC. Poor have high MPC



## Gauss-Markov Theorem (CLRM)
If a quality is not mentioned that is because the effect does not exist. 
*  $E(\epsilon) = 0$
  * if violated: $\hat{\beta}$ are biased and inefficient, standard error is not the right one
  * causes: no intercept $\beta_0$ defined
  * solutions: add intercept
  
  
*  $E(\epsilon x)=0$ (covariance, correlation should be 0) OLS by nature makes this = 0
  * macroeconomic model that has 2-way causality
  * case 1: aggregate data usually has this two way causality
    * $C=\beta_0 + \beta_1Y + \epsilon$, i.e. *Y* causes *C* and the other way around
  * case 2: unobservable variables $i = r + \pi^e$, interest rate = real interest rate + expectation[inflation]
  * if violated: $\hat{\beta}$ are biased, inefficient, and inconsistent
  * correction: via IV(instrumental variable) estimate


*  $E(\epsilon_t \epsilon_{t-1}) = 0$ (errors are time independent, no **serial/auto correlation**
  * happens only with time series data
  * if violated: the $\hat{\beta}$ coefficients will be $\color{red}{unbiased}$ and inefficient.
  * tests: Durbin Watson (1st order), Breusch-Godfrey (2nd and 3rd order)
  * correction: use Cochrane-Orcutt Methodology 
    * graph the residuals of the regression to see if there is a pattern
    * Correct using **Cochrane-Orcutt Methodology**
      * You have regression $Y_t=\beta_0 + \beta_1X_t + \epsilon_t$
      * $\epsilon_t =\rho\epsilon_{t-1} + e_t$
        * $\rho$ is the correlation between $\epsilon_t$ and $\epsilon_{t-1}$. 
      * first take lag $Y_{t-}= \beta_0 + \beta_1  X_{t-1} + \epsilon_{t-1}$
      * second multiply by $\rho$: $\rho Y_{t-1}=\rho \beta_0 + \beta_1 \rho X_{t-1} + \rho \epsilon_{t-1}$
      * third subtract lagged function from the regression, 
    $$Y_t-\rho Y_{t-1}=\beta_0 (1-\rho) + \beta_1(X_t-\rho X{t-1}) + \epsilon_t - \rho \epsilon_{t-1}$$ where the last term = $e_t$
    $$ \color{red}{Y_t^* = b_0 + \beta_1 X_t^* + e_t}$$
      * The trick here is that $\rho$ is defined s.t. the sum of the errors are minimized, i.e $\rho$ is argmin $\sum{\epsilon^2}$
  * **Statistical Tests for Serial Correlation**
    * **DW, Durbin-Watson** test (for first order only)
      * DW = $2(1-\rho)$
      * for $\rho$, critical values are +- 1
        * so if DW < 1.4 then it is considered there is + serial correlation
        * if DW in between 1.4 and 1.6, that is a grey area and a correction is recommended
        * if DW in between 1.6 and 2.4 (note that rho = 0, DW = 2) then you have no serial correlation. 
        * if DW in between 2.4 and 2.6, again a grey area
        * if DW > 2.6 to 4, you have - serial correlation
![durbin watson test](lecture09_01.png)
    * **Breusch-Godfrey** (for second and third order)


*  $E(\epsilon^2) = \sigma^2$ (variance of the error is constant, **homoskedastic** regresssion)
  * happens only with cross-sectional data
  * if violated: beta coefficients will be unbiased and inefficient (t-values will be misleading)
  * causes: variance is not constant  
  * **Statistical Tests for Heteroskedasticity**
    * For heteroskedasticy tests the null is $H_0: \sigma_1^2 = ... = \sigma_n^2$ homoskedastic
      * so if you reject null, then you have heteroskedasticity and hence a problem
    * **Bresusch-Pagan** (bptest) - ise if you don't know which X caussing issue
      * Not to be confused with the bgtest (for serial correlation)
    * **White test**
    * **Goldfeld-Quandt** - when you know which X is causing issue
      * compute $\hat{\epsilon^2}$ 
      * sort from smallest to largest based on the variable that is causing heteroskedasticity
        * here you regress error on ind. vars. $$\hat{\epsilon_{X_1}^2} = \alpha_0 + \alpha_1 X_1\\
        \hat{\epsilon_{X_2}^2} = \alpha_2 + \alpha_3 X_2$$
      * F-test on the top and bottom 40% of the $\epsilon^2$ (where $\hat{\epsilon}$ comes from the original regression)
      ![Goldfeld-Quand test](lecture12.png)
      * Correction is to divide every variable by the variable causing heteroskedasticity. Say that $X_2$ is the cause, $$\frac{Y}{X_2} = \beta_0 + \beta_1 \frac{X_1}{X_2}$$
  * correction:  

*  No severe multi-collinearity
  * if violated: **Does not result in unbiased or inefficient** estimates, the **WHOLE** regression is off
  * causes: when a number of ind variables are highly correlated to eachother
  * common signs: 
    * Beta coefficients having wrong signs from what is expected, i.e. inversely related coefficients having same sign.
    * Inconsistent regression statistics. Ex: significant coefficients, high $R^2$ > .7 and a low F. If you have significant coefficients and a high $R^2$ you would expect an F > 4. In fact, if all are significant F > 50. 
  * solutions: drop least important variable
  * **Statistical Tests for Multicolinearity**
    * Corr Test among all independent variables
      * if $r > |.9|$ that is usually a sign of multicolinearity 
    * VIF (Variance inflation factor)
      * regression is $X_1=\beta_0 + \beta_1 X_2 + \beta_2 X_3 + \epsilon$
      * $VIF = \frac{1}{(1-R^2)} > 5 \text{ you have multicolinearity}$
        * keep the most important variabe from economic theory, drop the other one
        * Sometimes a solution is to combine both correlated variables as a ratio in the regression



*  Stationary variables - mean, variance, and covariance is stationary over time (all three conditions must be met)
  * if violated: 
  * causes: 
  * solutions: 
  
### Further notes on the GM Theorem and Model Specifications

* **Influential observations** - Major outliers
  * **Cook's D** test tells you which are the influential observations
* **Chow test** (type of F test) - test of restriction of coefficients of the regression 
  * Uses the F-distribution
  * take a linear regression 
  * $\beta_1 + \beta_2 = 1$ - constant return to scale industry. If you increase labor by 2x, production increases 2x
  * $\beta_1 + \beta_2 \gt 1$ - increasing return to scale industry. If you increase labor by 2x, production increases by more than 2x
  *  $\beta_1 + \beta_2 \lt 1$ - decreasing return to scale industry. If you increase labor by 2x, production increases by less than 2x
  * Get the $R^2$ of the 1st regression (called an unrestricted $R^2$)
  * Then incorporate restriction in the 2nd regression
    * $Y = \beta_0 + \beta_1X_1 + (1-\beta_1)X_2 +\beta_3X_3 + \epsilon$
      * move the $X_2$ to the LHS, name $Y-X_2$ nY and $X_1-X_2$ nX
    * $Y - X_2 = \beta_0 + \beta_1(X_1 - X_2) +\beta_3X_3 + \epsilon$
    * $nY = \beta_0 + \beta_1(nX) +\beta_3X_3 + \epsilon$
    * Take the R^2 and this is the restricted R^2
  * Take F test, $F = \frac{R_u^2 - R_r^2}{1-R_u^2}\frac{n-k}{m}$, 
    * where n is the number of samples
    * k are the number of slopes (not including the intercept)
    * m are the number of restrictions (number of = signs)
    * $H_0$: Restriction holds, i.e. $\beta_1 + \beta_2 =1$
      * R: LinearHypothesis(reg, "beta1+beta2=1"), reg is the orginal(1st regression)
      * R: LinearHypothesis(reg, c("beta1+beta2=1","beta3=2"))
* **Wald test**
  * Like a Chow test but uses the $\chi^2$ distribution
* **Ramsey Reset Test**
  * Test of specification of the model. Whether you have correctly defined the model (mathematical properties)
  * Suppose you have $Y = \beta_0+\beta_1X_1+\beta_2X_2+\epsilon$ 
  * But the relation is not linear, it is $Y=\beta_0 X_1^{\beta_1} X_2^{\beta_2} \epsilon$
  * Ramsey test $H_0$: specification is correct, $H_a$: is not correct
  * If the model is not correct, you will have to take the log of the non-linear model
    * If none of your models are correct you have an 'omitted' variable. 
* If you have an omitte variable - your coefficients are going to be biased, inefficient and inconsistent. 
* **Shapiro** test of normality 
  * Null is that the variable is normally distributed