# correlation and regression 
---
__ correlation__
---
``Whether returns to different stock market indexes are related and, if so, in what way. ``

  1. __``(correlation coefficient)``__ the direction and extent of linear association between two variables. 
  2. __``(scatter plot)``__  (+/-1) lie on a straight line with a positive (negative) slope, even if the slope of the line in the figure were different (but positive (negative)). 
  3. __``(0)``__  no __``linear``__ relation. 
  
* __``Disadvantages``__: 
  * correlation measures the __``linear``__ association between two variables, but it may not always be reliable. Two variables can have a strong __``nolinear relation``__ and still have a very low correlation. For example X~N(0,1), Y=$X^2$
  
  * correlation also may be an unreliable measure when outliers are present in one or both of the series. __``How to deal with outliers?__
  
    * (step 1) determine whether a computed sample correlation changes greatly by removing a few outliers.
    * (step 2) use __``judgement``__ to determine whether those outliers contain information about the two variables' relationship (and should thus be included in the correlation analysis) or contain no information (and should thus be excluded). 
  
 * correlation does not imply __``causation``__. Even if two variables are highly correlated, one does not necessarily cause the  other in the sense that certain values of one variables bring about the occurence of certain values of the other. 
 
 * correlation can be spurious in the sense of misleadingly pointing towards associations between variables: 
    $$ high - age$$
    $$ age - vocabulary$$
    $$\to high \quad ?\quad  vovabulary$$
    
   investment professionals must be cautious in basing investment strategies on high correlations. Spurious correlation may suggest investment strategies that appear profitable but actually would not be so, if implemented. 
    
* __`` Significance of the Correlation Coefficient``__: assess whether apparent relationships between random variables are the result of chance. If we decide that the relationships do not result from chance, we will be inclined to use this information in predictions because a good prediction of one variable will help us predict the other variable. 
   * distribution of the underlying variables (normally distributed): 
     $$ t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$$
     where r: sample correlation; t: t-distribution with $n-2$ degrees of freedom.
  





---
__ Regression__
---
``Test hypothesis where X helps to explain Y (single).``
 

* when the __``linear relationship``__ between the two variables is significant, linear regression provides a simple model for forecasting the value of one variable, known as the dependent variable, given the value of the second variables, known as the independent variable. 
   * regression analysis uses two principal types of data: 
      * cross-sectional 
      * time series. 

   Cross-sectional data involve many observations on $X$ and $Y$ for the same time period. Those observations could come from different companies, asset classes, investmnet funds, people, countries, r other entities, depending on the regression model.
   
   Time-series data use many observations from different time periods for the same company, asset class, investment fund, person, country, or other entiry, depending on the regression model. 
   
   
   * __``X explains Y``__: One estimate of $Y$ is $\bar{Y}$, the average of $Y$. If a regression of $Y$ on $X$ tends to give more accurate estimats of $Y$ than $\bar{Y}$, we say that the independent variable helps ``explain`` $Y$ because using that independent variable improves our estimates. 
   
   
   * Linear regression, also known as __``linear least squares``__, computes a line that best fits the observations.
     * $\frac{Cov(X, Y)}{Var(X)} \to \hat{b}_1$
     * $(\bar{X}, \bar{Y})$ lies in the line $\to \hat{b}_0$.
     
     
   * __``Assumptions``__: 
      1. linear in parameters $(b_0, b_1)$
         * $Y=b_0+b_1X^2+\epsilon$
         * $Y=b_0 e^{b_1X}+\epsilon$
      2. $X$ is not random
      3. $E[\epsilon_t]=0$
      4. $Var(\epsilon_t)=\sigma^2$
      5. $E[\epsilon_t\epsilon_t]=0, t\ne s$
      6. $\epsilon \sim N(0,1)$
      
   * __``How well a given linear regression model captures the relationship between X and Y?``__
      * A strong relation? 
      
      $\to$ reasonably certain that we could use this regression model to firecast $Y$ using $X$
      
      $\to$ an inaccurate forecast. 
      
      * __``The standard error of estimate (SEE)``__  measures the uncertainty:
      $$
      SEE=\left(\frac{\sum_{i=1}^{n}\left(\hat{\epsilon_i}\right)^2}{n-2}\right)^{1/2}
      $$
      
      n-2: n observations- 2 parameters.
      
      * __``The Coefficient of Determination (R^2)``__ measures the fraction of the total variation in the dependent variable that is explained by the independent variable. 
      
         *  __``(one independent variable)``__ $$R^2=r^2$$
         * __``(general cases)``__ If we did not know the regression relationship, the best guess for the value of any particular observation of the dependent variable would simply be $\bar{Y}$:
         $$
         \sum_{i=1}^n\frac{(Y_i-\bar{Y})^2}{n-1}
         $$
    An alternative to using $\bar{Y}$ to predict $Y$ is using the regression relationship to make that prediction:
    $$ \hat{Y}_i=\hat{b}_0+\hat{b}_1 X_i.
    $$
 If the regression relationship works well, the error in predicting $Y$ using $X$ __``(unexplained variation)``__:
         $$
         \sum_{i=1}^n(Y_i-\hat{Y}_i)^2.
         $$
should be much smaller than the error in predicting $Y$ using $\bar{Y}$ __``(total variation)``__:
         $$
         \sum_{i=1}^n(Y_i-\bar{Y})^2.
         $$
         
  $$\to R^2=1-\frac{Unexplained}{Total}$$

      * __``Hypothesis Testing``__ 
$$ t=\frac{\hat{b}_1-b_1}{s_{\hat{b}_1}}$$
with n-2 degrees of freedom.

      * __``Analysis of Varaince``__  
$$
F=\frac{RSS/\#slope_p}{SSE/(n-p )}=t^2 \quad(p=1)
$$
      * __``Estimate Error``__  
$$
s_f^2=s^2\left(1+\frac{1}{n}+\frac{(X-\bar{X})^2}{(n-1)s_x^2}\right)
$$

   * __``Limitations``__
       * Regression relations can change over time 
       * The use of regression reuslts specific to investment contexts is that public knowledge of regression relationships may negate their future usefullness. 
       
       
       
---
`` Multiple ``
* 
$$Y=0.50+0.75X_1$$
$$Y=?+??X_1+???X_2$$
    * we would generally find that $?? \ne 0.75$ unless the second independent variable were uncorrelated with $X_1$

    * if ??=0.6,can we say that for every 1-unit increase in X_1, we expect Y to increase by 0.6 units? No, we still expect Y to increase by 0.75 units when X_2 is not held constant. 
   * if $X_1=\alpha+\beta X_2+Z$, then $Y=?+0.6 Z$. 
   * 0.6 would represent the expected effect on $Y$ of a 1-unit increase in $X_1$ after removing the part of $X_1$ that is correlated with $X_2$. 
   
* __``Assumptions``__
    * ... +
    * No exact linear relation exists between two or more of the independent variables. 
    
* __``Notice!``__
   * We should be confident that the assumptions of the regression model are met
   * We should be cautious about predictions based on values of the independent variables that are outside the range of the data on which the model was estimated; such predictions are often unreliable. 
   * When predicting the dependent variable using a linear regression model, we encounter two types of uncertainty: 
      * uncertainty in the regression model itself
      * uncertainty about the estimates of the regression model's parameters. For multiple regression, computing a prediction interval to properly incorporate both types of uncertainty requires matrix algebra. 
   * __``t_test``__: test the significance of coefficients individually
   * __``F_test``__: test the significance of the regression as a whole. 
       * In a multiple regression, we cannot test the null hypothesis that all slope coefficients equal 0 based on t_tests that each individual slope coefficient equals 0, because the individual tests do not account for the effects of interactions among the independent variables. 
   * __``Adjusted R^2``__: If we add regression variables to the model, the amount of unexplained variation will decrease, and RSS will increase, if the new independent variable explains any of the unexplained variation in the model. Such a reduction occurs when the new independent variable is even __``slightly``__ correlated with the dependent variable and is not linear combination of other independent variables in the regression: 
     $$
     \bar{R}^2=1-\left(\frac{n-1}{n-k-1}\right)(1-R^2)
     $$

* __``Violations``__
    * __``Heteroskedasticity``__
        * __``Consequences``__: Although heteroskedasticity does not affect the __``consistency``__ of the regression parameter estimator, it can lead to mistakes in inference [introduce bias into estimators of the standard error of regression coefficients] 
            * F_test is unreliable
            * t_tests are unreliable
            
         When we ignore heteroskedasticity, we tend to find significant relationships where none actually exist. 
         
         * __``Unconditional``__ $\epsilon$ is not correlated with $X$ $\to$ creates no major problems for statistical inference 
         
         * __``Conditional``__ $\to$ causes the most problems for statistical inference
         
        * __``Testing``__ 
             * ``Breusch and Pagan``: under the null hypothesis of no conditional heteroskedasticity,  nR^2 (from  the regression of the squared residuals on the independent variables from the original regression) will be a $\chi^2$ random variable with the number of degrees of freedom equal to the number of independent variables in the regression. 
             
        * __``Correcting``__
           * __``robust standard errors``__: corrects the standard errors of the linear regression model's estimated coefficients to account for the conditional heteroskedasticity
           * __``geenralized least squares``__: modifies the original equation in an attempt to eliminate the heteroskedasticity. 
                     
    * __``Serial Correlation``__
       * __``Consequences``__: As long as none of $X_i$ is a lagged value of $Y$, then the estimated parameters themselves will be __``consistent``__ and need not be adjusted for the effects of serial correlation; Otherwise, series correlation will cause all the parameter estimates to be __``inconsistent``__ and they will not be valid estimates of the true parameters. 
       
            * __``Positive serial correlation``__: 
                 underestimate the population error variance. 
                 
       * __``Testing``__
          * __``Durbin and Watson``__: 
             $$
             DW=\frac{\sum_{t=2}^{T}\left(\hat{\epsilon}_t-\hat{\epsilon}_{t-1}\right)^2}{\sum_{t=1}^T\hat{\epsilon}_t^2}
             $$
             
        * __``Correcting``__
           * Adjust the coefficient standard errors 
               * ``Newey-West``
           * Modify the regression equation
           
 * __``Multicolinearity``__ With multicollinearity we can estimate the regression, but the interpretation of the regression output becomes problematic.  
     * __``Consequence``__: Although the presence of multicollinearity does not affect the  __``consistency``__ of the OLS estimates of the regression coefficients, the estimates become extremely imprecise and unreliable. It becomes impossible to distinguish the individual impacts of the independent variables on the dependent variable. $\sigma^2$ increase, t_tests decrease...
    
    * __``Testing``__ 
       * A high R^2 + low t_statistics       
          
    * __`` Correcting``__
       * excluding one or more of the regression variables. 
       * often no solution based in theory
           
           
           
     
        

--- 
Model Specification & Errors in Specification
---

* __`` Misspecified Functional Form``__
   * One or more important variables could be omitted from regression 
   * One or more of the regression variables may need to be transformed before estimating the regression 
   * The regression model pools data from different samples that should not be pooled. 
   
   
* __``Time-Series Misspecification (Independent Variables Correlated with Errors)``__: violate regression assumption 3 that the error term has mean 0, conditioned on the independent variables. 
   * If these assumption is violated, the estimated refression coefficients will be biased and inconsistent. 
   * __``Cases``__
      * including lagged dependent variables as independent variable in regressions with serially correlated errors;
      * including a function of a dependent variable as an independent variable, sometimes as a result of the incorrect dating of variables; and 
      * independent variables that are measured with error










---
Time-Series Analysis
---

Apply linear regression to a given time series. 

#### AR(1)
$$ x_t=b_0+b_1x_{t-1}+\epsilon_t$$

__``Issues``__: the assumptions of the linear regression model are not satisfied. 

*  The residual errors are correlated instead of being uncorrelated $\to$ causes estimates of $(b_0, b_1)$ to be __``inconsistent``__

* The mean and/or variance of the time series changes over time $\to$ regression results are __``invalid``__


 
* __``Model Calibrate Issue``__: $X_{t-1}$ is a random variable! If we use OLS to estimate the model, our statistical inference may be __``invalid``__. To conduct valid statistical inference, we must make a key assumption in time-series analysis: __``We must assume that the time series we are modeling is covariance stationary``__
   * $E[x_t]=\mu, \quad |\mu|<\infty$
   * $Cov(x_t, x_{t-s})=\lambda_s, \quad |\lambda_s|<\infty$;

  __``and that the errors are uncorrelated``__
  
__``Detecting Serially Correlated Errors in AR Model``__ DW statistic is invalid under this setting.
* __``t_test``__ involving a residual autocorrelation and the standard error of the residual autocorrelation.
* If a time series comes from an AR(1) model, then to be convariance stationary the absolute value of the lag coefficient must be less thatn 1.0
   * __``Dickey-Fuller test``__


__``Random Walk``__
We can not use the regression methods we have discussed so far to estimate an AR(1) model on a time series that is actually a random walk. 
* Not __``mean reversion``__
* Not a covariance-stationary time series


#### MA(1)
$$
x_t=\epsilon_t+\theta \epsilon_{t-1}
$$
* $\to$ ARMA $\to$ ARIMA 
* __``Issues``__: 
   * The parameters in ARMA models can be very unstable. 
      * Slight changes in the data sample or the initial guesses for the values of the ARMA parameters can result in very different final estimates of the ARMA parameters
   * Choosing the right ARMA model is more of an art than a scicence
      * The criteria for deciding on p and q are far from perfect
   * Even after a model is selected, that model may not forecast well. 
   
   

#### Other Issues 
* __``Seasonality``__
  * add a seasonality lag 
  
* __``AR Conditional Heteroskedasticity (ARCH)``__ The variance of the error in a particular time-series model in one period depends on the variance of the error in previous periods. 
   * ARCH(1):
     $$
     \epsilon_t\sim N(0, a_0+a_1\epsilon_{t-1}^2)
     $$
     
   * __``Engle``__ shows that we can test whether a time series is ARCH(1) by regressing the squared residuals from a previously estimated time-series model (AR, MA or ARMA) on a constant and one lag of the squared residuals. For example, 
      $$
      \hat{\epsilon}_t^2=a_0+a_1\hat{\epsilon}_{t-1}^2+u_t
      $$
      
   * __``consequence``__: the standard errors for the regression parameters will not be correct. We will need to use __``generalized least squares``__ or other methods that correct for heteroskedasticity to correctly estimate the standard error of the parameters in the time-series model. 
   
* __``Regressions with more than one time series``__ A time series that contains a unit root is not covariance stationary. If any time series in a linear regression contains a unit root, OLS estimates of regression test statistics may be invalid.
  * __``Duckey-Fuller test``__ 
     * Neither X or Y has a unit root $\to$ OK!
     * Either X or Y has a unit root but not both $\to$ not covariance stationary $\to$ __``inconsistent results``__
     * Both X and Y have a unit root $\to$ test for __``cointegration``__
        * No $\to$ __``inconsistent results``__
        * Yes $\to$ OK!
     
   * __``Coitegrated:``__ a __``long-term``__ financial or economic relationship exists between them such that they do not diverge from each other without bound in the long run. 
     * __``Cautious``__ in interpreting the results of a regression with cointegrated variables. The cointegrated regression estimates the __``long-term``__ relation between the two series but may not be the best model of the __``short-term``__ relation between the two series. 
     * __``Testing``__
        * Estimate the regression $y_t=b_0+b_1x_t+\epsilon_t$
        * test whether the error term from the regression has a unit root using __``(Engle-Granger) Dickey-Fuller test``__
        * if test fails $\to$ not cointegrated 
        * if test successes $\to$ cointegrated