# Linear Regression

## Assumption
1. __Linearity and additivity__: 1). There should be a linear relationship between dependent and independent variables. It is also important to check for outliers since linear regression is sensitive to outlier effects. 2). The impact of change in independent variables values should have additive impact on dependent variables.

  __How to diagnose__: The linearity assumption can best be tested with scatter plots.
  
  __How to fix__: Consider applying a nonlinear transformation to the dependent and/or independent variables. Another possibility to consider is adding another regressor that is a nonlinear function of one of the other variables.
      
2. __Statistical independence of errors__: The residuals are independent, in particular, no correlation between consecutive errors in the case of time series data.

   __How to diagnose__: To test for non-time-series violations of independence, you can look at plots of the residuals versus independent variables. The residuals should be randomly and symmetrically distributed around zero.
   
   __How to fix__: It could be due to a violation of the linearity assumption or due to bias that is explainable by omitted variables (say, interaction terms or dummies for identifiable conditions).

3. __Normality of residuals__: Distribution of residuals should be normal distributed. 

  __How to diagnose__: This assumption can best be checked with a histogram or a Q-Q plot. 

  __How to fix__: violations of normality often arise either because (a) the distributions of the dependent and/or independent variables are themselves significantly non-normal, and/or (b) the linearity assumption is violated. In such cases, a nonlinear transformation of variables might cure both problems. 

  __Note__:  1). The dependent and independent variables in a regression model do not need to be normally distributed by themselves--only the prediction errors need to be normally distributed. But if the distributions of some of the variables that are random are extremely asymmetric or long-tailed, it may be hard to fit them into a linear model whose errors will be normally distributed. 
  2). $\hat{y_i}=x_i\beta+\hat{\epsilon_i}$, $\hat{\epsilon_i}$ is an observation here and $\hat{\epsilon_1}...\hat{\epsilon_n}$ should follow a normal distribution.
  
4. __Homoscedasticity__: Variance of errors should be constant versus, a. Time, b. The predictions, c. Independent variable values. 

  __How to diagnose__: Look at a plot of residuals versus predicted values and, in the case of time series data, a plot of residuals versus time. To be really thorough, you should also generate plots of residuals versus independent variables to look for consistency there as well. 
  
  __How to fix__: If the dependent variable is strictly positive and if the residual-versus-predicted plot shows that the size of the errors is proportional to the size of the predictions, a log transformation applied to the dependent variable may be appropriate. 
  
  __Note__: $y_i=x_i\beta+\epsilon_i$, $\epsilon_i$ is a reandom variable here and has a variance $\sigma^2_i$

5. __ No multicollinearity__:  linear regression assumes that there is little or no multicollinearity in the data.  Multicollinearity occurs when the independent variables are too highly correlated with each other.

  __How to diagnose__:  
   * Correlation matrix 
   * Tolerance：With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is.
   $$T_k=1 - R_k^2$$
   * Variance Inflation Factor (VIF): If the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.
   $$VIF_k = {1}/{(1 - R_k^2)}$$ where $R^2_k$  is the $R^2$-value obtained by regressing the kth predictor on the remaining predictors.  

  __How to fix__: Centering the data (that is deducting the mean of the variable from each score) might help to solve the problem. The simplest way to address the problem is to remove independent variables with high VIF values.

## Analysis of Variance (ANOVA)
When there is no association between Y and X, the best predictor of each observation is $\bar{y}=\hat{\beta_0}$. In this case, the toal variation can be denoted as $TSS=\sum(y_i-\bar{y})^2$, the __Total Sum of Squares__.


When there is an association between Y and X, the best predictor of each observation is $\hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i$. In this case, the error variation can be denoted as $SSE=\sum(y_i-\hat{y_i})^2$, the __Error Sum of Squares__.


The difference between TSS and SSE is the variation "explained" by the regression of Y on X. It represents the difference between the fitted values and the mean: $SSR=\sum(\hat{y_i}-\bar{y})^2$, the __Regression Sum of Squares__.


 The relationship among them is: $TSS=SSE+SSR$, $\sum(y_i-\bar{y})^2=\sum(y_i-\hat{y_i})^2+\sum(\hat{y_i}-\bar{y})^2$
 

A common way to summarize how well a linear regression model fits the data is via the coefficient of determination or $R^2$. It is the proportion of variation in the forecast variable that is explained by the regression model. It can be calculated as:$$R^2 = \frac{\sum(\hat{y_i}-\bar{y})^2}{\sum(y_i-\bar{y})^2}=\frac{SSR}{TSS}$$

The __$R^2$ value__ is commonly used, often incorrectly, in forecasting. There are no set rules of what a good $R^2$ value is and typical values of $R^2$ depend on the type of data used. Validating a model’s out-of-sample forecasting performance is much better than measuring the in-sample $R^2$ value.

  
The __Adjusted $R^2$__ is a modified version of $R^2$ that has been adjusted for the number of predictors in the model. The adjusted $R^2$ increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted $R^2$ can be negative, but it’s usually not. It is always lower than the $R^2$. 
$$R_{adj}^2=1-[\frac{(1-R^2)(n-1)}{n-k-1}]$$
    where: 
   * n is the number of points in your data sample  
   * k is the number of variables in your model, excluding the constant.

__Mean Squared Error (MSE)__ measures the average of the squares of the errors or deviations. If $\hat{y}$ is a vector of n predictions, and $y$ is the vector of observed values corresponding to the inputs to the function which generated the predictions, then MSE of the predictor can be estimated by$$MSE=\frac{1}{n}\sum_{i=1}^n(\hat{y_i}-y_i)^2$$


__ANOVA__ calculations are displayed in an analysis of variance table, which has the following format for multiple linear regression:

| Source | Degrees of Freedom | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Sum of squares&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Mean Square&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$F_{obs}$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;P-value &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |
| :----- | :-----: | :----- | :----- | :-----: | :-----: |
| Model | p | $SSR=\sum(\hat{y_i}-\bar{y})^2$ | $MSR=\frac{SSR}{p}$ | $F=\frac{MSR}{MSE}$ | $P(F_{p, n-p-1} \ge F_{obs})$ |
| Error | n-p-1 | $SSE=\sum(y_i-\hat{y_i})^2$ | $MSE=\frac{SSE}{n-p-1}$ | &nbsp; | &nbsp; |
| Total | n-1 | $TSS=\sum(y_i-\bar{y})^2$ | &nbsp; | &nbsp; | &nbsp; |

## Hypotheses Tests for the Slopes
__F test: testing all slope parameters equal 0__

For a multiple regression model with intercept, we want to test the following null hypothesis and alternative hypothesis:

$H_0: \beta_1 = \beta_2 = ... = \beta_{p-1} = 0$
 
$H1: \beta_j \neq 0$, for at least one value of $j$

This test is known as the overall __F-test for regression__. The test statistic should be $F=\frac{MSR}{MSE}$ and we should find a $(1-\alpha)100\%$ confidence interval $I$ for $(p, n-p-1)$ degrees of freedom using an F table. Accept the null hypothesus if $F \in I$, reject if $F \notin I$.
   
__T test: testing one slope parameter is 0__

The  test is used to check the significance of individual regression coefficients in the multiple linear regression model. The hypothesis statements to test the significance of a particular regression coefficient $\beta_j$ are:

$H_0: \beta_j = 0$
 
$H1: \beta_j \neq  0$

The test statistic for this test is based on the t distribution: $T_0 = \frac{\hat{\beta_j}}{se(\hat{\beta_j})}$

## Vairable Selection Criterion
__Akaike's Information Criterion (AIC)__

A simple formula for the calculation of the AIC in the OLS framework is:
$$AIC=n*log(\frac{SSE}{n})+2k$$
Where SSE means Sum of Squared Errors $\sum(y_i-\hat{y_i})^2$, n is the sample size, and k is the number of predictors in the model plus one for the intercept. 

__Bayes Information Criterion (BIC)__

A simple formula for the calculation of the BIC in the OLS framework is:
$$BIC=n*log(\frac{SSE}{n})+klog(n)$$
Larger models will fit better and so have smaller SSE but use more parameters. Thus the best choice of model will balance fit with model size. BIC penalizes larger models more heavily and so will tend to prefer smaller models in comparison to AIC. AIC and BIC can be used as selection criteria for other types of model too.


http://www.biostat.jhsph.edu/~iruczins/teaching/jf/ch10.pdf

##  Learning Algorithms Used to Estimate the Coefficients
* __Simple Linear Regression__

When there is a single input variable (x), the method is referred to as simple linear regression. 

* __Ordinary Least Squares__

When we have more than one input we can use Ordinary Least Squares to estimate the values of the coefficients. The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals.

* __Gradient Descent__

Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

* __Regularization__

These seek to both minimize the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in the model).

## Prepare Data for Linear Regression
* Linear Assumption (log transform)
* Remove Noise (remove outlier)
* Remove Collinearity (overfit)
* Gaussian Distributions (log or Boxcox transform)
* Rescale Inputs (standardization or normalization)


https://machinelearningmastery.com/linear-regression-for-machine-learning/

## Regression  Diagnostics
https://medium.com/@emredjan/emulating-r-regression-plots-in-python-43741952c034

## Regularized Linear Models
### Ridge Regression
Ridge Regression cost function

$J(\theta) = MSE(\theta) + \alpha\sum_{i=1}^{n} \theta_i^2$

### Lasso Regression
Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso Regression) is another regularized version of Linear Regression: just like Ridge Regression, it adds a regularization term to the cost function, but it uses the ℓ1 norm
of the weight vector instead of half the square of the ℓ2 norm.

Lasso Regression cost function

$J(\theta) = MSE(\theta) + \alpha\sum_{i=1}^{n} |\theta_i|$

### Elastic Net
Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge
Regression, and when r = 1, it is equivalent to Lasso Regression.

Elastic Net cost function

$J(\theta) = MSE(\theta) + r\alpha\sum_{i=1}^{n} |\theta_i| + \frac{1-r}{2}\alpha\sum_{i=1}^{n}\theta_i^2$

## Linear Models in Python
* __scipy.stats.linregress__ only handles the case of a single explanatory variable with specialized code and calculates a few extra statistics.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

* __statsmodels.OLS__ is a generic linear model (OLS) estimation class. It doesn't prespecify what the explanatory variables are and can handle any multivariate array of explanatory variables, or formulas and pandas DataFrames. It not only returns the estimated parameters, but also a large set of results staistics and methods for statistical inference and prediction.

* __sklearn.linear_model__

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

## Reference
Assumption: http://people.duke.edu/~rnau/testing.htm

ANOVA: http://www.stat.ufl.edu/~winner/statnotescomp/regression.pdf

ANOVA Table: http://www.stat.yale.edu/Courses/1997-98/101/anovareg.htm

Hypothesis Test: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis#Test_on_Individual_Regression_Coefficients_.28t__Test.29