## Contents

* Regression
* The assumptions made in Regression
* How to check if these assumptions are violated
* How to improve the accuracy of a Regression Model
* How to access the fit of a Regression Model
* Solving a Regression Problem

### Regression 

Below is the regression line. If we have the data of x and y then we can build a model to generalize their relation

![](reg.png)

The best line will have the minimum error. 
Some errors are positive and some are negative. Taking their sum is not a good idea.
Squared sum of errors are convenient to minimize. The method of minimizing squared sum or errors is called least squared method of regression

OLS technique tries to reduce the sum of squared errors ∑[Actual(y) - Predicted(y')]² by finding the best possible value of regression coefficients (b0, b1, etc).

In OLS, the error estimates can be divided into three parts:

** Sum of Squares error (SSE) - ∑[Actual(y) - Predicted(y)]² **

** Sum of Squares regression (SSR) - ∑[Predicted(y) - Mean(ymean)]² **

** Sum of Squares Total (SST) - ∑[Actual(y) - Mean(ymean)]² **

![](anat.png)

The most important use of these error terms is used in the calculation of the Coefficient of Determination (R²).

R² metric tells us the amount of variance explained by the independent variables in the model. 

### The assumptions made in regression 

The following are the major assumptions made by standard linear regression models with standard estimation techniques (e.g. ordinary least squares):

* *** Linearity: *** This means that the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables. It means that the change in DV by 1 unit change in IV is constant. 

* *** Lack of perfect multicollinearity: *** For standard least squares estimation methods there must be no correlation among independent variables. Presence of correlation in independent variables lead to Multicollinearity. If variables are correlated, it becomes extremely difficult for the model to determine the true effect of Xs on Y.

* *** Constant variance (Homoscedasticity): *** This means that different values of the response variable have the same variance in their errors, regardless of the values of the predictor variables. Absence of constant variance leads to heteroskedestacity.

* *** No auto-correlation: *** The error terms must be uncorrelated i.e. error at ∈t must not indicate the at error at ∈t+1. Presence of correlation in error terms is known as Autocorrelation. It drastically affects the regression coefficients and standard error values since they are based on the assumption of uncorrelated error terms.

* The dependent variable and the residuals must possess a normal distribution.



### How check if  these assumptions are violated 
 * ** Residual vs. Fitted Values Plot ** --  this plot shouldn't show any pattern. But if you see any shape (curve, U shape), it suggests non-linearity in the data set. In addition, if you see a funnel shape pattern, it suggests your data is suffering from heteroskedasticity, i.e. the error terms have non-constant variance.
 
 ![](het1.png)

If you are a non-graphical person, you can also perform quick tests / methods to check assumption violations:

* Durbin Watson Statistic (DW) - This test is used to check autocorrelation. Its value lies between 0 and 4. A DW=2 value shows no autocorrelation. However, a value between 0 < DW < 2 implies positive autocorrelation, while 2 < DW < 4 implies negative autocorrelation.
* Variance Inflation Factor (VIF) - This metric is used to check multicollinearity. VIF <=4 implies no multicollinearity but VIF >=10 suggests high multicollinearity. Alternatively, you can also look at the tolerance (1/VIF) value to determine correlation in IVs. In addition, you can also create a correlation matrix to determine collinear variables.


### How to improve the accuracy of a regression model 

There is little you can do when your data violates regression assumptions. Following are some tips you can implement:

* If your data is suffering from non-linearity, transform the IVs using sqrt, log, square, etc.
* If your data is suffering from heteroskedasticity, transform the DV using sqrt, log, square, etc. Also, you can use weighted least square method to tackle this problem.
* If your data is suffering from multicollinearity, use a correlation matrix to check correlated variables. Let's say variables A and B are highly correlated. Now, instead of removing one of them, use this approach: Find the average correlation of A and B with the rest of the variables. Whichever variable has the higher average in comparison with other variables, remove it. Alternatively, you can use penalized regression methods such as lasso, ridge, elastic net, etc.


### How can you access the fit of regression model?

The metrics used to determine model fit can have different values based on the type of data. Following are some metrics you can use to evaluate your regression model:

* **R Square (Coefficient of Determination)** - As explained above, this metric explains the percentage of variance explained by covariates in the model. It ranges between 0 and 1. Usually, higher values are desirable but it rests on the data quality and domain. For example, if the data is noisy, you'd be happy to accept a model at low R² values. But it's a good practice to consider adjusted R² than R² to determine model fit.
* **Adjusted R²**- The problem with R² is that it keeps on increasing as you increase the number of variables, regardless of the fact that the new variable is actually adding new information to the model. To overcome that, we use adjusted R² which doesn't increase (stays same or decrease) unless the newly added variable is truly useful.
![](adjr2.gif)

* **RMSE / MSE / MAE** - Error metric is the crucial evaluation number we must check. Since all these are errors, lower the number, better the model. Let's look at them one by one:
 * MSE - This is mean squared error. It tends to amplify the impact of outliers on the model's accuracy. For example, suppose the actual y is 10 and predictive y is 30, the resultant MSE would be (30-10)² = 400.
 * MAE - This is mean absolute error. It is robust against the effect of outliers. Using the previous example, the resultant MAE would be (30-10) = 20
 * RMSE - This is root mean square error. It is interpreted as how far on an average, the residuals are from zero. It nullifies squared effect of MSE by square root and provides the result in original units as data. Here, the resultant RMSE would be √(30-10)² = 20. Don't get baffled when you see the same value of MAE and RMSE. Usually, we calculate these numbers after summing overall values (actual - predicted) from the data.