# Table of Contents:

## [1. Definition](#1)
## [2. Hypothesis](#2)
## [3. Cost Function](#3)
## [4. Normal Equation](#4)
## [5. Gradient Descent](#5)
## [6. Feature Scaling](#6)
## [7. Polynomial Regression](#7)
## [8. Assumptions of Linear Regression](#8)
## [9. Hyperparameter Optimization](#9)
## [10. Solutions to Overfitting](#10)
## [11. Solutions to Underfitting](#11)
## [12. Pros vs Cons](#12)

## Definition <a class="anchor" id="1"></a>

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that output variable (y) can be calculated from a linear combination of the input variables (x).

More generally, a linear model makes a prediction by simply computing a weighted
sum of the input features, plus a constant called the bias term (also called the intercept
term),

## Hypothesis <a class="anchor" id="2"></a>

![image.png](attachment:image.png)

$θ$ is the model's parameter vector containing the bias term $θ_{0}$ and the feature weights $θ_{1}$ to $θ_{n}$

$θ^{T}$ is the transpose the parameter vector (row vector instead of column)

$h_{0}$ is the hypothesis function

$X$ is the instance's feature vector containing $X_{0}$ to $X_{n}$

## Cost Function <a class="anchor" id="3"></a>

<img src="https://github.com/trekhleb/homemade-machine-learning/raw/master/images/linear_regression/cost-function.svg">

$X_{i}$ - input (features) of ith training example

$Y_{i}$ - output of ith training example

$m$ - number of training examples

https://www.youtube.com/watch?v=K_EH2abOp00 - Has derivation of simple and multiple linear regression

To train a Linear Regression model, you need to find the value of $θ$ that minimizes the RMSE. In practice, it is simpler to minimize the Mean Square Error (MSE) than the RMSE, and it leads to the same result (because the value that minimizes a
function also minimizes its square root).

## Normal Equation <a class="anchor" id="4"></a>

To find the value of $θ$ that minimizes the cost function, there is a closed-form solution—in other words, a mathematical equation that gives the result directly. This is calledthe Normal Equation.

![image.png](attachment:image.png)

where the $\theta$ obtained is the value of $\theta$ that minimizes the cost function.

## Gradient Descent <a class="anchor" id="5"></a>

Gradient descent is an iterative optimization algorithm for finding the minimum of a cost function for a wide range of problems.

It measures the local gradient of the error function with regards to the parameter vector $\theta$, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum.

You start by filling $\theta$ with random values (this is called random initialization),and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum.

https://youtu.be/sDv4f4s2SB8 - explains gradient descent v well

<div>
<img src="attachment:image.png" width=700/>
</div>

When we use term "batch" for gradient descent it means that each step of gradient descent uses all the training examples (as you might see from the formula above).

## Feature Scaling <a class="anchor" id="6"></a>

To make linear regression and gradient descent algorithm work correctly we need to make sure that features are on a similar scale.

<img src="https://github.com/trekhleb/homemade-machine-learning/raw/master/images/linear_regression/feature-scaling.svg">

In order to scale the features we need to do mean normalization

<div>
<img src="attachment:image.png" width=400>
</div>

## Polynomial Regression <a class="anchor" id="7"></a>

Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x.

Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the hypothesis function is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

<div>
    <img src="https://camo.githubusercontent.com/86e425e6c3ce1b8dec10076c1660936f62a726fb/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f7468756d622f382f38622f506f6c797265675f736368656666652e7376672f36353070782d506f6c797265675f736368656666652e7376672e706e67"/ width=400>
</div>

Example of a cubic polynomial regression, which is a type of linear regression.

Equation of above graph would be: 

<div>
<img src="attachment:image.png" width=300/>
</div>

## Assumptions of Linear Regression <a class="anchor" id="8"></a>

Source: https://towardsdatascience.com/verifying-the-assumptions-of-linear-regression-in-python-and-r-f4cd2907d4c0

According to the Gauss–Markov theorem, in a linear regression model the ordinary least squares (OLS) estimator gives the best linear unbiased estimator (BLUE) of the coefficients, provided that:

- the expectation of errors (residuals) is 0

- the errors are uncorrelated

- the errors have equal variance — homoscedasticity of errors

Also, ‘best’ in BLUE means resulting in the lowest variance of the estimate, in comparison to other unbiased, linear estimators.


<b> Note on OLS: </b>

Ordinary least squares, or linear least squares, estimates the parameters in a regression model by minimizing the sum of the squared residuals. This method draws a line through the data points that minimizes the sum of the squared differences between the observed values and the corresponding fitted values.

https://youtu.be/PaFPbb66DxQ - Concept of "least squares" explained well

<img src="https://i1.wp.com/statisticsbyjim.com/wp-content/uploads/2017/04/residuals.png?resize=300%2C186&ssl=1" width=300>

Source: https://jeffmacaluso.github.io/post/LinearRegressionAssumptions/

#### 1. Linearity of the model: 
The response variable y should be a linearly related to the explanatory variables X.
#### 2. Normality of error terms:
This assumes that the error terms of the model are normally distributed.
#### 3. No (perfect) multicollinearity:
The independent variables should not be correlated. Absence of this phenomenon is known as multicollinearity.
#### 4. Residual errors should be homoscedastic: 
The residual errors should have constant variance.
#### 5. No Autocorrelation of the Error Terms
There should be no correlation between the residual (error) terms. Absence of this phenomenon is known as Autocorrelation.

### 1. Linearity of the model

The dependent variable (y) is assumed to be a linear function of the independent variables (X, features) specified in the model. The specification must be linear in its parameters. Fitting a linear model to data with non-linear patterns results in serious prediction errors.

To detect nonlinearity one can inspect plots of observed vs. predicted values or residuals vs. predicted values. The desired outcome is that points are symmetrically distributed around a diagonal line in the former plot or around a horizontal line in the latter one.

<u>Checking with a scatter plot of actual vs. predicted. Predictions should follow the diagonal line.</u>

<img src="https://raw.githubusercontent.com/JeffMacaluso/JeffMacaluso.github.io/master/_posts/LinearRegressionAssumptions_files/11_1.png" width=400>

<b>What to do if that is not the case?</b>

- non-linear transformations to dependent/independent variables

- adding extra features which are a transformation of the already used ones (for example squared version)

- adding features that were not considered before

### 2. Normality of the Error Terms

This assumes that the error terms of the model are normally distributed.There are a variety of ways to check for this, but we’ll look at both a histogram and the p-value from the Anderson-Darling test for normality.

<u>Below histogram shows how the distribution of the residuals should look -Residuals are normally distributed</u>
<img src="https://raw.githubusercontent.com/JeffMacaluso/JeffMacaluso.github.io/master/_posts/LinearRegressionAssumptions_files/18_2.png" width=600>

<b> What to do if that is not the case?</b>
It depends on the root cause, but there are a few options. Nonlinear transformations of the variables, excluding specific variables (such as long-tailed variables), or removing outliers may solve this problem.

### 3. No (perfect) multicollinearity

The independent variables should not be correlated. Absence of this phenomenon is known as multicollinearity..

In other words, the features should be linearly independent. What does that mean in practice? We should not be able to use a linear model to accurately predict one feature using another one. Let’s take X1 and X2 as examples of features. It could happen that X1 = 2 + 3 * X2, which violates the assumption.

We can detect multicollinearity using the variance inflation factor (VIF). The square root of a given variable’s VIF shows how much larger the standard error is, compared with what it would be if that predictor were uncorrelated with the other features in the model. If no features are correlated, then all values for VIF will be 1.

Example of VIF:

<img src="https://miro.medium.com/max/945/1*5ZIgOYIoQOXxBxBiSN3NHg.png" width=800>

<b> What to do if that is not the case?</b>

To deal with multicollinearity we should iteratively remove features with high values of VIF. A rule of thumb for removal could be VIF larger than 10 (5 is also common). Another possible solution is to use PCA to reduce features to a smaller set of uncorrelated components.

### 4. Homoscedasticity (equal variance) of residuals

When residuals do not have constant variance (they exhibit heteroscedasticity), it is difficult to determine the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide/narrow.

We can also use two statistical tests: Breusch-Pagan and Goldfeld-Quandt. In both of them, the null hypothesis assumes homoscedasticity and a p-value below a certain level (like 0.05) indicates we should reject the null in favor of heteroscedasticity.

Example of dataset showing homoscedastic variance

<img src="https://miro.medium.com/max/945/1*oy8od7F8VaXARwHyM3DhOw.png" width =400>

Example of dataset showing heteroscedastic variance

<img src="https://miro.medium.com/max/945/1*Ij8bs7kdgkF15x33A1dPUA.png" width=400>

<b>What to do if that is not the case</b>
- Transform the dependent variable so as to linearize it and dampen down the heteroscedastic variance. Commonly used transforms are log(y) and square-root(y).

- using ARCH (auto-regressive conditional heteroscedasticity) models to model the error variance. An example might be stock market, where data can exhibit periods of increased or decreased volatility over time (volatility clustering, see this article for more information)

- Identify important variables that may be missing from the model, and which are causing the variance in the errors to develop a pattern, and add those variables into the model. Alternately, stop using the linear model and switch to a completely different model such as a Generalized Linear Model, or a neural net model.

- Simply accept the heteroscedasticity present in the residual errors.

### 5. No Autocorrelation of the Error Terms

This assumes no autocorrelation of the error terms. Autocorrelation being present typically indicates that we are missing some information that should be captured by the model. 

We can detect it by performing a Durbin-Watson test to determine if either positive or negative correlation is present. 

<b> What to do if that is not the case?</b>
A simple fix of adding lag variables can fix this problem(can only be applied to time series problems). Alternatively, interaction terms, additional variables, or additional transformations may fix this.

## Hyperparameter Optimization <a class="anchor" id="9"></a>

Source: https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

Vanilla linear regression doesn’t have any hyperparameters. But variants of linear regression do. Ridge regression and lasso both add a regularization term to linear regression; the weight for the regularization term is called the regularization parameter.


## Solutions to Overfitting <a class="anchor" id="10"></a>

1. Regualarization : Use Ridge/ Elastic Net / Lasso (regularized linear models) which accounts for overfitting 

2. Reduce the number of features: Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html

3. Clean and scale the data: Perform log transformations and add scaling of the features to the zero mean and unit variance (using either StandardScaler or RobustScaler. Linear models are affected by the difference in scale

4. If there is a mildly positive R^2 value on test data, go for Kernel Ridge Regression: After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data.  sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html (non-linear method) is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use. 

5. Try other non-linear techinuqes like trees: Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better).

6. Finally, try Neural Networks: Finally, state-of-the-art results are indeed generally obtained via neural networks, see dedicated environments (TensorFlow, Caffe, PyTorch, etc.)

## Solutions to Underfitting <a class="anchor" id="11"></a>

1. Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).

2. Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).

3. Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.

4. If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.

## Pros vs Cons <a class="anchor" id="12"></a>


Source: https://medium.com/@satyavishnumolakala/linear-regression-pros-cons-62085314aef0

#### Pros

1. Simple model : The Linear regression model is the simplest equation using which the relationship between the multiple predictor variables and predicted variable can be expressed.


2. Computationally efficient : The modeling speed of Linear regression is fast as it does not require complicated calculations and runs predictions fast when the amount of data is large.


3. Interpretability of the Output: The ability of Linear regression to determine the relative influence of one or more predictor variables to the predicted value when the predictors are independent of each other is one of the key reasons of the popularity of Linear regression. The model derived using this method can express the what change in the predictor variable causes what change in the predicted or target variable.

#### Cons

1. Overly-Simplistic: The Linear regression model is too simplistic to capture real world complexity


2. Linearity Assumption: Linear regression makes strong assumptions that there is Predictor (independent) and Predicted (dependent) variables are linearly related which may not be the case.


3. Severely affected by Outliers: Outliers can have a large effect on the output, as the Best Fit Line tries to minimize the MSE for the outlier points as well, resulting in a model that is not able to capture the information in the data.


3. Independence of variables :Assumes that the predictor variables are not correlated which is rarely true. It is important to, therefore, remove multicollinearity (using dimensionality reduction techniques) because the technique assumes that there is no relationship among independent variables. In cases of high multicollinearity, two features that have high correlation will influence each other’s weight and result in an unreliable model.


4. Assumes Homoskedacity :Linear regression looks at a relationship between the mean of the predictor/dependent variable and the predicted/independent variables and assumes constant variance around the mean which is unrealistic in most cases.


5. Inability to determine Feature importance :As discussed in the “Assumes independent variables” point, in cases of high multicollinearity, 2 features that have high correlation will affect each other’s weight. If we run stochastic(random) linear regression multiple times, the result may be different weights each time for these 2 features. So, it’s we cannot really interpret the importance of these features.