Linear Regression is a technique to measure the relationship between a dependent variable and one or more features.

It does so by drawing a straight line ($mx+b$) through the data, and finding the values for $m$ and $b$ that minimize the distance in $y$ values from the line to each point (aka the loss), at that point's $x$ value.

Often each distance from each point's y value is squared:

$(y_i - \hat{y}_i)^2$ ($\hat{y}_i$ means the predicted value)

Then the square root of the sum of all of these squared distances is calculated:

$\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$

This is the loss function, or a measure of how well the line describes the data. There are multiple types of loss functions, and this is the most common. It's fittingly called Root Mean Squared Error (or RMSE).

Here is that same equation, with $\hat{y}$ expressed in terms of $M$ and $B$ (the actual variables to describe the line):

$\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - (Mx_i + B))^2}$


If one can find the minimum for this function, they've found the best fitting line.

One could use Gradient Descent to estimate this, but often for basic Linear Regresison, the minimum is directly calculated. This is a matter of calculus; find the partial derivatives for this equation, find the roots (i.e. where the values are 0), and swap the terms around a bit. With MSE (which will always find the same minima as RMSE), we obtain these two equations, to find $M$ and $B$:

$M = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$ ($\bar{x}$ means the average $x$)

$B = \bar{y} - M\bar{x}$

These equations yield the values for M and B that correspond to the line with the minimum error. Then it's just a matter of plugging in the data to the equation. This is $O(n)$, where n is the size of the data.

For Linear Regression with 

In [None]:
#Loading Data

In [None]:
#Single-Variable Linear Regression

In [None]:
#Multi-Variable Linear Regression

In [None]:
#Built-In Library Uses

Notes:

* Linear Regression estimates a line. To predict the relationship between building height and number of rooms, this model makes sense, in that there is a roughly linear relationship. For a relationship between 

* MSE, Mean Squared Error, is the primary error function used to find the optimal fit; either through direct analysis, or gradient descent. This finds the exact same minima as RMSE, since the minimum square root of a given function is the same as the minimum of that function when positive (and the MSE can never be negative).

* RMSE is typically used to express model accuracy, rather than MSE, because it brings the order of magnitude back to the original terms (rather than squared terms), which yields a more intuitive measure of accuracy.

* The primary alternative to MSE/RMSE is taking the absolute difference, rather than the squared difference, between the expected and observed y values:

$|y_i - \hat{y}_i|$

The Mean Absolute Difference weighs all error linearly, as opposed to MSE paying more attention to larger errors. This is a choice that depends on the nature of the data at hand, though weighing larger errors by squaring the residuals is often preferred.

* Squaring the residuals serves effectively a dual-purpose; to take the absolute value of the difference (as larger positive/negative $y_i - \hat{y}_i$ values both mean a larger error), and also to pay more attention to these differences if they are larger. The first is basically always necessary, and the second is often preferred.