# Linear Regression

The linear regression model is one of the most simple algorithms used in ML. To a set of inputs, it predicts which output is expected. A simple regression model can be written as:

$$\hat{y} = \theta_{0} + \theta_{1}x_{1}$$

In this case, $\hat{y}$ is the predicted value, with $\theta_{0}$ being the bias and $\theta_{1}$ the first parameter of the model. $x_{1}$ is the first feature value. Linear regression models can have more than one feature values, and to $n$ feature values, it can be rewritten as:

$$\hat{y} = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} + ... + \theta_{n}x_{n}$$

This expression can be simplified to its vectorized form:

$$\hat{y} = \theta^{T} \cdot x$$

The vector $\theta$ contains all the parameters that compose the linear regression model, so that $\theta^{T} = [\theta_{0}, \theta_{1}, \theta_{2}, \theta_{3}, ..., \theta_{n}]$. It is a vector with shape $[1, n + 1]$ because of the transposition which is necessary to perform the dot product between the $\theta$ values and the $x$ values, which is of shape $[n + 1, 1]$. The result of this product is then a single value, $\hat{y}$. In this example, the vector $x$ contains all the instances with $x_{0} = 1$, so that the bias term $\theta_{0}$ is always independent. Then, $x$ would be $x = [1, x_{1}, x_{2}, x_{3}, ..., x_{n}]$.

The most common loss function used in regression models is the Root Mean Square Error (RMSE). However, to make calculations easier, the Mean Square Error (MSE) is commonly used in linear regression functions, while obtaining the same results as those obtained using the RMSE since the feature values that obtain the minimal error to RMSE are the same that obtain the minimal error to MSE.

The MSE can be calculated as:

$$MSE = \frac{1}{m} \sum^{m}_{i = 1} (\theta^{T} \cdot x^{(i)} - y^{(i)})^{2}$$

It is important to note that while $\hat{y}^{(i)}$ represents the value predicted by the model, $y^{(i)}$ corresponds to the true value associated with inputs $x^{(i)}$.

The optimal $\theta$, labeled as $\theta^{*}$, corresponds to the set of parameters that results on the smallest MSE. In this case, since our loss function (the value we seek to minimize), it can be written as:

$$\theta^{*} = \text{arg$_{\theta}$min MSE} = \text{arg$_{\theta}$min} (\frac{1}{m} \sum^{m}_{i = 1} (\theta^{T} \cdot x^{(i)} - y^{(i)})^{2})$$

The point where the loss value is minimal can be determined as the point in which the derivative is equal to zero. Therefore:

$$
\begin{aligned}
\frac{\partial MSE}{\partial \theta} = 0 \\
\frac{\partial (\frac{1}{m} \sum^{m}_{i = 1} (\theta^{T} \cdot x^{(i)} - y^{(i)})^{2})}{\partial \theta} = 0
\end{aligned}
$$

If we handle $\theta$ as a variable, as done in calculus, instead of a vector, it is possible to simplify the equation as:

$$ 
\begin{aligned}
\frac{1}{m} \sum^{m}_{i = 1} -2(y_{i} - \hat{\theta} x_{i})x_{i} = 0 \\
-\frac{2}{m} \sum^{m}_{i = 1} (y_{i} - \hat{\theta} x_{i})x_{i} = 0 \\
\hat{\theta} \sum^{m}_{i = 1} x_{i}^{2} - \sum^{m}_{i = 1} x_{i}y_{i} = 0 \\
\hat{\theta} = \frac{\sum^{m}_{i = 1} x_{i}y_{i}}{\sum^{m}_{i = 1} x_{i}^{2}} \\
\hat{\theta} = (X^{T} \cdot X)^{-1} \cdot X^{T} \cdot Y
\end{aligned}
$$