# Training Models
Two different ways to train linear regression models:
1. Direct "closed form" equation that directly computes the model parameters that best fit the model to the training set (i.e. the model parameters that minimise the cost function over the training set)
2. Use an iterative operation called _gradient descent_ that gradually tweaks the model parameters to minimise the cost function eventually converging to the same set of parameters as the first method.

Polynomial regression is a more complex model that can fit non-linear datasets. Since the model has more parameters than Linear Regression, it is more prone to overfitting the data. To detect whether this is the case, _learning curves_ are used along with _regularisation techniques_ to reduce the risk of overfitting.

Two common models used for classification are: Logistic Regression and Softmax Regression.

### Linear Regression
The first linear regression model of life satisfaction was given by the equation 

$$life\_satisfaction = \theta_{0} + \theta_{1} \times GDP\_per\_capita$$ 

The model is a linear function of the input feature GDP_per_capita. $\theta_{0}$ and $\theta_{1}$ are the model's parameters. The linear model computes a weighted sum of the input features plus a constant called the _bias term_ (or the _intercept term_).

$$\hat{y} = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$$

Where $\hat{y}$ is the predicted value, n is the number of features, $x_{i}$ is the $i^{th}$ feature value, $\theta_{j}$ is the $j^{th}$ model parameter (including the bias term $\theta_{0}$ and the feature weights $\theta_{1}, \theta_{2}, ..., \theta_{n}$)

The equivalent vectorised form of the above equation is

$$\hat{y} = h_{\theta}(x) = \mathbf{\theta} \bullet \mathbf{x} $$

Where $\theta$ is the model's _parameter vector_, containing the bias term $\theta_{0}$ and the feature weights $\theta_{1}$ to $\theta_{n}$. $\mathbf{x}$ is the instance's _feature vector_ containing $x_{0}$ to $x_{n}$, with $x_{0}$ always equal 1. $\theta \bullet \mathbf{x}$ is the dot product of the vectors $\theta$ and $\mathbf{x}$ which is equal to $\theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$ and $h_{\theta}$ is the _hypothesis function_ using the model parameters, $\theta$.

Note that in machine learning, vectors are usually stored as _column vectors_ so getting the prediction requires $\theta$ to be transposed $\hat{y} = \theta^{T}\mathbf{x}$.

A reminder that linear regression model is evaluated by the Root Mean Square Error and so to train a linear regression model, the model parameters should be computed such that it minimises RMSE. In practice, the MSE is easier to minimise than the RMSE and can lead to the same result (minising the MSE also minimises its square root).

The MSE of a Linear Regression model with hypothesis $h_{\theta}$ on a training set $\mathbf {X}$ is calculated by:
$$ MSE(\mathbf{X}, h_{\theta}) = \frac{1}{m} \sum^{m}_{i=1} (\theta^{T}\mathbf{x}^{i}-y^{i})^2$$

### The Normal Equation
To find the value of $\theta$ that minimises the cost function, the _closed-form solution_ (the mathematical equation that gives the result directly) is called the _Normal Equation_.
$$ \hat{\theta} = (\mathbf{X}^{T} \mathbf{X})^{-1} \mathbf{X}^{T} \mathbf{y} $$
Where \hat