# Week 2 - Multivariable Linear Regression

## Hypothesis

> $h_{\theta}(\mathbf{x}) = \theta^{\top}\mathbf{x} = \theta_{0}\mathbf{x}_{0} + \theta_{1}\mathbf{x}_{1} + \theta_{2}\mathbf{x}_{2} + \cdots + \theta_{n}\mathbf{x}_{n}$
>
>$x_{0} = 1$

## Cost Function

> $\displaystyle J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(\mathbf{x}^{(i)}) - y^{(i)})^2$ 

## Gradient Descent Algorithm

repeat until convergence:

> $\displaystyle \theta_{j} := \theta_{j} - \alpha\frac{\partial}{\partial \theta_{j}} J(\theta_{0},\theta_{1},\dots,\theta{n}), j \in \{0,1,\dots,n\}$
>
> $\alpha$ is the **learning rate**.

or

> $\displaystyle \theta_{j} := \theta_{j} - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(\mathbf{x}^{(i)}) -y^{(i)})\mathbf{x}_{j}^{(i)}$

## Feature Scaling

> get every feature into approximately a $-1 \leqslant x_{i} \leqslant +1$ range.

## Mean Normalization

make features have approximately zero mean and a $-0.5 \leqslant x_{i} \leqslant +0.5$ range.

> $\displaystyle x_{i} \leftarrow \frac{x_{i} - \mu_{i}}{s_{i}}$
>
> where $\mu_{i}$ is the average value of $x_{i}$ in the training set,
>
> and $s_{i}$ is the range (max - min) of the values of $x_{i}$ or his **standard deviation** when using the **standard score(z-score)**.

We **never** normalize $x_0$!

## Learning Rate

plot the cost $J(\theta)$ in regard to the number of iterations.

- if $\alpha$ is too small: slow convergence.
- if $\alpha$ is too large: $J(\theta)$ may not decrease on every iteration; may not converge.

To choose $\alpha$, try: $0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, \dots$

## Polynomial Regression

> $h_{\theta}(x) = \theta_{0} + \theta_{1}x + \theta_{2}x^{2} + \theta_{3}x^3$

**Always** use feature scaling after adding polynomial features, but not on $x_{0}$!

## Normal Equation

> $\theta = (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{y}, \mathbf{X} \in \mathbb{R}^{m \times n+1}$
>
> $m$ is the number of training examples.
>
> $n$ is the number of features.
>
> $\displaystyle \mathbf{X} = \begin{bmatrix}
    1 & \mathbf{x}_{1}^{(1)} & \mathbf{x}_{2}^{(1)} & \cdots & \mathbf{x}_{n}^{(1)} \\
    1 & \mathbf{x}_{1}^{(2)} & \mathbf{x}_{2}^{(2)} & \cdots & \mathbf{x}_{n}^{(2)} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    1 & \mathbf{x}_{1}^{(m)} & \mathbf{x}_{2}^{(m)} & \cdots & \mathbf{x}_{n}^{(m)}
    \end{bmatrix} \in \mathbb{R}^{m \times n+1}$
>
> $\displaystyle \mathbf{y} = \begin{bmatrix}
    y^{(1)} \\
    y^{(2)} \\
    \vdots \\
    y^{(m)}
    \end{bmatrix} \in \mathbb{R}^{m}$
>    
> $\displaystyle \theta = \begin{bmatrix}
    \theta_{0} \\
    \theta_{1} \\
    \theta_{2} \\
    \vdots \\
    \theta_{n}
    \end{bmatrix} \in \mathbb{R}^{n+1}$
    
No need to use feature scaling.

Use **gradient descent** over **normal equation** when we have more than 1000~10000 features.

If $\mathbf{X}^{\top}\mathbf{X}$ is non-invertible:

- remove redundant features (linearly dependent).
- delete features or use regularization if too many features ($m\leqslant n$).