# Single variable linear regression

__Notation:__
* m $\rightarrow$ number of training examples
* x $\rightarrow$ features
* y $\rightarrow$ target
* (x,y) $\rightarrow$ one training example
* (x$^{(i)}$,y$^{(i)}$) $\rightarrow$ i$^{th}$ training example ($i$ is an index into the training set)

TRAINING SET $\rightarrow$ LEARNING ALGORITHM $\rightarrow$ HYPOTHESIS

x $\rightarrow$ HYPOTHESIS $\rightarrow$ estimated value of y

__Hypothesis__: $h_\theta(x) = \theta_0 + \theta_1x$ (linear function)

The model above is named as __Univariate linear regression__

## Cost function


__Notation:__
* $\theta$ $\rightarrow$ parameters

The cost fucntion enables us to obtain the values of the parameters

Choose $\theta_0$ and $\theta_1$ so that $h_\theta(0)$ is close to y for our training examples (x,y)

Choose $\theta_0$ and $\theta_1$ so that the average, the 1 over the 2m, times the sum of square errors between my predictions on the training set minus the actual values of the houses on the training set is minimized

Minimise the average ($ \frac{1}{m}$) square of the error between the predicted and the actual value of $y$. For easier math $ \frac{1}{2}$

Minimise $\theta_0$ and $\theta_1$ 
$ \frac{1}{2m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)} )^2$ (m = # of training examples)

__Cost function__: 

$$ J(\theta_0, \theta_1) = \frac{1}{2m} \sum\limits^m_{i=1} (h_\theta(x^{(i)}) -  y^{(i)} )^2 $$

Also called __square error cost fucntion__ (most common cost function)

Obtain $\theta_0$, $\theta_1$, such that $J$ is minimised

## Cost fucntion intuition

Section intended to give a better inderstanding of $h_\theta(x)$ and $J(\theta)$. Methodology used:
1. Reduce the hypothesis to $h_\theta(x) = \theta_1x$
2. Manually calculate h and J for a range of problems
3. Repeat for the hypothesis $h_\theta(x) = \theta_0 + \theta_1x$

# Gradient descent

Some function: $J(\theta_0, \theta_1)$

Requirement: $min_{\theta_0, \theta_1}$ $J(\theta_0, \theta_1)$


Algorithm:

repeat until convergence{

$$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1) $$
(for $j=0$ and $j=1$)
}

Where $\alpha$ is the learning rate

This is used to __simultaniously update__ $\theta_0$ and $\theta_1$:

temp0 = $:= \theta_0 - \alpha \frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1)$

temp1 = $:= \theta_0 - \alpha \frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1)$

$\theta_0$:= temp0

$\theta_1$:= temp1


If the gradient of J is positive the updated term will be smaller. Contrary, if the slope of J is negative, the updated term will be larger.

Beware of slow convergence ($\alpha$ too small) and overshooting ($\alpha$ too large).

(outside note)
Evenly scaling of features contributes to equate the effect of each of the features used in gradient descent.

Working out the partial terms which are used to update $\theta_0$ and $\theta_1$, the following is obtained:

$$\frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum\limits^m_{i=1} (h_\theta(x^{(i)}) -  y^{(i)} )^2 $$

$$\frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1) = \frac{1}{m}\sum\limits^m_{i=1} ( h_\theta(x^{(i)}) - y^{(i)} ) $$

$$\frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1) = \frac{1}{m}\sum\limits^m_{i=1} ( h_\theta(x^{(i)}) - y^{(i)} ) x^{(i)} $$

The algorithm can be updated:

repeat until convergence{

temp0 = $:= \theta_0 - \frac{1}{m}\sum\limits^m_{i=1} ( h_\theta(x^{(i)}) - y^{(i)} )$

temp1 = $:= \theta_1 - \alpha \frac{1}{m}\sum\limits^m_{i=1} ( h_\theta(x^{(i)}) - y^{(i)} ) x^{(i)}$

$\theta_0$:= temp0

$\theta_1$:= temp1

}

In gradient descent the cost fucntion has no local optima since it is a __convex function__

__"Batch" Gradient Descent__

Each step of gradient descent uses all the training examples

# Multi variable linear regression

A larger number of features allows to better make predictions.

Notation:
* $x_1$, $x_2$, ..., $x_n$ $\rightarrow$  Features
* $y$ $\rightarrow$ Variable to predict
* $n$ $\rightarrow$ Number of features
* $m$ $\rightarrow$ Number of samples
* $x^{(i)}_j$ $\rightarrow$ Value of feature $j$ of $i^{th}$ training example


## Hypothesis

$$h_{\theta}(x) =  (\theta_0 + \theta_1x_1 + ... + \theta_nx_n)$$

This can be thought of as if $\theta_j$ is the effect of feature $j$ in the predicted variable $y$

For convenience, a $x_0$ term is defined: $$ x_0  = 1 \rightarrow x^i_0 = 1$$ for all training samples


The feature and parameter vectors will be:
$$
\begin{bmatrix}
    x_{0} \\
    x_{1} \\
    x_{2} \\
    \vdots\\
    x_{n} 
\end{bmatrix}
\in \mathbb{R}^{n+1}\;\; ; \;\;
\begin{bmatrix}
    \theta_{0} \\
    \theta_{1} \\
    \theta_{2} \\
    \vdots\\
    \theta_{n} 
\end{bmatrix}
\in \mathbb{R}^{n+1}$$

Then the hypothesis can be reformulated as:

$$h_{\theta}(x) = \sum^n_0 (\theta_0x_0 + \theta_1x_1 + ... + \theta_nx_n)$$

Since $x_0 = 1$

The above can be rewritten in matrix notation as:

$$h_{\theta}(x) = \underline{\theta}^T \underline{x}$$

All of this can also be referec to as __Multi variate linear regression__

# Gradient descent for multiple variables

__Hypothesis__: $h_{\theta}(x) = \underline{\theta}^T \underline{x}$

__Cost function__: $J(\vec{\theta}) =\frac{1}{2m} \sum\limits^{m}_{i=1}(h_\theta(x^{(i)}) - y^{(i)} )^2 $

__Algorithm__:

Repeat until convergence{

$ \theta_j := \theta_j - \alpha\frac{1}{m}\sum\limits^m_{i=1}(h_\theta(x^{(i)}) - y^{(i)} )x^{(i)}_j $

(simultaniously update for j=0, 1, ..., n)

}

## Feature scaling

Make sure features are on a similar scale.

## Mean normalization

Replace $x_i \rightarrow (x_i - \mu_i)$ to make features have approximately zero mean

$$ x_i = \frac{x_i - \mu_i}{\sigma_i} $$

where $\mu_i$ is the mean and $\sigma_i$ is the standard deviation

## Convergence of linear descent

The value of each of the features should converge as the number of interations increases. 

Furthermore the cost function should also decrease with the number of iterations.

(personal note) Can be automated by checking the gradient of the cost fucntion with respect to the number of iterations.

## Choosing $\alpha$

Use some scale:

0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...

# Features & Polynomial regrassion

May be useful to create a new feature to aid the linear regresion.

Again, it may be usefull to modify the model (linear, quadratic, exponential, ...)  to better fit the data. With higher order terms, feature scaling becomes key.

## Normal equation

Solve $\vec{\theta}$ for a minimum in $J(\vec{\theta})$ analytically. Where $\vec{\theta} \in \mathbb{R}^{n+1}$ 

__Cost function__: $ J(\theta_0, \theta_1, ..., \theta_m) = \frac{1}{2m} \sum\limits^m_{i=1}(h_{\theta}(x^{(i)}) - y^{(i)})^2$

We define the design matrix $\underline{X}$ and the vector $\vec{y}$. $\underline{X}$ has a size of $m \times (n+1)$ (# of training examples $\times$ # of features + $x_0$). $\vec{y}$ is an $m$ dimensional vector (# of training examples)

For the case of $m$ training examples $(x^{(1)}, y^{(1)})$, ..., $(x^{(m)}, y^{(m)})$

And $n$ features $ x^{(i)} = 
\begin{bmatrix}
    x^{(i)}_{0} \\
    x^{(i)}_{1} \\
    x^{(i)}_{2} \\
    \vdots\\
    x^{(i)}_{n} 
\end{bmatrix}
\in \mathbb{R}^{n+1} $

The design matrix $\underline{X}$ and the vector $\vec{y}$ would be:

$$
\underline{X} = 
\begin{bmatrix}
    x^{(1)}_{0} x^{(1)}_{1} x^{(1)}_{2} \dots  x^{(1)}_{n}\\
    (x^{(i)}_{2})^T \\
    \vdots\\
    (x^{(i)}_{m})^T
\end{bmatrix} 
\;\; ; \;\;
\vec{y} = 
\begin{bmatrix}
    y^{(1)} \\
    y^{(2)} \\
    y^{(3)} \\
    \vdots\\
    y^{(m)} 
\end{bmatrix} 
$$

By calculating $ \vec{\theta} = (\underline{X}^{T} \underline{X})^{-1} \underline{X}^{T} \vec{y}$ the value of $\vec{\theta}$ which minimises the cost fucntion is obtained