### Linear regression

#### The hypothesis function

$$h_\theta (x) = \theta_0 + \theta_1 x$$

#### Cost function

$$J(\theta_{0},\theta_{1})=\frac{1}{2m}\sum_{i=1}^{m}\left(h_{\theta}(x^{(i)})-y^{(i)}\right)^{2}$$

#### Gradient descent
Update $\theta_0$ and $\theta_1$ simultaniousely, until convergence:
$$\theta_{j}:=\theta_{j}-\alpha\frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1})$$

Multiple variables and solve the differentiation.
Repeat until convergence:{
$$\theta_{j}:=\theta_{j}-\frac{\alpha}{m}\sum_{i=1}^{m}\left(\left(h_{\theta}(x^{(i)})-y^{(i)}\right)x_{j}^{(i)}\right)$$
}

Here, i = 1:m is the number of observations, j is the number of variables (with 1 as the first one), $\alpha$ is the learning step.

Variable (feature) **scaling** can help gradient descent converge faster.

```
# matlab code
m = length(y); % number of training examples
num_iters = 1000;

% gradient descent to find thetas that will minimize cost
J_history = zeros(num_iters, 1); % to host cost of each comb of theta
for iter = 1:num_iters
    diffe = X * theta - y;
    % update both theta at the same time.
    theta(1) = theta(1) - (alpha/m) * sum(diffe .* X(:, 1));
    theta(2) = theta(2) - (alpha/m) * sum(diffe .* X(:, 2));
    J_history(iter) = sum((X * theta - y).^2)/(2*m);
end
```

**Normal equation**: $\theta = (X^TX)^{-1}X^Ty$.

For large sample size, the normal equation will be very slow to calculate ($(X^TX)^{-1}$). But the gradient descent will work well.

### Logistic regression
The outcomes are either 1 or 0. So, we need $0\leq h(\theta) \leq 1$. We use logistic regression.
$$h_{\theta}(x)=g(\theta_0 + \theta_1x_1 + \theta_2x_2 +\cdots) = g(\theta^{T}x)=P(y=1|x;\theta)$$
$$g(z)=\frac{1}{1+e^{-z}}$$

#### Cost function
If we still use the same cost function as linear regression, the the $J(\theta)$ vs $\theta$ is non-convex, it will have many local minimum.

Logistic regression cost function:
$$J(\theta)=\frac{1}{m}\sum_{i=1}^{m}\textrm{Cost}(h_{\theta}(x^{(i)}),y^{(i)})$$

$$\textrm{Cost}(h_{\theta}(x),y)=\begin{cases}
-\log(h_{\theta}(x)) & if\;y=1\\
-\log(1-h_{\theta}(x)) & if\;y=0
\end{cases}$$ 

In this way, if y = 1 will predict = 0, then the cost is approaching infinity.

Put both conditions together (y can only be either 1 or 0):

$$\textrm{Cost}(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))$$

$$J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m}y^{(i)}\log(h_{\theta}(x^{(i)}))-(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right]$$

#### Gradient descent
Our **goal** is to get the $\theta$ that minimize $J(\theta)$: $\min _\theta J(\theta)$.

We need code that can compute $J(\theta)$ and $\partial J(\theta)/\partial \theta_j$

Repeat many times and simultaneously update all $\theta_j$.
$$\theta_{j}:=\theta_{j}-\frac{\alpha}{m}\sum_{i=1}^{m}\left(\left(h_{\theta}(x^{(i)})-y^{(i)}\right)x_{j}^{(i)}\right)$$

It looks identical to linear regression, but here $h_\theta (x) = \frac{1}{1+e^{-z}}$ instead of $\theta^Tx$.

#### Optimization algorithm

- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS

The 2-4 do not need to manually pick $\alpha$ and often faster than gradient descent, but they are more complex.

### Multiclass classification
E.g. weather: sunny (1), cloudy (2), rain (3), snow (4), etc.

##### One-vs-all (one-vs-rest)